Без рубрики

Dolly 2.0 trained on a 100% human-generated and open source dataset of prompts & responses

Databricks announced the release of the first open source instruction-tuned language model, called Dolly 2.0. It was trained using similar methodology as InstructGPT but with a claimed higher quality dataset that is 100% open source.

This model is free to use, including for commercial purposes, because every part of the model is 100% open source.

Open Source Instruction Training

What makes ChatGPT able to follow directions is the training it receives using techniques outlined in the InstructGPT research paper.

The breakthrough discovered with InstructGPT is that language models don’t need larger and larger training sets.

By using human evaluated question and answer training, OpenAI was able to train a better language model using one hundred times fewer parameters than the previous model, GPT-3.

Databricks used a similar approach to create prompt and response dataset called they call databricks-dolly-15k.

Their prompt/response dataset was created without scraping web forums or Reddit.

databricks-dolly-15k is a dataset created by Databricks employees, a 100% original, human generated 15,000 prompt and response pairs designed to train the Dolly 2.0 language model in the same way that ChatGPT model was created with InstructGPT.

The GitHub page for the dataset explains how they did it:

“databricks-dolly-15k is an open source dataset of instruction-following records used in training databricks/dolly-v2-12b that was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

…Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category.

The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category.

Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly.”

Databricks claims that this may be the very first human generated instruction dataset created to train a language model to follow instructions, just like ChatGPT does.

The challenge was to create a 100% original dataset that had zero ties to ChatGPT or any other source with a restrictive license.

Employees were incentivized by a contest to contribute to generating the 15,000 prompt/responses along seven categories of tasks such as brainstorming, classification, and creative writing.

Databricks asserts that the databricks-dolly-15k training set may be superior to the dataset used to train ChatGPT.

They note that although their dataset is smaller than the one used to train the Stanford Alpaca model, their model performed better because their data is higher quality.

They write:

“Dolly 2.0 model, based on EleutherAI’s pythia-12b, exhibited high-quality instruction following behavior. In hindsight, this isn’t surprising.

Many of the instruction tuning datasets released in recent months contain synthesized data, which often contains hallucinations and factual errors.

databricks-dolly-15k, on the other hand, is generated by professionals, is high quality, and contains long answers to most tasks.

…we don’t expect Dolly to be state-of-the-art in terms of effectiveness.

However, we do expect Dolly and the open source dataset will act as the seed for a multitude of follow-on works, which may serve to bootstrap even more powerful language models.”

15.04.2023
dolly22

Open Source Language Model Named Dolly 2.0 Trained Similarly To ChatGPT

Dolly 2.0 trained on a 100% human-generated and open source dataset of prompts & responses Databricks announced the release of the first open source instruction-tuned language […]
24.03.2023
MozillaAI

Mozilla Open Source AI To Challenge ChatGPT & Bard

Mozilla invests $30 million to create a non-profit AI company to challenge OpenAI ChatGPT and Google Bard Mozilla announced the founding of an open source initiative […]
07.03.2023
IndexNow

Bing’s AI Algorithm: Insights From Fabrice Canel’s Pubcon Keynote

Learn how to optimize websites for Bing’s AI-powered algorithm with insights from Fabrice Canel’s Pubcon keynote. Fabrice Canel, the Principal Product Manager for Microsoft Bing, gave […]
21.02.2023
bing11

Microsoft Limits Bing AI Chat Messages Per Day

Microsoft has implemented new changes to improve Bing’s chat feature, including a cap of 50 chat turns per day.   Microsoft’s AI-powered Bing search engine, Edge […]
08.02.2023
microsoft-logo-2-31

Microsoft Announces ChatGPT Capabilities Coming To Bing

Microsoft announces ChatGPT integration into Bing and the Edge browser to enhance user experience and productivity. Microsoft announced today that it is bringing cutting-edge AI capabilities […]
23.01.2023
google-search-mobile

Google Recommends Multiple Date Signals On Webpages

Maintain the integrity and reliability of your website by using multiple date signals to display the correct date in search results.
09.01.2023
river

Google Updates Article Structured Data Guidance

The structured data guidance for the headline property was changed for the articles structured data Google updated the documentation for the Article structured data which supports […]
25.12.2022
google_core_update

Google’s John Mueller: WordPress Not Inherently Better For SEO

Google Search Advocate John Mueller says WordPress isn’t inherently better for SEO than coding a website from scratch. In response to a thread on Reddit, Google […]
13.12.2022
google-search-mobile

Google Rolls Out December 2022 Helpful Content Update

Google confirms another helpful content update is starting to roll out across search results. It began on December 5 and will take up to two weeks […]
23.11.2022
page-experience

Google Publishes Guide To Current & Retired Ranking Systems

A new guide from Google will help you stay informed about which ranking systems Google uses and which ones are retired. A new guide to Google’s […]