Large language models (LLMs) generate text by predicting the next probable word or token of a sentence or sequence. Before a large language model (LLM) can be used in a valuable way, it must first be trained using a process called pretraining.
Below is a common self-learning pretraining approach to train LLMs:
For example, given the sequence The dog chased the… the model might predict bird as the next word and may then be told that the answer is wrong because the actual next word was cat. The loss (or error) from the expected result can be calculated at which time the model weights are adjusted to predict better the next time the training is performed.
Training Datasets
Vast amounts of data are required to train generative AI large language models (LLMs). The data within the various corpora influence the output of a model and it is critical to evaluate the training data to mitigate model bias. The following datasets have been used to train some of the world’s leading LLMs:
Dataset Name | Description | Download | Models |
---|---|---|---|
BookCorpus | Turned scraped data of 11,000 unpublished books into a 985 million-word dataset. It was initially created to align storylines in books to their movie interpretations. The dataset was used for training LLMs like RoBERTA, XLNET, and T5. | https://huggingface.co/datasets/bookcorpus | |
C4 | A 750 GB English corpus derived from the Common Crawl. It uses heuristic methods to extract only natural language data while removing all gibberish text. C4 has also undergone heavy deduplication to improve its quality. Language models like MPT-7B and T5 are pre-trained with C4. | https://huggingface.co/datasets/c4 | |
Common Crawl | Contains petabytes of data collected over 8 years of web crawling across billions of pages. The corpus contains raw web page data, metadata extracts and text extracts with light filtering. | https://commoncrawl.org/ | |
CommonsenseQA | General domain crowd-sourced questions with high semantic complexity which command the use of prior knowledge. | https://allenai.org/data/commonsenseqa | |
MedMCQA | Real-world medical entrance exam questions. | https://huggingface.co/datasets/medmcqa | |
MedQA | Questions from US (USMLE) medical board exams. | https://huggingface.co/datasets/bigbio/med_qa | |
OpenBookQA | Science and broad common knowledge questions, which require multi-step reasoning and rich text comprehension. | https://allenai.org/data/open-book-qa | |
Red Pajama | 1.2 trillion tokens extracted from Common Crawl, C4, GitHub, books, and other sources. Red Pajama’s transparent approach helps train MPT-7B and OpenLLaMA. | https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2 | |
Refinedweb | A massive corpus of deduplicated and filtered tokens from the Common Crawl dataset. The dataset has more than 5 trillion tokens of textual data, of which 600 billion are made publicly available. It was developed as an initiative to train the Falcon-40B model with smaller-sized but high-quality datasets. | https://huggingface.co/datasets/tiiuae/falcon-refinedweb | |
Roots | A 1.6TB multilingual dataset curated from text sourced in 59 languages. Created to train the BigScience Large Open-science Open-access Multilingual (BLOOM) language model. ROOTS uses heavily deduplicated and filtered data from Common Crawl, GitHub Code, and other crowdsourced initiatives. | https://huggingface.co/bigscience-data | |
Starcoder | A programming-centric dataset built from 783 GB of code written in 86 programming languages. It also contains 250 billion tokens extracted from GitHub and Jupyter Notebooks. Salesforce CodeGen, Starcoder, and StableCode were trained with Starcoder Data to enable better program synthesis. | https://huggingface.co/datasets/bigcode/starcoderdata | |
StrategyQA | General domain crowd-sourced questions with high semantic complexity which require implicit reasoning and multi-step answer strategies. Yes/no answers. | https://allenai.org/data/strategyqa | |
The Pile | An 800 GB corpus that enhances a model’s generalization capability across a broader context. It was curated from 22 diverse datasets, mostly from academic or professional sources. | https://pile.eleuther.ai/ | GPT-Neo, LLaMA, and OPT. |
WebText2 | The text of web pages from all outbound Reddit links from posts with 3+ upvotes. | https://openwebtext2.readthedocs.io/en/latest/ | |
Wikipedia | All Wikipedia pages in the English language. | https://huggingface.co/datasets/wikipedia |
Training Challenges
LLMs are trained on a massive text corpus ranging at least in the size of 1000 GBs. The models used to train on these datasets are very large containing billions of parameters. In order to train such large models on the massive text corpus, infrastructure/hardware supporting multiple GPUs is required.
Aside from the infrastructure, training LLMs are often cost-prohibitive. Companies and research institutions invest millions of dollars to set it up and train LLMs from scratch. For example, it is estimated that GPT-3 cost around $4.6 million dollars to train from scratch.
Scaling laws were introduced to determine how much optimal data is required to train a model of a particular size. In 2022, DeepMind proposed the scaling laws for training the LLMs with the optimal model size and dataset (no. of tokens) in the paper Training Compute-Optimal Large Language Models. These scaling laws are popularly known as Chinchilla or Hoffman scaling laws. It states that the number of tokens used to train a given LLM should be 20 times more than the number of parameters of the model.
For example, 1,400B (1.4T) tokens should be used to train 70B parameter LLM model (20 text tokens per parameter).
From Pretraining to Instruction
It is important to note that the output from the pretraining phase is very raw. The model may not reliably follow instructions and might often produce output that appears as if it has no concept of how to meet human expectations. If performed correctly, the pretrained model should understand human language in all its infinite variety and complexity. To be able to interact with LLMs, supervised fine-tuning (SFT) is required to further enhance the model.
AI INFLUENCERS
AI MODELS
Popular Large Language Models
ALPACA (Stanford)
BARD (Google)
Gemini (Google)
GPT (OpenAI)
LLaMA (Meta)
Mixtral 8x7B (Mistral)
PaLM-E (Google)
VICUNA (Fine Tuned LLaMA)
Popular Image Models
Stable Diffusion (StabilityAI)
Leaderboards
NOTABLE AI APPS
Chat
Image Generation / Editing
Audio / Voice Generation
Video Generation
DAILY LINKS TO YOUR INBOX
©2024 The Horizon