Large language models (LLMs) generate text by predicting the next probable word or token of a sentence or sequence. Before a large language model (LLM) can be used in a valuable way, it must first be trained using a process called pretraining.

Below is a common self-learning pretraining approach to train LLMs:

  1. An untrained LLM model is setup using a randomly initialized set of weights.
  2. A large corpus of sentences is collected from various sources to serve as the training dataset. The dataset is provided to the model in small chunks. It has been demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. Previously, Common Crawl was the go-to dataset for training LLMs. The Common Crawl contains the raw web page data, extracted metadata, and text extractions since 2008. The size of the dataset is in petabytes (1 petabyte=1e6 GB). Large Language Models trained on this dataset demonstrated effective results but failed to generalize well across other tasks. Hence, a new dataset called Pile was created from 22 diverse high-quality datasets. It’s a combination of existing data sources and new datasets in the range of 825 GB.
  3. Because the data is derived from the Internet, it must be preprocessed, or cleaned. Preprocessing includes removing HTML Code, fixing spelling mistakes, eliminating toxic/biased data, converting emoji into their text equivalent, and deduplicating data.
  4. To be used within the LLM training model, the corpus of knowledge must be converted into tokens. Each word is further broken down into sub words using tokenization algorithms such as Byte Pair Encoding (BPE) or WordPiece. The tokens of input and output pairs can then be sent to the model for training.
  5. When provided the prepared tokenized data, the LLM uses the decoder portion of the transformer model (as the input is already encoded using the tokenizer) attempts to predict / generate the next word or token of the provided sequence of words from the dataset. Although the training environment knows the correct token/word sequence to determine if the model’s generation is accurate, during the training all future tokens/words in the sequence are masked (hidden) from the model, forcing it to predict the next sequence based on the value of its parameters. Upon assessing the degree of error (loss) of the generated text from the dataset, the model iteratively adjusts its parameter values until it correctly predicts the next token or word.



For example, given the sequence The dog chased the… the model might predict bird as the next word and may then be told that the answer is wrong because the actual next word was cat. The loss (or error) from the expected result can be calculated at which time the model weights are adjusted to predict better the next time the training is performed.


Training Datasets

Vast amounts of data are required to train generative AI large language models (LLMs). The data within the various corpora influence the output of a model and it is critical to evaluate the training data to mitigate model bias. The following datasets have been used to train some of the world’s leading LLMs:


Dataset Name Description Download Models
BookCorpus Turned scraped data of 11,000 unpublished books into a 985 million-word dataset. It was initially created to align storylines in books to their movie interpretations. The dataset was used for training LLMs like RoBERTA, XLNET, and T5.
C4 A 750 GB English corpus derived from the Common Crawl. It uses heuristic methods to extract only natural language data while removing all gibberish text. C4 has also undergone heavy deduplication to improve its quality. Language models like MPT-7B and T5 are pre-trained with C4.
Common Crawl Contains petabytes of data collected over 8 years of web crawling across billions of pages. The corpus contains raw web page data, metadata extracts and text extracts with light filtering.
CommonsenseQA General domain crowd-sourced questions with high semantic complexity which command the use of prior knowledge.
MedMCQA Real-world medical entrance exam questions.
MedQA Questions from US (USMLE) medical board exams.
OpenBookQA Science and broad common knowledge questions, which require multi-step reasoning and rich text comprehension.
Red Pajama 1.2 trillion tokens extracted from Common Crawl, C4, GitHub, books, and other sources. Red Pajama’s transparent approach helps train MPT-7B and OpenLLaMA.
Refinedweb A massive corpus of deduplicated and filtered tokens from the Common Crawl dataset. The dataset has more than 5 trillion tokens of textual data, of which 600 billion are made publicly available. It was developed as an initiative to train the Falcon-40B model with smaller-sized but high-quality datasets.
Roots A 1.6TB multilingual dataset curated from text sourced in 59 languages. Created to train the BigScience Large Open-science Open-access Multilingual (BLOOM) language model. ROOTS uses heavily deduplicated and filtered data from Common Crawl, GitHub Code, and other crowdsourced initiatives.
Starcoder A programming-centric dataset built from 783 GB of code written in 86 programming languages. It also contains 250 billion tokens extracted from GitHub and Jupyter Notebooks. Salesforce CodeGen, Starcoder, and StableCode were trained with Starcoder Data to enable better program synthesis.
StrategyQA General domain crowd-sourced questions with high semantic complexity which require implicit reasoning and multi-step answer strategies. Yes/no answers.
The Pile An 800 GB corpus that enhances a model’s generalization capability across a broader context. It was curated from 22 diverse datasets, mostly from academic or professional sources. GPT-Neo, LLaMA, and OPT.
WebText2 The text of web pages from all outbound Reddit links from posts with 3+ upvotes.
Wikipedia All Wikipedia pages in the English language.

Training Challenges

LLMs are trained on a massive text corpus ranging at least in the size of 1000 GBs. The models used to train on these datasets are very large containing billions of parameters. In order to train such large models on the massive text corpus, infrastructure/hardware supporting multiple GPUs is required.


Aside from the infrastructure, training LLMs are often cost-prohibitive. Companies and research institutions invest millions of dollars to set it up and train LLMs from scratch. For example, it is estimated that GPT-3 cost around $4.6 million dollars to train from scratch.


Scaling laws were introduced to determine how much optimal data is required to train a model of a particular size. In 2022, DeepMind proposed the scaling laws for training the LLMs with the optimal model size and dataset (no. of tokens) in the paper Training Compute-Optimal Large Language Models. These scaling laws are popularly known as Chinchilla or Hoffman scaling laws. It states that the number of tokens used to train a given LLM should be 20 times more than the number of parameters of the model.


For example, 1,400B (1.4T) tokens should be used to train 70B parameter LLM model (20 text tokens per parameter).


From Pretraining to Instruction

It is important to note that the output from the pretraining phase is very raw. The model may not reliably follow instructions and might often produce output that appears as if it has no concept of how to meet human expectations. If performed correctly, the pretrained model should understand human language in all its infinite variety and complexity. To be able to interact with LLMs, supervised fine-tuning (SFT) is required to further enhance the model.




Prompt Engineering Guides



©2023 The Horizon