Textbooks Are All You Need II: phi-1.5 technical report

Source: Microsoft We continue the investigation into the power of smaller Transformer-based language models as initiated by textbf{TinyStories} — a 10 million parameter model that can produce coherent English — and the follow-up work on textbf{phi-1}, a 1.3 billion parameter model with Python coding performance close to the state-of-the-art. The latter work proposed to use […]

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

Source: Google We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering […]

From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

Source: Salesforce Selecting the “right” amount of information to include in a summary is a difficult task. A good summary should be detailed and entity-centric without being overly dense and hard to follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with what we refer to as a “Chain of Density” (CoD) […]