Day: September 9, 2023

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

Source: Google We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering […]

DAILY LINKS TO YOUR INBOX

PROMPT ENGINEERING

Prompt Engineering Guides

ShareGPT

 

©2024 The Horizon