Cohere for AI, the nonprofit research lab established by Cohere in 2022, has recently unveiled Aya, an open-source large language model supporting 101 languages, more than double what existing models offer. This impressive feat was achieved through the collaborative efforts of over 3000 participants from 119 countries. According to Sara Hooker, VP of research at Cohere, this project, dubbed Aya, turned out to be a monumental endeavor, with a rich dataset containing over 513 million fine-tuned annotations.
Hooker emphasized the tremendous value of this dataset, referring to it as ‘gold dust’ crucial for the success of large language models. The Aya model has excelled in performance tests, surpassing well-known models like mT0 and Bloomz by a considerable margin and expanding coverage to over 50 previously unsupported languages like Somali and Uzbek.
The release of Aya marks a significant milestone in the advancement of multilingual AI capabilities, with experts like Ivan Zhang praising the project’s ambition to cater to a more diverse linguistic audience beyond English. Cohere for AI aims to bridge the gap in multilingual data availability, allowing researchers to leverage the power of large language models for a wider range of languages and cultures often overlooked by existing models.
Aleksa Gordic, a former Google DeepMind researcher, commends Aya and similar multilingual data initiatives as essential steps toward building high-quality language-specific models. While acknowledging the need for more efforts in this direction, Gordic stresses the importance of a global research community and government support to preserve linguistic diversity in the evolving AI landscape.
Cohere for AI’s Aya model and datasets are already accessible on Hugging Face, signaling a significant advancement in the democratization of AI technology for a more inclusive and linguistically diverse future