A ‘Shocking’ Amount of the Web Is Already AI-Translated Trash, Scientists Determine

Key Points:

  • Over 50% of sentences on the web have been translated into multiple languages, leading to concerns about large language model training.
  • High-resource languages tend to have more accurate translations, with low-resource languages averaging higher parallelism, but lower translation quality.
  • The prevalence of poorly translated machine-generated content raises questions about the development of large language models in lower-resource languages.

Summary:

The internet is filled with machine-translated content, particularly in languages spoken in Africa and the Global South, raising concerns about the quality of large language models.

Researchers found that over half of the sentences on the web have been translated into two or more languages, with varying degrees of quality, leading to serious concerns about the training of large language models.

The study indicates that high-resource languages tend to have more accurate translations with an average parallelism of 4, while low-resource languages have an average parallelism of 8.6, and their translations tend to be much worse in quality.

DAILY LINKS TO YOUR INBOX

PROMPT ENGINEERING

Prompt Engineering Guides

ShareGPT

 

©2024 The Horizon