The internet is filled with machine-translated content, particularly in languages spoken in Africa and the Global South, raising concerns about the quality of large language models.
Researchers found that over half of the sentences on the web have been translated into two or more languages, with varying degrees of quality, leading to serious concerns about the training of large language models.
The study indicates that high-resource languages tend to have more accurate translations with an average parallelism of 4, while low-resource languages have an average parallelism of 8.6, and their translations tend to be much worse in quality.