drahardja, (edited ) #AI #LLM is gunking up the web, especially for lesser-represented languages. Spammers are creating garbage English language content using LLMs, then translating it into multiple languages at the same time, using Machine Translation, presumably to generate clickbait ad revenue in several languages at once.
In English, such gunk accounts for some 9% of total sampled web content. But in languages with less representation on the Internet, the figures could be much higher. In Malay, it’s something like 26%, and in Swahili it’s nearly HALF of everything found on the web.
Paper [pdf]: “A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism”
Add comment