drahardja, (edited )
@drahardja@sfba.social avatar

is gunking up the web, especially for lesser-represented languages. Spammers are creating garbage English language content using LLMs, then translating it into multiple languages at the same time, using Machine Translation, presumably to generate clickbait ad revenue in several languages at once.

In English, such gunk accounts for some 9% of total sampled web content. But in languages with less representation on the Internet, the figures could be much higher. In Malay, it’s something like 26%, and in Swahili it’s nearly HALF of everything found on the web.

Paper [pdf]: “A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism”

https://arxiv.org/pdf/2401.05749.pdf

carkner,
@carkner@klezmor.im avatar

@drahardja I started noticing this a few years ago when I would write a Wikipedia article about a niche historical figure, try to go back a few months later to see if I missed anything and in google searches I would find a bunch of Thai or other language results which I quickly realized had machine translated my article.

simon,

@drahardja @bill Holy shit 9% of English is already terrifying. That's way worse than I imagined it would be.

alberto_cottica,
@alberto_cottica@mastodon.green avatar

@drahardja something similar was predicted by Peter Watts in the Rifters trilogy.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • ai
  • GTA5RPClips
  • DreamBathrooms
  • thenastyranch
  • magazineikmin
  • tacticalgear
  • cubers
  • Youngstown
  • mdbf
  • slotface
  • rosin
  • osvaldo12
  • ngwrru68w68
  • kavyap
  • InstantRegret
  • JUstTest
  • everett
  • Durango
  • cisconetworking
  • khanakhh
  • ethstaker
  • tester
  • anitta
  • Leos
  • normalnudes
  • modclub
  • megavids
  • provamag3
  • lostlight
  • All magazines