alecm, to ArtificialIntelligence

“Suffice it to say that everyone in possession of a copy of the LAION-5B images has hundreds if not thousands of instances of CSAM” | …so that’s 0.0001% of the content, then

So David Thiel at Stanford has posted a much-reported paper/story which tells us that the dataset which drives Stable Diffusion and a bunch of other AI systems, has scraped:

hundreds if not thousands of instances of CSAM (and a much larger number of instances of NCII more broadly)

https://www.threads.net/


…and it struck me to ask “how many images are there in LAION-5B so we can get a percentage?”

It turns out that the number of images in LAION-5B is five billion – hence the 5B:

LAION-5B was released in early 2022 by a German nonprofit that has received funding from several AI startups. The dataset comprises more than 5 billion images scraped from the web and accompanying captions. It’s an upgraded version of earlier AI training dataset, called LAION-400M, that was published by the same nonprofit a few months earlier and includes about 400 million images.

https://siliconangle.com/2023/12/20/researchers-find-csam-images-laion-5b-ai-training-dataset/


So if we generously interpret “…if not thousands…” to mean “five thousand” then some simple maths tells us that this is 0.0001% of the content, or literally “one in a million”.

This is the “needle in a haystack” ballpark – again, literally, if a heavyweight darning needle weighs 1 gram, then one million needles would weigh 1000kg, and the largest 4x4x8 haybales max-out at 2000lb / a little over 900kg.

The US Food & Drug Administration permits “defects” of up to “[an] Average of 9 mg or more rodent excreta pellets and/or pellet fragments per kilogram” – which works out as:

(9mg / 1kg) * 100 = 0.0009%

So there can be more than 9x more mouse poop in the flour which makes your bread, than there generously is CSAM in the LAION-5B dataset.

“But this is all guesswork on your part / One image is one too many…”

The numbers are all above. Feel free to nitpick. Pick your own percentages. The FDA acknowledges that that poop in food is unavoidable, and the unstated goal of “Zero CSAM in a scraped dataset” will probably likewise be unavoidable. Thiel himself acknowledges:

While it’s not surprising that a crawl of the public internet will contain some CSAM, there’s no reason to go gather data on that scale without appropriate safeguards. The project that seeded the LAION sets made some efforts to filter content with CLIP, but it didn’t do enough.

https://www.threads.net/


Perhaps some enterprising journalist should ask Thiel “how much would be enough?” and then go ask the FDA the same question?

https://www.addtoany.com/add_to/copy_link?linkurl=https%3A%2F%2Falecmuffett.com%2Farticle%2F108656&linkname=%E2%80%9CSuffice%20it%20to%20say%20that%20everyone%20in%20possession%20of%20a%20copy%20of%20the%20LAION-5B%20images%20has%20hundreds%20if%20not%20thousands%20of%20instances%20of%20CSAM%E2%80%9D%20%7C%20%E2%80%A6so%20that%E2%80%99s%200.0001%25%20of%20the%20content%2C%20thenhttps://www.addtoany.com/add_to/facebook?linkurl=https%3A%2F%2Falecmuffett.com%2Farticle%2F108656&linkname=%E2%80%9CSuffice%20it%20to%20say%20that%20everyone%20in%20possession%20of%20a%20copy%20of%20the%20LAION-5B%20images%20has%20hundreds%20if%20not%20thousands%20of%20instances%20of%20CSAM%E2%80%9D%20%7C%20%E2%80%A6so%20that%E2%80%99s%200.0001%25%20of%20the%20content%2C%20thenhttps://www.addtoany.com/add_to/linkedin?linkurl=https%3A%2F%2Falecmuffett.com%2Farticle%2F108656&linkname=%E2%80%9CSuffice%20it%20to%20say%20that%20everyone%20in%20possession%20of%20a%20copy%20of%20the%20LAION-5B%20images%20has%20hundreds%20if%20not%20thousands%20of%20instances%20of%20CSAM%E2%80%9D%20%7C%20%E2%80%A6so%20that%E2%80%99s%200.0001%25%20of%20the%20content%2C%20thenhttps://www.addtoany.com/add_to/mastodon?linkurl=https%3A%2F%2Falecmuffett.com%2Farticle%2F108656&linkname=%E2%80%9CSuffice%20it%20to%20say%20that%20everyone%20in%20possession%20of%20a%20copy%20of%20the%20LAION-5B%20images%20has%20hundreds%20if%20not%20thousands%20of%20instances%20of%20CSAM%E2%80%9D%20%7C%20%E2%80%A6so%20that%E2%80%99s%200.0001%25%20of%20the%20content%2C%20thenhttps://www.addtoany.com/add_to/email?linkurl=https%3A%2F%2Falecmuffett.com%2Farticle%2F108656&linkname=%E2%80%9CSuffice%20it%20to%20say%20that%20everyone%20in%20possession%20of%20a%20copy%20of%20the%20LAION-5B%20images%20has%20hundreds%20if%20not%20thousands%20of%20instances%20of%20CSAM%E2%80%9D%20%7C%20%E2%80%A6so%20that%E2%80%99s%200.0001%25%20of%20the%20content%2C%20thenhttps://www.addtoany.com/add_to/hacker_news?linkurl=https%3A%2F%2Falecmuffett.com%2Farticle%2F108656&linkname=%E2%80%9CSuffice%20it%20to%20say%20that%20everyone%20in%20possession%20of%20a%20copy%20of%20the%20LAION-5B%20images%20has%20hundreds%20if%20not%20thousands%20of%20instances%20of%20CSAM%E2%80%9D%20%7C%20%E2%80%A6so%20that%E2%80%99s%200.0001%25%20of%20the%20content%2C%20thenhttps://www.addtoany.com/add_to/twitter?linkurl=https%3A%2F%2Falecmuffett.com%2Farticle%2F108656&linkname=%E2%80%9CSuffice%20it%20to%20say%20that%20everyone%20in%20possession%20of%20a%20copy%20of%20the%20LAION-5B%20images%20has%20hundreds%20if%20not%20thousands%20of%20instances%20of%20CSAM%E2%80%9D%20%7C%20%E2%80%A6so%20that%E2%80%99s%200.0001%25%20of%20the%20content%2C%20thenhttps://www.addtoany.com/add_to/threads?linkurl=https%3A%2F%2Falecmuffett.com%2Farticle%2F108656&linkname=%E2%80%9CSuffice%20it%20to%20say%20that%20everyone%20in%20possession%20of%20a%20copy%20of%20the%20LAION-5B%20images%20has%20hundreds%20if%20not%20thousands%20of%20instances%20of%20CSAM%E2%80%9D%20%7C%20%E2%80%A6so%20that%E2%80%99s%200.0001%25%20of%20the%20content%2C%20thenhttps://www.addtoany.com/share

https://alecmuffett.com/article/108656

  • All
  • Subscribed
  • Moderated
  • Favorites
  • JUstTest
  • mdbf
  • ngwrru68w68
  • modclub
  • magazineikmin
  • thenastyranch
  • rosin
  • khanakhh
  • InstantRegret
  • Youngstown
  • slotface
  • Durango
  • kavyap
  • DreamBathrooms
  • megavids
  • GTA5RPClips
  • tacticalgear
  • normalnudes
  • tester
  • osvaldo12
  • everett
  • cubers
  • ethstaker
  • anitta
  • provamag3
  • Leos
  • cisconetworking
  • lostlight
  • All magazines