Copyright lawsuits against Meta and OpenAI mention shadow libraries, including Library Genesis, as sources of training data

cross-posted from: lemmy.world/post/1330512

Below are direct quotes from the filings.

OpenAI

As noted in Paragraph 32, supra, the OpenAI Books2 dataset can be estimated to contain about 294,000 titles. The only “internet-based books corpora” that have ever offered that much material are notorious “shadow library” websites like Library Genesis (aka LibGen), Z-Library (aka B-4ok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems. These flagrantly illegal shadow libraries have long been of interest to the AI-training community: for instance, an AI training dataset published in December 2020 by EleutherAI called “Books3” includes a recreation of the Bibliotik collection and contains nearly 200,000 books. On information and belief, the OpenAI Books2 dataset includes books copied from these “shadow libraries,” because those are the most sources of trainable books most similar in nature and size to OpenAI’s description of Books2.

Meta

Bibliotik is one of a number of notorious “shadow library” websites that also includes Library Genesis (aka LibGen), Z-Library (aka B-ok), and Sci-Hub. The books and other materials aggregated by these websites have also been available in bulk via torrent systems. These shadow libraries have long been of interest to the AI-training community because of the large quantity of copyrighted material they host. For that reason, these shadow libraries are also flagrantly illegal.

This article from Ars Tecnica covers a few more details. Filings are viewable at the law firm’s site here.

Image

Image alternative text

ArkyonVeil, 10 months ago (edited 10 months ago)

Absolutely peeved that according to laws: Libraries in a digital format literally cannot exist without being illegal. Archive.org only managed to exist as a Library because they enforced DRM which limited available rentals to the books they “bought” and had copies of.

This is because physical Libraries allow you to borrow their own copies, thus you can even read copyrighted material without asking for permission from the rights holder. So they could argue in court that the DRM only emulated the real thing.

Come COVID and they decide to be nice to people by temporarily stripping the rental bullocks. Their reward for a good deed is a sledgehammer to the stomach.

It matters not, books shall be, and remain forever free (For those that need them). One way or another. All I know is that I’ll never buy a book if I’m treated as a criminal.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Vendetta9076, 10 months ago

Shadow libraries is such a metal name. Information should be free ya fucks. Eat shit.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

RedCanasta, 10 months ago

“Flagrantly Illegal”

😂

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

LeZero, 10 months ago

Shadow librarian money gang

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Vendetta9076, 10 months ago

We love free information.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

SinJab0n, 10 months ago

I’m ok with PEOPLE reading books in any way for self improvement.

But, when a FUCKING COMPANY starts screwing with shit like this, thats when they crossed the line.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Vendetta9076, 10 months ago

Sure but you understand that publishers dont give a fuck about any of that. They find any way to shut these things down they can. Not to mention the things on Sci-Hub and Libgen should be free public knowledge to anyone or anything that wants it. Its full of tax funded research papers and textbooks. That information should belong to everyone and everything. Thats not a crossed line. Thats consistency.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

SinJab0n, 10 months ago

I agree with u, it should be free to every PERSON who wants it.

As i said before thats the fundamental difference between individuals and a company stealing.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Vendetta9076, 10 months ago

We dont agree. Its not stealing and companies should have access to the same free information.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

FactorSD, 10 months ago

You really think people would spend a lifetime writing books if they couldn’t make money from it?

Things which are free have no value, both economic and societal. Even when we pirate stuff, at least our society encourages creative labour.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Vendetta9076, 10 months ago

What makes you assume thats what I think at all? Also things that are free can bring tons of economic and societal value. That blanket statement is utterly moronic.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

VubDapple, 10 months ago

Some people would write books for free if they didn’t need to work to support themselves. Fame and the prestige of being a recognized expert are enough reward.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

MrsEaves, 10 months ago

Hell, I’d do it just because I like sharing information and helping others out. Plus it’s a big project with a sense of accomplishment.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Kissaki, 10 months ago

I can see economic, but what do you mean by no societal value?

Free access allows people to participate in culture and society that otherwise couldn’t. That seems like a positive.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ErgodicTangle, 10 months ago

Not even economic. There’s all of the textbooks on LibGen. Having access to those means even poor people can get a shot at learning with expensive textbooks. Having easier access to education means the population can be more productive and work in high impact fields.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

rustic_tiddles, 10 months ago

No but this isn’t really limiting sales of the book in any way. I buy real used books, I buy new books sometimes. I go through a few audible credits a month. I also pirate books if I feel like it. I’ve had books I bought and gotten rid of, then years later decided to pirate it and read it again. Anyway used books are so ridiculously cheap it’s very rare for me to buy a book new, often it’s a gift for a friend.

I also use ChatGPT almost every day, and while I have asked it for the summary to a book I didn’t feel like reading, it has never once replaced “reading a book” in my life. You can also get the summary to most books on wikipedia if that’s all you want.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

DieterParker, 10 months ago

Exactly that. Old cds and books change their owners for little to no money all the time. I have accumulated 100s of cds without spending anything, that where about to get thrown away. I will rip and share them on soulseek eventually.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

crunchpaste, 10 months ago

That being said, does anyone know where can said torrents be found, and how big are they?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

huojtkeg, 10 months ago

annas-archive.org/about

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ancuuiqter, 10 months ago

Mentioning this since the project Anna’s Archive compiles several datasets and their corresponding torrents.

Anna’s Archive, whose aim is to “archive all the books in the world, and make them widely accessible,” pulls from a number of shadow library sources; the project provides its own torrent links (via Tor) for Library Genesis, Z-lib, Internet Archive, among others, plus Library Genesis’s torrents. In the datasets linked below, you can click on a given source and find its onion site or the torrents provided by the shadow library itself (in the case of Library Genesis, for example).

Anna’s Archive datasets

…almost all files shown on Anna’s Archive are available through torrents. Below is a list of the different data sources that we use, with links to their torrents. Our own torrents are available on Tor.

Sources include

Internet Archive Digital Lending Library

Libgen.li comics

Z-Library scrape

ISBNdb scrape

Libgen auxiliary data

Libgen.rs

Libgen.li (includes Sci-Hub)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

crunchpaste, 10 months ago

Thanks a lot.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment