KathyReid,
@KathyReid@aus.social avatar

Like many other technologists, I gave my time and expertise for free to because the content was licensed CC-BY-SA - meaning that it was a public good. It brought me joy to help people figure out why their code wasn't working, or assist with a bug.

Now that a deal has been struck with to scrape all the questions and answers in Stack Overflow, to train models, like , without attribution to authors (as required under the CC-BY-SA license under which Stack Overflow content is licensed), to be sold back to us (the SA clause requires derivative works to be shared under the same license), I have issued a Data Deletion request to Stack Overflow to disassociate my username from my Stack Overflow username, and am closing my account, just like I did with Reddit, Inc.

https://policies.stackoverflow.co/data-request/

The data I helped create is going to be bundled in an and sold back to me.

In a single move, Stack Overflow has alienated its community - which is also its main source of competitive advantage, in exchange for token lucre.

Stack Exchange, Stack Overflow's former instantiation, used to fulfill a psychological contract - help others out when you can, for the expectation that others may in turn assist you in the future. Now it's not an exchange, it's .

Programmers now join artists and copywriters, whose works have been snaffled up to create solutions.

The silver lining I see is that once OpenAI creates LLMs that generate code - like Microsoft has done with Copilot on GitHub - where will they go to get help with the bugs that the generative AI models introduce, particularly, given the recent GitClear report, of the "downward pressure on code quality" caused by these tools?

While this is just one more example of , it's also a salient lesson for folks - if your community is your source of advantage, don't upset them.

craigduncan,
@craigduncan@mastodon.au avatar

@KathyReid

I feel your disappointment, but it's best to find a non-commercial community, and there's reason to put effort into making them; it seems Mastodon is a good place to start looking.

When stripped bare, is a profit-making software company that runs a website encouraging people to improve its product for free. The strategy of these companies seems to be to take over public-interest websites and pretend it's business as usual. However, whilst they are paying lip service to the altruism, they want to increase profits and market share. This is exemplified in their self-serving PR statement about attribution from back in February:
https://stackoverflow.blog/2024/02/29/defining-socially-responsible-ai-how-we-select-api-partners/

In that statement, they use the word power a lot. That's what's really important to them.

Drusniel,

@KathyReid Scrqppimg can be done anyway, and was already done. So your data was captured already even if you delete it. For SO makes sense to have an alternative source of revenue given that many programmers are using AI to find the answers and their site will be scrapped anyway, there are companies doing that and selling the data, they will have records, history etc...

alper,
@alper@rls.social avatar

@KathyReid Maybe this would make it possible to sue the OpenAI model to become open?

njsg,
@njsg@social.sdf.org avatar

@KathyReid So what they're doing with data deletion requests is just deleting the PII in the attribution? They're not deleting the content?

KathyReid,
@KathyReid@aus.social avatar

@njsg right, that's my understanding. As part of the ToS of Stack Overflow, they own the content - but what I have requested here is that my username, which was very similar to my real name, is disassociated from the content I have produced.

They're not deleting the content, because they own it.

njsg,
@njsg@social.sdf.org avatar

@KathyReid According to the ToS, they don't own the content. They are licensed the content (under CC BY-SA, now 4.0), but authors retain copyright ownership.

KathyReid,
@KathyReid@aus.social avatar

@njsg that is a good point, so there may also be copyright violations here.

rythur,
@rythur@mastodon.social avatar

@KathyReid

This is a pure and unadulterated decline in trust happening in real time, and it's palpably saddening.

We should all just stop using the internet now. That would rid most societal problems in one night.

This AI stuff is merely theft. That's all.

KathyReid,
@KathyReid@aus.social avatar

@rythur You raise an excellent point about trust in a time of generative AI - and whether we can trust what we see on the internet.

The second and third order impacts of this are also huge. The early days of the internet were based on trust - the early internet was literally constructed - built - on people trusting each other.

The lack of trust means people take fewer risks - it will inhibit innovation.

MoBaAusbWerk120,
@MoBaAusbWerk120@zug.network avatar

@KathyReid Is anyone of the controbutors thinking of a sueing SO of lisence/copyright infridgement yet?

KathyReid,
@KathyReid@aus.social avatar

@MoBaAusbWerk120 Good question, not that I know of. I also think it would be an enormous undertaking.

hcs,
@hcs@mathstodon.xyz avatar

@KathyReid can you sue them for violating the license?

KathyReid,
@KathyReid@aus.social avatar

@hcs probably not, because SO owns the copyright in the material - so it's a copyright vs creative commons interplay

futurebird,
@futurebird@sauropods.win avatar

@KathyReid

Will this impact all of the "overflows" eg mathoverflow?

KathyReid,
@KathyReid@aus.social avatar
patrickleavy,
@patrickleavy@mastodon.social avatar

@KathyReid someone needs to fork the web

KathyReid,
@KathyReid@aus.social avatar

@patrickleavy well, except it's now populated with so much LLM-generated bullshit it would be impossible to tell what's LLM-generated and what's human-generated.

highvizghilliesuit,
@highvizghilliesuit@newsie.social avatar

@KathyReid To be clear, I think your outrage is valid here.

Are you or anyone reading this comment aware of a good primer on how LLMs are actually trained. My (possibly wrong) understanding is that the trained models don't contain any copyrighted work per se, but more like the patterns of characters of a very large number of works. This is why they hallucinate, and also why even though hundreds of terabytes of information are fed into them, even the largest models are like 100 GB.

bornach,
@bornach@masto.ai avatar

@highvizghilliesuit @KathyReid
Just do an internet search on Transformers, "Attention is all you need", GPT, BERT, etc. There are many great tutorials covering different levels of detail. This video is more of an overview:
https://youtu.be/Rx-5AGHNu7M

They do in fact encode the copyrighted work into their neural network weights and biases, and can be prompted to regenerate entire passages of text.
https://www.patronus.ai/blog/introducing-copyright-catcher

But it is all linear algebra under the hood

ErikJonker,
@ErikJonker@mastodon.social avatar

@KathyReid If your code was used for a fully (!) opensource LLM or AI model , would you think that's OK ?

kellogh,
@kellogh@hachyderm.io avatar

@ErikJonker @KathyReid i stopped contributing to SO about 10 years ago bc i realized i was generating value for a company and little to none for myself. at best, someone sees my profile and thinks, “gee, 8000 points, wow”. but i haven’t ever been able to convert that into jobs or money, whereas i have with open source contributions.

if anything, i’m a little suprised that people are just now realizing the cost of “free” corporate services

KathyReid,
@KathyReid@aus.social avatar

@kellogh @ErikJonker That's a good point. Your example is where SO is hoarding the power and profits, from contributors. There's another type of scale happening here with OpenAI - where they're essentially eating the profits of Stack Overflow by vacuuming up the text into an LLM.

It's a concentration effect.

How do individuals effectively resist this type of power concentration?

ErikJonker,
@ErikJonker@mastodon.social avatar

@kellogh @KathyReid ...true, it's also a bit strange to be shocked that companies like OpenAI use it as trainingdata. It's my assumption that all my free published data will be used in that context...

kellogh,
@kellogh@hachyderm.io avatar

@ErikJonker @KathyReid right, i’m pretty sure it was already being used. by every model. just they’re actually getting paid for it now

KathyReid,
@KathyReid@aus.social avatar

@ErikJonker good question. By fully open source, I am assuming the weights, biases and the source data, and training algorithm, are openly available.

This would be a situation I am a lot more comfortable with, but it still would not fulfil the requirements of the CC-BY license (requiring attribution).

If the LLM was used with RAG, and RAG was used to provide attribution, I think I would be comfortable with that.

wraptile,
@wraptile@fosstodon.org avatar

@KathyReid personally I disagree. For the record I have over 20k points on stackoverflow and gamification and attribution part was never the goal but a nice extra.

Goal was always to share as much free information as possible to raise everyone up with the tide and LLMs align with this goal. It's a bit sad that we don't get attribution for our labor but I think its unethical to claim for it when the primary goal is clearly being fulfilled in a significantly more efficient fashion 🙏

krans,
@krans@mastodon.me.uk avatar

@wraptile @KathyReid No, throwing plausible-looking bullshit that looks like answers at users does not in any way align with the goal to “raise everyone up with the tide,” and is actually just force feeding everyone lowest-common-denominator pap.

bornach,
@bornach@masto.ai avatar

@krans @wraptile @KathyReid
"Raise everyone up with the tide" would be releasing their training weights and biases as open source as required by CC-BY-SA but OpenAI has just stated they have no intention of doing this
https://youtu.be/lQNEnVVv4OE

Their lawyers will claim fair use and that their Terms and Conditions mean the user has taken on all risk of any copyright infringement
https://youtu.be/fOTuIhOWFXU

wraptile,
@wraptile@fosstodon.org avatar

@krans @KathyReid it works though. LLMs are brilliant assistants for everyone learning programming and to claim otherwise is just sheer ignorance. Don't make perfect the enemy of good and all.

scruss,
@scruss@xoxo.zone avatar

@wraptile @krans @KathyReid don't make me post the "Copying and Pasting from Stack Overflow" ORLY meme.

LLMs don't teach you to program. They teach you to consume without discernment

KathyReid,
@KathyReid@aus.social avatar

@scruss @wraptile @krans

Strong agree.

I think there's also a danger here that by not writing code, and going through the learning journey that writing code provides, people are less able to debug code, and understand what it's doing.

It's a form of abstraction where the complexity - writing code - is abstracted away for faster development. But what do we lose in that process?

In a way, there will be a higher dependency on people who have coded for decades to be able to do debugging and more complex programming tasks.

It's like cars - as they've become easier to drive, they're harder to debug and fix, so there's an increased dependency on mechanics (and in turn, on car manufacturers who don't let mechanics do as much).

OmegaPolice,
@OmegaPolice@hachyderm.io avatar

@KathyReid So, if they listed all Stack Overflow users in a credits roll on their derived models, BY would be taken care of, right?

Not sure SA applies here (IANAL); pretty sure nobody ever has licensed their manual work created by taking answers and snippets from SO under a CC license for that reason alone.

So ... while I think the war against hyping dumb AI applications is just and necessary, I don't think that this (SO content for OpenAI) is a battle worth engaging in. 🤔

x0,
@x0@dragonscave.space avatar

@KathyReid My personal takeaway from this, in addition to other examples not only related to AI, is that they don't actually care about these licenses at all. CC licenses, FOSS licenses, they just give them an excuse to use the things covered, regardless of the terms, because who's going to challenge them? Has there actually been any legal repercussion in any form for a company breaching the terms of an open license? If there has, please correct me, because I am just making sinical conclusions and haven't done actual research on this. Would also be worth looking into whether many things that are free for noncommercial use but require a token license for commercial use, by design not piracy, are actually respected that way by commercial users. I would at least hope that is the case. My own personal belief on how all private industry operates is "that unless you have teeth, you are for chewing." This also crops up with moral vs. law in, say, accesibility.

mastodonmigration,
@mastodonmigration@mastodon.online avatar

@KathyReid

Simple question. Would it be allowed under the license terms to scrape the entire Stack Overflow dataset and recreate the entire thing as a fedi app?

mapto,
@mapto@qoto.org avatar

@mastodonmigration @KathyReid to the ignorant that I am, this sounds legally feasible, but economically challenging. That's a lot of content and maintaining it on a live platform is associated with notable costs. Not a reason to stop someone from trying, rather an argument to plan ahead.

Here's a starting point: https://www.kaggle.com/datasets/stackoverflow/stackoverflow

Paxxi,
@Paxxi@hachyderm.io avatar

@mastodonmigration @KathyReid afaik you don't need to scrape it, they publish the database periodically https://archive.org/details/stackexchange

blogdiva,
@blogdiva@mastodon.social avatar

@KathyReid Sincere question: shouldn't Creative Commons turn this into a class action lawsuit? it's so huge and out of proportion what they're doing, groups of individuals can't really tackle this alone.

j3j5,
@j3j5@hachyderm.io avatar

@blogdiva @KathyReid apparently, it isn't as clear as it seems that they're breaking the license, see this post about images https://creativecommons.org/2023/02/17/fair-use-training-generative-ai/

mapto,
@mapto@qoto.org avatar

@j3j5 @blogdiva @KathyReid this appears to be only an opinion, which 1) doesn't appear very informed to me, and 2) only mentions code-related issues, and does not address them.

More precisely, clearly when talking of code, this is not part of the process used:
"Diffusion models like Stable Diffusion and Midjourney take these inputs, add “noise” to them, corrupting them, and then train neural networks to remove the corruption."
Neither is this:
"This is because using the digitized books as part of the database provided information about the books and did not use them for their creative content"

Also, regarding the court's inverse question, it seems to me that this is an extremely valid concern:
"On this point, the court wrote that the better question to answer was not how much of the works [company] copied, but instead how much was available to users". Due to the granularity of SO, this could easily be 100%, even though this would need to be illustrated (=proven) as in the NYT case.

Could anyone identify "substantial, non infringing uses" of the SO data? Here using CC might make it difficult to establish what was the original interest of the contributor.

Final note, to me it remains a bit unclear what is the relevance of the Oracle Vs Google case according to the author.

bornach,
@bornach@masto.ai avatar

@mapto @j3j5 @blogdiva @KathyReid

This assumption made by Wolfson:
"they do not reproduce images in their data sets"
is on very shaky ground, especially when it comes to Large Language Models.

Patronus AI found several examples of LLMs generating passages of copyrighted books
https://www.patronus.ai/blog/introducing-copyright-catcher
One might be able to chain together a sequence of text completion prompts to regenerate entire chapters.

mapto,
@mapto@qoto.org avatar

@j3j5 @blogdiva @KathyReid excuse me, actually at a second thought BERT and GPT are exactly doing corrupt-and-train on the dataset. This certainly weakens my interpretation. Apologies for the confusion.

bornach,
@bornach@masto.ai avatar

@mapto @j3j5 @blogdiva @KathyReid
Not sure what the relevance of corrupt-and-train is to the legal argument being made here. Wolfson claims "they do not piece together new images from bits of images from their training data" but one could argue that neither is transcoding a Disney movie into a lossy MPEG format. Each frame is regenerated from discrete cosine transforms and motion vectors. Error correction happens during storage. Does that make it fair use?

astrojuanlu,
@astrojuanlu@social.juanlu.space avatar

@j3j5 @blogdiva @KathyReid tl;dr: CC licenses aren't above copyright law. We need to wait for judges to settle whether training algorithms falls within "fair use" doctrine. If they approve of that, there's nothing to do.

DoesntExist,
@DoesntExist@mastodon.social avatar

@astrojuanlu @j3j5 @blogdiva @KathyReid

Hi.

As someone who has worked on hundreds of licenses, I can tell you that they 100% trump copyright law, which is why licenses must specifically call out things like fair use.

The CC BY-SA specifically allows for derivative use. The issue here is it also requires attribution for that use.

blogdiva,
@blogdiva@mastodon.social avatar

@DoesntExist @astrojuanlu @j3j5 @KathyReid does derivative include closing off access to the content behind a paywall? has that been turned into an interpretation of the license?

DoesntExist,
@DoesntExist@mastodon.social avatar

@blogdiva @astrojuanlu @j3j5 @KathyReid

That would be the sort of thing a court would decide. Copyright simply has nothing to do with it. A court has to interpret the license.

The lack of attribution is the obvious problem. The paywall is an open question.

j3j5,
@j3j5@hachyderm.io avatar

@DoesntExist @blogdiva @astrojuanlu @KathyReid what about the Share-Alike part? Output of LLMs does not take into account the original license and, iiuc, they go as far as to "give you ownership" of the output as prompt author. Isn't that an obvious breach as well? (assuming the answer is considered a derivative)

DoesntExist,
@DoesntExist@mastodon.social avatar

@j3j5 @blogdiva @astrojuanlu @KathyReid

The weakness of CC licenses is the violators have deeper pockets for better lawyers and high profit motivation.

They're doing a cost-benefit analysis up front.

The corps are the pirates.

j3j5,
@j3j5@hachyderm.io avatar

@DoesntExist @blogdiva @astrojuanlu @KathyReid well, yeah, it's the same weakness in FOSS generally 😥

I think every day about this thread talking how OpenAI is going to kill the Commons bc exactly this. The only possible defensea are data poisoning or not sharing, which both suck for the Commons.

https://mastodon.social/@mcc/112209121196262534

DoesntExist,
@DoesntExist@mastodon.social avatar

@j3j5 @blogdiva @astrojuanlu @KathyReid

They profit by killing sharing.

j3j5, (edited )
@j3j5@hachyderm.io avatar

@DoesntExist @blogdiva @astrojuanlu @KathyReid that's what I thought, but honestly, lately I can shake away this feeling that I'm training their for-profit climate-killing machines everytime I share something on the internet (regardless of licenses). What used to be a feeling of helping somebody has been replaced with a fear of "how will They™ use this against me"

KathyReid,
@KathyReid@aus.social avatar

@j3j5 @DoesntExist @blogdiva @astrojuanlu

That's an excellent point. Each training run of a large LLM consumes an enormous amount of electricity AND an enormous amount of fresh (clean, potable) water.

The purpose of a system is what it does - Stafford Beer.

KathyReid,
@KathyReid@aus.social avatar

@j3j5 @DoesntExist @blogdiva @astrojuanlu

Strong agree. A lot of Elinor Ostrom's work around governance of the commons - where we get the phrase "tragedy of the commons" - relied on mechanisms of co-operation between institutions.

One of the key challenges I see here is that corporations like OpenAI now have a lot more power than even groups of institutions - lawmakers, governments, civil society. We've seen that recently with the way Meta has influenced government policy around paying to share content from commercial news agencies.

There's also a paradox here - an increased production of work in the Commons is good for OpenAI - because it provides them with more data. However, the way in which the Commons is used - to create for-profit products like , serves as a constraint on people donating creative material to the commons.

KathyReid,
@KathyReid@aus.social avatar

@j3j5 @DoesntExist @blogdiva @astrojuanlu

IMHO the key issue here is whether an LLM trained on CC material is a "derivative work" under the relevant CC license.

@creativecommons provides a good blog post here on the interplay between copyright and creative commons licenses, and how they intersect with AI training:
https://creativecommons.org/2023/08/18/understanding-cc-licenses-and-generative-ai/

Because copyright law is different in each country, the interplay between copyright and creative commons is also different.

KathyReid,
@KathyReid@aus.social avatar

@blogdiva @DoesntExist @astrojuanlu @j3j5

Good question. In the CC-SA clause, share alike means that the derivative work has to be licensed in the same way:

"ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. "

The complication as I see it here is that they are using copyright law, because Stack Overflow holds the copyright, and can license the content as it sees fit, rather then the creative commons restrictions.

There's also a question here if LLMs / AI models are considered derivative works.

IIUC, Creative Commons position is that they are:
https://creativecommons.org/2023/08/18/understanding-cc-licenses-and-generative-ai/

KathyReid,
@KathyReid@aus.social avatar

@DoesntExist @astrojuanlu @j3j5 @blogdiva

and for the derivative works to be licensed in the same way (share alike)

KathyReid,
@KathyReid@aus.social avatar

@astrojuanlu @j3j5 @blogdiva

Except that copyright laws are different in different countries - not all countries have a fair use exemption in copyright law

KathyReid,
@KathyReid@aus.social avatar

@blogdiva Right, but their position seems to be very generative AI friendly, which aligns with their remix, reuse ethos.

They are unlikely to sue because generative AI fulfils parts of the mission of Creative Commons - to use creative works in new, creative ways.

LLS,
@LLS@wandering.shop avatar

@KathyReid @blogdiva Where’s the creative part though?

KathyReid,
@KathyReid@aus.social avatar

@LLS @blogdiva Right, so this comes down to the definition of creativity.

If a person re-mixes content in a new or unique way, we consider that creative. Possibly derivative, but creative.

If an LLM does it, is it still creative?

I would argue no, because I see LLMs as bullshit generators that regurgitate what they were fed on, but others are likely to take a different philosophical view.

LLS,
@LLS@wandering.shop avatar

@KathyReid @blogdiva So true. Yeah, until there’s actual sapient intelligence, it’s hard to view “AI” as anything but recombining devices akin to food processors or wood chippers.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • stackoverflow
  • DreamBathrooms
  • magazineikmin
  • ethstaker
  • khanakhh
  • rosin
  • Youngstown
  • everett
  • slotface
  • ngwrru68w68
  • mdbf
  • GTA5RPClips
  • kavyap
  • thenastyranch
  • cisconetworking
  • JUstTest
  • cubers
  • Leos
  • InstantRegret
  • Durango
  • tacticalgear
  • tester
  • osvaldo12
  • normalnudes
  • anitta
  • modclub
  • megavids
  • provamag3
  • lostlight
  • All magazines