@KathyReid@aus.social
@KathyReid@aus.social avatar

KathyReid

@KathyReid@aus.social

Doing a #PhD https://aus.social/@anucybernetics in #opensource #voice and #data #bias #FairML. Into #linux, #IoT. Built @SenseBreast. She/her pronouns. Ex @mycroft_ai https://fosstodon.org/@linuxaustralia @deakin @mozilla
Living in Australia on Waddawurrung land but with connections in #Northumberland
#MastoAdmin for fediverse.au

This profile is from a federated server and may be incomplete. Browse more on the original instance.

DanielEriksson, to random
@DanielEriksson@mstdn.science avatar

@KathyReid
Small world - I've students at the ANU Research School of Biology (Williams lab, structural biology of plant innate immunity).

Seems I'll need to explore more of the campus next time I'm in Canberra!

KathyReid,
@KathyReid@aus.social avatar

@DanielEriksson small world indeed! 👋 from many thousands of km away

KathyReid, to stackoverflow
@KathyReid@aus.social avatar

Like many other technologists, I gave my time and expertise for free to #StackOverflow because the content was licensed CC-BY-SA - meaning that it was a public good. It brought me joy to help people figure out why their #ASR code wasn't working, or assist with a #CUDA bug.

Now that a deal has been struck with #OpenAI to scrape all the questions and answers in Stack Overflow, to train #GenerativeAI models, like #LLMs, without attribution to authors (as required under the CC-BY-SA license under which Stack Overflow content is licensed), to be sold back to us (the SA clause requires derivative works to be shared under the same license), I have issued a Data Deletion request to Stack Overflow to disassociate my username from my Stack Overflow username, and am closing my account, just like I did with Reddit, Inc.

https://policies.stackoverflow.co/data-request/

The data I helped create is going to be bundled in an #LLM and sold back to me.

In a single move, Stack Overflow has alienated its community - which is also its main source of competitive advantage, in exchange for token lucre.

Stack Exchange, Stack Overflow's former instantiation, used to fulfill a psychological contract - help others out when you can, for the expectation that others may in turn assist you in the future. Now it's not an exchange, it's #enshittification.

Programmers now join artists and copywriters, whose works have been snaffled up to create #GenAI solutions.

The silver lining I see is that once OpenAI creates LLMs that generate code - like Microsoft has done with Copilot on GitHub - where will they go to get help with the bugs that the generative AI models introduce, particularly, given the recent GitClear report, of the "downward pressure on code quality" caused by these tools?

While this is just one more example of #enshittification, it's also a salient lesson for #DevRel folks - if your community is your source of advantage, don't upset them.

KathyReid,
@KathyReid@aus.social avatar

@njsg right, that's my understanding. As part of the ToS of Stack Overflow, they own the content - but what I have requested here is that my username, which was very similar to my real name, is disassociated from the content I have produced.

They're not deleting the content, because they own it.

KathyReid,
@KathyReid@aus.social avatar

@j3j5 @DoesntExist @blogdiva @astrojuanlu

That's an excellent point. Each training run of a large LLM consumes an enormous amount of electricity AND an enormous amount of fresh (clean, potable) water.

The purpose of a system is what it does - Stafford Beer.

KathyReid,
@KathyReid@aus.social avatar

@j3j5 @DoesntExist @blogdiva @astrojuanlu

Strong agree. A lot of Elinor Ostrom's work around governance of the commons - where we get the phrase "tragedy of the commons" - relied on mechanisms of co-operation between institutions.

One of the key challenges I see here is that corporations like OpenAI now have a lot more power than even groups of institutions - lawmakers, governments, civil society. We've seen that recently with the way Meta has influenced government policy around paying to share content from commercial news agencies.

There's also a paradox here - an increased production of work in the Commons is good for OpenAI - because it provides them with more data. However, the way in which the Commons is used - to create for-profit products like , serves as a constraint on people donating creative material to the commons.

KathyReid,
@KathyReid@aus.social avatar

@j3j5 @DoesntExist @blogdiva @astrojuanlu

IMHO the key issue here is whether an LLM trained on CC material is a "derivative work" under the relevant CC license.

@creativecommons provides a good blog post here on the interplay between copyright and creative commons licenses, and how they intersect with AI training:
https://creativecommons.org/2023/08/18/understanding-cc-licenses-and-generative-ai/

Because copyright law is different in each country, the interplay between copyright and creative commons is also different.

KathyReid,
@KathyReid@aus.social avatar

@blogdiva @DoesntExist @astrojuanlu @j3j5

Good question. In the CC-SA clause, share alike means that the derivative work has to be licensed in the same way:

"ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. "

The complication as I see it here is that they are using copyright law, because Stack Overflow holds the copyright, and can license the content as it sees fit, rather then the creative commons restrictions.

There's also a question here if LLMs / AI models are considered derivative works.

IIUC, Creative Commons position is that they are:
https://creativecommons.org/2023/08/18/understanding-cc-licenses-and-generative-ai/

KathyReid,
@KathyReid@aus.social avatar

@DoesntExist @astrojuanlu @j3j5 @blogdiva

and for the derivative works to be licensed in the same way (share alike)

KathyReid,
@KathyReid@aus.social avatar

@rythur You raise an excellent point about trust in a time of generative AI - and whether we can trust what we see on the internet.

The second and third order impacts of this are also huge. The early days of the internet were based on trust - the early internet was literally constructed - built - on people trusting each other.

The lack of trust means people take fewer risks - it will inhibit innovation.

KathyReid,
@KathyReid@aus.social avatar

@MoBaAusbWerk120 Good question, not that I know of. I also think it would be an enormous undertaking.

KathyReid,
@KathyReid@aus.social avatar

@kellogh @ErikJonker That's a good point. Your example is where SO is hoarding the power and profits, from contributors. There's another type of scale happening here with OpenAI - where they're essentially eating the profits of Stack Overflow by vacuuming up the text into an LLM.

It's a concentration effect.

How do individuals effectively resist this type of power concentration?

KathyReid,
@KathyReid@aus.social avatar

@scruss @wraptile @krans

Strong agree.

I think there's also a danger here that by not writing code, and going through the learning journey that writing code provides, people are less able to debug code, and understand what it's doing.

It's a form of abstraction where the complexity - writing code - is abstracted away for faster development. But what do we lose in that process?

In a way, there will be a higher dependency on people who have coded for decades to be able to do debugging and more complex programming tasks.

It's like cars - as they've become easier to drive, they're harder to debug and fix, so there's an increased dependency on mechanics (and in turn, on car manufacturers who don't let mechanics do as much).

KathyReid,
@KathyReid@aus.social avatar

@hcs probably not, because SO owns the copyright in the material - so it's a copyright vs creative commons interplay

KathyReid,
@KathyReid@aus.social avatar
KathyReid,
@KathyReid@aus.social avatar

@patrickleavy well, except it's now populated with so much LLM-generated bullshit it would be impossible to tell what's LLM-generated and what's human-generated.

KathyReid,
@KathyReid@aus.social avatar

@astrojuanlu @j3j5 @blogdiva

Except that copyright laws are different in different countries - not all countries have a fair use exemption in copyright law

KathyReid,
@KathyReid@aus.social avatar

@ErikJonker good question. By fully open source, I am assuming the weights, biases and the source data, and training algorithm, are openly available.

This would be a situation I am a lot more comfortable with, but it still would not fulfil the requirements of the CC-BY license (requiring attribution).

If the LLM was used with RAG, and RAG was used to provide attribution, I think I would be comfortable with that.

KathyReid,
@KathyReid@aus.social avatar

@blogdiva Right, but their position seems to be very generative AI friendly, which aligns with their remix, reuse ethos.

They are unlikely to sue because generative AI fulfils parts of the mission of Creative Commons - to use creative works in new, creative ways.

shlee, to random
@shlee@aus.social avatar
KathyReid,
@KathyReid@aus.social avatar

@shlee oof sorry

KathyReid, to stackoverflow
@KathyReid@aus.social avatar

I just issued a data deletion request to #StackOverflow to erase all of the associations between my name and the questions, answers and comments I have on the platform.

One of the key ways in which #RAG works to supplement #LLMs is based on proven associations. Higher ranked Stack Overflow members' answers will carry more weight in any #LLM that is produced.

By asking for my name to be disassociated from the textual data, it removes a semantic relationship that is helpful for determining which tokens of text to use in an #LLM.

If you sell out your user base without consultation, expect a backlash.

KathyReid,
@KathyReid@aus.social avatar

@dtbell91 yep

KathyReid,
@KathyReid@aus.social avatar

@kellogh which part, scraping for OpenAI? Absolutely

KathyReid,
@KathyReid@aus.social avatar

@kellogh ah no they have not confirmed that the tokens are weighted by the number of points the user who authored the post has - but if I were doing an LLM from SO, that's how I would approach it - because higher points users are likely to have more reliable answers, and better phrased questions.

KathyReid,
@KathyReid@aus.social avatar

@arestelle @kellogh that's an excellent point - they just replace my username with a random string but the points of that username are still associated with the random string.

KathyReid,
@KathyReid@aus.social avatar

@sean To clarify, I'm not saying they do carry more weight, but I am predicting that when they tokenise the text in SO to train LLMs on, they will give more weight to text created by high-ranking users.

Also, highly-ranked users are likely to have more text in the SO corpus, because they are highly-ranked (and have therefore answered lots of questions, or a small amount of questions well).

That is, when the text is tokenised, high-ranked user-generated text will make up more of the tokens. If I can break that association, then it makes OpenAI's job harder.

KathyReid,
@KathyReid@aus.social avatar

@sean right but I'm guessing that OpenAI will write custom tokenisers for SO content, which probably would take into account user rank info ... So it's not the ML, it's the data preparation.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • JUstTest
  • kavyap
  • DreamBathrooms
  • thenastyranch
  • magazineikmin
  • InstantRegret
  • GTA5RPClips
  • Youngstown
  • everett
  • slotface
  • rosin
  • osvaldo12
  • mdbf
  • ngwrru68w68
  • megavids
  • cubers
  • modclub
  • normalnudes
  • tester
  • khanakhh
  • Durango
  • ethstaker
  • tacticalgear
  • Leos
  • provamag3
  • anitta
  • cisconetworking
  • lostlight
  • All magazines