I am once again encouraging #Fediverse#admins to add a clause to prohibit using their server’s data for machine training into their Terms of Use, because at some point in the near future there is likely going to be a lawsuit against some major company for scraping and exploiting users’ data, and we should make sure we have a legal leg to stand on.
That's almost certainly going to require a lawyer to write it, if you want to word it in such a way that it has any hope of being enforceable in a suit at some future time.
@drahardja@seb Meant to respond to your earlier post (a couple weeks ago IIRC) to strongly endorse this suggestion and to encourage developers, especially those of us who offer open APIs, to do the same. (We already have thanks to your suggestion)
@drahardja@seb Feels like this might need a collective effort, perhaps per jurisdiction. Something an admin can just roll out, like a code license, without being an effort in the legalese. Great suggestion to have everyone start thinking about this.
@feld Are you suggesting that consuming data served by a privately-owned server is the same thing as taking a picture from a public space? Because it’s not.
@victor I'd be glad to. I spent four miserable days writing my own TOS, and then I made them Creative Commons, or whatever that one is that says, "Take what you need, Fam." 🥰🥰🥰
@victor *but not til July 6th for me. Gonna attempt this "relax" thing ppl without ADHD can do tomorrow, and then the 5th is the misters birthday and we are gonna watch a whole movie. 😂😂😂🥂
@interstellarenigma By the time a lawsuit emerges, I hope it would be clear how we can prove it. My guess is there will likely be some range of IP addresses that have been used exclusively for grabbing data for training, and we can show a pattern of access from that address.
@drahardja@seb I really would like to do this but I honestly am not sure how to add that in a way that would hold up in a court. I'm not a lawyer :blobNervous:
Is there an example, or better yet, a pre-written clause that admins can use that has been vetted?
@seb I’m willing to chip in to pay lawyers to write something up, or at least consult as to the viability of such a clause. If you know of a lawyer who’s familiar with this space of IP law (especially in your part of the world) please recommend them here.
@drahardja@seb I agree, and yet at the same time, unlike FB federation (which is exclusively harmful and should never be allowed) - I think that we need to address machine-learning scraping by creating marked, flagged, opt-in consent areas (which no one should be able to unintentionally stumble into) to be intentionally scraped, on purpose, ideally with fair compensation for participation.
while doing what you say and keeping scrapers out of most places with such a clause on anywhere that could be exploiting people/sensitive/etc.
(I say this because machine learning is already biased hard toward cishet white normie sources at best, and outright hate groups at worst. Leftist, or at the very least non-hateful, representative, and diverse content needs to be introduced into datasets at wide scale.)
@mybarkingdogs@drahardja@seb I was thinking about this in recent days when I saw another post talking about stopping scraping of fediverse data. Of course many reasons to not allow it, but based on how much more pleasant the discourse is on here, I’d sure rather ML algos were trained on data from here rather than most other comment-based parts of the internet.
What converted me there was when I saw that a horrific stalker site which I won't name but that's responsible for multiple deaths, swattings, more - is part of OpenAI's set. As well as a couple of other awful things.
And as tempting as the response to everything Big Tech does is to just say NO and draw a deep line in the sand (and sometimes, like with Facebook, that's valid - when there's no way to get any advantage for ourselves, and every way to be exploited and subsumed)
other times, it's actually necessary to fight and resist in other ways - whether crapflooding data that could be used to do harm such as personally identifying info to protect people or, in this sense, intentionally providing data to pull the Overton window of datasets away from literal harassment collectives.
If only that someone, somewhere, against all good advice, will ask something running off these datasets about the people the bigots hate.
@Brendanjones I don’t buy this argument. The main beneficiaries of “AI” today are corporations that have huge investments in compute, who will surely hoard all profit that may arise from products trained on our data. Why should we make their products more enriching or pleasant to use? Don’t fall for the “my tech is inevitable” trope that technologists always use; it is not inevitable that AI will permeate into our lives in any more meaningful measure than spam or “smart” TVs do today: that is, as technological means to scam and surveil us to even greater degrees.
I, for one, am not interested in contributing to this technology in its present form.
@drahardja don’t worry, I’m not actually suggesting to offer up our data. You don’t need to convince me. It was merely a think out loud hypothetical of “how different would ML algos be if they were trained on the (more polite and left-leaning) fediverse content?”
@drahardja@seb
I don't think such a clause will technically work.
Not only are all the readers machines (so what does "learning" or "training" mean, exactly?), but using machine leaning to classify toots as spam or not is very likely to become a necessity in the future. Saying "no machine learning" would cut that off at the knees.
The best way to prevent others from using my writing, is to keep it private -- email, or closed forums. Public writing is, by necessity, public.
@supernovae@StompyRobot@drahardja@seb they could probably carefully word exceptions, like that it's not allowed unless for moderation purposes and only from posts that have been reported (just an example, not the words I think should actually be put there lol I have no idea)
@dozymoe@seb I’m not convinced that ML is that useful for moderation or spam patrol (see the mess that is Facebook automoderation), but that is a good point. What’s important, I think is that the admin and users explicitly consent to any training use. A per-user setting (just like opt-in searchability) might be the right balance to strike.
@drahardja@seb on this, what I'm most worried about is post expiry time. If my post is set to delete after a month that should be mandated in the protocol. It's not meant to be copied and kept indefinitely.
@drahardja@seb Unless the instance's ToS require you to license them to do stuff with your posts including sublicense to scrapers, the default is that any such use is infringement. Making it explicitly against ToS could be nice, but should not be necessary.
Add comment