Announcing OpenLemmyStats.org: Publicly Queryable Vote History + Other Hidden Data for Any Lemmy User!

What’s stopping me from doing this? Here we go:

I’m going to start an instance and federate with everyone who will allow it, which is most instances including this one, I believe.

Then I’m going to feed all that data into my new website, called Open Lemmy Stats, where anyone can query the user data ive accumulated. The homepage will be ripe with insights, leaderboards and all kinds of data on prolific users.

Additionally, I’ll display a snapshot/profile of a random user by feeding that users data to GPT4 to make inferences about the user’s political affiliations and display the results.

Worst of all, I’m not going to out my instance for everyone to know it as the one to defederate. In fact, I’m spinning up a few instances that will host innocuous communities that I plan to mod and support to give my instances cover for their true purpose: redundant fediverse datastreams for my site, Open Lemmy Stats.

I’ll also have a store where anyone can buy my collected fediverse data for a handsome sum.

Just kidding I’m not doing any of this. But someone absolutely will or already is working on it. They’ll make a good bit of money too, I’d bet.

This is inspired by a recent post on youshouldknow@lemmy.world where someone highlighted what kind of data instance admins have access to, even for users not on their instance.

I wanted to share this to start a discussion that I find interesting. I’m interested in your thoughts, or to hear more on why this may or may not be possible and if it is, maybe some ideas how to fix that? because obviously such a site would be problematic, but no doubt popular for oh so many reasons.

Edit: typo, I called admins adminis. Corrected.

Edit 2: wanted to credit the post I was referencing from YSK, here it is - lemmy.world/post/1033769

BilboSwaggins,
BilboSwaggins avatar

And any EU citizen could proceed to sue the shit out of you and anyone who uses that data, based on GDPR. Especially, once you not only collect it, but also run any kind of inference on it.
Would be interesting to see where that ends. Once you start selling it, you act as some kind of company/have commercial interest and thereby clearly fall under GDPR. If they've never given their consent to your data processing, it would be best if your servers stand on some offshore oil rig and your bank account is somewhere on the Bahamas I guess...

booty_flexx,

There are handsome penalties for violating copyright but torrent trackers are still thriving, I expect similar legal evasion tactics from sites like OpenLemmyStats

gloriousspearfish,

Hey, I completely agree with you, in that the most interesting discussions are among groups where I don’t agree with everyone. This is where I learn and grow as a person.

But in saying that, aren’t you also saying that some people, like you and me, would not use such a database to filter out the users we do not agree with?

And would it not be a logical conclusion to make, that people who likes to build and stay in their echo chambers, would not be more inclined to listen to different opinions just because they don’t have a more efficient tool to sort out people they disagree with?

What I am saying is, all information that is technically available will be collected and analysed. Better make a public and open platform showing everything, such that everyone can see exactly what can be collected and surmised from the already public information, than to keep users blind from what information they actually leak publically.

danc4498,

Does the instance owners read your DMs? Does Reddit read your DMs? You never really know.

rcmaehl,
@rcmaehl@lemmy.world avatar

Jokes on them. I already know what’s in my DMs. /j

xaxl,

Just post everything in public and never have to worry about it in the first place.

BoxOfFeet,

Wait… the Lemmy logo is a Lemming?? I’ve spent the last 6 days thinking it was a gerbil. And this whole thing was referencing the Lemmiwinks episode of South Park.

GreenCrush,
@GreenCrush@lemmy.world avatar

I was told it was a lemur…

_finger_,

That’s a fucking cute as hell illustration. I’d wear that

booty_flexx,

I’d love to take credit but that was midjourney (and all the artists that feed its capabilities.)

I think my prompt was “a logo featuring a mouse holding a magnifying glass”

I’ve since realized that I should have said lemming instead of mouse, but a dummy like me can only do so much.

_finger_,

Man that is so amazingly clean for AI, really scary

GamingChairModel,

I don’t think that site would be problematic. After all, we’re just talking about custom interfaces to analyze public data.

A big part of the solution is that users should have an awareness that their activity is public. Every once in a while someone gets burned not knowing that anyone can view what a specific Twitter user or Instagram user liked (like politicians liking risque thirst trap photos).

Another is easy alts and throwaways, with tips to avoid correlations:

  • Don’t use the same verified email address
  • Don’t reuse usernames, including across platforms
  • Try not to use the same instances, such that instance admins can see whether login activity is coming from the same place, unless you absolutely trust that the admins won’t analyze your data OR inadvertently leak their records.
  • Be aware of the techniques used to correlate users: analysis of timestamps, linguistic/grammatical quirks, etc.

This is a public place, so people should be aware that this is a public place. That means they can still find this useful space, as with many other public places, but should be aware that the more they do on this platform, the easier it is to correlate with a real life identity.

booty_flexx, (edited )

Those are good practices if you have privacy concerns.

we’re just talking about custom interfaces to analyze public data

Semi-public. As it stands, only instance admins have access to per-user vote data. Possibly also API users, but I’m not sure the lemmy api has an endpoint for exposing per-user vote data, I believe it just gives you a tally of the up/down votes of posts and comments, but not who made each vote. But most people don’t have the skillset to host their own instance and process the data into something meaningful/easy to digest.

You could make the argument that semi-public is basically public, but I think there is some nuance to be explored:

Once a site like open lemmy stats launches, it becomes trivial for any user to query that data, who upvoted what, who downvoted what, when they up/downvoted it, etc.

There’s a difference between something being available to people motivated enough to get it vs it reaching critical mass and being trivial to access by anyone with a browser. How the data is ultimately used, whether it is used nefariously or not, is going to be up to the people that access openlemmystats and what they wish to use it for.

Which has me considering an analogy, without expressly intending to make this political, please consider the statement “guns don’t kill people, people kill people”. “Openlemmystats doesnt harass political dissenters! The people who use it do!”. One could argue that openlemmystats wouldn’t do anything inherently bad, it’s the people who would use it. Just like with guns, there will likely be debate on whether or not the world would be better without openlemmystats or if we should start doing things to make it impossible for openlemmystats-alike sites to exist.

That said, I mostly agree with you, and I appreciate your privacy suggestions/best practices, good stuff!

Edit: for the record, I think “guns don’t kill people, people do” is a stupid statement, but I thought it was an interesting analogy. That is to say nothing of my feelings on gun control, I’m just not a fan of distilling complex issues into dismissive one line statements.

GamingChairModel,

Thinking about this some more, I don’t mean to put everything on the user.

The platform itself, through its design and architecture and settings, should also do stuff to make super detailed analysis more difficult:

  • Don’t log unnecessary metadata, such as views/visits, clicks, scrolls, time spent on specific posts, etc. Information that is never observed/logged can’t be shared/published.
  • Don’t share unnecessary information with other instances. For example, with an update to the protocol, an instance might be able to hide which local users voted for what in local threads, while maintaining the proper count internally of what the vote totals are, who has already voted, etc. Non-local users would have to have their votes publicly known, though.
  • Make the public nature of each action obvious. Make votes more obviously public through the interface (perhaps by allowing people to view who upvoted or downvoted). Make people’s comment history and like history easy to view within the native interface, so that people understand that the information isn’t private to begin with.
  • Commit to deletion in a public, auditable way. Let instance administrators know that being a good citizen on the fediverse requires adherence to norms about privacy and deletion, and have watchdogs publish stats on how long it takes for an instance to delete a comment or vote or whether it retains edit/delete history.
lily33,

That last point is completely impossible. Don’t forget that I don’t have to run the official lemmy software on my instance. I can make changes: for example, I can add a feature to my instance like “log every post in a separate, local database before deleting it from lemmy”. Nobody else but me will know this feature exists. Or (to be AGPL compliant) have a separate tool to regularly back up my lemmy database, undoing deletions.

As for the second point: I’d say making local votes private and non-local public will be worse for privacy due to causing confusion.

booty_flexx,

Great point and ideas, I hope to see things like this introduced as the lemmy project matures

aaaa,

I appreciate the illustration (and even warning) here. I predict things like this will just lead to more people having throwaway accounts. Now instead of just having throwaway accounts for posting shameful stories, you’ll also find people with their “commenting” accounts separate from their “voting” accounts.

The more I see kbin users calling people out for downvoting them, the faster I expect the votes to just become gamed instead of natural. Anything that’s used to draw attention to the way people vote will make this worse.

We’re in the early stages, but as soon as we start seeing communities that ban users based on their voting records, people will just find other ways to obscure things, which will make it even harder for instance admins to address massive misuse of the voting system.

booty_flexx,

I definitely expect a drawn out game of whack a mole as lemmy devs, instance admins and key contributors start seeing stuff like this pop up, and they develop tools or tech to mitigate abuse, until another exploit is found by bad actors, rinse and repeat.

Some say it’s an inherent flaw with federation/activitypub but I expect/hope it progresses the way other vulnerable tech has.

For example, in the early days of wifi it was pretty trivial to packet sniff (a practice that lets you peer into other folks network activity). Now most sites encrypt their transmitted data and while the packets could be sniffed over an unsecured network, the data within stays safe because it’s encrypted (assuming most sites that deal with sensitive data now encrypt, which in my experience, they do)

Furthermore WIFI as a technology has gone through many iterations, each one bringing with it better and stronger security, to the point where average Joe can setup a secure home network by following the quick start guide included with their router, which these days is essentially plug in, power on, choose a password, and authenticate with your devices.

I expect activitypub and fedi tech to develop in the same way: releasing patches and updates and ammending the standard to combat/mitigate abuse of an open federated platform., it’s gonna take time though.

Edit: typos

aaaa,

I think the biggest concern is getting all participating instances to agree on how to handle the issue.

We’ll start to see more fragmentation of the Fediverse as different instance owners have different views on what should be done. But many of the measures to fight this will only work if all participating instances do the same, whether actively, or by using a new version of the federation standard. Some instances may think the way is to be more transparent, while others may think the way is to obscure the votes more. Now you’ll have the “transparent” fediverse and the “obscure” fediverse with fundamental disagreements with each other on the way things work.

It’s interesting times ahead. Personally, I don’t think federation is the simple answer to all our social media woes like some folks around seem to think. There’s a lot that needs to be addressed, which will be uncovered as more companies like Meta try to get in on it.

booty_flexx,

biggest concern is getting all participating instances to agree

I see what you mean, that is true if the responsibility ultimately ends up falling on instance owners.

Which is why I’m hoping that the developments instead occur on the Lemmy project itself and other fediverse project code bases. Lemmy devs and contributors will hopefully work on privacy and security as the Lemmy project matures. If instance admins are keeping their instances mostly up to date, there is virtually no (dis)agreement to be had: the mitigation patches will be loaded on the next update.

Of course, anyone can fork lemmy or manually remove these changes from their instance, or some admins may simply refuse to update, but that would reflect badly and privacy minded users may choose move to another instance that has updated to the latest/most secure version of Lemmy and other instance owners can also choose to defederate from instances that leave themselves vulnerable to issues that have been patched out.

Prasaedonium,

I’m a data nerd even though I’m still noob so this sounds amazing

impulse,

Nope, it’s an absolute nightmare. The post basically outlines how you could feasibly exploit data across a majority of the Lemmy network without much effort at all.

With a bit more effort you could also link the Lemmy accounts to the users email, as becoming an admin is as simple as hosting your own instance and getting users to join.

Boom you have a business case of profiling people on Lemmy and selling those profiles to advertisers, stalkers and perverts alike.

booty_flexx,

it’s an absolute nightmare

Indeed! I felt it was important to illustrate this, to Jumpstart discussion and hopefully motivate some talented/passionate devs to start thinking about this. Not that they haven’t, but there’s been a lot of handwaving on lemmy this week when someone brings up the vulnerabilities of the fediverse. I wanted to further illustrate the possibilities.

I’m encouraged by seeing folks like yourself taking the implications seriously (not to say you ever didn’t take it seriously)

fulano,

By seeying most reactions ro your post, I can only think that most lemmy users don’t care about privacy at all. Or, at least, didn’t fully understand the implications.

Erk,

For myself, I’ve already just assumed this stuff is public. I don’t know why I’d assume it was private, in fact. I have a few different accounts and I use them for different things, but anything I want to keep off the public internet doesn’t go on the public internet, on Lemmy or Reddit or Facebook or anywhere. It’s 2023, I think most people have some understanding of this already. Threatening to out data I already assumed no privacy on is not terribly threatening.

fulano,

Assuming everything is public, on one hand, can help develop better practices, but, on the other, can lead us to stop fighting for our privacy, so I’m always cautious with it.

About the upvotes/downvotes, they give a lot of information about you, and your pattern can be so unique, that a new account could be identified by it. It can also be used for doxing. Having public votes can also lead to metadrama, just like happens in places like facebook with their like system. And don’t forget that it takes just a small mistake to have your identity leaked, and then you have this data available and tied to your person, exposing your psychological behavior and positions on every theme.

Another thing worth mentioning is the email used to join lemmy. This is basically public, eliminating the expected anonymity from a lot o people (remember, most people aren’t tech-savy enough to create a fake one). In time, bots and trolls will become more common and most instances will probably ban fake or temporary emails, forcing the users to use real ones.

It all might not be a great issue now, when we’re small, but if we expect to grow, I think these things will need to be addressed at some point.

booty_flexx,

Yeah, I almost want to make it now to drive the point home to those folks. (Edit: emphasis on almost)

who cares if they can see my public posts

Misses the whole point, Open Lemmy Stats probably wouldn’t display your posts (lemmy itself does that), it would display all of the analytical inferences to be made from those posts, votes and other activity, revealing more about you than you intended or even were aware of. Which isn’t readily public in the way some folks are making it out to be, it takes some work to get that data and you need sysadmin/database/programming skills to make it manageable and useful. OpenLemmyStats will let anyone of any skill level query your data that otherwise would require you to be, at a minimum, an instance admin to get to.

fulano,

I like the idea too, but I’d prefer to wait and see what the official devs think about it, and if adding privacy measures is part of the roadmap. Lemmy is still too new and things are still unstable.

boopdepop,

I guess you forgot to link it? Here we go

lily33,

Frankly, I think someone should actually do that. Except maybe use open source AI instead of ChatGPT.

The fact is, in a federated setting all this data will be accessible. For example, if lemmy tried to hide who made each vote, and just federate totals, that would allow my malicious instance to report 1M upvotes for my post.

When lemmy tries to hide this data, all this does is instill a false sense of privacy with users. IMHO the best thing is to make all this de facto public data, officially public, so everyone knows and can act accordingly.

As for privacy, I’d say the best thing to do is, keep your account anonymous.

Nataratata,

I actually think something like that would be great. Not only for Reddit but also for the data your internet provider, email provider, WhatsApp, Google, Apple, etc. has. People don’t realise that these companies have all kinds of sensitive data. And with “companies” I don’t mean some abstract organisation but literal employees, as in people.

I am more shocked about people being shocked that Lemmy instances can see your upvotes.

yarn,
@yarn@sopuli.xyz avatar

That YSK thread from yesterday inspired me to create a new account with an anonymous relay email, instead of my personal email. I’m not sure how much I would’ve actually had to worry about if I kept using my personal email, but I figure it’s better to be safe than sorry.

I also probably could’ve just changed the email in my first account instead of creating a brand new account, but I don’t really know how data is persisted or anything. That was another case of better to be safe than sorry.

toasteranimation, (edited )
@toasteranimation@lemmy.world avatar

error loading comment

redcalcium,

Lmao I was wondering if this will be the beginning of the new era of karmawhoring in lemmy because now you could figure out your total karma without busting out a calculator.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • fediverse@lemmy.world
  • Durango
  • DreamBathrooms
  • InstantRegret
  • magazineikmin
  • osvaldo12
  • everett
  • Youngstown
  • khanakhh
  • slotface
  • rosin
  • thenastyranch
  • ngwrru68w68
  • kavyap
  • normalnudes
  • megavids
  • ethstaker
  • GTA5RPClips
  • modclub
  • cisconetworking
  • mdbf
  • tacticalgear
  • cubers
  • provamag3
  • tester
  • anitta
  • Leos
  • JUstTest
  • lostlight
  • All magazines