gimulnautti, to ai avatar

With regard to for , I think it is reasonable to assume the big AI companies will scrape everything.

So it makes sense to build new forums of expression only behind walls of strong

And when I say strong, I mean it.

To use such a system, the user would have to provide a pulse, electric conductivity of the skin and other life signs and do so long enough with an implementation vetted by the strongest security ever scrutinized to even permit login.

kzimmermann, to til avatar

#TIL that @JasonPunyon curated and compiled a whopping archive of answers from #StackOverflow and assorted #StackExchange Q&A sites in a minimal sqlite format, where they can be downloaded and analyzed offline:

Amazing effort and great idea. Reminded me of the archives that #Kiwix kept of it (alongside Wikipedia and similar projects), but more streamlined and cross-platform. Nice.

kiwix, avatar

@kzimmermann An even better place to browse Kiwix library of books for offline usage:

alecm, to AdobePhotoshop

Zuckerman vs: Zuckerberg: why and how this is a battle of the public understanding of APIs, and why Zuckerman needs to lose and Meta needs to win

Imagine that you’re a cool, high-school, technocultural teenager; you’ve been raised reading Cory Doctorow’s “Little Brother” series, you have a 3D printer, a soldering iron, you hack on Arduino control systems for fun, and you really, really want a big strobe light in your bedroom to go with the music that you blast-out when your parents are away.

So you build a stepper-motor with a wheel and a couple of little arms, link it to a microphone circuit which does a FFT of ambient sound, and hot-glue the whole thing to your bedroom lightswitch so that the wheel’s arms can flick the lightswitch on-and-off in time to the beat.

If you’re lucky the whole thing will work for a minute or two and then the switch will break, because it wasn’t designed to be flicked on-and-off ten times per second; or maybe you’ll blow the lightbulb. If you’re very unlucky the entire switch and wiring will get really hot, arc, and set fire to the building. And if you share, distribute, and encourage your friends to do the same then you’re likely to be held liable in one of several ways if any of them suffer cost or harm.

Who am I?

My name’s Alec. I am a long-term blogger and an information, network and cyber security expert. From 1992-2009 I worked for Sun Microsystems, from 2013-16 I worked for Facebook, and today I am a full-time stay at home dad and part-time consultant. For more information please see my “about” page.

What does this have to do with APIs?

Before I begin I want to acknowledge the work of Kin Lane, The API Evangelist, who has been writing about the politics of APIs for many years. I will not claim that Kin and I share the same views on everything, but we appear to overlap perspectives on a bunch of topics and a lot of the discussion surrounding his work resonates with my perspectives. Go read his stuff, it’s illuminating.

So what is an API? My personal definition is broad but I would describe an API as any mechanism that offers a public or private contract to observe (query, read) or manipulate (set, create, update, delete) the state of a resource (device, file, or data).

In other words: a light switch. You can use it to turn the light on if it’s off, or off if it’s on, and maybe there’s a “dimmer” to set the brightness if the bulb is compatible; but light switches have their physical limitations and expected modes of use, and they need to be chosen or designed to fit the desired usage model and purpose.

Perhaps to some this definition sounds a little too broad because it would literally include referring to (e.g.) “in-browser HTML widgets and ‘submit’ buttons for deleting friendships” as an “API”; but the history of computing is rife with human-interface elements being repurposed as application-interfaces, such as banking where it was once fashionable to link new systems to old backend mainframes by using software that pretends to be a traditional IBM 3270 terminal and then screen-scraping responses to queries which were “typed” into the terminal by the new system.

The modern equivalent for web-browsers is called Selenium WebDriver and is widely used by both automated software testers and criminal bot-farms, to name but two purposes.

So yes: the tech industry — or perhaps: the tech hacker/user community — has a long history of wiring programmable motors to light switches and hoping that their house does not catch on fire… but we should really aspire to do better than that… and that’s where we come to the history of EBay and Twitter.

History of Public APIs

In the early 2000s there was a proliferation of platforms that offered various services — “I can buy books over the internet? That’s amazing!” — and this was all before the concept of a “Public API” was invented.

People wanted to “add-value” or “auto-submit” or “retrieve data” from those platforms, or even to build “alternative clients”; so they examined the HTML, reverse-engineered the functions of Internal or Private APIs which made the platform work, wrote and shared ad-hoc tools that posted and scraped data, and published their work as hackerly acts of radical empowerment “on behalf of the users” … except for those tools which stole or misused your data.

Kin Lane particularly describes the launch of the Public APIs for EBay in November 2000 and for Twitter in September 2006; about the former he writes:

The eBay API was originally rolled out to only a select number of licensed eBay partners and developers. […] The eBay API was a response to the growing number of applications that were already relying on its site either legitimately or illegitimately. The API aimed to standardize how applications integrated with eBay, and make it easier for partners and developers to build a business around the eBay ecosystem.


…and regarding the latter:

On September 20, 2006 Twitter introduced the Twitter API to the world. Much like the release of the eBay API, Twitter’s API release was in response to the growing usage of Twitter by those scraping the site or creating rogue APIs.


…both of which hint at some issues:

  1. an ecosystem of ad-hoc tools that attempt to blindly and retrospectively track EBay’s own platform development would not offer standardisation across the tools that use those APIs, and so would thereby actually limit potential for third-party client development; each tool would be working with different assumed “contracts” of behaviour that were never meant to be fixed or exposed to the public, and would also replicate work
  2. proliferation of man-in-the-middle “services” that would act “on your behalf” — and with your credentials — on the Twitter and EBay platforms, presented both a massive trust and security risk to the user (fraudulent purchases? fake tweets? stolen credentials?) with consequent reputational risk to the platform

Why do Public APIs exist?

In short: to solve these problems. Kin Lane writes a great summary on the pros-and-cons of Public APIs and how they are used both to enable, but also to (possibly unfairly) limit, the power of third party clients that offer extra value to a platform’s users.

But at the most fundamental level: Public APIs exist in order to formalise contracts of adequate means by which third-parties can observe or manipulate “state” (e.g.; user data, postings, friendships, …) on the platform.

By offering a Public API the platform frees itself also to develop and use Private APIs which can service other or new aspects of platform functionality, and it’s in a position to build and “ring-fence” the Public API service in the expectation of both heavy use and abuse being submitted through it.

Similarly: the Private APIs can be engineered more simply to act like domestic light-switches: to be used in limited ways and at human speeds; it turns out that this can be important for matters like privacy and safety.

Third parties benefit from Public APIs by having a guaranteed set of features to work with, proper documentation of API behaviour, and confidence that the API will behave in a way that they can reason about, and an API lifecycle management process with which will enable them to make their own guarantees regarding their work.

What is the Zuckerman lawsuit?

First, let me start with a few references:

The shortest summary of the lawsuit that I have heard from one of its ardent supporters, is that the lawsuit:

[…] seeks immunity from [the Computer Fraud and Abuse Act] and [the Digital Millennium Copyright Act] [for legal] claims [against third parties or users] for automating a browser [to use Private APIs to obtain extra “value” from a website] and [the lawsuit also] does not seek state mandated APIs, or, indeed, any APIs

(private communication)

To make a strawman analogy so that we can defend it’s accuracy:

Let’s build and distribute motors to flick lightswitches on and off to make strobe lights, because what’s the worst that could happen? And we want people to have a fundamental right to do this, because Section 230 says we have such a right. We won’t be requiring any new switches to be installed, we just want to be allowed to use the ones that are already there, so it’s easy and low-cost to ask for, and there’s no risk to us doing this. But we also want legal immunity just in case what we provide happens to burn someone’s house down.

In other words: a return to the ways of the early 2000s, where scraping data and poking undocumented Private APIs was an accepted way to hack extra value into a website platform. To a particular mindset — especially the “big tech is irredeemably evil” folk — this sounds great, because clearly Meta intentionally prevents your having full, automated remote control over your user data on the grounds that it’s terribly valuable to them, and their having it keeps you addicted, so it helps them make money

And you know what? To a very limited extent I agree with that premise — or at least that some of the Facebook user-interface is unnecessarily painful to use.

E.g. I feel there is little (some, but little) practical excuse for the heavy user friction which Facebook imposes upon editing of the “topics you may be interested in receiving adverts about“; but the way to address this is not to encourage proliferation of browser plugins (of dubious provenance regarding privacy and regulatory compliance, let alone uncertain behaviour) which manipulate undocumented Private APIs.

Apart from any other reason, as alluded above, Private APIs are built in the expectation of being used in a particular way — e.g. by humans, at a particular cadence and frequency — and on advanced platforms like Facebook they are engineered with those expectations enforced by rate limits not only for efficiency but also for availability, security and privacy reasons.

This is something which I partially described in a presentation on behalf of Facebook at PasswordCon in 2014, but the short version is: if an API is expected to be used primarily by a human being, then for security and trust purposes it makes sense to limit it to human rates of activity.

If you start driving these Private APIs at rates which are inhuman — 10s or 100s of actions per second — then you should and will expect them to either be rate-limited, or else possibly break the platform in much the same way that flicking a lightswitch at such a rate would break that lightswitch or bulb.

With this we can describe the error in one of the proponent’s claims: We aren’t requiring any new [APIs] to be installed, we just want to be allowed to use the ones that are already there — but if the Private API is neither intended nor capable of being driven at automated speeds then either something (the platform?) will break, or else there will be loud demands that the Private APIs be re-engineered to remove “bottlenecks” (rate limits) to the detriment of availability and security.

But if you will be calling for the formalisation of Private APIs to provide functionality, why are you not instead calling for an obligation upon the platform to provide a Public API?

Private APIs are not Public APIs, and Public APIs may demand registration

The general theme of the lawsuit is to demand that any API which a platform implements — even undocumented Private ones — should be legally treated as a Public API, open for use by third party implementors, without reciprocal obligation that the third-party client obtain an “API Key” to identify itself, nor to abide by particular behaviour or rate-limits.

In short: all APIs, both Public and Private, should become “fair game” to third party implementors, and the Platforms should have no business to distinguish between one third-party or another, even in the instance that one or more of them are malicious.

This is a dangerous proposal. Platforms innovate new functionality and change their Private API behaviour at a relatively rapid speed, and there is currently nothing to prevent that; but if a true “right to use” for a Private API becomes somehow enshrined, what happens next?

Obviously: any behaviour which interferes with a public right-to-use is illegal, so it will therefore become illegal to change or remove Private APIs — or at very least any attempt to do so will lead to claims of “anticompetitive behaviour” and yet more punitive lawsuits. The free-speech rights of the platform will be abridged by compulsion to never change APIs, or to support legacy-publicly-used-yet-undocumented APIs forever more.

So, again, why not cut this Gordian knot by compelling platforms to make available a Public API that supports the desired functionality? After all, even Mastodon obligates developers of third-party apps to register their apps before use; but somehow big platforms should accept and and all non-human usage of Private APIs without discrimination?


I don’t want to keep flogging this horse, so I am just going to try and summarise in a few bullets:

  1. Private APIs exist to provide functionality to directly support a platform; they are implemented in ways which reflect their expected (usually: human) modes of use, they are not publicly documented, they can come and go, and this is normal and okay
  2. Public APIs exist to provide functionality to support third-party value-add to a platform; they are documented and offer some form of public “contract” or guarantee of behaviour, capability, and reliability. They are often designed in expectation of automated or bulk usage.
  3. Private APIs do not offer such a public contract; they are not meant to be built upon other than by the platform itself. They are meant to be able to “go away” without fuss, but if their use is a guaranteed “right” then how can they ever be deprecated?
  4. If third parties want to start using Private APIs as if they were Public APIs then the Private APIs will probably need to be re-engineered to support the weight of automated or bulk usage; but if they are going to be re-engineered anyway, why not push for them to become Public APIs?
  5. If Private APIs are not re-engineered and their excessive automated use by third party tools breaks the platform, why should the tool-user or the tool-provider not be held at least partly responsible as would happen in any other form of intentional or unintentional Denial-of-Service attack?
  6. If some (in-browser) third party tools claim to be acting “for the public good” then presumably they will have no problem in identifying themselves in order to differentiate themselves from (in-browser) evil cookie-stealing malware and worms; but to differentiate themselves would require use of an API Key and a Public API — so why are the third-party tool authors not calling to have the necessary Public APIs?

Just because an academic says “I wrote a script and I think it will work and that I [or one of your users] should be allowed to run it against your service without fear of reprisal even though [we] don’t understand how the back end system will scale with it”— does not mean that they should be permitted to do so willy-nilly, not against Facebook nor against your local community Mastodon instance.

r_alb, to privacy avatar

Another data broker is telling me that they have a „legitimate interest“ in scraping and selling my data because they need to for their business. 🙄 That is not enough.
When someone claims legitimate interest, they have to show that your rights and freedoms do not outweigh their interests. „We want to because money!“ does not quite do that!

Time to prepare my next complaint.

#privacy #DataProtection #GDPR #DataBrokers #BigData #PrivacyMatters

helma, avatar
pac, to OSINT avatar

À , nous avons pour mission de détecter et caractériser les manipulations de l'information impliquant des acteurs étrangers dans le débat public français.

Nous recrutons un•e développeur•se fullstack pour renforcer le pôle technique. Au programme, pas mal de et de collecte de données mais aussi développement d'outils en interne.


f_moncomble, to linguistics avatar

And another one for fellow linguists interested in compiling of digital discourse: MastoScraper takes advantage of the Mastodon API to collect toots based on a keyword search.
Here goes, feedback welcome!


@f_moncomble @linguistics I sure scraped my bit of data for linguistic research, but I am contemplating how ethical it is. Is justified for research purposes, and not for commercial goals? Who is responsible for the data that is scraped with your tool? I think that is an interesting discussion to have?!

alatitude77, to Discord avatar
beach, to random avatar

I understand the importance of alt text for images. But as I was drafting my image description just now a question occurred to me – in adding alt text to images am I also inadvertently making it easier for the AI bots to scrape my artwork?

NatureMC, avatar

@beach At least in the Fediverse there shouldn't be any for (?) Does anybody know more? @FediTips @feditips ?

Scraping bots don't need ALT-texts for scraping (they have image recognition).

At the moment, the only ways to protect are and , and and banning AI bots on your website in the robot.txt

BeAware, to fediverse avatar

It's SO strange to me how people are so adament that they don't want their data being sold/abused.

Yet, we have a literal data harvester in front of our own eyes with NewsMast out in the open using Fedi posts for their "news app" and nobody is blinking an eye...🤦‍♂️

fpb, avatar

@BeAware – All data that is publicly available on a website will be scraped for commercial use, violating intellectual proporty rights and data privacy rights. Such is the nature of today's Web, and the Fediverse is a part of it.

flipsideza, to ai avatar

If you don't want those bots your website/blog... time to add a bunch of Disallow: / to your robots.txt!!

metin, (edited ) to ai avatar

Just spent at least two hours deleting all of my work from Tumblr, before their AI scraping shit hits the fan, although it's probably too late. In that case, the deletion functions as a gesture of protest.

This shameless large-scale intellectual property theft by greedy tech business assholes everywhere is starting to make the internet pretty annoying. 😖

dalonso, to ai Spanish avatar

Politico website design embraces AI crawlers

"Politico has given its Europe, German and French websites a redesign which improves readability for the web crawlers that feed into generative AI large language models (LLMs)."

#Politico #AI #Scraping #LLM #ArtificialIntelligence

conspirator0, to random avatar

None of these people exist, but you can buy their books on Amazon anyway. Here's an article on Amazon authors with GAN-generated faces and the books these authors publish (which appear to be almost entirely devoid of original human-created content).

NatureMC, avatar

@conspirator0 If the content is AI-generated, it's nearly "fine". But what happens if these fictional authors and AI persons scrape copyrighted, original writer's content? In music it already happens:
BTW, the problem with doesn't exist only on Amazon but also in Science!

atineoSE, to random

Mr. Musk adds a new infamy to his growing record by going head-on against #scraping performed by a nonprofit.

Great piece on where the practice stands by @themarkup

Gargron, to threads avatar

If for whatever reason you never wish to interact with , you can personally block it for your account. This hides all posts and profiles from Threads, prevents anyone from Threads from following you, and stops your posts from being delivered to or fetched by Threads. Simply click the "Block domain" option on any Threads profile or post you see in Mastodon.


@tasket If they want to all your information, they will, it is publicly available on the internet. Wouldn't be surprised if they already did it with their AI. No, it does not matter if you add no index to your robots.txt or profile.

Scrapers gonna scrape, goes brbrbr.

@Gargron @Oozenet @jcrabapple

smach, to python avatar

If you want to get started doing Web scraping with Python, this tutorial is very well done.
By Cody Winchester for the NICAR @IRE_NICAR (National Institute for Computer Assisted Reporting - that is, data journalism) conference early this year

@IRE_NICAR @python

rwg, to random avatar

I've said this before, but I will say it again: those of you doing research on the fediverse using techniques or , keep in mind that this place is not 1 thing, but 10s of thousands of communities, each with their own terms/codes of conduct. So just grabbing all the posts you can likely violates many communities' norms or terms.


i0null, to random
strypey, avatar

> I'm no advocate of IP law but commercial derivatives should absolutely be compensated, especially given the market capital of these companies

Cory Doctorow (@pluralistic) wrote a good column about regulation of web scraping, and read it as a podcast here;


PFCOGAN, to Cybersecurity

A New Tool Helps Artists Thwart AI—With a Middle Finger.
Kudurru, the new tool from the creator of Have I Been Trained?, can help artists block web scrapers and even “poison” the scraping by sending back the wrong image.

oaklandprivacy, to privacy avatar

23andMe says private user data is up for sale after being scraped

...23andMe officials... confirmed that private data for some of its users is... up for sale. The cause of the leak... is data scraping, a technique that essentially reassembles large amounts of data by systematically extracting smaller amounts of information available to individual users

The crime forum post claimed the attackers obtained “13M pieces of data.”

itnewsbot, to security avatar

23andMe says private user data is up for sale after being scraped - Enlarge / The 23andMe logo displayed on a smartphone screen.

... - &it

pluralistic, to privacy avatar

This week on my podcast, I read my recent @medium column, "How To Think About : In and fights, copyright is a clumsy tool at best," which proposes ways to retain the benefits of scraping without the privacy and labor harms that sometimes accompany it:


If you'd like an essay-formatted version of this thread to read or share, here's a link to it on, my surveillance-free, ad-free, tracker-free blog:


Recommended emulator frontend + scraping tool for multiregion collection?

So I have a collection of cartridges from US, Europe, and Japan. I've extracted ROMs from all of them I can so far and I'm trying to come up with a collection folder I can backup that can be dropped into an emulator frontend (e.g. Emulation Station) with correct information....

pluralistic, to privacy avatar
pluralistic, avatar
pluralistic, avatar
  • All
  • Subscribed
  • Moderated
  • Favorites
  • JUstTest
  • tacticalgear
  • DreamBathrooms
  • thenastyranch
  • magazineikmin
  • Durango
  • cubers
  • Youngstown
  • mdbf
  • slotface
  • rosin
  • ngwrru68w68
  • kavyap
  • GTA5RPClips
  • provamag3
  • ethstaker
  • InstantRegret
  • Leos
  • normalnudes
  • everett
  • khanakhh
  • osvaldo12
  • cisconetworking
  • modclub
  • anitta
  • tester
  • megavids
  • lostlight
  • All magazines