If you can't figure out how to block four entities who you have identified by IP from scraping your website such that it is causing that level of damage to your business, I don't know what to tell you.
I read that document with the plaintiffs original filing.
Technically, it is right about the fact that the content providers, namely twitter's tweets, never gave constent to this type of mining by whoever owns and operates the crawlers. In the EU, that might be a logical argument. Not sure about Texas, though.
While there may or may not be a legal debate in Texas on this (I have my doubts, and I suspect that vigilantibus non dormientibus is going to come up if they can ever find these accounts to get them into court in the first place), I'm more looking at it from a technical perspective.
If you have four entities that you have identified as coming from Akamai on a single ASN and you can't figure out how to rate limit or block that, this is not the field for you.
I also give a bit of a side eye to identifying based on a single IP address coming from a cloud provider's network.
Either you have identified them well enough that you know that there are 1-4 entities associated with 4 IPs. In which case blocking or rate limiting them is trivial.
OR they are engaging in tactics to evade your defenses, in which case without a date range "good luck with that" for the IPs, and there's no guarantee the logs still exist.
I heard from a friend who is, that it is not trivial to stop crawlers. Most respect the robots.txt entries, but it is unknown if commoncrawl and others, used for AI learning data mining follow these rules. There is a new standard in the works.
So filtering these ip-addresses might not do the job, if unknown what type of crawl they do.
Since you are referencing an unknown friend as a source I'm going to do something I rarely do: I'm going to pull rank.
I'm a software engineer with nearly 20 years of professional experience working as an SRE at Google on their identity infrastructure. Before I was an SRE I spent nearly 7 years as a data engineer working to identify and eliminate bot traffic, including scrapers.
In essence: If it isn't easy, then the id in the lawsuit is likely nonsense.
Logs like that are often not kept very long and so "a random IP for a transient VM in a cloud provider's network" is not identifying. It would be "we think there are 4, but there might be 400, or 1, we don't know, but they used these four at some point?" At a minimum you would have to say "the IP address associated with this account for these times," and logs may not still exist.
But if it is 4 semi-constant IPs, then we're back to "this is trivial."
no need for pulling ranks, since I said I am not web designer, nor a website admin.
I just happened to read an article by that website admin discussing crawling in order to mine learning corpora for machine learning and if they follow robots.txt directives or not. So I chimed in raising if it is trivial or not.
My experience is in the receiving domain of such dataminers. So the legal case per se is interesting to me.
@hrefnaThat part. They're all Akamai IP addresses from the same ASN. Would've been cheaper and faster if they'd just pick up the phone and ask them if they were actually VMs or (more likely) syndication servers for the embedded tweets
@DataDrivenMD Yeah, my immediate thought was "he misidentified a syndicator, didn't he." Either that or he's very confused about someone who twitter has an extant contract with. -.-
It's just… utterly predictable.
But if you give me a set of Akamai IP addresses under one ASN I can have those rate limited in no time at all with virtually no impact elsewhere. Better still if they have any other consistent behaviors.
@elan@hrefna seems like the most obvious thing to do. Ultimately, I think this was just another expensive distraction tactic to get people to stop talking about all the other expensive stupid shit he's done
@DataDrivenMD@hrefna I don't understand the filing nor am I versed in court processes.
Why wouldn't they file suit against Linode and seek the identities of the IP addresses.
There is sufficient case law on this subject as it's very similar to movie and audio piracy law suits that seek the same end result: identify John Doe who is responsible for the prohibited activity.
They are not filing against linode, they are filing against 4 john does, they'll try to get more through discovery with linode.
One of two things are true:
Those are transient IPs, in which case they need to give a date range/other behavior associated with the scraping campaign, because that could be a whole bunch of companies and lost to a TTL.
They are firmly fixed, in which case id is easy but the damages are nonsense and they could fix this trivially.
Everyone here is aware of what it takes, and also that it requires more than just the IP address unless several other things are simultaneously true and/or a great deal more information is supplied than what's in the lawsuit. If those things are true then it undermines the lawsuit in other ways.
I don't know why you feel then need to explain how you worked in a cloud company. I worked for six years in GCP before my current role.
Add comment