If you can't figure out how to block four entities who you have identified by IP... - Twitter

hrefna, 11 months ago (edited 11 months ago)

If you can't figure out how to block four entities who you have identified by IP from scraping your website such that it is causing that level of damage to your business, I don't know what to tell you.

#Twitter #BirdSite

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ j3j5, hrefna, wfaler

Image

Image alternative text

JohnLaRooy, 11 months ago

@hrefna is it true that the CEO is so dense that they need to know his exact location for the new gravity wave experiments?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

katharina_buholzer, 11 months ago

@hrefna
@wfaler

I read that document with the plaintiffs original filing.

Technically, it is right about the fact that the content providers, namely twitter's tweets, never gave constent to this type of mining by whoever owns and operates the crawlers. In the EU, that might be a logical argument. Not sure about Texas, though.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrefna, 11 months ago

@katharina_buholzer

While there may or may not be a legal debate in Texas on this (I have my doubts, and I suspect that vigilantibus non dormientibus is going to come up if they can ever find these accounts to get them into court in the first place), I'm more looking at it from a technical perspective.

If you have four entities that you have identified as coming from Akamai on a single ASN and you can't figure out how to rate limit or block that, this is not the field for you.

@wfaler

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrefna, 11 months ago

@katharina_buholzer

I also give a bit of a side eye to identifying based on a single IP address coming from a cloud provider's network.

Either you have identified them well enough that you know that there are 1-4 entities associated with 4 IPs. In which case blocking or rate limiting them is trivial.

OR they are engaging in tactics to evade your defenses, in which case without a date range "good luck with that" for the IPs, and there's no guarantee the logs still exist.

@wfaler

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

katharina_buholzer, 11 months ago

@hrefna
@wfaler

I am not a web designer nor website admin.

I heard from a friend who is, that it is not trivial to stop crawlers. Most respect the robots.txt entries, but it is unknown if commoncrawl and others, used for AI learning data mining follow these rules. There is a new standard in the works.

So filtering these ip-addresses might not do the job, if unknown what type of crawl they do.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrefna, 11 months ago

@katharina_buholzer

Since you are referencing an unknown friend as a source I'm going to do something I rarely do: I'm going to pull rank.

I'm a software engineer with nearly 20 years of professional experience working as an SRE at Google on their identity infrastructure. Before I was an SRE I spent nearly 7 years as a data engineer working to identify and eliminate bot traffic, including scrapers.

In essence: If it isn't easy, then the id in the lawsuit is likely nonsense.

@wfaler

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrefna, 11 months ago

@katharina_buholzer

Logs like that are often not kept very long and so "a random IP for a transient VM in a cloud provider's network" is not identifying. It would be "we think there are 4, but there might be 400, or 1, we don't know, but they used these four at some point?" At a minimum you would have to say "the IP address associated with this account for these times," and logs may not still exist.

But if it is 4 semi-constant IPs, then we're back to "this is trivial."

@wfaler

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

katharina_buholzer, 11 months ago

@hrefna @wfaler

no need for pulling ranks, since I said I am not web designer, nor a website admin.

I just happened to read an article by that website admin discussing crawling in order to mine learning corpora for machine learning and if they follow robots.txt directives or not. So I chimed in raising if it is trivial or not.

My experience is in the receiving domain of such dataminers. So the legal case per se is interesting to me.

Musk must have fired the wrong talent, I guess.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

edk, 11 months ago

@hrefna "find another hobby"?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

DataDrivenMD, 11 months ago

@hrefna That part. They're all Akamai IP addresses from the same ASN. Would've been cheaper and faster if they'd just pick up the phone and ask them if they were actually VMs or (more likely) syndication servers for the embedded tweets

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrefna, 11 months ago

@DataDrivenMD Yeah, my immediate thought was "he misidentified a syndicator, didn't he." Either that or he's very confused about someone who twitter has an extant contract with. -.-

It's just… utterly predictable.

But if you give me a set of Akamai IP addresses under one ASN I can have those rate limited in no time at all with virtually no impact elsewhere. Better still if they have any other consistent behaviors.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

elan, 11 months ago

@DataDrivenMD @hrefna what are we talking about here ?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrefna, 11 months ago

@elan twitter is suing four john does. https://www.documentcloud.org/documents/23875001-x-corp-data-scraping-lawsuit?responsive=1&title=1

@DataDrivenMD

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrefna, 11 months ago

@elan

Specifically the entities "associated with"

23.239.23.31
194.195.210.128
23.239.17.31
23.239.20.149

No times on those four IP addresses or any other identifying information I can see.

They want the "maximum rate allowed" for… scraping.

@DataDrivenMD

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

elan, 10 months ago

@hrefna @DataDrivenMD so linode / akamai to unmask the customer?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

DataDrivenMD, 10 months ago

@elan @hrefna seems like the most obvious thing to do. Ultimately, I think this was just another expensive distraction tactic to get people to stop talking about all the other expensive stupid shit he's done

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

elan, 10 months ago

@DataDrivenMD @hrefna I don't understand the filing nor am I versed in court processes.

Why wouldn't they file suit against Linode and seek the identities of the IP addresses.

There is sufficient case law on this subject as it's very similar to movie and audio piracy law suits that seek the same end result: identify John Doe who is responsible for the prohibited activity.

Do they expect the judge to know the identities?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrefna, 10 months ago

@elan

They are not filing against linode, they are filing against 4 john does, they'll try to get more through discovery with linode.

One of two things are true:

Those are transient IPs, in which case they need to give a date range/other behavior associated with the scraping campaign, because that could be a whole bunch of companies and lost to a TTL.

They are firmly fixed, in which case id is easy but the damages are nonsense and they could fix this trivially.

@DataDrivenMD

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

elan, 10 months ago

@hrefna @DataDrivenMD so discovery and translate to subpoenas for third parties not named in the suit?

I guess that kind of makes sense

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

leviramsey, 11 months ago

@hrefna @elan @DataDrivenMD

There are a lot of executive types who think that IPv4 identification is magical.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

elan, 10 months ago

@leviramsey @hrefna @DataDrivenMD those IPs belong to a cloud company.

As someone who works for a cloud company, let me tell you, we know exactly who had the IP at a specific point in time.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrefna, 10 months ago

@elan

Everyone here is aware of what it takes, and also that it requires more than just the IP address unless several other things are simultaneously true and/or a great deal more information is supplied than what's in the lawsuit. If those things are true then it undermines the lawsuit in other ways.

I don't know why you feel then need to explain how you worked in a cloud company. I worked for six years in GCP before my current role.

@leviramsey @DataDrivenMD

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

elan, 10 months ago

@hrefna

I may be misinterpreting your response because it's text and tone is lost, but your reply comes off as hostile and quite rude.

If I'm incorrect, please accept my apologies.

@leviramsey @DataDrivenMD

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrefna, 10 months ago

@elan

Let me flip that a bit.

When you said "As someone who works for a cloud company, let me tell you…" what did you mean to convey? To whom?

Of the three possibilities here I see:

An akka dev (Lightbend specifically?) who presumably knows about distributed systems

A researcher in infosec and cofounder of "coders against covid"

A Google SRE who used to work as a data engineer in Google Cloud

Which of these three was your "let me tell you" directed at?

@leviramsey @DataDrivenMD

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

elan, 10 months ago

@hrefna @leviramsey @DataDrivenMD

I can see that. It's a colloquialism. Not intended to be taken as mansplaining, reply guy-ing, or anything similar.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment