Remix.run Logo
gruez a day ago

Why? The only thing that's vaguely objectionable is the fact the consent screen's wording of "download public web data from the internet" omits important information on what's actually happening and the associated risks. Otherwise I'm not sure how you can come up with a principled justification of the ban beyond just "AI scrapers bad" or "hiding identity". Tor relays and VPNs are basically doing the same thing, except with clearer disclosure about what actually goes on.

tadfisher a day ago | parent | next [-]

Does there need to be a principled justification beyond that? I used to be on the side of the traffic, as in, it does not matter where traffic originates as long as it's not abusive. But the fact is that too many scrapers exist which are, in fact, bad. Their behavior is bad, their programming is bad, and they result in way too high costs for free infrastructure, thus they are morally bad.

I expect AT&T and Comcast to offer a residential proxy service any day now.

topranks 20 hours ago | parent [-]

Absolutely.

Bear in mind the scrapers wouldn’t need to use these proxies were they not being blocked by the sites they are scraping. So it’s being used to evade blocks.

For some content the level of scraping is outweighing real users, driving up costs and pushing them towards more closed models.

Wikipedia for example make content available free, if you start hammering the site they will rate limit you to keep the lights on. If you need the data fast in bulk they have a paid program to get it without scraping. But some prefer to neither adhere to reasonable request limits nor pay for their use of the infra; instead they choose to pay these grifters to avoid the rate limits.

ff317 a day ago | parent | prev | next [-]

From the content hosting side (getting reamed by scrapers overloading infrastructure), the problem is that we have to be able to set "reasonable" ratelimits to share finite network uplink and server cpu resources between all of our real users and these scrapers.

When you can identify the nature of the traffic (quickly in realtime, based on simple deterministic rules), you can protect the resources: you can rate/concurrency -limit the AI scrapers in the name of saving resources for the real humans, effectively putting the scrapers in a lower priority band (which is how it generally worked for search engine scrapers before!).

The problem is they're using resiproxies to disperse and whitewash their traffic, making it extremely difficult to tell their requests apart from the legitimate human requests. They're basically lying to us about the origin, and thus denying us the ability to put them in a lower priority band than humans.

They may scrape us at, say, 25K reqs/second, but it's coming from 50K random residential eyeball IPs at an average rate of only 0.5 reqs/second/IP, and then they're intentionally lying with the UA and headers and other fingerprint details as best they can to "blend in" with the humans so that we can't differentiate.

Let's do an analogy: Imagine if there was a neighborhood grocery store you and all your neighbors rely on for food. It's cheap because they keep their margins low, and more importantly the next store down the road is like 50 miles further away. That store 50 miles down the road also charges double the price. Now they've decided to play arbitrage: they load up 100 employees in the back of an air conditioned semi, clothe them to look like local shoppers, park it 3 blocks from your neighborhood store hidden inside a fenced property, and have them all go in and buy out all the inventory in the store over the course of a couple hours. The store just looks like it's having a great sales day at first. All these customers waiting in line, each getting just a few things at a time. But two hours later, the store shelves are empty, the semi is loaded up, and they're headed 50 miles back to double the price and sell it to someone else. You go in to buy some veggies to cook dinner and there's nothing to buy.

We've been playing this game with AI scrapers and resiproxies for way too long, and someone needs to hold them accountable for their fraud.

gruez a day ago | parent [-]

All the arguments you made applies to VPNs or tor as well. I'm sure rightsholders would be very happy if VPNs are banned, because that gets rid of one avenue for pirating with impunity. Same goes with every ad network ever, which has to fight click fraud.

drdexebtjl a day ago | parent | next [-]

How exactly does that also apply to VPN or Tor?

Who's using VPNs and Tor to blend in their automated scraping traffic with real human traffic?

Who's using multiple VPNs or Tor exit nodes to avoid rate limits?

No one, but I would have no problem with that being illegal too.

topranks 20 hours ago | parent | prev [-]

VPN ranges at least are obvious so that’s different.

Tor less so but it doesn’t seem to be commonly used for this kind of abuse.

bigfishrunning a day ago | parent | prev [-]

This is why I don't run a tor endpoint; possibly objectionable traffic I don't control sourced from my network. All it takes is one horrible request to come from your IP and you're on a list

thephyber a day ago | parent [-]

Perhaps.

But if these are popular apps / APIs, then the number of affected households is significant. Authorities / investigators will have to treat IPs as likely proxies and not the geolocation of the human initiating the request.