Remix.run Logo
tptacek 6 days ago

Respectfully, I think it's you missing the point here. None of this is to say you shouldn't use Anubis, but Tavis Ormandy is offering a computer science critique of how it purports to function. You don't have to care about computer science in this instance! But you can't dismiss it because it's computer science.

Consider:

An adaptive password hash like bcrypt or Argon2 uses a work function to apply asymmetric costs to adversaries (attackers who don't know the real password). Both users and attackers have to apply the work function, but the user gets ~constant value for it (they know the password, so to a first approx. they only have to call it once). Attackers have to iterate the function, potentially indefinitely, in the limit obtaining 0 reward for infinite cost.

A blockchain cryptocurrency uses a work function principally as a synchronization mechanism. The work function itself doesn't have a meaningfully separate adversary. Everyone obtains the same value (the expected value of attempting to solve the next round of the block commitment puzzle) for each application of the work function. And note in this scenario most of the value returned from the work function goes to a small, centralized group of highly-capitalized specialists.

A proof-of-work-based antiabuse system wants to function the way a password hash functions. You want to define an adversary and then find a way to incur asymmetric costs on them, so that the adversary gets minimal value compared to legitimate users.

And this is in fact how proof-of-work-based antispam systems function: the value of sending a single spam message is so low that the EV of applying the work function is negative.

But here we're talking about a system where legitimate users (human browsers) and scrapers get the same value for every application of the work function. The cost:value ratio is unchanged; it's just that everything is more expensive for everybody. You're getting the worst of both worlds: user-visible costs and a system that favors large centralized well-capitalized clients.

There are antiabuse systems that do incur asymmetric costs on automated users. Youtube had (has?) one. Rather than simply attaching a constant extra cost for every request, it instead delivered a VM (through JS) to browsers, and programs for that VM. The VM and its programs were deliberately hard to reverse, and changed regularly. Part of their purpose was to verify, through a bunch of fussy side channels, that they were actually running on real browsers. Every time Youtube changed the VM, the bots had to do large amounts of new reversing work to keep up, but normal users didn't.

This is also how the Blu-Ray BD+ system worked.

The term of art for these systems is "content protection", which is what I think Anubis actually wants to be, but really isn't (yet?).

The problem with "this is good because none of the scrapers even bother to do this POW yet" is that you don't need an annoying POW to get that value! You could just write a mildly complicated Javascript function, or do an automated captcha.

sugarpimpdorsey 6 days ago | parent | next [-]

A lot of these passive types of anti-abuse systems rely on the rather bold assumption that making a bot perform a computation is expensive, but isn't for me as an ordinary user.

According to whom or what data exactly?

AI operators are clearly well-funded operations and the amount of electricity and CPU power is negligible. Software like Anubis and nearly all its identical predecessors grant you access after a single "proof". So you then have free reign to scrape the whole site.

The best physical analogy are those shopping cart things where you have to insert a quarter to unlock the cart, and you presumably get it back when you return the cart.

The group of people this doesn't affect are the well-funded, a quarter is a small price to pay for leaving your cart in the middle of the parking lot.

Those that suffer the most are the ones that can't find a quarter in the cupholder so you're stuck filling your arms with groceries.

Would you be richer if they didn't charge you a quarter? (For these anti-bot tools you're paying the electric company, not the site owner.). Maybe. But if you're Scrooge McDuck who is counting?

tptacek 6 days ago | parent | next [-]

Right, that's the point of the article. If you can tune asymmetric costs on bots/scrapers, it doesn't matter: you can drive bot costs to infinity without doing so for users. But if everyone's on a level playing field, POW is problematic.

account42 6 days ago | parent | prev | next [-]

I like your example because the quarters for shopping cards are not universal everywhere. Some societies have either accepted shopping cart shrinkage as an acceptable cost of doing business or have found better ways to deter it.

Almondsetat 6 days ago | parent | prev [-]

Scrapers are orders of magnitude faster than humans at browsing websites. If the challenge takes 1 second but a human stays on the page for 3 minutes, then it's negligible. But if the challenge takes 1 second and the scraper does ita job in 5 seconds, you already have a 20% slowdown

mewpmewp2 6 days ago | parent | next [-]

By that logic you could just make your website in general load slower to make scraping harder.

Almondsetat 6 days ago | parent [-]

No, because in this case there are cookies involved. If the scraper accepts cookies then it's trivial to detect it and block it. If it doesn't, it will have to solve the challenge every single time.

rfoo 6 days ago | parent | prev | next [-]

Scrapers do not care about having a 20% slowdown. All they care is being able to scale up. This does not block any scale up attempt.

6 days ago | parent | prev [-]
[deleted]
xena 6 days ago | parent | prev | next [-]

For what it's worth, kernel.org seems to be running an old version of Anubis that predates the current challenge generation method. Previously it took information about the user request, hashed it, and then relied on that being idempotent to avoid having to store state. This didn't scale and was prone to issues like in the OP.

The modern version of Anubis as of PR https://github.com/TecharoHQ/anubis/pull/749 uses a different flow. Minting a challenge generates state including 64 bytes of random data. This random data is sent to the client and used on the server side in order to validate challenge solutions.

The core problem here is that kernel.org isn't upgrading their version of Anubis as it's released. I suspect this means they're also vulnerable to GHSA-jhjj-2g64-px7c.

account42 6 days ago | parent | next [-]

OP is a real human user trying to make your DRM work with their system. That you consider this to be an "issue" that should be fixed says a lot.

tptacek 6 days ago | parent | prev [-]

Right, I get that. I'm just saying that over the long term, you're going to have to find asymmetric costs to apply to scrapers, or it's not going to work. I'm not criticizing any specific implementation detail of your current system. It's good to have a place to take it!

I think that's the valuable observation in this post. Tavis can tell me I'm wrong. :)

landhar 6 days ago | parent | prev | next [-]

> But here we're talking about a system where legitimate users (human browsers) and scrapers get the same value for every application of the work function. The cost:value ratio is unchanged; it's just that everything is more expensive for everybody. You're getting the worst of both worlds: user-visible costs and a system that favors large centralized well-capitalized clients.

Based on my own experience fighting these AI scrappers, I feel that the way they are actually implemented makes it that in practice there is asymmetry in the work scrappers have to do vs humans.

The pattern these scrappers follow is that they are highly distributed. I’ll see a given {ip, UA} pair make a request to /foo immediately followed by _hundreds_ of requests from completely different {ip, UA} pairs to all the links from that page (ie: /foo/a, /foo/b, /foo/c, etc..).

This is a big part of what makes these AI crawlers such a challenge for us admins. There isn’t a whole lot we can do to apply regular rate limiting techniques: the IPs are always changing and are no longer limited to corporate ASN (I’m now seeing IPs belonging to consumer ISPs and even cell phone companies), and the User Agents all look genuine. But when looking through the logs you can see the pattern that all these unrelated requests are actually working together to perform a BFS traversal of your site.

Given this pattern, I believe that’s what makes the Anubis approach actually work well in practice. For a given user, they will encounter the challenge once when accessing the site the first time, then they’ll be able to navigate through it without incurring any cost. While the AI scrappers would need to solve the challenge for every single one of their “nodes” (or whatever it is they would call their {ip, UA} pairs). From a site reliability perspective, I don’t even care if the crawlers manage to solve the challenge or not. That it manages to slow them down enough to rate limit them as a network is enough.

To be clear: I don’t disagree with you that the cost incurred by regular human users is still high. But I don’t think it’s fair to say that this is not a situation in which the cost to the adversary is not asymmetrical. It wouldn’t be if the AI crawlers hadn’t converged towards an implementation that behaves as a DDOS botnet.

akoboldfrying 6 days ago | parent | prev | next [-]

The (almost only?) distinguishing factor between genuine users and bots is the total volume of requests, but this can still be used for asymmetric costs. If botPain > botPainThreshold and humanPain < humanPainThreshold then Anubis is working as intended. A key point is that those inequalities look different at the next level of detail. A very rough model might be:

botPain = nBotRequests * cpuWorkPerRequest * dollarsPerCpuSecond

humanPain = c_1 * max(elapsedTimePerRequest) + c_2 * avg(elapsedTimePerRequest)

The article points out that the botPain Anubis currently generates is unfortunately much too low to hit any realistic threshold. But if the cost model I've suggested above is in any way realistic, then useful improvements would include:

1. More frequent but less taxing computation demands (this assumes c_1 >> c_2)

2. Parallel computation (this improves the human experience with no effect for bots)

ETA: Concretely, regarding (1), I would tolerate 500ms lag on every page load (meaning forget about the 7-day cookie), and wouldn't notice 250ms.

tptacek 6 days ago | parent [-]

That's exactly what I'm saying isn't happening: the user pays some cost C per article, and the bot pays exactly the same cost C. Both obtain the same reward. That's not how Hashcash works.

akoboldfrying 6 days ago | parent [-]

I'm saying your notion of "the same cost" is off. They pay the same total CPU cost, but that isn't the actual perceived cost in each case.

tptacek 6 days ago | parent [-]

Can you flesh that out more? In the case of AI scrapers it seems especially clear: the model companies just want tokens, and are paying a (one-time) cost of C for N tokens.

Again, with Hashcash, this isn't how it works: most outbound spam messages are worthless. The point of the system is to exploit the negative exponent on the attacker's value function.

remexre 6 days ago | parent | next [-]

The scraper breaking every time a new version of Anubis is deployed, until new anti-Anubis features are implemented, is the point; if the scrapers were well-engineered by a team that cared about the individual sites they're scraping, they probably wouldn't be so pathological towards forges.

The human-labor cost of working around Anubis is unlikely to be paid unless it affects enough data to be worth dedicating time to, and the data they're trying to scrape can typically be obtained "respectfully" in those cases -- instead of hitting the git blame route on every file of every commit of every repo, just clone the repos and run it locally, etc.

tptacek 6 days ago | parent [-]

Sure, but if that's the case, you don't need the POW, which is what bugs people about this design. I'm not objecting to the idea of anti-bot content protection on websites.

akoboldfrying 6 days ago | parent | prev [-]

Perhaps I caused confusion by writing "If botPain > botPainThreshold and humanPain < humanPainThreshold then Anubis is working as intended", as I'm not actually disputing that Anubis is currently ineffective against bots. (The article makes that point and I agree with it.) I'm arguing against what I take to be your stronger claim, namely that no "Anubis-like" countermeasure (meaning no countermeasure that charges each request the same amount of CPU in expectation) can work.

I claim that the cost for the two classes of user are meaningfully different: bots care exclusively about the total CPU usage, while humans care about some subjective combination of average and worst-case elapsed times on page loads. Because the sheer number of requests done by bots is so much higher, there's an opportunity to hurt them disproportionately according to their cost model by tweaking Anubis to increase the frequency of checks but decrease each check's elapsed time below the threshold of human annoyance.

seba_dos1 6 days ago | parent | prev | next [-]

> The term of art for these systems is "content protection", which is what I think Anubis actually wants to be, but really isn't (yet?).

No, that's missing the point. Anubis is effectively a DDoS protection system, all the talking about AI bots comes from the fact that the latest wave of DDoS attacks was initiated by AI scrapers, whether intentionally or not.

If these bots would clone git repos instead of unleashing the hordes of dumbest bots on Earth pretending to be thousands and thousands of users browsing through git blame web UI, there would be no need for Anubis.

tptacek 6 days ago | parent [-]

I'm not moralizing, I'm talking about whether it can work. If it's your site, you don't need to justify putting anything in front of it.

seba_dos1 6 days ago | parent [-]

Did you accidentally reply to a wrong comment? (not trying to be snarky, just confused)

The only "justification" there would be is that it keeps the server online that struggled under load before deploying it. That's the whole reason why major FLOSS projects and code forges have deployed Anubis. Nobody cares about bots downloading FLOSS code or kernel mailing lists archives; they care about keeping their infrastructure running and whether it's being DDoSed or not.

tptacek 6 days ago | parent [-]

I just said you didn't have to justify it. I don't care why you run it. Run whatever you want. The point of the post is that regardless of your reasons for running it, it's unlikely to work in the long run.

seba_dos1 6 days ago | parent [-]

And what I said is that all these most visible deployments of Anubis did not deploy it to be a content protection system of any kind, so it doesn't have to work this way at all for them. As long as the server doesn't struggle with load anymore after deploying Anubis, it's a win - and it works so far.

(and frankly, it likely will only need to work until the bubble bursts, making "the long run" irrelevant)

rfoo 6 days ago | parent [-]

> and frankly, it likely will only need to work until the bubble bursts, making "the long run" irrelevant

Now I get why people are so weirdly being dismissive about the whole thing. Good luck, it's not going to "burst" any time soon.

Or rather, a "burst" would not change the world in the direction you want it to be.

seba_dos1 6 days ago | parent [-]

Not exactly sure what you're talking about. The problem is caused by tons of shitty companies cutting corners to collect training data as fast as possible, fueled by easy money that you get by putting "AI" somewhere in your company's name.

As soon as the investment boom is over, this will be largely gone. LLMs will continue to be trained and data will continue to be scraped, but that alone isn't the problem. Search engine crawlers somehow manage not to DDoS the servers they pull the data from, competent AI scrapers can do the same. In fact, a competent AI scraper wouldn't even be stopped by Anubis as it is right now at all, and yet Anubis works pretty well in practice. Go figure.

account42 6 days ago | parent | prev [-]

> There are antiabuse systems that do incur asymmetric costs on automated users. Youtube had (has?) one. Rather than simply attaching a constant extra cost for every request, it instead delivered a VM (through JS) to browsers, and programs for that VM. The VM and its programs were deliberately hard to reverse, and changed regularly. Part of their purpose was to verify, through a bunch of fussy side channels, that they were actually running on real browsers. Every time Youtube changed the VM, the bots had to do large amounts of new reversing work to keep up, but normal users didn't.

That depends on what you count as normal users though. Users that want to use alternative players also have to deal with this and since yt-dlp and youtube-dl before have been able to provide a solution for those user and bots can just do the same I'm not sure if I'd call the scheme successful in any way.