Not on topic, but wow the internet has very quickly devolved into: click -> "making sure you're not a bot", click -> "making sure you're a human", click -> "COOKIES COOKIES COOKIES", click -> "cloudflare something something"

▲

thresh 8 hours ago | parent | next [-]

We had to set it up on the parts of VideoLAN infra so the service would remain usable.

Otherwise it was under a constant DDoS by the AI bots.

▲

nijave 7 hours ago | parent | next [-]

While I do sympathetize with the AI DDoS situation, it'd be nice if there were a solution that allows them to work so they can pull official docs.

For instance, MCP, static sites that are easy to scale, a cache in front of a dynamic site engine

	▲	thresh 6 hours ago \| parent [-]
		Of course, static websites is the best solution to that problem. Our documentation and a main website are not fronted by this protection, so they're still accessible for the scrapers.

▲

hectormalot 7 hours ago | parent | prev | next [-]

Maybe I’m naive about this, but I didn’t expect AI scrapers to be that big of a load? I mean, it’s not that they need to scrape the same at 1000+ QPS, and even then I wouldn’t expect them to download all media and images either?

What am I missing that explains the gap between this and “constant DDoS” of the site?

▲

thresh 6 hours ago | parent | next [-]

You cant really cache the dynamic content produced by the forges like Gitlab and, say, web forums like phpbb. So it means every request gets through the slow path. Media/JS is of course cached on the edge, so it's not an issue.

Even when the amount of AI requests isnt that high - generally it's in hundreds per second tops for our services combined - that's still a load that causes issues for legitimate users/developers. We've seen it grow from somewhat reasonable to pretty much being 99% of responses we serve.

Can it be solved by throwing more hardware at the problem? Sure. But it's not sustainable, and the reasonable approach in our case is to filter off the parasitic traffic.

▲

hectormalot 2 hours ago | parent | next [-]

Thanks, appreciate the details. 99% is far above the amount I expected, and if it specifically hits hard to cache data then I can see how that brings a system to its knees.

▲

fragmede 5 hours ago | parent | prev [-]

You kind of can though. You serve cached assets and then use JavaScript to modify it for the individual user. The specific user actions can't be cached, but the rest of it can.

	▲	davidron 2 hours ago \| parent \| next [-]
		Totally. Remember slashdot in the 1990s used to house a dynamic page on a handful of servers with horsepower dwarfed by a Nintendo Switch that had a user base capable of bringing major properties down.
	▲	Avamander 2 hours ago \| parent \| prev [-]
		The "can't" comes from the fact that VLC is not going to rewrite their forum software or software forge. Software written in PHP is in most cases frankly still abysmally slow and inefficient. Wordpress runs like 70% of the web and you can really feel it from the 1500ms+ TFFB most sites have. PhpBB is not much better. Pathetic throughput at best and it has not gotten better in decades now. I don't know how GitLab became so disgustingly slow. But yeah, I'm not surprised bots can easily bring it to its knees.

▲

nijave 7 hours ago | parent | prev | next [-]

I think there's a few things at play here

- AI scrapers will pull a bunch of docs from many sites in parallel (so instead of a human request where someone picks a single Google result, it hits a bunch of sites)

- AI will crawl the site looking for the correct answer which may hit a handful of pages

- AI sends requests in quick succession (big bursts instead of small trickle over longer time)

- Personal assistants may crawl the site repeatedly scraping everything (we saw a fair bit of this at work, they announced themselves with user agents)

- At work (b2b SaaS webapp) we also found that the personal assistant variety tended to hammer really computationally expensive data export and reporting endpoints generally without filters. While our app technically supported it, it was very inorganic traffic

That said, I don't think the solution is blanket blocks. Really it's exposing sites are poorly optimized for emerging technology.

▲

Y-bar 7 hours ago | parent | prev | next [-]

They are a scourge, they never rate-limit themselves, there are a hundred of them, and a significant number don’t respect robots.txt. Many of them also end up our meta:no-index,no-follow search pages leading to cost overruns on our Algolia usage. We spend way too much time adjusting WAF and other bot-controls than we should have.

▲

eipi10_hn 3 hours ago | parent | prev | next [-]

Yes, it's that BIG of a load: https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/

	▲	hectormalot 2 hours ago \| parent [-]
		Thanks. I imagine there is a (a) a lot of interest in scraping source code, and (b) many requests to forges hitting expensive paths. 99% of volume though, wow, much more than expected.

▲

7 hours ago | parent | prev [-]

[deleted]

▲

stefantalpalaru 4 hours ago | parent | prev | next [-]

[dead]

▲

nerdralph 7 hours ago | parent | prev [-]

I highly doubt there is no other technically feasible option to block the AI bots. You end up blocking not just bots, but many humans too. When I clicked on the link and the bot block came up, I just clicked back. I think HN posts should have warnings when the site blocks you from seeing it until you somehow, maybe, prove you are human.

▲

goobatrooba 7 hours ago | parent | next [-]

I'm sure there are many solutions for many problems, but expecting a small Foss development team to know or implement them all is rather unreasonable.

I think the world gains more if the VLAN team focuses on their amazing, free contribution to the world, than if they spend the same time trying to figure out how to save you two clicks.

We all hate that this is happening, but you don't need to attack everyone that is unfortunately caught up in it.

▲

overfeed 7 hours ago | parent | prev | next [-]

> I highly doubt there is no other technically feasible option to block the AI bots.

If you have discovered such an option, you could get very wealthy: minimizing friction for humans in e-commerce is valuable. If you're a drive-by critic not vested in the project, then yours is an instance of talk being cheap.

▲

thresh 7 hours ago | parent | prev [-]

I'm all ears on how we can fix it otherwise.

Keep in mind that those kinds of services: - should not be MITMed by CDNs - are generally ran by volunteers with zero budget, money and time-wise

	▲	nerdralph 5 hours ago \| parent [-]
		First off, don't block the first connection of the day from a given IP. Rate limit/block from there, for example how sshguard does it. I've seen several posts on HN and elsewhere showing many bots can be fingerprinted and blocked based on HTTP headers and TLS. For the bots that perfectly match the fingerprint of an interactive browser and don't trigger rate limits, use hidden links to tarpits and zip bombs. Many of these have been discussed on HN. Here's the first one that came to memory: https://news.ycombinator.com/item?id=42725147

▲

port11 9 hours ago | parent | prev | next [-]

The internet is such a Tragedy of the Commons… its citizens that act selfishly and in bad faith will slowly make it unusable.

▲

codedokode 8 hours ago | parent | next [-]

No, it is because citizen allow treating them like this.

▲

esseph 8 hours ago | parent | prev | next [-]

> its citizens that act selfishly and in bad faith will slowly make it unusable

It's rarely been the citizens that have been the problem, but the governments and companies that seek the use the network connection for their overwhelming benefit.

Re (above):

> Not on topic, but wow the internet has very quickly devolved into: click -> "making sure you're not a bot", click -> "making sure you're a human", click -> "COOKIES COOKIES COOKIES", click -> "cloudflare something something"

	▲	fastball 7 hours ago \| parent [-]
		wat. The protections in place that the OP is talking about are almost entirely due to (not government and company) bad actors.

▲

honktime 8 hours ago | parent | prev [-]

Its pretty explicitly not a tragedy of the commons. Its a tragedy of the ruling class abusing the resources of the 'commons' to extract value. There is nothing 'commons' about trillion dollar companies extracting all available value from the labor of the working class. That's just the tragedy that'll bring around the death of society, the same tragedy that brings all other tragedys

▲

throw-the-towel 8 hours ago | parent | next [-]

The commons in question is the internet itself.

▲

amusingimpala75 8 hours ago | parent | prev | next [-]

Thank you for describing the tragedy of the commons

	▲	multjoy 8 hours ago \| parent [-]
		The commons were never unregulated. This is a tragedy of enclosure. https://en.wikipedia.org/wiki/Enclosure

▲

dyauspitr 8 hours ago | parent | prev [-]

There’s definitely lots of problems with the ruling class and wealth disparity. Perhaps the defining problems of our current age.

That being said, so many of the plebs suck. Like 2% will ruin everything for everyone.

▲

throw-the-towel 8 hours ago | parent [-]

While a lot of the plebs do suck, a pleb who sucks causes way less problems than a big corp that sucks simply by virtue of not having too much resources.

	▲	dyauspitr 6 hours ago \| parent [-]
		I agree. But whether you agree with me or not, most paradigm shifting changes come from billionaires/corps because they are the only ones with the money to pull off massive shifts. Most innovation is not grassroots and heavily funded by the “elites”. This is how most successful countries have been for atleast the last 100 years. So billionaires add a lot of value even as they cause a lot of pain. The solution in my mind is we absolutely need uncapped billionaires but they need to be effectively taxed (not like 90% but closer to 50%) and they have to have absolutely no influence on the government.

▲

notenlish 7 hours ago | parent | prev | next [-]

Nearly every single website I'm not logged into these days want me to "confirm I'm not a bot".

it is incredibly annoying but what can you do? AI scrapers ruined the web.

▲

pixelpoet 7 hours ago | parent | prev | next [-]

No one's even clicking anymore, everything implores me to tap or swipe these days, and everything is optimised for humans with one eye above the other.

Then I press the X to close the all-caps banner commanding me to install the app, upon which I get sent to the app store. Users of the website refer to it as an app.

▲

rayiner 8 hours ago | parent | prev | next [-]

Wow I’m glad it’s not just me. I thought my IP block had gotten caught up in some known spamming or something.

▲

tomwheeler 6 hours ago | parent | prev | next [-]

At least this one was significantly faster than Cloudflare and required no action on my part.

▲

tosti 9 hours ago | parent | prev | next [-]

I get exactly none of that. Is your adblocker still working?

	▲	8 hours ago \| parent [-]
		[deleted]

▲

oybng 8 hours ago | parent | prev [-]

renders your gigabit connection pointless