Remix.run Logo
tommek4077 4 hours ago

How do they get overloaded? Is the website too slow? I have a quite big wiki online and barely see any impact from bots.

stinky613 4 hours ago | parent | next [-]

A year or two ago I personally encountered scraping bots that were scraping every possible resultant page from a given starting point. So if it scraped a search results page it would also scrape every single distinct combination of facets on that search (including nonsensical combinations e.g. products that match the filter "products where weight<2lbs AND weight>2lbs")

We ended up having to block entire ASNs and several subnets (lots from Facebook IPs, interestingly)

chao- 3 hours ago | parent [-]

I have encountered this same issue with faceted search results and individual inventory listings.

switz 4 hours ago | parent | prev | next [-]

If you have a lot of pages, AI bots will scrape every single one on a loop - wiki's generally don't have anywhere near the number of pages as an incremented entity primary id. I have a few million pages on a tiny website and it gets hammered by AI bots all day long. I can handle it, but it's a nuisance and they're basically just scraping garbage (statistics pages of historical matches or user pages that have essentially no content).

Many of them don't even self-identify and end up scraping with shrouded user-agents or via bot-farms. I've had to block entire ASNs just to tone it down. It also hurts good-faith actors who genuinely want to build on top of our APIs because I have to block some cloud providers.

I would guess that I'm getting anywhere from 10-25 AI bot requests (maybe more) per real user request - and at scale that ends up being quite a lot. I route bot traffic to separate pods just so it doesn't hinder my real users' experience[0]. Keep in mind that they're hitting deeply cold links so caching doesn't do a whole lot here.

[0] this was more of a fun experiment than anything explicitly necessary, but it's proven useful in ways I didn't anticipate

tommek4077 3 hours ago | parent [-]

How many requests per second do you get? I also see a lot of bot traffic but nowhere near to hit the servers significantly, and i render most stuff on the server directly.

blell 4 hours ago | parent | prev | next [-]

In these discussions no one will admit this, but the answer is generally yes. Websites written in python and stuff like that.

Qwertious 2 hours ago | parent | next [-]

It's not "written too slow" if you e.g. only get 50 users a week, though. If bots add so much load that you need to go optimise your website for them, then that's a bot problem not a website problem.

tclancy 4 hours ago | parent | prev [-]

Yes yes, definitely people don’t know what they’re doing and not that they’re operating on a scale or problem you are not. Metabrainz cannot cache all of these links as most of them are hardly ever hit. Try to assume good intent.

tommek4077 3 hours ago | parent [-]

But serving HTML is unbelievably cheap, isn't it?

chlorion an hour ago | parent [-]

It adds up very quickly.

roblh 4 hours ago | parent | prev | next [-]

There’s a lot of factors. Depends how well your content lends itself to being cached by a CDN, the tech you (or your predecessors) chose to build it with, and how many unique pages you have. Even with pretty aggressive caching, having a couple million pages indexed adds up real fast. Especially if you weren’t fortunate enough to inherit a project using a framework that makes server side rendering easy.

kpcyrd 4 hours ago | parent | prev [-]

The API seems to be written in Perl: https://github.com/metabrainz/musicbrainz-server

jjgreen 4 hours ago | parent [-]

Time for a vinyl-style Perl revival ...