Remix.run Logo
thethingundone 20 hours ago

I own a forum which currently has 23k online users, all of them bots. The last new post in that forum is from _2019_. Its topic is also very niche. Why are so many bots there? This site should have basically been scraped a million times by now, yet those bots seem to fetch the stuff live, on the fly? I don’t get it.

sethops1 19 hours ago | parent | next [-]

I have a site with a complete and accurate sitemap.xml describing when its ~6k pages are last updated (on average, maybe weekly or monthly). What do the bots do? They scrape every page continuously 24/7, because of course they do. The amount of waste going into this AI craze is just obscene. It's not even good content.

thisislife2 12 hours ago | parent | next [-]

If you are in the US, have you considered suing them for robot.txt / copyright violation? AI companies are currently flush with cash from VCs and there may be a few big law firms willing to fight a law suit against them on your behalf. AI companies have already lost some copyright cases.

happymellon 10 hours ago | parent [-]

Based upon traffic you could tell whether an IP or request structure is coming from a not, but how would you reliability tell which company is DDOSing you?

chrismorgan 9 hours ago | parent [-]

It should be at least theoretically possible: each IP address is assigned to an organisation running the IP routing prefix, and you can look that up easily, and they should have some sort of abuse channel, or at the very least a legal system should be able to compel them to cooperate and give up the information they’re required to have.

n1xis10t 19 hours ago | parent | prev [-]

It would be interesting if someone made a map that depicts the locations of the ip addresses that are sending so many requests, over the course of a day maybe.

giantrobot 17 hours ago | parent [-]

Maps That Are Just Datacenters

tokioyoyo 16 hours ago | parent | prev | next [-]

Large scale scraping tech is not as sophisticated as you'd think. A significant chunk of it is "get as much as possible, categorize and clean up later". Man, I really want the real web of the 2000s back, when things felt "real" more or less... how can we even get there.

tmnvix an hour ago | parent | next [-]

A curated web directory. Kind of like Yahoo had. The internet according to the dewey system with pages somehow rated for quality by actual humans (maybe something to learn from Wikipedia's approach here?)

n1xis10t 16 hours ago | parent | prev | next [-]

If people start making search engines again and there is more competition for Google, I think things would be pretty sweet.

nephihaha 5 hours ago | parent | next [-]

There are other search engines, they've just been marginalised. Even something as mainstream as Bing has been pushed to the side.

tokioyoyo 16 hours ago | parent | prev | next [-]

Because of the financial incentives, it would still end up with people doing things to drive traffic to their website though, no? Maybe because the web was smaller, and people looked at it as means "to explore curiosity" in the olden days it kinda worked differently... maybe I just got old, but I don't want to believe that.

n1xis10t 15 hours ago | parent [-]

By “doing things to drive traffic to their website” do you mean trying to do SEO type things to manipulate search engine rankings? If so, I think that there are probably ways to rank that are immune to tampering.

Don’t worry, you’re not just old. The internet kind of sucks now.

makapuf 11 hours ago | parent [-]

Google was neat in that you didn't see the content keyword spam either on the websites or the portal home pages. The Web was already full of shit (first ad banner was 1994? By 1999 you already had punch the monkey as classy content), but it was more ... organic and you could easily skip it.

PunchyHamster 8 hours ago | parent | prev [-]

it's few orders of magnitude harder given the amount of SEO spam prevalent, and that just gonna get worse with AI

thethingundone 16 hours ago | parent | prev | next [-]

I would understand that, but it seems they don’t store the stuff but recollect the same content every hour.

tokioyoyo 16 hours ago | parent [-]

I'm assuming a quick hash check to see if there's any change? Between scrapers "most up to date data" is fairly valuable nowadays as well.

idiotsecant 12 hours ago | parent | prev [-]

Have you ever listened to the 'high water mark' monologue from fear and loathing? It's pretty much just that. It was a unique time and it was neat that we got to see it, but it can't possibly happen again.

https://www.youtube.com/watch?v=vUgs2O7Okqc

symbogra 8 hours ago | parent [-]

Thanks for reminding me about that, what a great monologue. I didn't really understand it when I was younger, but now I feel the same thing with regards to software engineering. There was a golden age which finally broke at the end of the 2010's.

thethingundone 19 hours ago | parent | prev | next [-]

The bots are exposing themselves as Google, Bing and Yandex. I can’t verify whether it’s being attributed by IP address or whether the forum trusts their user agent. It could basically be anyone.

n1xis10t 19 hours ago | parent [-]

Interesting. When it was just normal search engines I didn’t hear of people having this problem, so this either means that there are a bunch of people pretending to be bing google and yandex, or those companies have gotten a lot more aggressive.

bobbiechen 18 hours ago | parent | next [-]

There are lots of people pretending to be Google and friends. They far outnumber the real Googlebot, etc. and most people don't check the reverse DNS/IP list - it's tedious to do this for even well-behaved crawlers that publish how to ID themselves. So much for User Agent.

happymellon 10 hours ago | parent [-]

> So much for User Agent.

User agent has been abused for so long, I forget a time when it wasn't.

Anyone else remember having to fake being a Windows machine so that YouTube/Netflix would serve you content better than standard def, or banking portals that blocked you if your agent didn't say you were Internet Explorer?

wooger 8 hours ago | parent [-]

I mean forget that, all modern desktop browsers (at least) start with the string 'Mozilla/5.0', still, in a world where Chrome is so dominant.

reallyhuh 18 hours ago | parent | prev | next [-]

What are the proportions for the attributions? Is it equally distributed or lopsided towards one of the three?

giantrobot 17 hours ago | parent | prev [-]

Normal search engine spiders did/do cause problems but not on the scale of AI scrapers. Search engine spiders tend to follow a robots.txt, look at the sitemap.xml, and generally try to throttle requests. You'll find some that are poorly behaved but they tend to get blocked and either die out or get fixed and behave better.

The AI scrapers are atrocious. They just blindly blast every URL on a site with no throttling. They are terribly written and managed as the same scraper will hit the same site multiple times a day or even hour. They also don't pay any attention to context so they'll happily blast git repo hosts and hit expensive endpoints.

They're like a constant DOS attack. They're hard to block at the network level because they span across different hyperscalers' IP blocks.

n1xis10t 17 hours ago | parent [-]

Puts on tinfoil hat: Maybe it isn’t AI scrapers, but actually is a massive dos attack, and it’s a conspiracy to get people to not self-host.

danpalmer 20 hours ago | parent | prev | next [-]

How do you define a user, and how do you define online?

If the forum considers unique cookies to be a user and creates a new cookie for any new cookie-less request, and if it considers a user to be online for 1 hour after their last request, then actually this may be one scraper making ~6 requests per second. That may be a pain in its own way, but it's far from 23k online bots.

crote 19 hours ago | parent | next [-]

That's still 518.400 requests per day. For static content. And it's a niche forum, so it's not exactly going to have millions of pages.

Either there are indeed hundreds or thousands of AI bots DDoSing the entire internet, or a couple of bots are needlessly hammering it over and over and over again. I'm not sure which option is worse.

n1xis10t 19 hours ago | parent [-]

Imagine if all this scraping was going into a search engine with a massive index, or a bunch of smaller search engines that a meta-search engine could be made for. This’d be a lot more cool in that case

thethingundone 19 hours ago | parent | prev [-]

AFAIK it keeps a user counted as online for 5 or 15 minutes (I think 5). It’s a Woltlab Burning Board.

Edit: it’s 15 minutes.

danpalmer 18 hours ago | parent [-]

And what is a "user"?

thethingundone 16 hours ago | parent [-]

Whatever the forum software Woltlab Burning Board considers a user. If I recall correctly, it tries to build an identifier based on PHP session ids, so most likely simply cookies.

danpalmer 14 hours ago | parent [-]

This is exactly my point. Scrapers typically don't store cookies, so every single request is likely to be a "new" user as far as the forum software is concerned.

Couple that with 15 minute session times, and that could just be one entity scraping the forum at 30 requests per second. One scraper going moderately fast sounds far less bad than 29000 bots.

It still sounds excessive for a niche site, but I'd guess this is sporadic, or that the forum software has a page structure that traps scrapers accidentally, quite easy to do.

mrweasel 7 hours ago | parent | prev | next [-]

Why pay for storage when you do it for them?

GaryBluto 6 hours ago | parent | prev | next [-]

Why do you keep it operating? Is it the aquarium value?

csomar 4 hours ago | parent | prev | next [-]

Sure you do by now. You are the hard drive.

sandblast 20 hours ago | parent | prev | next [-]

Are you sure the counter is not broken?

thethingundone 19 hours ago | parent [-]

Yes, it’s running on a Woltlab Burning Board since forever.

andrepd 20 hours ago | parent | prev [-]

When you have trillions of dollars being poured into your company by the financial system, and when furthermore there are no repercussions for behaving however you please, you tend not to care about that sort of "waste".