Remix.run Logo
andai 10 hours ago

Can someone help me understand where all this traffic is coming from? Are there thousands of companies all doing it simultaneously? How come even small sites get hammered constantly? At some point haven't you scraped the whole thing?

marginalia_nu 8 hours ago | parent | next [-]

> At some point haven't you scraped the whole thing?

Git forges will expose a version of every file at every commit in the project's history. If you have medium sized project consisting of say 1000 files and 10,000 commits, the crawler will identify a number of URLs on the same order of magnitude as English Wikipedia, just for that one project. This is also very expensive for the git forge, as it needs to reconstruct the historical files from a bunch of commits.

Git forges interact spectacularly poorly with naively implemented web crawlers, unless the crawlers put in logic to avoid exhaustively crawling git forges. You honestly get a pretty long way just excluding URLs with long base64-like path elements, which isn't hard but it's also not obvious.

input_sh 9 hours ago | parent | prev | next [-]

> How come even small sites get hammered constantly?

Because big sites have decades of experience fighting against scrapers and have recently upped their game significantly (even when doing so carries some SEO costs) so that they're the only ones that can train AI on their own data.

So now, when you're starting from scratch and your goal is to gather as much data as possible, targetting smaller sites with weak / non-existent scraping protection is the path of least resistence.

andai 6 hours ago | parent [-]

No I meant like, if you have a blog with 10 posts.. do they just scrape the same 10 pages thousands of times?

Because people are reporting constant traffic, which would imply that the site is being scraped millions of times per year. How does that make any sense? Are there millions of AI companies?

marcthe12 5 hours ago | parent [-]

Basically the scrappers do not bother to cache your website or if they do, with an insanely low ttl. Also they do not specialize the content. So the worst hit sites are something like git hosting due the bfs style scrape (every link). The worst part is alot of this is done via tunneling so ip can be different each time or from residential ops. Which makes it annoying.

bingo-bongo 9 hours ago | parent | prev | next [-]

AI companies scrape to:

- have data to train on

- update the data more or less continuously

- answer queries from users on the fly

With a lot of AI companies, that generates a lot of scraping. Also, some of them behave terribly when scraping or is just bad at it.

adastra22 9 hours ago | parent [-]

Why don’t they scrape once though?

blell 9 hours ago | parent [-]

1) It may be out of date 2) Storing it costs money

reppap 9 hours ago | parent | prev | next [-]

It's not just companies either, a lot of people run crawlers for their home lab projects too.

m0llusk 7 hours ago | parent | prev | next [-]

It isn't only companies, it is a mass social movement. Anyone with basic coding experience can download some basic learning apparatus and start feeding it material. The latest LLMs make it extremely easy to compose code that scrapes internet sites, so only the most minimal skills are required. Because everything is "AI" now aspiring young people are encouraged to do this in order to gain experience so they can get jobs and a careers in the new AI driven economy.

devsda 9 hours ago | parent | prev [-]

May be the teams developing AI crawlers are dogfooding & are using the AI itself(and its small context) to keep track of the sites that are already scraped. /s