| ▲ | andai 10 hours ago | ||||||||||||||||
Can someone help me understand where all this traffic is coming from? Are there thousands of companies all doing it simultaneously? How come even small sites get hammered constantly? At some point haven't you scraped the whole thing? | |||||||||||||||||
| ▲ | marginalia_nu 8 hours ago | parent | next [-] | ||||||||||||||||
> At some point haven't you scraped the whole thing? Git forges will expose a version of every file at every commit in the project's history. If you have medium sized project consisting of say 1000 files and 10,000 commits, the crawler will identify a number of URLs on the same order of magnitude as English Wikipedia, just for that one project. This is also very expensive for the git forge, as it needs to reconstruct the historical files from a bunch of commits. Git forges interact spectacularly poorly with naively implemented web crawlers, unless the crawlers put in logic to avoid exhaustively crawling git forges. You honestly get a pretty long way just excluding URLs with long base64-like path elements, which isn't hard but it's also not obvious. | |||||||||||||||||
| ▲ | input_sh 9 hours ago | parent | prev | next [-] | ||||||||||||||||
> How come even small sites get hammered constantly? Because big sites have decades of experience fighting against scrapers and have recently upped their game significantly (even when doing so carries some SEO costs) so that they're the only ones that can train AI on their own data. So now, when you're starting from scratch and your goal is to gather as much data as possible, targetting smaller sites with weak / non-existent scraping protection is the path of least resistence. | |||||||||||||||||
| |||||||||||||||||
| ▲ | bingo-bongo 9 hours ago | parent | prev | next [-] | ||||||||||||||||
AI companies scrape to: - have data to train on - update the data more or less continuously - answer queries from users on the fly With a lot of AI companies, that generates a lot of scraping. Also, some of them behave terribly when scraping or is just bad at it. | |||||||||||||||||
| |||||||||||||||||
| ▲ | reppap 9 hours ago | parent | prev | next [-] | ||||||||||||||||
It's not just companies either, a lot of people run crawlers for their home lab projects too. | |||||||||||||||||
| ▲ | m0llusk 7 hours ago | parent | prev | next [-] | ||||||||||||||||
It isn't only companies, it is a mass social movement. Anyone with basic coding experience can download some basic learning apparatus and start feeding it material. The latest LLMs make it extremely easy to compose code that scrapes internet sites, so only the most minimal skills are required. Because everything is "AI" now aspiring young people are encouraged to do this in order to gain experience so they can get jobs and a careers in the new AI driven economy. | |||||||||||||||||
| ▲ | devsda 9 hours ago | parent | prev [-] | ||||||||||||||||
May be the teams developing AI crawlers are dogfooding & are using the AI itself(and its small context) to keep track of the sites that are already scraped. /s | |||||||||||||||||