▲ | goku12 3 days ago | |||||||||||||||||||||||||
Please remember that an LLM accessing any website isn't the problem here. It's the scraping bots that saturate the server bandwidth (a DoS attack of sorts) to collect data to train the LLMs with. An LLM solving a captcha or an Anubis style proof of work problem isn't a big concern here, because the worst they're going to do with the collected data is to cache them for later analysis and reporting. Unlike the crawlers, LLMs don't have any incentives in sucking up huge amounts of data like a giant vacuum cleaner. | ||||||||||||||||||||||||||
▲ | TeMPOraL 3 days ago | parent [-] | |||||||||||||||||||||||||
Scraping was a thing before LLMs, there's a whole separate arms race around this for regular competition and "industrial espionage" reasons. I'm not really sure why model training would become a noticeable fraction of scrapping activity - there's only few players on the planet that can afford to train decent LLMs in the first place, and they're not going to re-scrape the content they already have ad infinitum. | ||||||||||||||||||||||||||
|