▲ | johnea 6 days ago | ||||||||||||||||||||||||||||||||||||||||||||||
My biggest bitch is that it requires JS and cookies... Although the long term problem is the business model of servers paying for all network bandwidth. Actual human users have consumed a minority of total net bandwidth for decades: https://www.atom.com/blog/internet-statistics/ Part 4 shows bots out using humans in 1996 8-/ What are "bots"? This needs to include goggleadservices, PIA sharing for profit, real-time ad auctions, and other "non-user" traffic. The difference between that and the LLM training data scraping, is that the previous non-human traffic was assumed, by site servers, to increase their human traffic, through search engine ranking, and thus their revenue. However the current training data scraping is likely to have the opposite effect: capturing traffic with LLM summaries, instead of redirecting it to original source sites. This is the first major disruption to the internet's model of finance since ad revenue look over after the dot bomb. So far, it's in the same category as the environmental disaster in progress, ownership is refusing to acknowledge the problem, and insisting on business as usual. Rational predictions are that it's not going to end well... | |||||||||||||||||||||||||||||||||||||||||||||||
▲ | jerf 6 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
"Although the long term problem is the business model of servers paying for all network bandwidth." Servers do not "pay for all the network bandwidth" as if they are somehow being targeted for fees and carrying water for the clients that are somehow getting it for "free". Everyone pays for the bandwidth they use, clients, servers, and all the networks in between, one way or another. Nobody out there gets free bandwidth at scale. The AI scrapers are paying lots of money to scrape the internet at the scales they do. | |||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||
▲ | Hizonner 6 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||
> The difference between that and the LLM training data scraping Is the traffic that people are complaining about really training traffic? My SWAG would be that there are maybe on the order of dozens of foundation models trained in a year. If you assume the training runs are maximally inefficient, cache nothing, and crawl every Web site 10 times for each model trained, then that means maybe a couple of hundred full-content downloads for each site in a year. But really they probably do cache, and they probably try to avoid downloading assets they don't actually want to put into the training hopper, and I'm not sure how many times they feed any given page through in a single training run. That doesn't seem like enough traffic to be a really big problem. On the other hand, if I ask ChatGPT Deep Research to give me a report on something, it runs around the Internet like a ferret on meth and maybe visits a couple of hundred sites (but only a few pages on each site). It'll do that a whole lot faster than I'd do it manually, it's probably less selective about what it visits than I would be... and I'm likely to ask for a lot more such research from it than I'd be willing to do manually. And the next time a user asks for a report, it'll do it again, often on the same sites, maybe with caching and maybe not. Thats not training; the results won't be used to update any neural network weights, and won't really affect anything at all beyond the context of a single session. It's "inference scraping" if you will. It's even "user traffic" in some sense, although not in the sense that there's much chance the user is going to see a site's advertising. It's conceivable the bot might check the advertising for useful information, but of course the problem there is that it's probably learned that's a waste of time. Not having given it much thought, I'm not sure how that distinction affects the economics of the whole thing, but I suspect it does. So what's really going on here? Anybody actually know? | |||||||||||||||||||||||||||||||||||||||||||||||
|