That’s so many scrapers. There must be a ton of companies with very large document collections at this point, and it really sucks that they don’t at least do us the courtesy of indexing them and making them available for keyword search, but instead only do AI.

It’s kind of crazy how much scraping goes on and how little search engine development goes on. I guess search engines aren’t fashionable. Reminds me of this article about search engines disappearing mysteriously: https://archive.org/details/search-timeline

I try to share that article as much as possible, it’s interesting.

▲

kstrauser 18 hours ago | parent | next [-]

So! Much! Scraping! They were downloading every commit multiple times, and fetching every file as seen at each of those commits, and trying to download archives of all the code, and hitting `/me/my-repo/blame` endpoints as their IP's first-ever request to my server, and other unlikely stuff.

My scraper dudes, it's a git repo. You can fetch the whole freaking thing if you wanna look at it. Of course, that would require work and context-aware processing on their end, and it's easier for them to shift the expense onto my little server and make me pay for their misbehavior.

	▲	n1xis10t 18 hours ago \| parent [-]
		Crazy

▲

PeterStuer 7 hours ago | parent | prev | next [-]

Or some anti-ddos/bot companies using ultra cheap scraping services to annoy you enough to get you into their "free" anti bot protection, so they can charge the few real ai scrapers for access to your site.

	▲	throw10920 2 hours ago \| parent [-]
		Is there any evidence that this has actually happened?

▲

miki123211 8 hours ago | parent | prev | next [-]

But there is a lot of search engine development going on, it's just that the results of the new search engines are fed straight into AI instead of displayed in the legacy 10-links-per-page view.

▲

mrweasel 5 hours ago | parent | prev [-]

> There must be a ton of companies with very large document collections at this point

See, I don't think there is, I don't think they want that expense. It's basically the Linus Torvalds philosophy of data storage, if it's on the Internet, I don't need a backup. While I have absolutely no proof of this, I'd guess that many AI companies just crawl the Internet constantly, never saving any of the data. We're seeing some of these scrapers go to great length attempting to circumvent any and all forms of caching, they aren't interested in having a two week old copy of anything.

	▲	n1xis10t 23 minutes ago \| parent [-]
		Could be. Can you train a model without saving things though?