Remix clone Hacker News

new | show | ask | jobs Github

	▲	senko 6 hours ago
		A full up-to-date index of the searchable web should be a public commons good. This would not only allow better competition in search, but fix the "AI scrapers" problem: No need to scrape if the data has already been scraped. Crawling is technically a solved problem, as witnessed by everyone and their dog seemingly crawling everything. If pooled together, it would be cheaper and less resource intensive. The secret sauce is in what happens afterwards, anyway. Here's the idea in more detail: https://senkorasic.com/articles/ai-scraper-tragedy-commons I'm under no illusion something like that will happen .. but it could.
	▲	moebrowne 5 hours ago \| parent \| next [-]
		Isn't this what CommonCrawl are doing? https://commoncrawl.org/
	▲	azornathogron 5 hours ago \| parent \| prev [-]
		Is crawling really solved? Any naive crawler is going to run into the problem that servers can give different responses to different clients which means you can show the crawler something different to what you show real users. That turns crawling into an antagonistic problem where the crawler developers need to continually be on the lookout for new ways of servers doing malicious things that poison/mislead the index. Otherwise you'll return junk spam results from spammers that lied to the crawler. I've never done it so maybe it's easier than I imagine but I wouldn't be quick to assume that crawling is solved.