A huge amount of the web is only crawlable with a googlebot user-agent and specific source IPs.

Imustaskforhelp 5 hours ago | parent | next [-]

> And given you-know-what, the battle to establish a new search crawler will be harder than ever. Crawlers are now presumed guilty of scraping for AI services until proven innocent.

I have always wondered but how does wayback machine work, is there no way that we can use wayback archive and then run a index on top of every wayback archive somehow?

	▲	ghm2199 5 hours ago \| parent [-]
		You can read https://hackernoon.com/the-long-now-of-the-web-inside-the-in... it was a nice look into their infra structure. One could theoretically build it. A few things stand out: 1. IIUC depends a lot on "Save Page Now" democratization, which could work, but its not like a crawler. 2. In absence of alexa they depend quite heavily on common crawl, which is quite crazy because there literally is no other place to go. I don't think they can use google's syndicated API, cause they would then start showing ads in their database, which is garbage that would strain their tiny storage budget. 3. Minor from a software engineering perspective but important for survival of the company: since they are an artifact of record storage, to convert that to an index would need a good legal team to battle google to argue. They do that the DoJ's recent ruling in their favor.

▲

deepsquirrelnet 5 hours ago | parent | prev | next [-]

I do not know a lot about this subject, but couldn’t you make a pretty decent index off of common crawl? It seems to me the bar is so low you wouldn’t have to have everything. Especially if your goal was not monetization with ads.

	▲	ghm2199 5 hours ago \| parent [-]
		I think someone had commented on another thread about SerpAPI the other day that common crawl is quite small. It would be a start, I think the key to a good index people will use is freshness of the results. You need good recall for a search engine, precision tuning/re-ranking is not going to help otherwise.

▲

charcircuit 5 hours ago | parent | prev | next [-]

If a crawler offered enough money they could be allowed too. It's not like Google has exclusive crawling rights.

	▲	Nextgrid an hour ago \| parent [-]
		There is a logistics problem here - even if you had enough money to pay, how would you get in touch with every single site to even let them know you're happy to pay? It's not like site operators routinely scan their error logs to see your failed crawling attempts and your offer in the user-agent. Even if they see it, it's a classic chicken & egg problem: it's not worth the time of the site operator to engage with your offer until your search engine popular enough to matter, but your search engine will never become popular enough to matter if it doesn't have a critical mass of sites to begin with.

▲

6 hours ago | parent | prev [-]

[deleted]