Remix.run Logo
Havoc 18 hours ago

What blows my mind is that this is functionally a solved problem.

The big search crawlers have been around for years & manage to mostly avoid nuking sites into oblivion. Then AI gang shows up - supposedly smartest guys around - and suddenly we're re-inventing the wheel on crawling and causing carnage in the process.

jeroenhd 18 hours ago | parent | next [-]

Search crawlers have the goal of directing people towards the websites they crawl. They have a symbiotic relationship, so they put in (some) effort not to blow websites out of the water with their crawling, because a website that's offline is useless for your search index.

AI crawlers don't care about directing people towards websites. They intend to replace websites, and are only interested in copying whatever information is on them. They are greedy crawlers that would only benefit from knocking a website offline after they're done, because then the competition can't crawl the same website.

The goals are different, so the crawlers behave differently, and websites need to deal with them differently. In my opinion the best approach is to ban any crawler that's not directly attached to a search engine through robots.txt, and to use offensive techniques to take out sites that ignore your preferences. Anything from randomly generated text to straight up ZIP bombs is fair game when it comes to malicious crawlers.

dmix 8 hours ago | parent | next [-]

FWIW when I research stuff through chatgpt I click on the source links all the time. It usually only summarizes stuff. For ex: if you're shopping for a certain product it wont bring you to the store page where all the reviews are. It will just make a top ten list type thing quickly.

freetonik 16 hours ago | parent | prev [-]

>Search crawlers have the goal of directing people towards the websites they crawl. They have a symbiotic relationship, so they put in (some) effort not to blow websites out of the water with their crawling, because a website that's offline is useless for your search index.

Ultimately not true. Google started showing pre-parsed "quick cards" instead of links a long time ago. The incentives of ad-driven search engines are to keep the visitors on the search engine rather than direct them to the source.

marginalia_nu 11 hours ago | parent [-]

> The incentives of ad-driven search engines are to keep the visitors on the search engine rather than direct them to the source.

It's more complicated than that. Google's incentives are to keep the visitors on the search engine only if the search result doesn't have Google ads. Though it's ultimately self-defeating I think, and the reason for their decline in perceived quality. If you go back to the backrub whitepaper from 1998, you'll find Brin and Page outlining this exact perverse incentive as the reason why their competitors sucked.

marginalia_nu 16 hours ago | parent | prev [-]

I think it's largely the mindset of moving fast and breaking things that's at fault. If say ship it at "good enough", it will not behave well.

Building a competent well-behaved crawler is a big effort that requires relatively deep understanding of more or less all web tech, and figuring out a bunch of stuff that is not documented anywhere and not part of any specs.