Remix.run Logo
arjie 4 hours ago

Well, imagine the best case and that you're a cooperative bot writer who does not intend to harm website owners. Okay, so you follow robots.txt and all that. That's straightforward.

But it's not like you're writing a "metabrainz crawler" and a "metafilter crawler" and a "wiki.roshangeorge.dev crawler". You're presumably trying to write a general Internet crawler. You encounter a site that is clearly a HTTP view into some git repo (say). How do you know to just `git clone` the repo in order to have the data archived as opposed to just browsing the HTTP view.

As you can see, I've got a lot of crawlers on my blog as well, but it's a mediawiki instance. I'd gladly host a Mediawiki dump for them to take, but then they'd have to know this was a Mediawiki-based site. How do I tell them that? The humans running the program don't know my site exists. Their bot just browses the universe and finds links and does things.

In the Metabrainz case, it's not like the crawler writer knows Metabrainz even exists. It's probably just linked somewhere in the web the crawler is exploring. There's no "if Metabrainz, do this" anywhere in there.

The robots.txt is a bit of a blunt-force instrument, and friendly bot writers should follow it. But assuming they do, there's no way for them to know that "inefficient path A to data" is the same as "efficient path B to data" if both are visible to their bot unless they write a YourSite-specific crawler.

What I want is to have a way to say "the canonical URL for the data on A is at URL B; you can save us both trouble by just fetching B". In practice, none of this is a problem for me. I cache requests at Cloudflare, and I have Mediawiki caching generated pages, so I can easily weather the bot traffic. But I want to enable good bot writers to save their own resources. It's not reasonable for me to expect them to write a me-crawler, but if there is a format to specify the rules I'm happy to be compliant.