new | show | ask | jobs Github

qualeed 2 days ago

There's no reason not to respect it.

If your browser behaves, it's not going to be excluded in robots.txt.

If your browser doesn't behave, you should at least respect robots.txt.

If your browser doesn't behave, and you continue to ignore robots.txt, that's just... shitty.

▲

lolinder 2 days ago | parent [-]

> If your browser behaves, it's not going to be excluded in robots.txt.

No, it's common practice to allow Googlebot and deny all other crawlers by default [0].

This is within their rights when it comes to true scrapers, but it's part of why I'm very uncomfortable with the idea of applying robots.txt to what are clearly user agents. It sets a precedent where it's not inconceivable that we have websites curating allowlists of user agents like they already do for scrapers, which would be very bad for the web.

[0] As just one example: https://www.404media.co/google-is-the-only-search-engine-tha...

▲

qualeed 2 days ago | parent [-]

>clearly user agents

I am not sure I agree with an AI-aided browser, that will scrape sites and aggregate that information, being classified as "clearly" a user agent.

If this browser were to gain traction and ends up being abusive to the web, that's bad too.

Where do you draw the line of crawler vs. automated "user agent"? Is it a certain number of web requests per minute? How are you defining "true scraper"?

	▲	lolinder 2 days ago \| parent [-]
		I draw the line where robotstxt.org (the semi-official home of robots.txt) draws the line [0]: > A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. To me "recursive" is key—it transforms the traffic pattern from one that strongly resembles that of a human to one that touches every page on the site, breaks caching by visiting pages humans wouldn't typically, and produces not just a little bit more but orders of magnitude more traffic. I was persuaded in another subthread that Nxtscape should respect robots.txt if a user issues a recursive request. I don't think it should if the request is "open these 5 subreddits and summarize the most popular links uploaded since yesterday", because the resulting traffic pattern is nearly identical to what I'd have done by hand (especially if the browser implements proper rate limiting, which I believe it should). [0] https://www.robotstxt.org/faq/what.html