▲ | lolinder 2 days ago | |||||||
> If your browser behaves, it's not going to be excluded in robots.txt. No, it's common practice to allow Googlebot and deny all other crawlers by default [0]. This is within their rights when it comes to true scrapers, but it's part of why I'm very uncomfortable with the idea of applying robots.txt to what are clearly user agents. It sets a precedent where it's not inconceivable that we have websites curating allowlists of user agents like they already do for scrapers, which would be very bad for the web. [0] As just one example: https://www.404media.co/google-is-the-only-search-engine-tha... | ||||||||
▲ | qualeed 2 days ago | parent [-] | |||||||
>clearly user agents I am not sure I agree with an AI-aided browser, that will scrape sites and aggregate that information, being classified as "clearly" a user agent. If this browser were to gain traction and ends up being abusive to the web, that's bad too. Where do you draw the line of crawler vs. automated "user agent"? Is it a certain number of web requests per minute? How are you defining "true scraper"? | ||||||||
|