▲ | mattigames 2 days ago | ||||||||||||||||
>It's meant for automated scrapers that recursively retrieve all pages on your website, _which this browser is not doing at all_ AFAIK this is false, and this browser can do things like "summarize all the cooking recipes linked in this page" and therefore act exactly like a scraper (even if at smaller scale than most scrapers) If tomorrow magically all phones and all computers had an ad-blocking browser installed -and set as the default browser- a big chunk of the economy would collapse, so while I can see the philosophical value of "What a user does with a page after it has entered their browser is their own prerogative", the pragmatic in me knows that if all users cared about that and enforced it it would have grave repercussions in the livelihood of many. | |||||||||||||||||
▲ | lolinder 2 days ago | parent [-] | ||||||||||||||||
https://www.robotstxt.org/faq/what.html > A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. There's nothing recursive about "summarize all the cooking recipes linked on this page". That's a single-level iterative loop. I will grant that I should alter my original statement: if OP wanted to respect robots.txt when it receives a request that should be interpreted as an instruction to recursively fetch pages, then I'd think that's an appropriate use of robots.txt, because that's not materially different than implementing a web crawler by hand in code. But that represents a tiny subset of the queries that will go through a tool like this and respecting robots.txt for non-recursive requests would lead to silly outcomes like the browser refusing to load reddit.com [0]. | |||||||||||||||||
|