Remix.run Logo
lolinder 2 days ago

> I know its not completely true, I know read-mode can help you bypass the ads _after_ you already had a peek at the cluttered version

What about reader mode that is auto-configured to turn on immediately on landing on specific domains? Is that a robot for the purposes of robots.txt?

https://addons.mozilla.org/en-US/firefox/addon/automatic-rea...

And also, just to confirm, I'm to understand that if I'm navigating the internet with an ad blocker then you believe that I should respect robots.txt because my user agent is now a robot by virtue of using an ad blocker?

Is that also true if I browse with a terminal-based browser that simply doesn't render JavaScript or images?

mattigames 2 days ago | parent [-]

If you are using an ad-blocker by definition you are intentionally breaking the intended behavior by the creator of any given website (for personal gain), in that context any discussion about robots.txt or any other behavior that the creator expects is a moot point.

Autoconfig of reader mode and so on its so uncommon that is not even in the radar of most websites, if it was browser developers probably would try to create a solution that satisfies both parties, like putting the ads at the end and required to be text-only and other guidelines, but its not popular, same thing happens with terminal-based browsers, a lot of the most visited websites in the world don't even work without JS enabled.

On the other hand, this AI stuff seems to envision a larger userbase so it could become a concern and therefore the role of robots.txt or other anti-bot features could have some practical connotations.

lolinder 2 days ago | parent [-]

> If you are using an ad-blocker by definition you are intentionally breaking the intended behavior by the creator of any given website (for personal gain), in that context any discussion about robots.txt or any other behavior that the creator expects is a moot point.

I'm not asking if you believe ad blocking is ethical, I got that you don't. I'm asking if it turns my browser into a scraper that should be treated as such, which is an orthogonal question to the ethics of the tool in the first place.

I strongly disagree that user agents of the sort shown in the demo should count as robots. Robots.txt is designed for bots that produce tons of traffic to discourage them from hitting expensive endpoints (or to politely ask them to not scrape at all). I've responded to incidents caused by scraper traffic and this tool will never produce traffic in the same order of magnitude as a problematic scraper.

If we count this as a robot for the purposes of robots.txt we're heading down a path that will end the user agent freedom we've hitherto enjoyed. I cannot endorse that path.

For me the line is simple, and it's the one defined by robotstxt.org [0]: "A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. ... Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images)."

If the user agent is acting on my instructions and accessing a specific and limited subset of the site that I asked it to, it's not a web scraper and should not be treated as such. The defining feature of a robot is amount of traffic produced, not what my user agent does with the information it pulls.

[0] https://www.robotstxt.org/faq/what.html