| ▲ | varenc 5 hours ago | ||||||||||||||||||||||||||||||||||||||||
An agent acting on my behalf, following my specific and narrowly scoped instructions, should not obey robots.txt because it's not a robot/crawler. Just like how a single cURL request shouldn't follow robots.txt. (It also shouldn't generate any more traffic than a regular browser user) Unfortunately "mass scraping the internet for training data" and an "LLM powered user agent" get lumped together too much as "AI Crawlers". The user agent shouldn't actually be crawling. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | saurik 3 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||
If your specific and narrowly scoped instructions cause the agent, acting on your behalf, to click that link that clearly isn't going to help it--a link that is only being clicked by the scrapers because the scrapers are blindly downloading everything they can find without having any real goal--then, frankly, you might as well be blocked also, as your narrowly scoped instructions must literally have been something like "scrape this website without paying any attention to what you are doing", as an actual agent--just like an actual human--wouldn't find our click that link (and that this is true has nothing at all to do with robots.txt). | |||||||||||||||||||||||||||||||||||||||||
| ▲ | mcv 4 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
If it's a robot it should follow robots.txt. And if it's following invisible links it's clearly crawling. Sure, a bad site could use this to screw with people, but bad sites have done that since forever in various ways. But if this technique helps against malicious crawlers, I think it's fair. The only downside I can see is that Google might mark you as a malware site. But again, they should be obeying robots.txt. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
| ▲ | hyperhopper 4 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
Confused as to what you're asking for here. You want a robot acting out of spec, to not be treated as a robot acting out of spec, because you told it to? How does this make you any different than the bad faith LLM actors they are trying to block? | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
| ▲ | kijin 4 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
How does a server tell an agent acting on behalf of a real person from the unwashed masses of scrapers? Do agents send a special header or token that other scrapers can't easily copy? They get lumped together because they're more or less indistinguishable and cause similar problems: server load spikes, increased bandwidth, increased AWS bill ... with no discernible benefit for the server operator such as increased user engagement or ad revenue. Now all automated requests are considered guilty until proven innocent. If you want your agent to be allowed, it's on you to prove that you're different. Maybe start by slowing down your agent so that it doesn't make requests any faster than the average human visitor would. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | AmbroseBierce 4 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||
Maybe your agent is smart enough to determine that going against the wishes of the website owner can be detrimental to your relationship the such website owner and therefore the likelihood of the website to continue existing, so is prioritizing your long-term interests over your short-term ones. | |||||||||||||||||||||||||||||||||||||||||