| ▲ | mcv 4 hours ago | |||||||
If it's a robot it should follow robots.txt. And if it's following invisible links it's clearly crawling. Sure, a bad site could use this to screw with people, but bad sites have done that since forever in various ways. But if this technique helps against malicious crawlers, I think it's fair. The only downside I can see is that Google might mark you as a malware site. But again, they should be obeying robots.txt. | ||||||||
| ▲ | varenc 4 hours ago | parent | next [-] | |||||||
should cURL follow robots.txt? What makes browser software not a robot? Should `curl <URL>` ignore robots.txt but `curl <URL> | llm` respect it? The line gets blurrier with things like OAI's Atlas browser. It's just re-skinned Chromium that's a regular browser, but you can ask an LLM about the content of the page you just navigated to. The decision to use an LLM on that page is made after the page load. Doing the same thing but without rendering the page doesn't seem meaningfully different. In general robots.txt is for headless automated crawlers fetching many pages, not software performing a specific request for a user. If there's 1:1 mapping between a user's request and a page load, then it's not a robot. An LLM powered user agent (browser) wouldn't follow invisible links, or any links, because it's not crawling. | ||||||||
| ||||||||
| ▲ | droopyEyelids 4 hours ago | parent | prev [-] | |||||||
Your web browser is a robot, and always has been. Even using netcat to manually type your GET request is a robot in some sense, as you have a machine translating your ascii and moving it between computers. The significant difference isn't in whether a robot is doing the actions for you or not, it's whether the robot is a user agent for a human or not. | ||||||||