| ▲ | michaelcampbell 7 hours ago | |||||||
I also wonder; it's a normal scraper mechanism doing the scraping, right? Not necessarily an LLM in the first place so the wholesale data-sucking isn't going "read" the file even if it IS accessed? Or is this file meant to be "read" by an LLM long after the entire site has been scraped? | ||||||||
| ▲ | hamdingers 4 hours ago | parent | next [-] | |||||||
Yes. It's a basic scraper that fetches the document, parses it for URLs using regex, then fetches all those, repeat forever. I've done honeypot tests with links in html comments, links in javascript comments, routes that only appear in robots.txt, etc. All of them get hit. | ||||||||
| ||||||||
| ▲ | reconnecting 7 hours ago | parent | prev | next [-] | |||||||
Absolutely. I assume that there are data brokers, or AI companies themselves, that are constantly scraping the entire internet through non-AI crawlers and then processing data in some way to use it in the learning process. But even through this process, there are no significant requests for LLMs.txt to consider that someone actually uses it. | ||||||||
| ▲ | giancarlostoro 5 hours ago | parent | prev [-] | |||||||
I think it depends. LLMs now can look up things on the fly to bypass the whole "this model was last updated in December 2025" issue of having dated information. I've literally told Claude before to look up something after it accused me of making up fake news. | ||||||||