▲ | deadbabe 5 days ago | |
Not every bot that ignores your robots.txt is necessarily using that data. What some bots do is they first scrape the whole site, then look at which parts are covered by robots.txt, and then store that portion of the website under an “ignored” flag. This way, if your robots.txt changes later, they don’t have to scrape the whole site again, they can just turn off the ignored flag. | ||
▲ | cyphar 5 days ago | parent | next [-] | |
Ah, so the NSA defence then -- "it's not bulk collection because it only counts as collection when we look at it". | ||
▲ | nvader 5 days ago | parent | prev | next [-] | |
Not every intruder who enters your home is necessarily a burglar. | ||
▲ | imtringued 5 days ago | parent | prev [-] | |
Your post-rationalization just doubles down on the stance that these crawlers are abusive and poorly developed. You're also under the blatantly wrong misconception that people are worried about their data, when they are worried about the load of a poorly configured crawler. The crawler will scrape the whole website on a regular interval anyway, so what is the point of this "optimization" that optimizes for highly infrequent events? |