▲ | muppetman 3 days ago | |||||||||||||||||||||||||
I have this in my Apache conf for a site I don't want indexed/archived etc. Header set X-Robots-Tag "noindex, nofollow, noarchive, nositelinkssearchbox, nosnippet, notranslate, noimageindex" Of course, only the beeping Internet Archive totally ignored it and scraped my site. And now, despite me trying many times, they won't remove it. It seems to mostly work, I also have Anubis in front of it now to keep the scrapers at bay. (It's a personal diary website, started in 2000 before the term "blog" existed [EDIT: Not true - see below comment]. I know it's public content, I just don't want it searchable public) | ||||||||||||||||||||||||||
▲ | worble 3 days ago | parent | next [-] | |||||||||||||||||||||||||
> Of course, only the beeping Internet Archive totally ignored it and scraped my site. And now, despite me trying many times, they won't remove it. In all honestly, if you're hosting it on the internet, why is this a problem? If you didn't want it to backed up, why is it publicly accessible at all? I'm glad the internet archive will keep hosting this content even when the original is long gone. Let's say I'd read your website and wanted to look it up one day in the far future, only to find many years later the domain had expired, I'd be damn glad at least one organization had kept it readable. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
▲ | muppetman 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||
>> And now, despite me trying many times, they won't remove it. >Good! It's literally the Internet Archive and you published it on the internet. That was your choice. >As a general rule, people shouldn't get to remove things from the historical record. >Sometimes we make exceptions for things that were unlawful to publish in the first place -- e.g. defamation, national secrets, certain types of obscene photos -- where there's a larger harm otherwise. >But if you make someone public, you make it public. I'm sorry you seem to at least partially regret that decision, but as a general rule, it's bad for humanity to allow people to erase things from what are now historical records we want to preserve. But it's my content - it's not your content. I don't regret my decision, anything I really don't want public is behind a login. The website is still there, still getting crawled. What really upsets me the MOST though is IA won't even reply to my requests to tell me "We're not going to remove it" - your reply (I am assuming from your wording you have some relationship with them, apologies if that's not the case) is the only information I've got! (Thanks) [Note reply was from user crazygringo but I can't find it now, almost like they... removed it? It was public though and I'm SURE they won't mind me archiving it here for them.] | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
▲ | bayindirh 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||
I have recently found out that the snapshots have a "why?" field. The archivers might not be internet archive themselves, but commoncrawl, archive team, etc. pushing your site to Internet Archive. Look at the reason, and get mad to the correct people. It might be the archive themselves, but just be sure. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
▲ | AnonC 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||
> Of course, only the beeping Internet Archive totally ignored it and scraped my site. And now, despite me trying many times, they won't remove it. Try using robots.txt to get it removed or excluded from The Internet Archive. The organization went back and forth on respecting robots.txt a couple of times, but it started respecting it (again) some years ago. Several years ago I was also frustrated by its refusal to remove some content taken from a site I owned, but later the change to follow robots.txt was implemented (and my site was removed). The FAQ has more information on how this works (there may be caveats). [1] https://support.archive-it.org/hc/en-us/articles/208001096-R... | ||||||||||||||||||||||||||
▲ | 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||
[deleted] | ||||||||||||||||||||||||||
▲ | blueg3 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||
The term blog existed in 1999, and "weblog" in 97. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
▲ | asdefghyk 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||
RE "...Of course, only the beeping Internet Archive totally ignored it and scraped my site. And now, despite me trying many times, they won't remove it...." Why would you NOT want internet archive to scrape your website? (Im Clueless - thank you) | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
▲ | 3 days ago | parent | prev [-] | |||||||||||||||||||||||||
[deleted] |