Remix.run Logo
muppetman 3 days ago

I have this in my Apache conf for a site I don't want indexed/archived etc.

Header set X-Robots-Tag "noindex, nofollow, noarchive, nositelinkssearchbox, nosnippet, notranslate, noimageindex"

Of course, only the beeping Internet Archive totally ignored it and scraped my site. And now, despite me trying many times, they won't remove it.

It seems to mostly work, I also have Anubis in front of it now to keep the scrapers at bay.

(It's a personal diary website, started in 2000 before the term "blog" existed [EDIT: Not true - see below comment]. I know it's public content, I just don't want it searchable public)

worble 3 days ago | parent | next [-]

> Of course, only the beeping Internet Archive totally ignored it and scraped my site. And now, despite me trying many times, they won't remove it.

In all honestly, if you're hosting it on the internet, why is this a problem? If you didn't want it to backed up, why is it publicly accessible at all? I'm glad the internet archive will keep hosting this content even when the original is long gone.

Let's say I'd read your website and wanted to look it up one day in the far future, only to find many years later the domain had expired, I'd be damn glad at least one organization had kept it readable.

muppetman 3 days ago | parent [-]

A totally fair question. I want to be in control of my content is the simple answer. Yes, I know it being public means I've already "lost control" in that you can scrap my website and that's that. But you scraping my website vs a anyone-can-search it website like IA are two different things. IA claim they will honour removal requests, but then roundly fail to do so. And then have the gal to email me and ask me to donate.

Additionally, when I die, I want my website to go dark and that's that. It's a diary, it's very very mundane. My tech blog I post to, sure, I'm 200% happy to have that scraped/archived. My diary I keep very up-to-date offline copies of that my family have access to, should I tip over tomorrow.

I realise this goes against the usual Internet wisdom, and I'm sure there's more than one Chinese AI/bot out there that's scraped it and I have zero control over. But where I allegedly do have control, I'd like to exercise it. I don't think that's an unfair/ridiculous request.

muppetman 3 days ago | parent | prev | next [-]

>> And now, despite me trying many times, they won't remove it.

>Good! It's literally the Internet Archive and you published it on the internet. That was your choice.

>As a general rule, people shouldn't get to remove things from the historical record.

>Sometimes we make exceptions for things that were unlawful to publish in the first place -- e.g. defamation, national secrets, certain types of obscene photos -- where there's a larger harm otherwise.

>But if you make someone public, you make it public. I'm sorry you seem to at least partially regret that decision, but as a general rule, it's bad for humanity to allow people to erase things from what are now historical records we want to preserve.

But it's my content - it's not your content. I don't regret my decision, anything I really don't want public is behind a login. The website is still there, still getting crawled.

What really upsets me the MOST though is IA won't even reply to my requests to tell me "We're not going to remove it" - your reply (I am assuming from your wording you have some relationship with them, apologies if that's not the case) is the only information I've got! (Thanks)

[Note reply was from user crazygringo but I can't find it now, almost like they... removed it? It was public though and I'm SURE they won't mind me archiving it here for them.]

yjftsjthsd-h 3 days ago | parent [-]

> Note reply was from user crazygringo but I can't find it now, almost like they... removed it? It was public though and I'm SURE they won't mind me archiving it here for them.

So... you believe that your and IA's behavior is or is not okay? Because it's a touch odd to start playing the other side now.

muppetman 3 days ago | parent [-]

I am obviously being a dick to prove my point on what a pathetic argument "It was public there's NOTHING we can do now" is.

yjftsjthsd-h 3 days ago | parent [-]

Being a hypocrite doesn't make your point, it undermines it. Also, if that's your position you really need to stop posting on this site, since after a short initial window HN doesn't let you delete comments.

bayindirh 3 days ago | parent | prev | next [-]

I have recently found out that the snapshots have a "why?" field. The archivers might not be internet archive themselves, but commoncrawl, archive team, etc. pushing your site to Internet Archive.

Look at the reason, and get mad to the correct people.

It might be the archive themselves, but just be sure.

muppetman 3 days ago | parent [-]

Thanks - wasn't aware. (why: certificate-transparency, open-research-datasets, webwidecrawl)

I still don't fathom why they just _ignore_ the request not to be scraped with the above headers. It's rude.

AnonC 3 days ago | parent | prev | next [-]

> Of course, only the beeping Internet Archive totally ignored it and scraped my site. And now, despite me trying many times, they won't remove it.

Try using robots.txt to get it removed or excluded from The Internet Archive. The organization went back and forth on respecting robots.txt a couple of times, but it started respecting it (again) some years ago.

Several years ago I was also frustrated by its refusal to remove some content taken from a site I owned, but later the change to follow robots.txt was implemented (and my site was removed).

The FAQ has more information on how this works (there may be caveats). [1]

https://support.archive-it.org/hc/en-us/articles/208001096-R...

3 days ago | parent | prev | next [-]
[deleted]
blueg3 3 days ago | parent | prev | next [-]

The term blog existed in 1999, and "weblog" in 97.

muppetman 3 days ago | parent [-]

Thank you - I started my diary in Oct 2000 and I didn't hear the term until after then. Or I chose to ignore it, it's that long ago I can't recall :) I have updated my comment above.

asdefghyk 3 days ago | parent | prev | next [-]

RE "...Of course, only the beeping Internet Archive totally ignored it and scraped my site. And now, despite me trying many times, they won't remove it...."

Why would you NOT want internet archive to scrape your website? (Im Clueless - thank you)

muppetman 3 days ago | parent [-]

It's a personal diary - very mundane. I don't _want_ to pollute search with the fact I struggled with getting my socks on yesterday because of my bad back.

Yes I could password protect it (and any really personal content is locked behind being logged in, AI hasn't scraped that) but I _like_ being able to share links with people without having to also share passwords.

I realise the HN crowd is very much "More eyeballs are better for business" but this isn't business. This is a tiny, 5 hits a month (that's not me writing it) website.

3 days ago | parent | prev [-]
[deleted]