| ▲ | mdaniel 3 days ago |
| No, I meant the .warc.zst files on archive.org that were the result of the ArchiveTeam's work. However, it seems they're under some kind of embargo - which is the first I've ever seen a private link on archive.org |
|
| ▲ | rafram 3 days ago | parent | next [-] |
| I can see some reasonable arguments for not publishing the full dataset. People undoubtedly shortened lots of links to unlisted videos/documents/pages under the assumption that the short link, like the original link, would be unguessable. |
| |
| ▲ | mdaniel 3 days ago | parent [-] | | Then why go to the trouble of archiving them, then upload them to a public archive site, only to then keep them secret? I'm sure pastebin is filled with people's AWS credentials, too, but you don't see them randomly denying access to listings | | |
| ▲ | rafram 3 days ago | parent [-] | | Because then you can access the archived destination if you already know the short URL. You just can't get a full list of potentially sensitive short URL/destination pairs. | | |
| ▲ | mdaniel 3 days ago | parent | next [-] | | You are aware of which thread you're discussing this in, right? The one where a bunch of like-minded souls enumerated all the address space in a few weeks? The sibling link above that queries Wayback's warc index shows at least the first several are only 6 alnum wide so it's no wonder the ArchiveTeam got them in reasonable time Picking one at random, it seems the super sekrit deets you're safeguarding include buyrussia21.co.kr which, yes, is for sure very, very secret | | |
| ▲ | brokensegue 3 days ago | parent [-] | | i asked them why they did this. the answer surprisingly is because they fear if they release the full dumps they will get blocked because of the AI scraping wars. | | |
| ▲ | cedws 3 days ago | parent [-] | | Feels like a bit of a kick in the teeth that I contributed towards archiving something that I don’t even get access to. What happens if they disappear? The dataset is gone forever. |
|
| |
| ▲ | yreg a day ago | parent | prev [-] | | Yeah what they did is probably the best way to handle it. |
|
|
|
|
| ▲ | viliml 2 days ago | parent | prev [-] |
| Tangentially related but I've seen twitter links that used to be on the wayback machine disappear from it at some point, presumably due to personal request from the owner. |
| |
| ▲ | corobo 2 days ago | parent [-] | | Pretty sure you can nuke all your domains old content by blocking archive.org in robots.txt |
|