Remix.run Logo
mdaniel 3 days ago

You are aware of which thread you're discussing this in, right? The one where a bunch of like-minded souls enumerated all the address space in a few weeks?

The sibling link above that queries Wayback's warc index shows at least the first several are only 6 alnum wide so it's no wonder the ArchiveTeam got them in reasonable time

Picking one at random, it seems the super sekrit deets you're safeguarding include buyrussia21.co.kr which, yes, is for sure very, very secret

brokensegue 3 days ago | parent [-]

i asked them why they did this. the answer surprisingly is because they fear if they release the full dumps they will get blocked because of the AI scraping wars.

cedws 3 days ago | parent | next [-]

Feels like a bit of a kick in the teeth that I contributed towards archiving something that I don’t even get access to. What happens if they disappear? The dataset is gone forever.

brokensegue 2 days ago | parent [-]

You get access to it via the wayback machine

mdaniel 2 days ago | parent | prev | next [-]

This whole thread is starting to read like some kind of misguided practical joke. I also recognize that it may seem like this is directed toward you, but I'm not shooting the messenger I'm just anchoring my reply under this new information. Sorry about that.

But, ok, let's continue in good faith

scenario 1: they don't want to uncork the .warc files because it will potentially leak the means and methods of the Archive Warrior or its usages

scenario 2: they don't want to expose the target of the redirects because it will feed the boundaries of the ravenous AI slurp machines

If it's scenario 1, then CSV exists and allows mapping from the 00aa11 codes to the "location:" header, no means and methods necessary

If it's scenario 2, then what the hell were they expecting to happen? Embargo the .warc until the AI hype blows over so their great grand children can read about how the Internet was back in the day? I guess the real question is "archive for whom?" because right now unless they have a back-channel way to feed the Wayback Machine's boundary using the .warc files, and thus it secretly populates the Wayback without wholesale feeding the AI boundary, this whole thing is just mysterious

brokensegue 2 days ago | parent | next [-]

i think you're missing some key information. the warcs do not just contain the location header information. and their methods are fully public/open source so scenario 1 makes no sense.

sure maybe the warcs will be unlocked at some point in the future. this is a fairly small volunteer effort. i doubt there is some "unlock in 100 years" feature on IA.

nicolas_17 2 days ago | parent | prev | next [-]

Yes exactly, Wayback Machine can use the warc files despite them being blocked for direct download.

osiride 2 days ago | parent | prev [-]

[dead]

globular-toast 3 days ago | parent | prev [-]

Who fears they will get blocked by whom?

brokensegue 2 days ago | parent [-]

Archive team blocked by hosts wanting to protect their data from AI companies (presumably because they want to extract money from them)