Remix.run Logo
giancarlostoro 6 hours ago

Friendly reminder that archive box exists to let you self host your own archive service.

https://github.com/ArchiveBox/ArchiveBox

I dream of a day where archivebox becomes a fleet of homelabs all over the world making it drastically harder to block them all.

nikisweeting 29 minutes ago | parent | next [-]

I've been mulling over how to take ArchiveBox in this direction for years, but it's a really hard problem to tackle because of privacy. https://docs.sweeting.me/s/cookie-dilemma

Most content is going behind logins these days, and if you include the PII of the person doing the archiving in the archives then it's A. really easy for providers to block that account B. potentially dangerous to dox the person doing the archiving. The problem is removing PII from logged in sites is that it's not as simple as stripping some EXIF data, the html and JS is littered with secret tokens, usernames, user-specific notifications, etc. that would reveal the ID of the archivist and cant be removed without breaking page behavior on replay.

My latest progress is that it might be possible to anonymize logged in snapshots by using the intersection of two different logged-in snapshots, making them easier to share over a distributed system like Bittorrent or IPFS without doxxing the archivist.

More here: https://github.com/pirate/html-private-set-intersection

e2le 22 minutes ago | parent | prev | next [-]

Out of curiosity, does ArchiveBox integrate some way of verifying the contents of the archived page(s) are legitimate and unmodified?

nikisweeting 18 minutes ago | parent [-]

ArchiveBox open source does not, but I have set it up for paying clients in the past using TLSNotary. This is actually a very hard problem and is not as simple as saving traffic hashes + original SSL certs (because HTTPS connections use a symmetric key after the initial handshake, the archivist can forge server responses and claim the server sent things that it did not).

There is only 1 reasonable approach that I know of as of today: https://tlsnotary.org/docs/intro, and it still involves trusting a third party with reputation (though it cleverly uses a zk algorithm so that the third party doesn't have to see the cleartext). Anyone claiming to provide "verifyable" web archives is likely lying or overstating it unless they are using TLSNotary or a similar approach. I've seen far to many companies make impossible claims about "signed" or "verified" web archives over the last decade, be very critial any time you see someone claiming that unless they talk explicitly about the "TLS Non-Repudiation Problem" and how they solve it: https://security.stackexchange.com/questions/103645/does-ssl...

codedokode 6 hours ago | parent | prev [-]

I think about the opposite, people reading in the news that FBI is after archiving sites, will not want to launch their own site, except maybe the radical types.

wartywhoa23 4 hours ago | parent [-]

1960s: FBI/CIA invents the term "conspiracy theorist"

2020s: FBI/CIA invents the term "radical archivist"