| ▲ | nikisweeting 2 hours ago | |
I've been mulling over how to take ArchiveBox in this direction for years, but it's a really hard problem to tackle because of privacy. https://docs.sweeting.me/s/cookie-dilemma Most content is going behind logins these days, and if you include the PII of the person doing the archiving in the archives then it's A. really easy for providers to block that account B. potentially dangerous to dox the person doing the archiving. The problem is removing PII from logged in sites is that it's not as simple as stripping some EXIF data, the html and JS is littered with secret tokens, usernames, user-specific notifications, etc. that would reveal the ID of the archivist and cant be removed without breaking page behavior on replay. My latest progress is that it might be possible to anonymize logged in snapshots by using the intersection of two different logged-in snapshots, making them easier to share over a distributed system like Bittorrent or IPFS without doxxing the archivist. More here: https://github.com/pirate/html-private-set-intersection | ||