| ▲ | sanqui 3 hours ago | |||||||||||||
That's neat! In my opinion, the WARC format is quite tricky and underspecified especially since HTTP2 introduced new semantics. It encodes too much in-band and requires rewriting of the server data. A mitmproxy capture is higher fidelity and supports capturing modern features such as WebSockets. I think if we could wrap Kage's crawler interactions by it and store its capture (the intercepted traffic), we could make a potentially nice new archival format. | ||||||||||||||
| ▲ | tamnd 3 hours ago | parent [-] | |||||||||||||
I tried to follow well-known formats first, such as WARC and ZIM from Kiwix, so we could benefit from existing tooling support. For my own custom data format, I have a lot of private code that I plan to release soon. It is optimized for compression, fast lookups, and more. I have been working on it for two years. This is part of a larger, ambitious umbrella project: I am building Google from scratch (all open source), something that anyone can host, including the crawler, indexer, storage, and serving layers. Stay tuned! | ||||||||||||||
| ||||||||||||||