| ▲ | tamnd 3 hours ago | ||||||||||||||||||||||
I'm working on WARC too, with format from Common Crawl! By converting it to Markdown, we save a lot of space, but it is for a different purpose and a different project: https://github.com/tamnd/ccrawl-cli | |||||||||||||||||||||||
| ▲ | sanqui 3 hours ago | parent [-] | ||||||||||||||||||||||
That's neat! In my opinion, the WARC format is quite tricky and underspecified especially since HTTP2 introduced new semantics. It encodes too much in-band and requires rewriting of the server data. A mitmproxy capture is higher fidelity and supports capturing modern features such as WebSockets. I think if we could wrap Kage's crawler interactions by it and store its capture (the intercepted traffic), we could make a potentially nice new archival format. | |||||||||||||||||||||||
| |||||||||||||||||||||||