Remix.run Logo
tamnd 3 hours ago

I'm working on WARC too, with format from Common Crawl!

By converting it to Markdown, we save a lot of space, but it is for a different purpose and a different project: https://github.com/tamnd/ccrawl-cli

sanqui 3 hours ago | parent [-]

That's neat! In my opinion, the WARC format is quite tricky and underspecified especially since HTTP2 introduced new semantics. It encodes too much in-band and requires rewriting of the server data. A mitmproxy capture is higher fidelity and supports capturing modern features such as WebSockets. I think if we could wrap Kage's crawler interactions by it and store its capture (the intercepted traffic), we could make a potentially nice new archival format.

tamnd 3 hours ago | parent [-]

I tried to follow well-known formats first, such as WARC and ZIM from Kiwix, so we could benefit from existing tooling support.

For my own custom data format, I have a lot of private code that I plan to release soon. It is optimized for compression, fast lookups, and more. I have been working on it for two years. This is part of a larger, ambitious umbrella project: I am building Google from scratch (all open source), something that anyone can host, including the crawler, indexer, storage, and serving layers. Stay tuned!

sanqui 3 hours ago | parent | next [-]

I'm a fan of compatibility with established formats!

Sounds awesome. There is a lot of untapped potential with respect to efficiently archiving and indexing websites. I saw the impressive things Marginalia Search is doing in this area (the blog is great when it gets technical). There is also a lot of very complete archives of websites out there which are not being indexed at all, and I would love to make them available for researchers. In any case, I'm interested in your project!

Prime_Axiom 2 hours ago | parent | prev [-]

Looking forward to the next project! I love these kinds of archiving tools.