| ▲ | Paul-E 16 hours ago | |||||||
That's pretty neat! I did something similar. I build a tool[1] to import the Project Arctic Shift dumps[2] of reddit into sqlite. It was mostly an exercise to experiment with Rust and SQLite (HN's two favorite topics). If you don't build a FTS5 index and import without WAL (--unsafe-mode), import of every reddit comment and submission takes a bit over 24 hours and produces a ~10TB DB. SQLite offers a lot of cool json features that would let you store the raw json and operate on that, but I eschewed them in favor of parsing only once at load time. THat also lets me normalize the data a bit. I find that building the DB is pretty "fast", but queries run much faster if I immediately vacuum the DB after building it. The vacuum operation is actually slower than the original import, taking a few days to finish. [1] https://github.com/Paul-E/Pushshift-Importer [2] https://github.com/ArthurHeitmann/arctic_shift/blob/master/d... | ||||||||
| ▲ | Xyra 2 hours ago | parent | next [-] | |||||||
Holy cow, I didn't know getting reddit was that straightforward. I am building public readonly-SQL+vector databases optimized for exploring high-quality public commons with Claude Code (https://exopriors.com/scry), I so cannot wait until some funding source comes in and I can upgrade to a $1500/month Hetzner server and pay the ~$1k to embed all that. | ||||||||
| ▲ | s_ting765 15 hours ago | parent | prev [-] | |||||||
You could check out SQLite's auto_vacuum which reclaims space without rebuilding the entire db https://sqlite.org/pragma.html#pragma_auto_vacuum | ||||||||
| ||||||||