| ▲ | simonw 15 hours ago |
| Don't miss how this works. It's not a server-side application - this code runs entirely in your browser using SQLite compiled to WASM, but rather than fetching a full 22GB database it instead uses a clever hack that retrieves just "shards" of the SQLite database needed for the page you are viewing. I watched it in the browser network panel and saw it fetch: https://hackerbook.dosaygo.com/static-shards/shard_1636.sqlite.gz
https://hackerbook.dosaygo.com/static-shards/shard_1635.sqlite.gz
https://hackerbook.dosaygo.com/static-shards/shard_1634.sqlite.gz
As I paginated to previous days.It's reminiscent of that brilliant SQLite.js VFS trick from a few years ago: https://github.com/phiresky/sql.js-httpvfs - only that one used HTTP range headers, this one uses sharded files instead. The interactive SQL query interface at https://hackerbook.dosaygo.com/?view=query asks you to select which shards to run the query against, there are 1636 total. |
|
| ▲ | ncruces 11 hours ago | parent | next [-] |
| A read-only VFS doing this can be really simple, with the right API… This is my VFS:
https://github.com/ncruces/go-sqlite3/blob/main/vfs/readervf... And using it with range requests:
https://pkg.go.dev/github.com/ncruces/go-sqlite3/vfs/readerv... And having it work with a Zstandard compressed SQLite database, is one library away:
https://pkg.go.dev/github.com/SaveTheRbtz/zstd-seekable-form... |
| |
| ▲ | keepamovin an hour ago | parent | next [-] | | Your page is served over sqlitevfs with Range queries? Let's try this here. | |
| ▲ | pdyc 6 hours ago | parent | prev [-] | | this does not caches the data right? it would always fetch from network? by any chance do you know of solution/extension that caches the data it would make it so much more efficient. | | |
|
|
| ▲ | keepamovin 6 hours ago | parent | prev | next [-] |
| Thanks! I'm glad you enjoyed the sausage being made. There's a little easter egg if you click on the compact disc icon. And I just now added a 'me' view. Enter your username and it will show your comments/posts on any day. So you can scrub back through your 2006 - 2025 retrospective using the calendar buttons. |
|
| ▲ | nextaccountic 13 hours ago | parent | prev | next [-] |
| Is there anything more production grade built around the same idea of HTTP range requests like that sqlite thing? This has so much potential |
| |
| ▲ | Humphrey 12 hours ago | parent | next [-] | | Yes — PMTiles is exactly that: a production-ready, single-file, static container for vector tiles built around HTTP range requests. I’ve used it in production to self-host Australia-only maps on S3. We generated a single ~900 MB PMTiles file from OpenStreetMap (Australia only, up to Z14) and uploaded it to S3. Clients then fetch just the required byte ranges for each vector tile via HTTP range requests. It’s fast, scales well, and bandwidth costs are negligible because clients only download the exact data they need. https://docs.protomaps.com/pmtiles/ | | |
| ▲ | simonw 12 hours ago | parent | next [-] | | PMTiles is absurdly great software. | | |
| ▲ | hyperbolablabla 2 hours ago | parent | next [-] | | My only gripe is that the tile metadata is stored as JSON, which I get is for compatibility reasons with existing software, but for e.g. a simple C program to implement the full spec you need to ship a JSON parser on top of the PMTiles parser itself. | |
| ▲ | Humphrey 12 hours ago | parent | prev [-] | | I know right! I'd never heard of HTTP Range requests until PMTiles - but gee it's an elegant solution. | | |
| ▲ | keepamovin 2 hours ago | parent [-] | | Hadn't seen PMTiles before, but that matches the mental model exactly! I chose physical file sharding over Range Requests on a single db because it felt safer for 'dumb' static hosts like CF. - less risk of a single 22GB file getting stuck or cached weirdly. Maybe it would work |
|
| |
| ▲ | nextaccountic 9 hours ago | parent | prev [-] | | That's neat, but.. is it just for cartographic data? I want something like a db with indexes | | |
| ▲ | jtbaker 4 hours ago | parent [-] | | Look into using duckdb with remote http/s3 parquet files. The parquet files are organized as columnar vectors, grouped into chunks of rows. Each row group stores metadata about the set it contains that can be used to prune out data that doesn’t need to be scanned by the query engine. https://duckdb.org/docs/stable/guides/performance/indexing LanceDB has a similar mechanism for operating on remote vector embeddings/text search. It’s a fun time to be a dev in this space! |
|
| |
| ▲ | simonw 13 hours ago | parent | prev | next [-] | | There was a UK government GitHub repo that did something interesting with this kind of trick against S3 but I checked just now and the repo is a 404. Here are my notes about what it did: https://simonwillison.net/2025/Feb/7/sqlite-s3vfs/ Looks like it's still on PyPI though: https://pypi.org/project/sqlite-s3vfs/ You can see inside it with my PyPI package explorer: https://tools.simonwillison.net/zip-wheel-explorer?package=s... | | | |
| ▲ | __turbobrew__ 8 hours ago | parent | prev | next [-] | | gdal vsis3 dynamically fetches chunks of rasters from s3 using range requests. It is the underlying technology for several mapping systems. There is also a file format to optimize this https://cogeo.org/ | |
| ▲ | 12 hours ago | parent | prev | next [-] | | [deleted] | |
| ▲ | ericd 13 hours ago | parent | prev | next [-] | | This is somewhat related to a large dataset browsing service a friend and I worked on a while back - we made index files, and the browser ran a lightweight query planner to fetch static chunks which could be served from S3/torrents/whatever. It worked pretty well, and I think there’s a lot of potential for this style of data serving infra. | |
| ▲ | omneity 9 hours ago | parent | prev | next [-] | | I tried to implement something similar to optimize sampling semi-random documents from (very) large datasets on Huggingface, unfortunately their API doesn't support range requests well. | |
| ▲ | mootothemax 7 hours ago | parent | prev | next [-] | | This is pretty much well what is so remarkable about parquet files; not only do you get seekable data, you can fetch only the columns you want too. I believe that there are also indexing opportunities (not necessarily via eg hive partitioning) but frankly - am kinda out of my depth pn it. | |
| ▲ | tlarkworthy 6 hours ago | parent | prev | next [-] | | Parquet/iceberg | |
| ▲ | 6510 11 hours ago | parent | prev [-] | | I want to see a bittorrent version :P | | |
|
|
| ▲ | maxloh 7 hours ago | parent | prev | next [-] |
| I am curios why they don't use a single file and HTTP Range Requests instead. PMTiles (a distribution of OpenStreetMap) uses that. |
| |
| ▲ | keepamovin 6 hours ago | parent [-] | | This would be a neat idea to try. Want to add a PR? Bench different "hackends" to see how DuckDB, SQLite shards, or range queries perform? |
|
|
| ▲ | meander_water 9 hours ago | parent | prev | next [-] |
| I love this so much, on my phone this is much faster than actual HN (I know it's only a read-only version). Where did you get the 22GB figure from? On the site it says: > 46,399,072 items, 1,637 shards, 8.5GB, spanning Oct 9, 2006 to Dec 28, 2025 |
| |
|
| ▲ | sodafountan 8 hours ago | parent | prev | next [-] |
| The GitHub page is no longer available, which is a shame because I'm really interested in how this works. How was the entirety of HN stored in a single SQLite database? In other words, how was the data acquired? And how does the page load instantly if there's 22GB of data having to be downloaded to the browser? |
| |
| ▲ | keepamovin 6 hours ago | parent [-] | | You can see it now, forgot to make it public. - 1. download_hn.sh - bash script that queries BigQuery and saves the data to *.json.gz - 2. etl-hn.js - does the sharding and ID -> shard map, plus the user stats shards. - 3. Then either npx serve docs or upload to CloudFlare Pages. The ./toool/s/predeploy-checks.sh script basically runs the entire pipeline. You can do it unattended with AUTO_RUN=true | | |
|
|
| ▲ | tehlike 15 hours ago | parent | prev [-] |
| Vfs support is amazing. |