Remix.run Logo
srameshc 5 days ago

I love SQLite and this is in no way I'm making a point devaluing SQLite, Author's method is excellent approach to get analytical speed out of SQLite. But I am loving DuckDB for similar analytical workloads as it is built for such tasks. DuckDB also reads from single file, like SQLite and DuckDB process large data sets at extreme speeds. I work on my macbook m2 and I have been dealing with about 20 million records and it works fast, very fast.

Loading data into DuckDB is super easy, I was surprised :

SELECT avg(sale_price), count(DISTINCT customer_id) FROM '/my-data-lake/sales/2024/*.json';

and you can also load into a JSON type column and can use postgres type syntax col->>'$.key'

loa_observer 5 days ago | parent | next [-]

duckdb is super fast for analytic tasks, especially when u use it with visual eda tool like pygwalker. it allows u handles millions of data visuals and eda in seconds.

but i would say, comparing duckdb and sqlite is a little bit unfair, i would still use sqlite to build system in most of cases, but duckdb only for analytic. you can hardly make a smooth deployment if you apps contains duckdb on a lot of platform

trueno 5 days ago | parent [-]

depending on the size and needs of distributed system or application im kind of really excited about postgres + pg_lake. postgres has blown my mind at how well it does concurrent writes at least for the types of things i build/support for my org, the pg_lake extension then adds the ability to.. honestly work like a datalake style analytics engine. it intuitively switches whether or not the transaction goes down the normal query path or it uses duckdb which brings giga-aggregation type queries to massive datasets.

someone should smush sqlite+duckdb together and do that kind of switching depending on query type

mikepurvis 5 days ago | parent | prev | next [-]

Whoa. Is that first query building an index of random filesystem json files on the fly?

NortySpock 5 days ago | parent [-]

It's not an index, it's just (probably parallel) file reads

That being said, it would be trivial to tweak the above script into two steps, one reading data into a DuckDB database table, and the second one reading from that table.

lame_lexem 5 days ago | parent | prev [-]

can we all agree to never store datasets uncompressed. duckdb supports reading many compression formats

hawk_ 5 days ago | parent [-]

How much impact do the various compression formats have on query performance?