Remix.run Logo
tjchear 13 hours ago

I’ve not used duckdb before nor do I do much data analysis so I am curious about this one aspect of processing medium sized json/csv with it: the data are not indexed, so any non-trivial query would require a full scan. Is duckdb so fast that this is never really a problem for most folks?

RobinL 12 hours ago | parent | next [-]

It is true that for json and csv you need a full scan but there are several mitigations.

The first is simply that it's fast - for example, DuckDB has one of the best csv readers around, and it's parallelised.

Next, engines like DuckDB are optimised for aggregate analysis, where your single query processes a lot of rows (often a significant % of all rows). That means that a full scan is not necessarily as big a problem as it first appears. It's not like a transactional database where often you need to quickly locate and update a single row out of millions.

In addition, engines like DuckDB have predicate pushdown so if your data is stored in parquet format, then you do not need to scan every row because the parquet files themselves hold metadata about the values contained within the file.

Finally, when data is stored in formats like parquet, it's a columnar format, so it only needs to scan the data in that column, rather than needing to process the whole row even though you may be only interested in one or two columns

mr_toad 3 hours ago | parent | prev | next [-]

Many analytical queries require full scans of fact tables anyway, so indexes are less useful. Joins are usually to dimensional tables, which are quite small. Snowflake doesn’t use indexes at all, and it’s built for handling the largest volumes of data.

However, you wouldn’t want to use either for transaction processing, the lack of indexes would really hurt.

ayhanfuat 12 hours ago | parent | prev | next [-]

If you are going to query it frequently then json/csv might become an issue. I think the reason it doesn't become a problem for duckdb/polars users is that we generally convert them to parquet after first read.

biophysboy 13 hours ago | parent | prev | next [-]

Zonemaps are created for columns automatically. I process somewhat large tables w/ duckdb regularly (100M rows) and never have any problems.

riku_iki 12 hours ago | parent [-]

that's true for duckdb native tables, but the question was about json.

biophysboy 10 hours ago | parent [-]

They said json and csv - it handles both!

riku_iki 10 hours ago | parent [-]

handles depends on size. But I tried to say there is no zonemaps for json.

0cf8612b2e1e 9 hours ago | parent | prev | next [-]

Visidata is in Python and has offered “real time” analytics of fixed files for a long time. Computers are stupidly fast. You can do a lot of operations within a few seconds time window.

simlevesque 12 hours ago | parent | prev | next [-]

But when indexing your json or csv, if you have say 10 rows, each row is separated on your disk instead of all together. So a scan for one columb only needs to read a tenth of the disk space used for the data. Obviously this depends on the columns' content.

gdulli 12 hours ago | parent [-]

But you can have a surprisingly large amount of data before the inefficiency you're talking about becomes untenable.

akhundelar 13 hours ago | parent | prev | next [-]

Not a duckdb user, but I use polars a lot (mentioned in the article).

Depends on your definition of medium sized, but for tables of hundreds of thousands of rows and ~30 columns, these tools are fast enough to run queries instantly or near instantly even on laptop CPUs.

mpalmer 13 hours ago | parent | prev [-]

I guess the question is: how much is medium? DuckDB can handle quite a lot of data without breaking a sweat. Certainly if you prefer writing SQL for certain things, it's a no-brainer.