It has become a favourite tool for me as well.

I work with scientists who research BC's coastal environment, from airborne observation of glaciers to autonomous drones in the deep sea. We've got heaps of data.

A while back I took a leap of faith with DuckDB as the data-processing engine for a new tool we're using to transform and validate biodiversity data. The goal is to take heaps of existing datasets and convert them to valid Darwin Core data. Keyword being valid.

DuckDB is such an incredible tool in this context. Essentially I dynamically build duckdb tables from schemas describing the data, then import it into the tables. If it fails, it explains why on a row-by-row basis (as far as it's able to, at least). Once the raw data is in, transformations can occur. This is accomplished entirely in DuckDB as well. Finally, validations are performed using application-layer logic if the transformation alone isn't assurance enough.

I've managed to build an application that's way faster, way more capable, and much easier to build than I expected. And it's portable! I think I can get the entire core running in a browser. Field researchers could run this on an iPad in a browser, offline!

This is incredible to me. I've had so much fun learning to use DuckDB better. It's probably my favourite discovery in a couple of years.

And yeah, this totally could have been done any number of different ways. I had prototypes which took much different routes. But the cool part here is I can trust DuckDB to do a ton of heavy lifting. It comes with the cost of some things happening in SQL that I'd prefer it didn't sometimes, but I'm content with that tradeoff. In cases where I'm missing application-layer type safety, I use parsing and tests to ensure my DB abstractions are doing what I expect. It works really well!

edit: For anyone curious, the point of this project is to allow scientists to analyze biodiversity and genomic data more easily using common rather than bespoke tools, as well as publish it to public repositories. Publishing is a major pain point because people in the field typically work very far from the Darwin Core spec :) I'm very excited to polish it a bit and get it in the hands of other organizations.

▲

anotherpaul 20 minutes ago | parent [-]

What's the advantage over using Polars for the same task? It seems to me the natural competitor here and I vastly prefer the Polars syntax over SQL any day. So I was curious if I should try duckdb or stick with polars

	▲	microflash 8 minutes ago \| parent [-]
		Familiarity with SQL is a plus in my opinion. Also, DuckDB has SDKs in more languages compared to Polars.