Remix clone Hacker News

new | show | ask | jobs Github

	▲	wenc 2 hours ago
		> RE: duckdb. I have a wonderful time with ChatGPT talking to duckdb but I have kept it to inmemory db only. Do you set up some system prompt that tell it to keep a duckdb database locally on disk in the current folder? No, I don't use DuckDB's database format at all. DuckDB for me is more like an engine to work with CSV/Parquet (similar to `jq` for JSON, and `grep` for strings). Also I don't use web-based chat (you mentioned ChatGPT) -- all these interactions are through agents like Kiro or Claude Code. I often have CSVs that are 100s of MBs and there's no way they fit in context, so I tell Opus to use DuckDB to sample data from the CSV. DuckDB works way better than any dedicated CSV tool because it packs a full database engine that can return aggregates, explore the limits of your data (max/min), figure out categorical data levels, etc. For Parquet, I just point DuckDB to the 100s of GBs of Parquet files in S3 (our data lake), and it's blazing fast at introspecting that data. DuckDB is one of the best Parquet query engines on the planet (imo better than Apache Spark) despite being just a tiny little CLI tool. One of the use cases is debugging results from an ML model artifact (which is more difficult that debugging code). For instance, let's say a customer points out a weird result in a particular model prediction. I highlight that weird result, and tell Opus to work backwards to trace how the ML model (I provide the training code and inference code) arrived at that number. Surprisingly, Opus 4.6 is does a great job using DuckDB to figure out how the input data produced that one weird output. If necessary, Opus will even write temporary Python code to call the inference part of the ML model to do inference on a sample to verify assumptions. If the assumptions turn out to be wrong, Opus will change strategies. It's like watching a really smart junior work through the problem systematically. Even if Opus doesn't end up nailing the actual cause, it gets into the proximity of the real cause and I can figure out the rest. (usually it's not the ML model itself, but some anomaly in the input). This has saved me so much time in deep-diving weird results. Not only that, I can have confidence in the deep-dive because I can just run the exact DuckDB SQL to convince myself (and others) of the source of the error, and that it's not something Opus hallucinated. CLI tools are deterministic and transparent that way. (unlike MCPs which are black boxes)