| ▲ | RobinL 6 hours ago |
| Worse in some ways, better in others. DuckDB is often an excellent tool for this kind of task. Since it can run parallelized reads I imagine it's often faster than command line tool, and with easier to understand syntax |
|
| ▲ | briHass 5 hours ago | parent | next [-] |
| More importantly, you have your data in a structured format that can be easily inspected at any stage of the pipeline using a familiar tool: SQL. I've been using this pattern (scripts or code that execute commands against DuckDB) to process data more recently, and the ability to do deep investigations on the data as you're designing the pipeline (or when things go wrong) is very useful. Doing it with a code-based solution (read data into objects in memory) is much more challenging to view the data. Debugging tools to inspect the objects on the heap is painful compared to being able to JOIN/WHERE/GROUP BY your data. |
|
| ▲ | mrgoldenbrown 5 hours ago | parent | prev [-] |
| IMHO the main point of the article is that typical unix command pipeline pipeline IS parallelized already. The bottleneck in the example was maxing out disk IO, which I don't think duckdb can help with. |
| |
| ▲ | chuckadams 4 hours ago | parent [-] | | Pipes are parallelized when you have unidirectional data flow between stages. They really kind of suck for fan-out and joining though. I do love a good long pipeline of do-one-thing-well utilities, but that design still has major limits. To me, the main advantage of pipelines is not so much the parallelism, but being streams that process "lazily". On the other hand, unix sockets combined with socat can perform some real wizardry, but I never quite got the hang of that style. | | |
| ▲ | mdavidn 2 hours ago | parent | next [-] | | Pipelines are indeed one flow, and that works most of the time, but shell scripts make parallel tasks easy too. The shell provides tools to spawn subshells in the background and wait for their completion. Then there are utilities like xargs -P and make -j. | |
| ▲ | Linux-Fan 2 hours ago | parent | prev [-] | | UNIX provides the Makefile as go-to tool if a simple pipeline is not enough. GNUmake makes this even more powerful by being able to generate rules on-the-fly. If the tool of interest works with files (like the UNIX tools do) it fits very well. If the tool doesn't work with single files I have had some success in using Makefiles for generic processing tasks by creating a marker file that a given task was complete as part of the target. |
|
|