▲ | skybrian 18 hours ago | |||||||
> three or four layers of subqueries, each one filtering or aggregating the results of the previous one, totaling over 5000 lines of code In a better language, this would be a pipeline. Pipelines are conceptually simple but annoying to debug, compared to putting intermediate results in a variable or file. Are there any debuggers that let you look at intermediate results of pipelines without modifying the code? | ||||||||
▲ | physicles 7 hours ago | parent | next [-] | |||||||
I wrote some tooling to help debug sql queries with many CTEs. It parses the sql, finds all the CTEs, and prints the result of each CTE formatted as csv. If the .sql file changes on disk, it reruns the query and tells you which CTEs’ output changed. Saved me hours in debugging. | ||||||||
▲ | tremon 18 hours ago | parent | prev [-] | |||||||
This is not a pipeline in the control flow sense; the full query is compiled into a single processing statement, and the query compiler is free to remove and/or reorder any of the subqueries as it sees fit. The intermediate results during query execution (e.g. temp table spools) do not follow the structure of the original query, as CTEs and subqueries are not execution boundaries. It's more accurate to compare this to a C compiler that performs aggressive link-time optimization, including new rounds of copy elision, loop unrolling and dead code elimination. If you want to build a pipeline and store each intermediate result, most tooling will make that easy for you. E.g. in dbt, just put each subquery in its separate file, and the processing engine will correctly schedule each subresult after the other. Just make sure you have enough storage available, it's not uncommon for intermediate results to be hundreds of times larger than the end result (e.g. when you perform a full table join in the first CTE, and do target filtering in another). | ||||||||
|