Remix.run Logo
namibj 2 days ago

Both reasons together :D

But yeah you can snatch traversal statistics from your index accesses as you're executing and if those hit planner-provided thresholds you go do what the planner said to do/check in case of that specific trigger, then feed your statistics and the extra queried info to the planner to get your plan adjusted.

Doing this without creating unpredictable performance bugs let alone any correctness bugs is going to be hard, though. Even more when you consider it also has to be fast because otherwise it wouldn't be worth using.

You may want to check WCOJ technology where you can run a multi-way-join with cardinalities being of a "could well be >10, don't know and even if it'll vary across data/keys"-nature:

1. You index all data as needed for perfect index lookup joins on ANY non-stupid join ordering that query could be run as. 2. You prepare separate indices that tell you the number of values your single-key lookup join would return, for each key, for each such possible index lookup join included in (1). 3. For querying you start with a key or key range that could be fed to an index from (1) and then for each tuple at each step between individual joins you ask the candidate indices of (2) which one would thus be the cheapest to expand with, use the selected index of (1) (as told by the fine-grained cardinality counts of (2)) to expand your tuple, then choose the next relation to expand with.

If you ever have a choice where you know the cardinality of a key in (1) is always 0 or 1, you don't need an index of (2) for that index of (1) and you will always greedily apply those, preferring those that have a higher chance of killing the currently evaluating partial tuple from propagation through further indices.

That alone btw. is sufficient to get the worst case optimal runtime of triangle listing/counting in arbitrary edge lists of O(n^(3/2)) (incidentally equal to the maximum possible number of triangles) instead of the naive/classic O(n^2). n being the number of edges.

Constants here for the triangle case are on the order of 10~50 depending on how hard you want to argue vectorizing and using flat array "columnar" data structures instead of in-memory structures that allow incremental updating without brute force rebuilding from scratch at every change. E.g. supporting adding (but not deleting) edges and efficiently getting updated results is something that doesn't fit the most aggressive performance optimizations that would give the ~50 constant factor vs. a "mere" ~10 when you want to support incremental updates.

(At constant=10 break even would be at n=100; at constant=50 break-even would be at n=2500; IMO it's therefore quite relevant in many practical situations with how quick those asymptotes hit results/benefits.)