Remix.run Logo
adsharma 6 hours ago

Are you talking about Andy Pavlo bet here?

https://news.ycombinator.com/item?id=29737326

Kuzu folks took some of these discussions and implemented them. SIP, ASP joins, factorized joins and WCOJ.

Internally it's structured very similar to DuckDB, except for the differences noted above.

DuckDB 1.5 implemented sideways information passing (SIP). And LadybugDB is bringing in support for DuckDB node tables.

So the idea that graph databases have shaky internals stems primarily from pre 2021 incumbents.

4 more years to go to 2030!

jandrewrogers 4 hours ago | parent | next [-]

I wasn't referring to the Pavlo bet but I would make the same one! Poor algorithm and architecture scalability is a serious bottleneck. I was part of a research program working on the fundamental computer science of high-scale graph databases ~15 years ago. Even back then we could show that the architectures you mention couldn't scale even in theory. Just about everyone has been re-hashing the same basic design for decades.

As I like to point out, for two decades DARPA has offered to pay many millions of dollars to anyone who can demonstrate a graph database that can handle a sparse trillion-edge graph. That data model easily fits on a single machine. No one has been able to claim the money.

Inexplicably, major advances in this area 15-20 years ago under the auspices of government programs never bled into the academic literature even though it materially improved the situation. (This case is the best example I've seen of obviously valuable advanced research that became lost for mundane reasons, which is pretty wild if you think about it.)

adsharma 2 hours ago | parent | next [-]

> many millions of dollars to anyone who can demonstrate a graph database that can handle a sparse trillion-edge graph.

I wonder why no one has claimed it. It's possible to compress large graphs to 1 byte per edge via Graph reordering techniques. So a trillion scale graph becomes 1TB, which can fit into high end machines.

Obviously it won't handle high write rates and mutations well. But with Apache Arrow based compression, it's certainly possible to handle read-only and read-mostly graphs.

Also the single machine constraint feels artificial. For any columnar database written in the last 5 years, implementing object store support is tablestakes.

jandrewrogers 14 minutes ago | parent [-]

Achieving adequate performance at 1T edges in one aspect requires severe tradeoffs in other aspects, making every implementation impractical at that scale. You touched on a couple of the key issues when I was working in this domain.

There is no single machine constraint, just the observation that we routinely run non-graph databases at similar scale on single machines without issue. It doesn't scale on in-memory supercomputers either, so the hardware details are unrelated to the problem:

- Graph database with good query performance typically has terrible write performance. It doesn't matter how fast queries are if it takes too long to get data into the system. At this scale there can be no secondary indexing structures into the graph; you need a graph cutting algorithm efficient for both scalable writes and join recursion. This was solved.

- Graph workloads break cache replacement algorithms for well-understood theory reasons. Avoiding disk just removes one layer of broken caching among many but doesn't address the abstract purpose for which a cache exists. This is why in-memory systems still scale poorly. We've known how to solve this in theory since at least the 1980s. The caveat is it is surprisingly difficult to fully reduce to practice in software, especially at scale, so no one really has. This is a work in progress.

- Most implementations use global synchronization barriers when parallelizing algorithms such as BFS, which greatly increases resource consumption while throttling hardware scalability and performance. My contribution to research was actually in this area: I discovered a way to efficiently use error correction algorithms to elide the barriers. I think there is room to make this even better but I don't think anyone has worked on it since.

The pathological cache replacement behavior is the real killer here. It is what is left even if you don't care about write performance or parallelization.

I haven't worked in this area for many years but I do keep tabs on new graph databases to see if someone is exploiting that prior R&D, even if developed independently.

rossjudson an hour ago | parent | prev [-]

I guess it all depends on the meaning of the word "handle", and what the use cases are.

adsharma 5 hours ago | parent | prev [-]

Source: https://www.theregister.com/2023/03/08/great_graph_debate_we...

> There are some additional optimizations that are specific to graphs that a relational DBMS needs to incorporate: [...]

This is essentially what Kuzu implemented and DuckDB tried to implement (DuckPGQ), without touching relational storage.

The jury is out on which one is a better approach.