| ▲ | Show HN: Dendrite – O(1) KV cache forking for tree-structured LLM inference(github.com) | |
| 3 points by RyeCatcher 2 hours ago | 1 comments | ||
| ▲ | RyeCatcher 2 hours ago | parent [-] | |
Hey HN, author here. Happy to answer questions. Why Rust? - Ownership model makes the CoW block table provably safe, the borrow checker enforces that you can't alias a block that's being written. That's not just style; it eliminates a whole class of correctness bugs that plague Python/C++ inference engines. How is this different from vLLM's paged attention? - vLLM pages memory in fixed blocks to avoid fragmentation. Dendrite does that too, but adds O(1) KV cache forking via copy-on-write. When you branch a beam or MCTS node, you get a shallow pointer copy (~500ns) instead of copying the full KV cache. The deeper the tree, the bigger the win. TurboQuant: Google published this last week (ICLR 2026). We already have a Rust implementation, PolarQuant + QJL pipeline in `cache/compress.rs`. Measured 3x compression at head_dim=128 on CPU; paper claims 6x with per-head grouping (coming). Status: This is research-grade, not production. No Python bindings yet (tracked in issues), no FlashAttention kernels. Best fit today: if you're building tree-structured search (MCTS, beam, speculative decoding) and want to control the stack. | ||