| ▲ | Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT(pythongiant.github.io) | ||||||||||||||||||||||||||||
| 18 points by pythongiant 3 hours ago | 12 comments | |||||||||||||||||||||||||||||
| ▲ | stpedgwdgfhgdd an hour ago | parent | next [-] | ||||||||||||||||||||||||||||
I just dont get why people choose Python and not e.g. Go for high performance problems. | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||
| ▲ | hexnuts 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||
Bad site design, if I can't scroll to see the next slide, that's just broken. | |||||||||||||||||||||||||||||
| ▲ | x0ruman an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||
The functionality is impressive, but the website needs some work | |||||||||||||||||||||||||||||
| ▲ | sakex an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||
Is this based on paged attention with hashing of the pages? | |||||||||||||||||||||||||||||
| ▲ | pythongiant 3 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||
KVBoost is a chunk-level KV cache reuse library for HuggingFace models (pip install kvboost). It supports two recompute strategies (selective boundary and CacheBlend), int8/int4 KV quantization for 2–4x RAM reduction, disk-backed cold storage, and 11 architectures including Llama, Qwen, Gemma, Mistral, and Phi. On Qwen2.5-3B we measured 47.9x TTFT speedup on an 8-turn conversation, 21x on code context reuse, 100–743x faster than MLX, and 3–41x faster than vLLM-MLX — including interior chunk reuse where vLLM gets zero hits. Outputs are token-for-token identical to baseline under greedy decoding. Works best on 3B+ models with 500+ token shared context. GitHub: https://github.com/pythongiant/KVBoost | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||