| ▲ | yjtpesesu2 8 hours ago | |||||||
How does this differ from anything llama.cpp offers, regarding offloading layers? The repo consistently refers to "DDR4". Is there a reason DDR5 won't work with this? | ||||||||
| ▲ | svnt 8 hours ago | parent | next [-] | |||||||
The readme opens with this: > I have an RTX 5070 with 12 GB VRAM and I wanted to run glm-4.7-flash:q8_0, which is a 31.8 GB model. The standard options are: > Offload layers to CPU — works, but drops token/s by 5–10× because CPU RAM has no CUDA coherence. You end up waiting. Use a smaller quantization — you lose quality. At q4_0 the model is noticeably worse on reasoning tasks. > Buy a bigger GPU — not realistic for consumer hardware. A 48 GB card costs more than a complete workstation. > None of those felt right, so I built an alternative: route the overflow memory to DDR4 via DMA-BUF, which gives the GPU direct access to system RAM over PCIe 4.0 without a CPU copy involved. And then limps home with this caveat on the closest thing to a benchmark: > The PCIe 4.0 link (~32 GB/s) is the bottleneck when the model overflows VRAM. The best strategy is to shrink the model until it fits — either with EXL3 quantization or ModelOpt PTQ — and use GreenBoost's DDR4 pool for KV cache only. I think the reason it refers it to DDR4 is because that is how the user explained it to their coding agent. LLMs are great at perpetuating unnecessary specificity. | ||||||||
| ▲ | kcb 8 hours ago | parent | prev | next [-] | |||||||
CUDA has had managed memory that pages between VRAM and system RAM for a decade. Problem is doing so is unusably slow for AI purposes. Seems like an unnecessary layer here. | ||||||||
| ||||||||
| ▲ | xienze 8 hours ago | parent | prev [-] | |||||||
Presumably it means that software doesn’t have to write the same sort of layer offloading support. It’ll “just work” as if you had X GB of VRAM all along. | ||||||||
| ||||||||