| ▲ | tgrowazay 10 hours ago | |||||||
LLM speed is roughly <memory_bandwidth> / <model_size> tok/s. DDR4 tops out about 27Gbs DDR5 can do around 40Gbs So for 70B model at 8 bit quant, you will get around 0.3-0.5 tokens per second using RAM alone. | ||||||||
| ▲ | uf00lme 9 hours ago | parent | next [-] | |||||||
Channels matter a lot, quad channel ddr4 is going to beat ddr5 in dual channel most of the time. | ||||||||
| ||||||||
| ▲ | someguy2026 9 hours ago | parent | prev | next [-] | |||||||
DRAM speeds is one thing, but you should also account for the data rate of the PCIe bus (and/or VRAM speed). But yes, holding it "lukewarm" in DRAM rather than on NVMe storage is obviously faster. | ||||||||
| ▲ | vlovich123 10 hours ago | parent | prev | next [-] | |||||||
Faster than the 0.2tok/s this approach manages | ||||||||
| ▲ | zozbot234 9 hours ago | parent | prev | next [-] | |||||||
Should be active param size, not model size. | ||||||||
| ▲ | xaskasdf 8 hours ago | parent | prev [-] | |||||||
yeah, actually, I'm bottlenecked af since my mobo got pcie3 only :( | ||||||||