| ▲ | tyfon 10 hours ago | ||||||||||||||||||||||||||||||||||||||||
I didn't really understand the performance table until I saw the top ones were 8B models. But 5 seconds / token is quite slow yeah. I guess this is for low ram machines? I'm pretty sure my 5950x with 128 gb ram can run this faster on the CPU with some layers / prefill on the 3060 gpu I have. I also see that they claim the process is compute bound at 2 seconds/token, but that doesn't seem correct with a 3090? | |||||||||||||||||||||||||||||||||||||||||
| ▲ | tgrowazay 10 hours ago | parent [-] | ||||||||||||||||||||||||||||||||||||||||
LLM speed is roughly <memory_bandwidth> / <model_size> tok/s. DDR4 tops out about 27Gbs DDR5 can do around 40Gbs So for 70B model at 8 bit quant, you will get around 0.3-0.5 tokens per second using RAM alone. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||