What's the best speed people have gotten on 4090s?

I'm on a 5090 so it's not apples to apples comparison. But I'm getting ~150t/s for the 20B version using ~16000 context size.

▲

steinvakt2 9 days ago | parent | next [-]

And flash attention doesn't work on 5090 yet, right? So currently 4090 is probably faster, or?

	▲	PeterStuer 9 days ago \| parent \| next [-]
		I don't think the 4090 has native 4bit support, which will probably have a significant impact.
	▲	diggan 9 days ago \| parent \| prev [-]
		> And flash attention doesn't work on 5090 yet, right? Flash attention works with GPT-OSS + llama.cpp (tested on 1d72c8418) and other Blackwell card (RTX Pro 6000) so I think it should work on 5090 as well, it's the same architecture after all.

▲

modeless 9 days ago | parent | prev [-]

Cool, what software?

	▲	asabla 9 days ago \| parent [-]
		Initial testing has only been done with ollama. Plan on testing out llama.cpp and vllm when there is enough time

▲

ActorNightly 9 days ago | parent | prev [-]

You can't fit the model into 4090 without quantization, its like 64 gigs.

For home use, Gemma27B QAT is king. Its almost as good as Deepseek R1

▲

SirMaster 8 days ago | parent | next [-]

You don't really need it to fit all in VRAM due to the efficient MoE architecture and with llama.cpp

The 120B is running at 20 tokens/sec on my 5060Ti 16GB with 64GB of system ram. Now personally I find 20 tokens/sec quite usable, but for some maybe it's not enough.

	▲	dexterlagan 8 days ago \| parent [-]
		I have a similar setup but with 32 GB of RAM. Do you partly offload the model to RAM? Do you use LMStudio or other to achieve this? Thanks!

▲

modeless 9 days ago | parent | prev [-]

The 20B one fits.

▲

steinvakt2 9 days ago | parent [-]

Does it fit on a 5080 (16gb)?

	▲	jwitthuhn 9 days ago \| parent \| next [-]
		Haven't tried myself but it looks like it probably does. The weight files total 13.8 GB which gives you a little left over to hold your context.
	▲	northern-lights 9 days ago \| parent \| prev [-]
		It fits on a 5070TI, so should fit on a 5080 as well.