| ▲ | segmondy 11 hours ago | ||||||||||||||||||||||
folks have too much money than sense, gpt-oss-120b full quant runs on my quad 3090 at 100tk/sec and that's with llama.cpp, with vllm it will probably run at 150tk/sec and that's without batching. | |||||||||||||||||||||||
| ▲ | amarshall 11 hours ago | parent | next [-] | ||||||||||||||||||||||
You're almost certainly (definitely, in fact) confusing the 120b and 20b models. | |||||||||||||||||||||||
| ▲ | Aurornis 9 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
> gpt-oss-120b full quant runs on my quad 3090 A 120B model cannot fit on 4 x 24GB GPUs at full quantization. Either you're confusing this with the 20B model, or you have 48GB modded 3090s. | |||||||||||||||||||||||
| ▲ | ericd 10 hours ago | parent | prev [-] | ||||||||||||||||||||||
How're you fitting a model made for 80 gig cards onto a GPU with 24 gigs at full quant? | |||||||||||||||||||||||
| |||||||||||||||||||||||