| ▲ | ericd 13 hours ago |
| Was that cheaper than a Blackwell 6000? But yeah, 4x Blackwell 6000s are ~32-36k, not sure where the other $30k is going. |
|
| ▲ | bastawhiz 12 hours ago | parent | next [-] |
| I bought the A100s used for a little over $6k each. |
| |
| ▲ | ericd 12 hours ago | parent [-] | | Oh, why'd you go that route? Considering going beyond 80 gigs with nvlink or something? |
|
|
| ▲ | segmondy 12 hours ago | parent | prev [-] |
| folks have too much money than sense, gpt-oss-120b full quant runs on my quad 3090 at 100tk/sec and that's with llama.cpp, with vllm it will probably run at 150tk/sec and that's without batching. |
| |
| ▲ | amarshall 12 hours ago | parent | next [-] | | You're almost certainly (definitely, in fact) confusing the 120b and 20b models. | |
| ▲ | Aurornis 10 hours ago | parent | prev | next [-] | | > gpt-oss-120b full quant runs on my quad 3090 A 120B model cannot fit on 4 x 24GB GPUs at full quantization. Either you're confusing this with the 20B model, or you have 48GB modded 3090s. | |
| ▲ | ericd 12 hours ago | parent | prev [-] | | How're you fitting a model made for 80 gig cards onto a GPU with 24 gigs at full quant? | | |
| ▲ | Havoc 3 hours ago | parent | next [-] | | He said quad 3090 not single | |
| ▲ | zozbot234 12 hours ago | parent | prev [-] | | MoE layers offload to CPU inference is the easiest way, though a bit of a drag on performance | | |
| ▲ | ericd 11 hours ago | parent [-] | | Yeah, I'd just be pretty surprised if they were getting 100 tokens/sec that way. EDIT: Either they edited that to say "quad 3090s", or I just missed it the first time. |
|
|
|