Remix clone Hacker News

new | show | ask | jobs Github

▲

ericd 13 hours ago

Was that cheaper than a Blackwell 6000?

But yeah, 4x Blackwell 6000s are ~32-36k, not sure where the other $30k is going.

▲

bastawhiz 12 hours ago | parent | next [-]

I bought the A100s used for a little over $6k each.

	▲	ericd 12 hours ago \| parent [-]
		Oh, why'd you go that route? Considering going beyond 80 gigs with nvlink or something?

▲

segmondy 12 hours ago | parent | prev [-]

folks have too much money than sense, gpt-oss-120b full quant runs on my quad 3090 at 100tk/sec and that's with llama.cpp, with vllm it will probably run at 150tk/sec and that's without batching.

▲

amarshall 12 hours ago | parent | next [-]

You're almost certainly (definitely, in fact) confusing the 120b and 20b models.

▲

Aurornis 10 hours ago | parent | prev | next [-]

> gpt-oss-120b full quant runs on my quad 3090

A 120B model cannot fit on 4 x 24GB GPUs at full quantization.

Either you're confusing this with the 20B model, or you have 48GB modded 3090s.

▲

ericd 12 hours ago | parent | prev [-]

How're you fitting a model made for 80 gig cards onto a GPU with 24 gigs at full quant?

▲

Havoc 3 hours ago | parent | next [-]

He said quad 3090 not single

▲

zozbot234 12 hours ago | parent | prev [-]

MoE layers offload to CPU inference is the easiest way, though a bit of a drag on performance

	▲	ericd 11 hours ago \| parent [-]
		Yeah, I'd just be pretty surprised if they were getting 100 tokens/sec that way. EDIT: Either they edited that to say "quad 3090s", or I just missed it the first time.