Remix.run Logo
ericd 13 hours ago

How're you fitting a model made for 80 gig cards onto a GPU with 24 gigs at full quant?

Havoc 5 hours ago | parent | next [-]

He said quad 3090 not single

ericd an hour ago | parent [-]

Yeah, pretty sure that was edited in after I commented because 150 toks/sec was also new, but could’ve just missed it.

zozbot234 13 hours ago | parent | prev [-]

MoE layers offload to CPU inference is the easiest way, though a bit of a drag on performance

ericd 13 hours ago | parent [-]

Yeah, I'd just be pretty surprised if they were getting 100 tokens/sec that way.

EDIT: Either they edited that to say "quad 3090s", or I just missed it the first time.

segmondy 32 minutes ago | parent [-]

you are correct, I did forget to add quad. you should join us in r/localllama

check out what other people are getting. you're welcome.

https://www.reddit.com/r/LocalLLaMA/comments/1nunq7s/gptoss1... https://www.reddit.com/r/LocalLLaMA/comments/1p4evyr/most_ec...

ericd a few seconds ago | parent [-]

[delayed]