| ▲ | ericd 13 hours ago |
| How're you fitting a model made for 80 gig cards onto a GPU with 24 gigs at full quant? |
|
| ▲ | Havoc 5 hours ago | parent | next [-] |
| He said quad 3090 not single |
| |
| ▲ | ericd an hour ago | parent [-] | | Yeah, pretty sure that was edited in after I commented because 150 toks/sec was also new, but could’ve just missed it. |
|
|
| ▲ | zozbot234 13 hours ago | parent | prev [-] |
| MoE layers offload to CPU inference is the easiest way, though a bit of a drag on performance |
| |
| ▲ | ericd 13 hours ago | parent [-] | | Yeah, I'd just be pretty surprised if they were getting 100 tokens/sec that way. EDIT: Either they edited that to say "quad 3090s", or I just missed it the first time. | | |
|