Remix.run Logo
thewebguyd 7 hours ago

I'd go for at least 32GB+. It'll fit in 24GB but leaves you little to no room for context, and that's at 4-bit quantization.

If you want to run unquantized, you definitely need 128GB.

Catloafdev 6 hours ago | parent | next [-]

Nobody runs unquantized, there's literally no reason to. Q8 would be the largest anyone actually runs on consumer hardware for inference.

3 hours ago | parent | next [-]
[deleted]
bityard 3 hours ago | parent | prev [-]

Halving the precision of the weights is not a free lunch...

Catloafdev an hour ago | parent [-]

Q8 is virtually lossless. The quantization is much more noticeable around Q4 and below. FP16->Q8 on consumer hardware is 2x the speed at ~99.99% the quality.

bitexploder 6 hours ago | parent | prev | next [-]

It also comes down to inference speed, not "can I run this". 8-bit quant is quite a bit slower on an M5 Pro.

gchamonlive 6 hours ago | parent | prev [-]

[dead]