I'd go for at least 32GB+. It'll fit in 24GB but leaves you little to no room for context, and that's at 4-bit quantization.

If you want to run unquantized, you definitely need 128GB.

Nobody runs unquantized, there's literally no reason to. Q8 would be the largest anyone actually runs on consumer hardware for inference.

3 hours ago | parent | next [-]

[deleted]

bityard 3 hours ago | parent | prev [-]

Halving the precision of the weights is not a free lunch...

	▲	Catloafdev an hour ago \| parent [-]
		Q8 is virtually lossless. The quantization is much more noticeable around Q4 and below. FP16->Q8 on consumer hardware is 2x the speed at ~99.99% the quality.

bitexploder 6 hours ago | parent | prev | next [-]

It also comes down to inference speed, not "can I run this". 8-bit quant is quite a bit slower on an M5 Pro.

[dead]