| ▲ | GaggiX 10 days ago | |||||||
> That's technically encoding Isn't that just projecting the patches into the d_model size vectors that the models takes? >I am assuming that involves of quantization 12B model in 16GB seems very reasonable to me, int8 is top quality for running models. | ||||||||
| ▲ | WhitneyLand 10 days ago | parent | next [-] | |||||||
I don’t think so, the HF weights are bf16 which means 24GB + cache/overhead. It sounds like marketing spin where the performance claims are based on BF16 and the “runs in 16GB” claim is on a totally different quantized version. | ||||||||
| ||||||||
| ▲ | minimaxir 10 days ago | parent | prev [-] | |||||||
The guide describes it as projection although there is apparently an extra step: "A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input." 12B at int8 would take up 12G memory, or 75% of the system memory which technically fits within 16GB but the OS will not like that. EDIT: On my 18G memory MacBook Pro, LM Studio reports a "partial GPU offload" for the int8 MLX weights. Can't test because the `gemma_unified" architecture is NYI. | ||||||||
| ||||||||