Remix.run Logo
WithinReason 10 hours ago

It's on the page:

  Precision  Quantization Tag File Size
  1-bit      UD-IQ1_M         10 GB
  2-bit      UD-IQ2_XXS       10.8 GB
             UD-Q2_K_XL       12.3 GB
  3-bit      UD-IQ3_XXS       13.2 GB
             UD-Q3_K_XL       16.8 GB
  4-bit      UD-IQ4_XS        17.7 GB
             UD-Q4_K_XL       22.4 GB
  5-bit      UD-Q5_K_XL       26.6 GB
  16-bit     BF16             69.4 GB
Aurornis 9 hours ago | parent | next [-]

Additional VRAM is needed for context.

This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.

Glemllksdf 9 hours ago | parent [-]

Isn't that some kind of gambling if you offload random experts onto the CPU?

Or is it only layers but that would affect all Experts?

dragonwriter 8 hours ago | parent [-]

Pretty sure all partial offload systems I’ve seen work by layers, but there might be something else out there.

est 8 hours ago | parent | prev | next [-]

I really want to know what does M, K, XL XS mean in this context and how to choose.

I searched all unsloth doc and there seems no explaination at all.

tredre3 4 hours ago | parent | next [-]

Q4_K is a type of quantization. It means that all weights will be at a minimum 4bits using the K method.

But if you're willing to give more bits to only certain important weights, you get to preserve a lot more quality for not that much more space.

The S/M/L/XL is what tells you how many tensors get to use more bits.

The difference between S and M is generally noticeable (on benchmarks). The difference between M and L/XL is less so, let alone in real use (ymmv).

Here's an example of the contents of a Q4_K_:

    S
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  136 tensors
    llama_model_loader: - type q5_0:   43 tensors
    llama_model_loader: - type q5_1:   17 tensors
    llama_model_loader: - type q6_K:   15 tensors
    llama_model_loader: - type q8_0:   55 tensors
    M
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  106 tensors
    llama_model_loader: - type q5_0:   32 tensors
    llama_model_loader: - type q5_K:   30 tensors
    llama_model_loader: - type q6_K:   15 tensors
    llama_model_loader: - type q8_0:   83 tensors
    L
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  106 tensors
    llama_model_loader: - type q5_0:   32 tensors
    llama_model_loader: - type q5_K:   30 tensors
    llama_model_loader: - type q6_K:   14 tensors
    llama_model_loader: - type q8_0:   84 tensors
huydotnet 8 hours ago | parent | prev | next [-]

They are different quantization types, you can read more here https://huggingface.co/docs/hub/gguf#quantization-types

arcanemachiner 2 hours ago | parent | prev [-]

Just start with q4_k_m and figure out the rest later.

JKCalhoun 8 hours ago | parent | prev | next [-]

"16-bit BF16 69.4 GB"

Is that (BF16) a 16-bit float?

adrian_b 2 hours ago | parent | next [-]

The IEEE standard FP16 is an older 16-bit format, which has balanced exponent and significand sizes.

It has been initially supported by GPUs, where it is useful especially for storing the color components of pixels. For geometry data, FP32 is preferred.

In CPUs, some support has been first added in 2012, in Intel Ivy Bridge. Better support is provided in some server CPUs, and since next year also in the desktop AMD Zen 6 and Intel Nova Lake.

BF16 is a format introduced by Google, intended only for AI/ML applications, not for graphics, so initially it was implemented in some of the Intel server CPUs and only later in GPUs. Unlike FP16, which is balanced, BF16 has great dynamic range, but very low precision. This is fine for ML but inappropriate for any other applications.

Nowadays, most LLMs are trained preponderantly using BF16, with a small number of parameters using FP32, for higher precision.

Then from the biggest model that uses BF16, smaller quantized models are derived, which use 8 bits or less per parameter, trading off accuracy for speed.

mtklein 8 hours ago | parent | prev | next [-]

Yes, it's a "Brain float", basically an ordinary 32-bit float with the low 16 mantissa bits cut off. Exact same range as fp32, lower precision, and not the same as the other fp16, which has less exponent and more mantissa.

Gracana 8 hours ago | parent | prev | next [-]

https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

Yes, however it’s a different format from standard fp16, it trades precision for greater dynamic range.

8 hours ago | parent | prev | next [-]
[deleted]
WithinReason 8 hours ago | parent | prev [-]

yes, it has 8 exponent bits like float32 instead of 6 like float16

palmotea 9 hours ago | parent | prev [-]

Thanks! I'd scanned the main content but I'd been blind to the sidebar on the far right.