Remix.run Logo
verdverm 8 hours ago

https://huggingface.co/moonshotai/Kimi-K2.6

Is this the same model?

Unsloth quants: https://huggingface.co/unsloth/Kimi-K2.6-GGUF

(work in progress, no gguf files yet, header message saying as much)

SwellJoe 6 hours ago | parent | next [-]

A trillion parameters is wild. That's not going to quantize to anything normal folks can run. Even at 1-bit, it's going to be bigger than what a Strix Halo or DGX Spark can run. Though I guess streaming from system RAM and disk makes it feasible to run it locally at <1 token per second, or whatever. GLM 5.1, at 754B parameters, is already beyond any reasonable self-hosting hardware (1-bit quantization is 206GB). Maybe a Mac Studio with 512GB can run them at very low-bit quantizations, also pretty slowly.

justinclift 16 minutes ago | parent | next [-]

Looks like it. This quant ( https://huggingface.co/inferencerlabs/Kimi-K2.6-MLX-3.6bit ) says:

> Q3.6 typically achieves useable accuracy in our coding test and fits within a 512GB memory budget

This one ( https://huggingface.co/mlx-community/Kimi-K2.6-MoE-Smart-Qua... ) though says it fits on a 192GB mac:

> M3/M4 Ultra 192GB+ (fits in ~150GB)

jauntywundrkind 6 hours ago | parent | prev [-]

A huge dual socket Epyc system used to be able to get to 1TB without difficulty. 16 dimms of 64gb each. Doable for ~$3000. With considerable memory bandwidth.

Our hope these days seems to be that maybe perhaps possibly High Bandwidth Flash works out. Instead of 4, 8, or maybe more for some highest end drives, having many many many dozens of channels of flash.

Ideally that can be very very near to the inference. PCIe 7.0 is 0.5Tb/s at 16x which is obviously nowhere remotely near enough throughout here. The difficulty is sort of that nand has been trying to be super dense, so as you scale channels you would normally tend to scale nand capacity too, and now instead of a 2tb drive you have a 200tb drive prices way beyond consumer means. Still, I think HBF is perhaps the only shot of the most important thing in computing going from mainframe back to consumer, and of course the models are going to balloon again if this dies hit, probably before consumers ever get a chance.

Balinares 7 hours ago | parent | prev | next [-]

Quite curious how well real usage will back the benchmarks, because even if it's only Opus ballpark, open weights Opus ballpark is seismic.

gpm 7 hours ago | parent | prev [-]

Huh, so the metadata says 1.1 trillion parameters, each 32 or 16 bits.

But the files are only roughly 640GB in size (~10GB * 64 files, slightly less in fact). Shouldn't they be closer to 2.2TB?

johndough 6 hours ago | parent | next [-]

The bulk of Kimi-K2.6's parameters are stored with 4 bits per weight, not 16 or 32. There are a few parameters that are stored with higher precision, but they make up only a fraction of the total parameters.

gpm 6 hours ago | parent [-]

Huh, cool. I guess that makes a lot of sense with all the success the quantization people have been having.

So am I misunderstanding "Tensor type F32 · I32 · BF16" or is it just tagged wrong?

rockinghigh 4 hours ago | parent | next [-]

The MoE experts are quantized to int4, all other weights like the shared expert weights are excluded from quantization and use bf16.

liuliu 4 hours ago | parent | prev [-]

I32 are 8 4-bit value packed into one int32.

coder543 6 hours ago | parent | prev [-]

The description specifically says:

"Kimi-K2.6 adopts the same native int4 quantization method as Kimi-K2-Thinking."