Remix.run Logo
terataiijo 8 hours ago

lmao they are so fast yooo

ttul 8 hours ago | parent | next [-]

Yes. How do they do it? Literally they must have PagerDuty set up to alert the team the second one of the labs releases anything.

beernet 8 hours ago | parent | next [-]

They obviously collaborate with some of the labs prior to the official release date.

sigbottle 8 hours ago | parent [-]

That... is a more plausible explanation I didn't think of.

danielhanchen 8 hours ago | parent [-]

Yes we collab with them!

qskousen 5 hours ago | parent [-]

Sorry this is a bit of a tangent, but I noticed you also released UD quants of ERNIE-Image the same day it released, which as I understand requires generating a bunch of images. I've been working to do something similar with my CLI program ggufy, and was curious of you had any info you could share on the kind of compute you put into that, and if you generate full images or look at latents?

sigbottle 8 hours ago | parent | prev [-]

Is quantization a mostly solved pipeline at this point? I thought that architectures were varied and weird enough where you can't just click a button, say "go optimize these weights", and go. I mean new models have new code that they want to operate on, right, so you'd have to analyze the code and insert the quantization at the right places, automatically, then make sure that doesn't degrade perf?

Maybe I just don't understand how quantization works, but I thought quantization was a very nasty problem involving a lot of plumbing

bildung 8 hours ago | parent | prev | next [-]

Bad QA :/ They had a bunch of broken quantizations in the last releases

danielhanchen 8 hours ago | parent [-]

1. Gemma-4 we re-uploaded 4 times - 3 times were 10-20 llama.cpp bug fixes - we had to notify people to upload the correct ones. The 4th is an official Gemma chat template improvement from Google themselves.

2. Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were under optimized, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space

3. MiniMax 2.7 - we swiftly fixed it due to NaN PPL - we found the issue in all quants regardless of provider - so it affected everyone not just us. We wrote a post on it, and fixed it - others have taken our fix and fixed their quants, whilst some haven't updated.

Note we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.

Unfortunately sometimes quants break, but we fix them quickly, and 95% of times these are out of our hand.

We swiftly and quickly fix them, and write up blogs on what happened. Other providers simply just take our blogs and fixes and re-apply, re-use our fixes.

rohansood15 7 hours ago | parent | next [-]

Thanks for all the amazing work Daniel. I remember you guys being late to OH because you were working on weights released the night before - and it's great to see you guys keep up the speed!

danielhanchen 7 hours ago | parent [-]

Oh thanks haha :) We try our best to get model releases out the door! :) Hope you're doing great!

bildung 8 hours ago | parent | prev [-]

Fair enough, appreciate the detailed response! Can you elaborate why other quantizations weren't affected (e.g. bartowski)? Simply because they were straight Q4 etc. for every layer?

danielhanchen 7 hours ago | parent [-]

No Bartowski's are more affected - (38% NaN) than ours (22%) - for MiniMax 2.7 see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax...

We already fixed ours. Bart hasn't yet but is still working on it following our findings.

blk.61.ffn_down_exps in Q4_K or Q5_K failed - it must be in Q6_K otherwise it overflows.

For the others, yes layers in some precision don't work. For eg Qwen3.5 ssm_out must be minimum Q4-Q6_K.

ssm_alpha and ssm_beta must be Q8_0 or higher.

Again Bart and others apply our findings - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...

bildung 7 hours ago | parent [-]

Thanks again, TIL

danielhanchen 7 hours ago | parent [-]

Thanks!

ekianjo 8 hours ago | parent | prev [-]

yeah and often their quants are broken. They had to update their Gemma4 quants like 4 times in the past 2 weeks.

danielhanchen 8 hours ago | parent [-]

No it's not our fault - re our 4 uploads - the first 3 are due to llama.cpp fixing bugs - this was out of our control (we're llama.cpp contributors, but not the main devs) - we could have waited, but it's best to update when multiple (10-20) bugs are fixed.

The 4th is Google themselves improving the chat template for tool calling for Gemma.

https://github.com/ggml-org/llama.cpp/issues/21255 was another issue CUDA 13.2 was broken - this was NVIDIA's CUDA compiler itself breaking - fully out of our hands - but we provided a solution for it.