Remix.run Logo
Aurornis 5 hours ago

> and it can fall back to similar examples in the training data easily.

This is an underrated consideration when evaluating the small models: The further you deviate from standard example code, the more their weaknesses show.

My experience is that Qwen3.6 produced some amazing results for a small model when I tried it with simple apps that are widely reproduced everywhere. If you want a React TODO app or to set up a little boilerplate app with shadcn and other popular tools, it will produce something that looks not too bad.

Then when I started straying outside of common tasks and into some of my more niche work, it would spin for hours and go in circles before finally producing some groan-inducing output that wasn't usable.

If you're looking for a model to help with simple refactoring or small tasks where you provide very explicit instructions for exactly what you want, but you don't want to do all of the typing yourself, they can do a lot of good work, though. But you're right that once you get into long context sessions involving topics off the beaten path, the weaknesses are very apparent.

The quantizations that are popular for making these models fit on smaller hardware make the problems worse. When you read it about online there is almost a consensus that 4-bit quants are lossless and that you can use q8_0/q8_0 kv cache quantization without any real loss, but in my experience with real projects there's a substantial degradation in long context performance with any of these quants.

CMay 4 hours ago | parent [-]

This is my experience too. Qwen optimizes for a lot of scenarios which masks their weaker generalization compared to US frontier models.

Never go below an fp16 kv cache unless you've already tested it in advance with your model on a verified task that you know it can successfully complete. People should also test the difference using the exact same seed value so they can see how the tokens diverge. If you have memory constraints, sometimes you can still use an fp16 kv cache and use storage for an agentic buffer to work your task with mixed abstractions rather than having everything in memory.

For 4-bit weight quants, Gemma 4 31B QAT is where people should be looking instead of Qwen 3.6.