Remix.run Logo
mattmanser 2 hours ago

That's a quant 4 which the thread OP specifically called out as rubbish.

The Q4_K_XL bit for those not in the know.

stymaar an hour ago | parent | next [-]

Anyone calling Qwen3.6-35B-A3B-Q4_K_XL “rubish” has no idea what they are talking about.

embedding-shape an hour ago | parent | next [-]

I'd agree that the quality degrades a lot between Q8 and Q4, borderline unusable as they start to fail with tool calling syntax even. Personally I'd say Q8 is as low as you want to go.

c0rruptbytes an hour ago | parent | prev | next [-]

q4 isn't rubbish, but it's a compromise for a good value, q6 is essentially a no-compromise quantization and it's what i recommend for MoEs in my experience for agentic workflows

greenavocado an hour ago | parent | prev [-]

He's probably calling me out for this comment https://news.ycombinator.com/item?id=48557579

greenavocado an hour ago | parent | prev [-]

I typically find myself using a context of between 150-500k with GPT models so local models are simply not enough and I stopped using them.

stymaar an hour ago | parent | next [-]

That's way higher than their optimal ceiling (and absolutely suboptimal from a token cost point of view), why are you doing that?

greenavocado an hour ago | parent [-]

You're 100% right and its even severe than that: I daily drive on xhigh. I really try to avoid it, but when reconciling APIs across two large codebases you really start pressing north of 200k. I find myself topping out at 800k sometimes and that's with careful context management. I actually had to drop to GPT 5.4 for 1M context in my subscription because GPT 5.5 tops out at 272k. Hitting 800k context is better than repeatedly hitting let's say 200k out of 272k with multiple rounds of compaction. I run Can's snapcompact and while its better than normal compaction it still lobotomizes the model more than running with a very high context window.

c0rruptbytes an hour ago | parent | prev [-]

large contexts degrade the performance - attention doesn't work will for large windows like that and cloud models are kind of hacking it

local models do involve some context engineering to get it okay, but it's not that rough