Remix.run Logo
Aurornis 3 hours ago

If you're new to this: All of the open source models are playing benchmark optimization games. Every new open weight model comes with promises of being as good as something SOTA from a few months ago then they always disappoint in actual use.

I've been playing with Qwen3-Coder-Next and the Qwen3.5 models since they were each released.

They are impressive, but they are not performing at Sonnet 4.5 level in my experience.

I have observed that they're configured to be very tenacious. If you can carefully constrain the goal with some tests they need to pass and frame it in a way to keep them on track, they will just keep trying things over and over. They'll "solve" a lot of these problems in the way that a broken clock is right twice a day, but there's a lot of fumbling to get there.

That said, they are impressive for open source models. It's amazing what you can do with self-hosted now. Just don't believe the hype that these are Sonnet 4.5 level models because you're going to be very disappointed once you get into anything complex.

kir-gadjello 2 hours ago | parent | next [-]

Respectfully, from my experience and a few billions of tokens consumed, some opensource models really are strong and useful. Specifically StepFun-3.5-flash https://github.com/stepfun-ai/Step-3.5-Flash

I'm working on a pretty complex Rust codebase right now, with hundreds of integration tests and nontrivial concurrency, and stepfun powers through.

I have no relation to stepfun, and I'm saying this purely from deep respect to the team that managed to pack this performance in 196B/11B active envelope.

aappleby 2 hours ago | parent [-]

What are you running that model on?

FuckButtons 33 minutes ago | parent | next [-]

A 3 bit quant will run on a 128gb MacBook Pro, it works pretty well.

nl 29 minutes ago | parent [-]

A 3 bit quant is quite a lot weaker than the OpenRouter version the OP is using.

kir-gadjello 2 hours ago | parent | prev [-]

I just use openrouter, it's free for now. But I would pay 30-100$ to use it 24/7.

aappleby 2 hours ago | parent [-]

Ah, I thought you meant you were running it locally.

wolvoleo 2 hours ago | parent | prev | next [-]

All models are doing that. Not only the open source ones.

I bet the cloud ones are doing it a lot more because they can also affect the runtime side which the open source ones can't.

dimgl 29 minutes ago | parent | prev | next [-]

I'm using Qwen 3.5 27b on my 4090 and let me tell you. This is the first time I am seriously blown away by coding performance on a local model. They are almost always unusable. Not this time though...

chaboud 2 hours ago | parent | prev | next [-]

"When a measure becomes a target, it ceases to be a good measure."

Goodhart's law shows up with people, in system design, in processor design, in education...

Models are going to be over-fit to the tests unless scruples or practical application realities intervene. It's a tale as old as machine learning.

rudhdb773b an hour ago | parent | prev | next [-]

Are there any up-to-date offline/private agentic coding benchmark leaderboards?

If the tests haven't been published anywhere and are sufficiently different from standard problems, I would think the benchmarks would be robust to intentional over optimization.

Edit: These look decent and generally match my expectations:

https://www.apex-testing.org/

noosphr 2 hours ago | parent | prev | next [-]

It's not just the open source ones.

The only benchmarks worth anything are dynamic ones which can be scaled up.

amelius 3 hours ago | parent | prev | next [-]

Are you saying that the benchmarks are flawed?

And could quantization maybe partially explain the worse than expected results?

TrainedMonkey 3 hours ago | parent | next [-]

No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up. So you add specific problems to training data... The result is that the model is smarter, but the benchmarks overstate the progress. Sure there are problem sets designed to be secret, but keeping secrets is hard given the fraction of planetary resources we are dedicating to making the AI numbers go up.

I have two of my own comments to add to that. First one is that there is problem alignment at play. Specifically - the benchmarks are mostly self-contained problems with well defined solutions and specific prompt language, humans tasks are open ended with messy prompts and much steerage. Second is that it would be interesting to test older models on brand new benchmarks to see how those compare.

Aurornis 3 hours ago | parent | next [-]

> No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up.

That's a much better way to say it than I did.

These models are known for being open weights but they're still products that Alibaba Cloud wants is trying to sell. They have Product Managers and PR and marketing people under pressure to get people using them.

This Venture Beat article is basically a PR piece for the models and Alibaba Cloud hosting. The pricing table is right in the article.

It's cool that they release the models for us to use, but don't think they're operating entirely altruistically. They're playing a business game just like everyone else.

2 hours ago | parent [-]
[deleted]
amelius 2 hours ago | parent | prev [-]

There should be a way to turn the questions we ask LLMs into benchmarks.

That way, we can have a benchmark that is always up to date.

Aurornis 3 hours ago | parent | prev | next [-]

The models outperform on the benchmarks relative to general tasks.

The benchmarks are public. They're guaranteed to be in the training sets by now. So the benchmarks are no longer an indicator of general performance because the specific tasks have been seen before.

> And could quantization maybe explain the worse than expected results?

You can use the models through various providers on OpenRouter cheaply without quantization.

3 hours ago | parent [-]
[deleted]
girvo 3 hours ago | parent | prev [-]

Flawed? Possibly, but I think it's more that any kind of benchmark then becomes a target, and is inherently going to be a "lossy" signal as to the models actual ability in practice.

Quantisation doesn't help, but even running full fat versions of these models through various cloud providers, they still don't match Sonnet in actual agentic coding uses: at least in my experience.

eurekin 2 hours ago | parent | prev | next [-]

Very good point. I'm playing with them too and got to the same conclusion.

crystal_revenge 2 hours ago | parent | prev | next [-]

> they always disappoint in actual use.

I’ve switched to using Kimi 2.5 for all of my personal usage and am far from disappointed.

Aside from being much cheaper than the big names (yes, I’m not running it locally, but like that I could) it just works and isn’t a sycophant. Nice to get coding problems solved without any “That’s a fantastic idea!”/“great point” comments.

At least with Kimi my understanding is that beating benchmarks was a secondary goal to good developer experience.

jackblemming 3 hours ago | parent | prev | next [-]

Death by KPIs. Management makes it too risky to do anything but benchmaxx. It will be the death of American AI companies too. Eventually, people will notice models aren’t actually getting better and the money will stop flowing. However, this might be a golden age of research as cheap GPUs flood the market and universities have their own clusters.

bourjwahwah 2 hours ago | parent | prev [-]

[dead]