| ▲ | NitpickLawyer 6 hours ago | |||||||||||||||||||||||||||||||
New, unbenched problems are really the only way to differentiate the models, and every time I see one it's along the same lines. Models from top labs are neck and neck, and the rest of the bunch are nowhere near. Should kinda calm down the "opus killer" marketing that we've seen these past few months, every time a new model releases, esp the small ones from china. It's funny that even one the strongest research labs in china (deepseek) has said there's still a gap to opus, after releasing a humongous 1.6T model, yet the internet goes crazy and we now have people claiming [1] a 27b dense model is "as good as opus"... I'm a huge fan of local models, have been using them regularly ever since devstral1 released, but you really have to adapt to their limitations if you want to do anything productive. Same as with other "cheap", "opus killers" from china. Some work, some look like they work, but they go haywire at the first contact with a real, non benchmarked task. | ||||||||||||||||||||||||||||||||
| ▲ | adrian_b 6 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||
Benchmarks for LLMs without complete information about the tested models are hard to interpret. For the OpenAI and Anthropic models, it is clear that they have been run by their owners, but for the other models there are a great number of options for running them, which may run the full models or only quantized variants, with very different performances. For instance, in the model list there are both "moonshotai/kimi-k2.6" and "kimi-k2.6", with very different results, but there is no information about which is the difference between these 2 labels, which refer to the same LLM. Moreover, as others have said, such a benchmark does not prove that a certain cheaper model cannot solve a problem. It happened to not solve it within the benchmark, but running it multiple times, possibly with adjusted prompts, may still solve the problem. While for commercial models running them many times can be too expensive, when you run a LLM locally you can afford to run it much more times than when you are afraid of the token price or of reaching the subscription limits. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | lgienapp an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
I feel like many of the “as good as opus” crowd would achieve the same with sonnet tbh. Actually reaching the ceiling of what Opus can do is maybe 10% of tasks, the rest is wasting compute on a too-strong model they default to for whatever they are doing. Hence they see little drop in output quality when trying out smaller open models. | ||||||||||||||||||||||||||||||||
| ▲ | anuramat 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
"almost as good as opus at writing python/js/... when given a spec" might be enough for a lot of people, especially if its 10x cheaper | ||||||||||||||||||||||||||||||||
| ▲ | 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
| [deleted] | ||||||||||||||||||||||||||||||||
| ▲ | cmrdporcupine 6 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||
The question isn't whether it's "as good as Opus" but that there exists something that costs 1/10th the cost to use but can still competently write code. Honestly, I was "happy" with December 2025 time frame AI or even earlier. Yes, what's come after has been smarter faster cleverer, but the biggest boost in productivity was just the release of Opus 4.5 and GPT 5.2/5.3. And yes it might be a competitive disadvantage for an engineer not to have access to the SOTA models from Anthropic/OpenAI, but at the same time I feel like the missing piece at this point is improvements in the tooling/harness/review tools, not better-yet models. They already write more than we can keep up with. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||