| ▲ | 0xbadcafebee 3 hours ago | |||||||||||||||||||
These posts are going to be a constant for the next year, because there's no objective way to compare models (past low-level numbers like token generation speed, average reasoning token amount, # of parameters, active experts, etc). They're all quite different in a lot of ways, they're used for many different things by different people, and they're not deterministic. So you're constantly gonna see benchmarks and tests and proclamations of "THIS model beat THAT model!", with people racing around trying to find the best one. But there is no best one. There's just the best one for you, based on whatever your criteria is. It's likely we'll end up in a "Windows vs MacOS vs Linux" style world, where people stick to their camps that do a particular thing a particular way. | ||||||||||||||||||||
| ▲ | pjerem 2 hours ago | parent | next [-] | |||||||||||||||||||
The news is not in the way to compare models, it’s that Kimi K2.6 (and I’d add Deepseek v4 Pro) are more or less equivalent to Opus and that’s already pretty big. They are open source and cost waaaay less per token than American models. I’m using them right now on the $20 Ollama cloud plan and I can actually work with them on my side projects without reaching the limits too much. With Claude Pro $20 plan my usage can barely survive one or two prompts. And I choose Ollama cloud just because their CLI is convenient to use but their are a lot of other providers for those models so you aren’t even stuck with shitty conditions and usage rules. To me that’s a pretty bad thing for American economy. | ||||||||||||||||||||
| ▲ | yorwba 2 hours ago | parent | prev | next [-] | |||||||||||||||||||
There are objective ways to compare models. They involve repeated sampling and statistical analysis to determine whether the results are likely to hold up in the future or whether they're just a fluke. If you fine-tune each model to achieve its full potential on the task you expect to be giving it, the rankings produced by different benchmarks even agree to a high degree: https://arxiv.org/abs/2507.05195 The author didn't do any of that. They ran each model once on each of 13 (so far) problems and then they chose to highlight the results for the 12th problem. That's not even p-hacking, because they didn't stop to think about p-values in the first place. LLM quality is highly variable across runs, so running each model once tells you about as much about which one is better as flipping two coins once and having one come up heads and the other tails tells you about whether one of them is more biased than the other. | ||||||||||||||||||||
| ||||||||||||||||||||
| ▲ | taegee 8 minutes ago | parent | prev | next [-] | |||||||||||||||||||
While I partially agree with you, there IS work being done to make the metrics comparable. Eg: https://ghzhang233.github.io/blog/2026/03/05/train-before-te... It just hasn't been widely adopted yet. And it might be in each of their particular interests that it continues to stay so for a while. It's basically like p-hacking. | ||||||||||||||||||||
| ▲ | charcircuit 17 minutes ago | parent | prev | next [-] | |||||||||||||||||||
A pretty simple one would be to have every model try and one shot every ticket your company has and then measure the acceptance rate of each model. | ||||||||||||||||||||
| ||||||||||||||||||||
| ▲ | verve_rat 2 hours ago | parent | prev | next [-] | |||||||||||||||||||
My theory is we will end up in a similar spot to hiring people. You can look at a CV (benchmarks) but you won't know for sure until you've worked with them for six months. We as an industry cannot determine if one software engineer is objectively better than another, on practically any dimension, so why do we think we can come to an objective ranking of models? | ||||||||||||||||||||
| ||||||||||||||||||||
| ▲ | chrisandchris 2 hours ago | parent | prev | next [-] | |||||||||||||||||||
That was my thought too. > The Word Gem Puzzle is a sliding-tile letter puzzle. The board is a rectangular grid (10×10, 15×15, 20×20, 25×25, or 30×30) filled with letter tiles and one blank space. Just last week my superior asked to implement that for a customer. /s Maybe some real, real task would be good? Add sone database, some REST, some random JS framework and let it figure out a full-stack task instead of creating some rectangles? | ||||||||||||||||||||
| ▲ | ljlolel 2 hours ago | parent | prev [-] | |||||||||||||||||||
[flagged] | ||||||||||||||||||||
| ||||||||||||||||||||