| ▲ | falcor84 10 hours ago | |||||||||||||
That looks impressive, but some of the are a bit out of date. On Terminal-Bench 2 for example, the leader is currently "Codex CLI (GPT-5.1-Codex)" at 57.8%, beating this new release. | ||||||||||||||
| ▲ | NitpickLawyer 9 hours ago | parent | next [-] | |||||||||||||
What's more impressive is that I find gemini2.5 still relevant in day-to-day usage, despite being so low on those benchmarks compared to claude 4.5 and gpt 5.1. There's something that gemini has that makes it a great model in real cases, I'd call it generalisation on its context or something. If you give it the proper context (or it digs through the files in its own agent) it comes up with great solutions. Even if their own coding thing is hit and miss sometimes. I can't wait to try 3.0, hopefully it continues this trend. Raw numbers in a table don't mean much, you can only get a true feeling once you use it on existing code, in existing projects. Anyway, the top labs keeping eachother honest is great for us, the consumers. | ||||||||||||||
| ||||||||||||||
| ▲ | sigmar 9 hours ago | parent | prev | next [-] | |||||||||||||
That's a different model not in the chart. They're not going to include hundreds of fine tunes in a chart like this. | ||||||||||||||
| ||||||||||||||
| ▲ | 9 hours ago | parent | prev [-] | |||||||||||||
| [deleted] | ||||||||||||||