Remix clone Hacker News

new | show | ask | jobs Github

▲

dmd 6 days ago

Gemini 2.5 Pro gets 64% on SWE-bench verified. Sonnet 3.7 gets 70%

They are reporting that GPT-4.1 gets 55%.

▲

egeozcan 6 days ago | parent | next [-]

Very interesting. For my use cases, Gemini's responses beat Sonnet 3.7's like 80% of the time (gut feeling, didn't collect actual data). It beats Sonnet 100% of the time when the context gets above 120k.

▲

int_19h 6 days ago | parent [-]

As usual with LLMs. In my experience, all those metrics are useful mainly to tell which models are definitely bad, but doesn't tell you much about which ones are good, and especially not how the good ones stack against each other in real world use cases.

Andrej Karpathy famously quipped that he only trusts two LLM evals: Chatbot Arena (which has humans blindly compare and score responses), and the r/LocalLLaMA comment section.

▲

ezyang 6 days ago | parent [-]

Lmarena isn't that useful anymore lol

	▲	int_19h 6 days ago \| parent [-]
		I actually agree with that, but it's generally better than other scores. Also, the quote is like a year old at this point. In practice you have to evaluate the models yourself for any non-trivial task.

▲

hmottestad 6 days ago | parent | prev [-]

Are those with «thinking» or without?

▲

sanxiyn 6 days ago | parent | next [-]

Sonnet 3.7's 70% is without thinking, see https://www.anthropic.com/news/claude-3-7-sonnet

▲

aledalgrande 6 days ago | parent | prev | next [-]

The thinking tokens (even just 1024) make a massive difference in real world tasks with 3.7 in my experience

▲

chaos_emergent 6 days ago | parent | prev | next [-]

based on their release cadence, I suspect that o4-mini will compete on price, performance, and context length with the rest of these models.

	▲	hecticjeff 6 days ago \| parent [-]
		o4-mini, not to be confused with 4o-mini

▲

energy123 6 days ago | parent | prev [-]

With