Remix.run Logo
make3 8 hours ago

absolutely not on par you're smoking

dkhenry 7 hours ago | parent | next [-]

You make a compelling argument, but thankfully I have data to back up my anecdotal experience

This comparison shows them neck and neck https://benchlm.ai/compare/claude-sonnet-4-5-vs-gemma-4-31b

As Does this one https://llm-stats.com/models/compare/claude-sonnet-4-6-vs-ge...

And the pelican benchmark even shows them pretty close https://simonwillison.net/2026/Apr/2/gemma-4/ https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/

Also this isn't a fringe statement, you can see most people who have done an evaluation agree with me

jmward01 6 hours ago | parent | next [-]

I think one area I find hard to get around is context length. Everything self hosted is so limited on length that it is marginal to use. Additionally I think that the tools (like claude code) are clearly in the training mix for Anthropic's models so they seem to get a boost over other models pushed into that environment. That being said, open source and local inference is -really- good and only going to get better. There is no doubt that the current frontier biz model is not sustainable.

make3 an hour ago | parent | prev [-]

if you look at the details of the numbers of the benchmarks that you shared, Sonnet 4.5 crushes gemma 4. Somehow the first link doesn't run Sonnet on the multi modal benchmark, that's why the top score looks close, it beats Gemma at every benchmark they actually ran. The arena in the second shows that it actually destroys Gemma 4 as well, not close

lostmsu 8 hours ago | parent | prev [-]

Just to be clear, did you notice the parent said 4.5?

cmorgan31 7 hours ago | parent | next [-]

They are also on par in a lot of classification tasks. I did have to actually use gemma4 and fine tune it a bit but that is part of the value add.

make3 an hour ago | parent | prev [-]

I did, what's your point?