Remix.run Logo
thot_experiment an hour ago

Benchmarks only give you the roughest idea of how models compare in real world use. They're essentially useless beyond maybe classifying models into a few buckets. The only way you gain an understanding of something as complex as how an LLM integrates with your workflow is by doing it and measuring across many trials. I've been running Opus 4.7 in Claude Code and Gemma 4 31b in parallel on projects for hours a day this past week, Opus 4.7 is definitely better, but for many things they are roughly equivalent, there are some things on the edge that are just up to chance, and either model may stumble across the solution, and there are some areas of my work that reliably trip up both models and I get better mileage out of writing code the old fashioned way. I understand that I'm just one data point, but I'm not writing CRUD apps here, I'm doing DSPs and weird color math in shaders, I don't think any of it is hard, and the stuff that I think is hard none of the models are good at yet, but idk, they just don't seem that extremely disparate from one another.

FWIW I think Gemma 4 31b is more likely to be of use to me than Sonnet, idfk, maybe it's a skill issue but I love Opus 4.7, undisputed king, but Sonnet seems borderline useless and I basically think of it as on the same level as Qwen 35b MoE.

cbg0 an hour ago | parent [-]

"essentially useless" is a gross overstatement. Your personal benchmarks will always provide you with the most value, but disregarding standardized benchmarks because you care more about vibes is not exactly scientific.