| ▲ | guilamu 3 hours ago |
| Just tested it on my homemade Wordpress+GravityForms benchmark and it's one of the worst model of the leaderboard performance wise and the worst value wise: https://github.com/guilamu/llms-wordpress-plugin-benchmark I know it's only on a single benchmark, but I dont understand how it can be so bad... |
|
| ▲ | goldenarm 3 hours ago | parent | next [-] |
| gemma4-e4b is 50% better than gemma4-26b in your benchmark, something's wrong |
| |
| ▲ | guilamu 3 hours ago | parent [-] | | Yes those two models were tested on my own PC (local inference using my own CPU/GPU). So something my be bugged on my setup. gemma4-26b should be far better than gemma4-e4b. | | |
| ▲ | embedding-shape 2 hours ago | parent [-] | | Sounds like maybe using worse quantization on the bigger model? Quantization matters a lot for the quality, basically anything below Q8 is borderline unusable. If it isn't specified in a benchmark already it probably should. |
|
|
|
| ▲ | ac29 3 hours ago | parent | prev | next [-] |
| Your benchmark has Opus 4.7 performing significantly worse than Sonnet 4.6. Even if true on your benchmark, that is not representative of the overall performance of the models. |
| |
| ▲ | guilamu 3 hours ago | parent [-] | | Yes Opus 4.7 fast (no reasoning) did a worst job than Sonnet 4.6 high (with reasoning) according to Gemini 3.1 Pro evaluation. | | |
| ▲ | ac29 3 hours ago | parent [-] | | Your table doesn't indicate reasoning vs non-reasoning, or reasoning level | | |
| ▲ | guilamu 3 hours ago | parent [-] | | When nothing is noted it's max reasoning (xhigh in copilot chat in vscode if available). The models not availble on copilot were tested through opencode (max reasoning) and deepseek v4 was tested through Cline (with max reasoning too). |
|
|
|
|
| ▲ | mosselman 3 hours ago | parent | prev | next [-] |
| You even traveled in time to deliver us this benchmark. I really like this benchmarking. Have you evaluated the judge benchmark somehow? I'd love to setup my own similar benchmark. |
| |
| ▲ | guilamu 3 hours ago | parent [-] | | Haha, just fixed the date! I haven't evaluated the judge benchmark. You have everything needed in the repo to do so though, so be my guest. It took me a bit of time to put all this together and won't have much more time to dedicate to it before a couple of weeks. BTW, if you explore the repo, sorry for all the French files... |
|
|
| ▲ | DrProtic 3 hours ago | parent | prev [-] |
| Seems like benchmark for how good a model is for vibe coding. Your prompt is extremely slim yet you score it on a bunch of features. |
| |
| ▲ | guilamu 3 hours ago | parent [-] | | Yes, the prompt is slim by design. I might be wrong, but the point was to see what the model can do "on it's own". The eval prompt is quite extensive: https://github.com/guilamu/llms-wordpress-plugin-benchmark/b... | | |
| ▲ | DrProtic 2 hours ago | parent [-] | | That’s the thing, not everyone wants and values the model based on that. But I guess it works for you, and that benchmark achieves it. I personally develop with very detailed spec, and I don’t want nothing more and nothing less compared to the spec. I found 5.4/5.5 much better at following spec while Opus makes some things up, which aligns with your benchmark but that makes 5.4/5.5 better for me while worse for you. | | |
| ▲ | guilamu an hour ago | parent [-] | | Yeah as I said this a benchmark for my usecase only, a single use case, which is obvisouly not representative of everybody's needs. What strike me as very strange though is that 0 model were able to just use the search input already present in GravitYForms forms list page and all created a second input. Also, I know it's not in the prompt, but adding a ctrl+f shortcut to a search input? Is that that crazy? I don't know. |
|
|
|