| ▲ | guilamu 5 hours ago | |||||||
Yes, the prompt is slim by design. I might be wrong, but the point was to see what the model can do "on it's own". The eval prompt is quite extensive: https://github.com/guilamu/llms-wordpress-plugin-benchmark/b... | ||||||||
| ▲ | DrProtic 4 hours ago | parent [-] | |||||||
That’s the thing, not everyone wants and values the model based on that. But I guess it works for you, and that benchmark achieves it. I personally develop with very detailed spec, and I don’t want nothing more and nothing less compared to the spec. I found 5.4/5.5 much better at following spec while Opus makes some things up, which aligns with your benchmark but that makes 5.4/5.5 better for me while worse for you. | ||||||||
| ||||||||