Remix.run Logo
DrProtic 3 hours ago

Seems like benchmark for how good a model is for vibe coding.

Your prompt is extremely slim yet you score it on a bunch of features.

guilamu 3 hours ago | parent [-]

Yes, the prompt is slim by design. I might be wrong, but the point was to see what the model can do "on it's own".

The eval prompt is quite extensive: https://github.com/guilamu/llms-wordpress-plugin-benchmark/b...

DrProtic 2 hours ago | parent [-]

That’s the thing, not everyone wants and values the model based on that. But I guess it works for you, and that benchmark achieves it.

I personally develop with very detailed spec, and I don’t want nothing more and nothing less compared to the spec.

I found 5.4/5.5 much better at following spec while Opus makes some things up, which aligns with your benchmark but that makes 5.4/5.5 better for me while worse for you.

guilamu an hour ago | parent [-]

Yeah as I said this a benchmark for my usecase only, a single use case, which is obvisouly not representative of everybody's needs.

What strike me as very strange though is that 0 model were able to just use the search input already present in GravitYForms forms list page and all created a second input.

Also, I know it's not in the prompt, but adding a ctrl+f shortcut to a search input? Is that that crazy? I don't know.