Remix.run Logo
simianwords 3 hours ago

Hi! The challenge was ChatGPT but even then it looks like you used the weakest version of Gemini.

the_snooze an hour ago | parent [-]

>I stress test commercially deployed LLMs like Gemini and Claude with trivial tasks

I did exactly what I said I did. I'm using these systems the way they're designed and advertised. I'm following the happy path with tasks that are small, trivial, and easy to check. This is the charitable approach. Yet the system creaks under the lightest load. If Google wants to put on a better show with stronger models, then they should make those the default.

You don't need to make excuses for shoddy engineering from multi-billion dollar corporations. And you're quite welcome to run the same prompt on ChatGPT and evaluate it on your own time.