> Like, do you run a proper experiment where you hand the same task to multiple models several times and compare the results?

This is what I do. I have a little TUI that fires off Claude Code, Codex, Gemini, Qwen Coder and AMP in separate containers for most task I do (although I've started to use AMP less and less), and either returns the last message of what they replied and/or a git diff of what exactly they did. Then I compare them side by side. If all of them got something wrong, I update the prompt, fire them off again. Always starting from zero, and always include the full context of what you're doing with the first message, they're all non-interactive sessions.

Sometimes I do 3x Codex instead of different agents, just to double-check that all of them would do the same thing. If they go off and do different things from each other, I know the initial prompt isn't specific/strict enough, and again iterate.

▲ dotancohen 20 hours ago | parent | next [-]

Please share! I'd much rather help develop your solution than vibe code one of my own ))

Honestly, I'd love to try that. My Gmail username is the same as my HN username.

	▲	an hour ago \| parent \| next [-]
		[deleted]
	▲	nl 13 hours ago \| parent \| prev [-]
		Not the OP but I have https://github.com/nlothian/autocoder which supports a Github-centric workflow using the following options: `- Claude - Codex - Kilocode - Amp - Mistral Vibe` Very vibe coded though.

▲ handfuloflight 18 hours ago | parent | prev | next [-]

What's this costing you?

▲ versteegen 16 hours ago | parent | prev [-]

So how do the models compare in your experience?