I always wonder how people make qualitative statements like this. There are so many variables! Is it my prompt? The task? The specific model version? A good or bad branch out of the non-deterministic solution space?

Like, do you run a proper experiment where you hand the same task to multiple models several times and compare the results? Not snark by the way, I’m asking in earnest how you pick one model over another.

▲ embedding-shape 20 hours ago | parent | next [-]

> Like, do you run a proper experiment where you hand the same task to multiple models several times and compare the results?

This is what I do. I have a little TUI that fires off Claude Code, Codex, Gemini, Qwen Coder and AMP in separate containers for most task I do (although I've started to use AMP less and less), and either returns the last message of what they replied and/or a git diff of what exactly they did. Then I compare them side by side. If all of them got something wrong, I update the prompt, fire them off again. Always starting from zero, and always include the full context of what you're doing with the first message, they're all non-interactive sessions.

Sometimes I do 3x Codex instead of different agents, just to double-check that all of them would do the same thing. If they go off and do different things from each other, I know the initial prompt isn't specific/strict enough, and again iterate.

▲ dotancohen 20 hours ago | parent | next [-]

Please share! I'd much rather help develop your solution than vibe code one of my own ))

Honestly, I'd love to try that. My Gmail username is the same as my HN username.

	▲	an hour ago \| parent \| next [-]
		[deleted]
	▲	nl 13 hours ago \| parent \| prev [-]
		Not the OP but I have https://github.com/nlothian/autocoder which supports a Github-centric workflow using the following options: `- Claude - Codex - Kilocode - Amp - Mistral Vibe` Very vibe coded though.

▲ handfuloflight 18 hours ago | parent | prev | next [-]

What's this costing you?

▲ versteegen 16 hours ago | parent | prev [-]

So how do the models compare in your experience?

▲ energy123 17 hours ago | parent | prev | next [-]

I have sent the same prompt to GPT-5.2 Thinking and Gemini 3.0 Pro many times because I subscribe to both.

GPT-5.2 Thinking (with extended thinking selected) is significantly better in my testing on software problems with 40k context.

I attribute this to thinking time, with GPT-5.2 Thinking I can coax 5 minutes+ of thinking time but with Gemini 3.0 Pro it only gives me about 30 seconds.

The main problem with the Plus sub in ChatGPT is you can't send more than 46k tokens in a single prompt, and attaching files doesn't help either because the VM blocks the model from accessing the attachments if there's ~46k tokens already in the context.

▲ enraged_camel 20 hours ago | parent | prev [-]

Last night I gave one of the flaky tests in our test suite to three different models, using the exact same prompt.

Gemini 3 and Gemini 3 Flash identified the root cause and nailed the fix. GPT 5.1 Codex misdiagnosed the issue and attempted a weird fix despite my prompt saying “don’t write code, simply investigate.”

I run these tests regularly, and Codex has not impressed me. Not even once. At best it’s on par, but most of the time it just fails miserably.

Languages: JavaScript, Elixir, Python

	▲	paustint 12 hours ago \| parent \| next [-]
		The one time I was impressed with codex was when I was adding translations in a bunch of languages for a business document generation service. I used claude to do the initial work and cross checked with codex. The codex agent ran for a long time and created and executed a bunch of python scripts (according to the output thinking text) to compare the translations and found a number of possible issues. I am not sure where the scripts were stored or executed, our project doesn't use python. Then I fed the output of the issues codex found to claude for a second "opinion". Claude said that the feedback was obviously from someone that knew the native language very well and agreed with all the feedback. I was really surprised at how long Codex was thinking and analyzing - probably 10 minutes. (This was ~1+mo ago, I don't recall exactly what model) Claude is pretty decent IMO - amp code is better, but seems to burn through money pretty quick.
	▲	tmikaeld 20 hours ago \| parent \| prev [-]
		I have the same experience. To make it worse, there’s a mile of difference between the all too many versions and efforts..