I routinely leave codex running for a few hours overnight to debug stuff

If you have a deterministic unit test that can reproduce the bug through your app front door, but you have no idea how the bug is actually happening, having a coding agent just grind through the slog of sticking debug prints everywhere, testing hypotheses, etc — it's an ideal usecase

▲

nikkwong 3 hours ago | parent | next [-]

I have a hard time understanding how that would work — for me, I typically interface with coding agents through cursor. The flow is like this: ask it something -> it works for a min or two -> I have to verify and fix by asking it again; etc. until we're at a happy place with the code. How do you get it to stop from going down a bad path and never pulling itself out of it?

The important role for me, as a SWE, in the process, is verify that the code does what we actually want it to do. If you remove yourself from the process by letting it run on its own overnight, how does it know it's doing what you actually want it to do?

Or is it more like with your usecase—you can say "here's a failing test—do whatever you can to fix it and don't stop until you do". I could see that limited case working.

	▲	gamegoblin 9 minutes ago \| parent \| next [-]
		I use Codex CLI or Claude Code I don't even necessarily ask it to fix the bug — just identify the bug Like if I've made a change that is causing some unit test to fail, it can just run off and figure out where I made an off-by-one error or whatever in my change.
	▲	woah 3 hours ago \| parent \| prev \| next [-]
		For some reason setting up agents in a loop with a solid prompt and new context each iteration seems to result in higher quality work for larger or more difficult tasks than the chat interface. It's like the agent doesn't have to spend half its time trying to guess what you want
	▲	zem an hour ago \| parent \| prev \| next [-]
		it's more like "this function is crashing with an inconsistent file format error. can you figure out how a file with the wrong format got this far into the pipeline?". in cases like that the fix is usually pretty easy once you have the one code path out of several thousands nailed down.
	▲	p1esk 3 hours ago \| parent \| prev \| next [-]
		“here's a failing test—do whatever you can to fix it” Bad idea. It can modify the code that the test passes but everything else is now broken.
	▲	vel0city 2 hours ago \| parent \| prev [-]
		You do things like ralph loops. https://github.com/snarktank/ralph Its constantly restarting itself, looking at the current state of things, re-reading what was the request, what it did and failed at in the past (at a higher level), and trying again and again.

▲

tsss 4 hours ago | parent | prev | next [-]

How can you afford that?

	▲	wahnfrieden 3 hours ago \| parent [-]
		It costs $200 for a month

▲

addaon 3 hours ago | parent | prev [-]

> it's an ideal usecase

This is impressive, you’ve completely mitigated the risk of learning or understanding.

	▲	arcanemachiner 3 hours ago \| parent [-]
		Or, they have freed up time for more useful endeavours, that may otherwise have spent on drudgery. I don't discount the value of blood, sweat and tears spent on debugging those hard issues, and the lessons learned from doing so, but there is a certain point where it's OK to take a pass and just let the robots figure it out.