| ▲ | wongarsu 3 days ago | ||||||||||||||||
In coding agents that would be "the test keeps failing and I can't fix it - let's delete the test" or "I can't fix this bug, let's delete the feature" If you measure success by unit test failures or by the presence of the bug those behaviors can obscure that the LLM wasn't able to do the intended fix. Of course a closer inspection will still reveal what happened, but using proxy measurements to track success is dangerous, especially if the LLM knows about them or if the task description implies improving that metric "a unit test is failing, fix that" | |||||||||||||||||
| ▲ | flir 3 days ago | parent [-] | ||||||||||||||||
Sure, but the discussion here is around "in production"? I'm trying to imagine a scenario and I'm coming up short. | |||||||||||||||||
| |||||||||||||||||