| ▲ | rich_sasha 5 hours ago | |
People say LLMs do better on tasks where success is clear, like tests passing, and I can imagine it's true. Still, I find complex code fixes confirmed by tests end in the LLM fudging the code to make the specific test pass, rather than fixing the general issue. Like, where successful code run should generate a file and the test checks for the file, eventually LLM will just touch the file regardless and be done. | ||
| ▲ | wild_egg 5 hours ago | parent [-] | |
Skill issue. Literally. Make a SKILL.md that has the agent leverage subagents to do all work. An implementor agent does the thing, and then a separate agent reviews and verifies afterwards. The fresh context window of the second agent doesn't have the shortcut chain of thought in it and so it will very happily flag if the first agent cheated. Main agent can then have a new set of agents go fix it. This has completely solved the cheating and fudging to make tests pass for me. | ||