| ▲ | cornstalks 2 hours ago | |
Anecdote time! I had Codex GPT 5.4 xhigh generate a Rust proc macro. It's pretty straightforward: use sqlparser to parse a SQL statement and extract the column names of any row-producing queries. It generated an implementation that worked well, but I hated the ~480 lines of code. The structure and flow was just... weird. It was hard to follow and I was seriously bugged by it. So I asked it to reimplement it with some simplifications I gave it. It dutifully executed, producing a result >600 lines long. The flow was simpler and easier to follow, but still seemed excessive for the task at hand. So I rolled up my sleeves and started deleting code and making changes manually. A little bit later, I had it down to <230 lines with a flow that was extremely easy to read and understand. So yeah, I can totally see many SWE-bench-passing PRs being functionally correct but still terrible code that I would not accept. | ||
| ▲ | mvanzoest 3 minutes ago | parent | next [-] | |
Yeah I had a similar experience on a smaller scale, reducing a function from 125 lines to 25. | ||
| ▲ | SerCe an hour ago | parent | prev [-] | |
If you've got some time, I highly recommend going through the exercise of trying to change the prompt in a way that would produce code similar to what you've achieved manually. Doing a similar exercise really helps to improve agent prompting skills, as it shows how changing parts of the prompt influences the result. | ||