Remix.run Logo
dudeinhawaii a day ago

LLMS like Opus, Gemini 3, and GPT-5.2/5.1-Codex-max, are phenomenal for coding and have only very recently crossed that gap between being "eh" and being quite fantastic to let operate on their own agentically. The major trade-off being a fairly expensive cost. I ran up $200 per provider after running through 'pro' tier limits during a single week of hacking over the holidays.

Unfortunately, it's still surprisingly easy for these models to fall into really stupid maintainability traps.

For instance today, Opus adds a feature to the code that needs access to a db. It fails because the db (sqlite) is not local to the executable at runtime. Its solution is to create this 100 line function to resolve a relative path and deal with errors and variations.

I hit ESC and say "... just accept a flag for --localdb <file>". It responds with "oh, that's a much cleaner implementation. Good idea!". It then implements my approach and deletes all the hacks it had scattered about.

This... is why LLMs are still not Senior engineers. They do plainly stupid things. They're still absurdly powerful and helpful, but if you want maintainable code you really have to pay attention.

Another common failure is when context is polluted.

I asked Opus to implement a feature by looking up the spec. It looked up the wrong spec (a v2 api instead of a v3) -- I had only indicated "latest spec". It then did the classic LLM circular troubleshooting as we went in 4 loops trying to figure out why calculations were failing.

I killed the session, asked a fresh instance to "figure out why the calculation was failing" and it found it straight away. The previous instance would have gone in circles for eternity because its worldview had been polluted by assumptions made -- that could not be shaken.

This is a second way in which LLMs are rigid and robotic in their thinking and approach -- taking the wrong way even when directed not to. Further reading on 'debugging decay': https://arxiv.org/abs/2506.18403

All this said, the number of failure scenarios gets ever smaller. We've gone from "problem and hallucination every other code block" to "problem every 200-1000 code blocks".

They're now in the sweet spot of acting as a massive accelerator. If you're not using them, you'll simply deliver slower.