Remix.run Logo
taormina 2 days ago

Grok hasn't gotten better. OpenAI hasn't gotten better. Claude Code with Opus and Sonnet I swear are getting actively worse. Maybe you only use them for toy projects, but attempting to get them to do real work in my real codebase is an exercise in frustration. Yes, I've done meaningful prompting work, and I've set up all the CLAUDE.md files, and then it proceeds to completely ignores everything I said, all of the context I gave, and just craps out something completely useless. It has accomplished a small amount of meaningful work, exactly enough that I think I'm neutral instead of in the negative in terms of work:time if I have just done it all myself.

I get to tell myself that it's worth it because at least I'm "keeping up with the industry" but I honestly just don't get the hype train one bit. Maybe I'm too senior? Maybe the frameworks I use, despite being completely open source and available as training data for every model on the planet are too esoteric?

And then the top post today on the front page is telling me that my problem is that I'm bothering to supervise and that I should be writing an agent framework so that it can spew out the crap in record time..... But I need to know what is absolute garbage and what needs to be reverted. I will admit that my usual pattern has been to try and prompt it into better test coverage/specific feature additions/etc on the nights and weekends, and then I focus my daytime working hours on reviewing what was produced. About half the time I review it and have to heavily clean it up to make it usable, but more often than not, I revert the whole thing and just start on it myself from scratch. I don't see how this counts as "better".

danenania 2 days ago | parent [-]

It can definitely be difficult and frustrating to try to use LLMs in a large codebase—no disagreement there. You have to be very selective about the tasks you give them and how they are framed. And yeah, you often need to throw away what they produced when they go in the wrong direction.

None of that means they’re getting worse though. They’re getting better; they’re just not as good as you want them to be.

taormina 2 days ago | parent [-]

I mean, this really isn't a large codebase, this is a small-medium sized codebase as judged by prior jobs/projects. 9000 lines of code?

When I give them the same task I tried to give them the day before, and the output gets noticeably worse than their last model version, is that better? When the day by day performance feels like it's degrading?

They are definitely not as good as I would like them to be but that's to be expected of professionals who beg for money hyping them up.