Remix.run Logo
cyberax 5 hours ago

I did not see that?

It's way more _proactive_ than the old models, sometimes in ways it shouldn't really be proactive. But it produces _more_ slop than 4.8, and I have not seen any real breakthroughs from it.

Edit: to give an example, I'm working on integrating a self-hosting auth provider into our app. So I gave it a prompt to create a "bootstrap" script that would create pre-configured settings for the local installation.

Fable did it. And then proceeded (unprompted) to test it by killing the running server, removing the database, re-initializing and (trying) to verify that the bootstrap produced identical results.

Well, yeah. Great. I can see how this "bias for action" works for security research and one-shot projects, not so sure about regular development.

I just tried that with Opus, and it produced a similar bootstrap script but did not start the test by itself.

0000000000100 an hour ago | parent | next [-]

Ah that I will admit. It gets shit done one way or another haha. This is why a sandboxed environment and a reproducible test DB is key here. I give read only access to my dev DB to my Claude, really removes the temptation that it increasingly has to “cheat”. E.g. doing something hacky and fixing the DB manually in a way that doesn’t solve the problem everywhere.

Personally I love when the AI has this amount of problem solving. But you have to build the environment around it that encourages solving problems right the first time, versus taking the easy way out and hacking out a solution.

It’s just all about constraining the behavior of the LLM into productive and permanent directions. The more advanced it gets, the more it feels like designing engineering processes rather than coding. Personally it’s a fun change of pace and it’s giving me a lot of opportunities to look at the project in working on at a wider lens. I find having to pump out features makes you myopic in a sense. I really miss the control I had over writing it all by hand, but I love just being able to build software. At the end of the day, what do you want? That’s the question I’ve had to grapple recently.

Personally I don’t mind switching gears to the bigger picture of why the software exists and what purpose it serves

gmueckl 2 hours ago | parent | prev [-]

This honestly sounds like a tweaked system prompt more than anything. Maybe it is an attempt to make the model appear stronger?