Remix.run Logo
benterix 8 months ago

OK, I see one glaring problem with this approach (having used both Claude 3.7 and o3). When they talk about 50% reliability, there is a hidden cost: you cannot know before hand whether the response (or a series of it) is leading you to the actual, optimal or good enough solution, or towards a blind alley (where the solution doesn't work at all, works terribly, or, worst at all, works only for the cases tested). This is more or less clear after you check the solution, but not before.

So, because most engineering tasks I'm dealing with are quite complex and would require multiple prompts, there is always the cost of taking into account the fact that it will go bollocks at some point. Frankly, most of them do at some point. It's not evident for simple tasks, but for more complex they simply start inserting their BS in spite of often excellent start. But you are already "80% done". What do you do? Start from scratch with a different approach? Everybody has their own strategies (starting a new thread with the contents generated so far etc) but there's always a human cost associated.