Remix.run Logo
bityard 7 hours ago

My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of VM output.

A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it.

I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first.

I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...

coffeefirst 4 hours ago | parent | next [-]

This is my theory too. There’s a predictable cycle where the models “get worse.” They probably don’t. A lot of people just take a while to really hit hard against the limitations.

And once you get unlucky you can’t unsee it.

skirmish 5 hours ago | parent | prev | next [-]

So will we have to do what image generation people have been doing for ages: generate 50 versions of output for the prompt, then pick the best manually? Anthropic must be licking its figurative chops hearing this.

motoroco 4 hours ago | parent [-]

I have to agree with OP, in my experience it is usually more productive to start over than to try correcting output early on. deeper into a project and it gets a bit harder to pull off a switch. I sometimes fork my chats before attempting to make a correction so that I can resume the original just in case (yes, I know you can double-tap Esc but the restoration has failed for me a few times in the past and now I generally avoid it)

billywhizz 2 hours ago | parent | prev | next [-]

you probably could have written the low stakes productivity app in a fraction of the time you wasted on this.

7 hours ago | parent | prev | next [-]
[deleted]
gilrain 7 hours ago | parent | prev [-]

> My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of [LLM] output.

I think you must have learned that they’re more nondeterministic than you had thought, but then wrongly connected your new understanding to the recent model degradation. Note: they’ve been nondeterministic the whole time, while the widely-reported degradation is recent.

bityard 6 hours ago | parent | next [-]

Er, no, I am fully aware that LLMs have always been non-deterministic.

gilrain 6 hours ago | parent [-]

Your argument seems to be that a statistically-improbable number of people all experienced ultimately- randomly-poor outputs, leading to only a misperception of model degradation… but this is not supported by reality, in which a different cause was found, so I was trying to connect your dots.

zamadatix 5 hours ago | parent | next [-]

Not everyone is reporting and the number of users is not consistent. On the former the noisiest will always be those that experience an issue while on the latter there are more people than ever using Claude Code regularly.

Combining these things in the strongest interpretation instead of an easy to attack one and it's very reasonable to posit a critical mass has been reached where enough people will report about issues causing others to try their own investigations while the negative outliers get the most online attention.

I'm not convinced this is the story (or, at least the biggest part of it) myself but I'm not ready to declare it illogical either.

bityard 5 hours ago | parent | prev | next [-]

No, that is not my argument, in fact I don't have any argument whatsoever. It was just a plausible observation that I felt like sharing. There's nothing further to read into it, I don't have a horse in this race.

furyofantares 5 hours ago | parent | prev [-]

Not really, they said "some of this a perceived quality drop". That's almost certainly correct, that _some_ of it is that.

When everyone's talking about the real degradation, you'll also get everyone who experiences "random"[1] degradation thinking they're experiencing the same thing, and chiming in as well.

[1] I also don't think we're talking the more technical type of nondeterminism here, temperature etc, but the nondeterminism where I can't really determine when I have a good context and when I don't, and in some cases can't tell why an LLM is capable of one thing but not another. And so when I switch tasks that I think are equally easy and it fails on the new one, or when my context has some meaningless-to-me (random-to-me) variation that causes it to fail instead of succeed, I can't determine the cause. And so I bucket myself with the crowd that's experiencing real degradation and chime in.

pydry 6 hours ago | parent | prev [-]

I wonder how well the "good" versions worked if you threw awkward edge cases at it.