Remix.run Logo
ctoth 3 hours ago

Yeah LOL tell me I'm holding it wrong again. Actually Boris, I am tracking what is happening here. I see it, and I'm keeping receipts[0]. This started with the 4.6 rollout, specifically with the unearned confidence and not reading as much between writes. The flail quotient has gone right the hell up. If your evals aren't showing that then bully for your evals I reckon.

[0]: https://github.com/ctoth/claude-failures

lambda 2 hours ago | parent | next [-]

I guess one of the things I don't understand: how you expect a stochastic model, sold as a proprietary SaaS, with a proprietary (though briefly leaked) client, is supposed to be predictable in its behavior.

It seems like people are expecting LLM based coding to work in a predictable and controllable way. And, well, no, that's not how it works, and especially so when you're using a proprietary SaaS model where you can't control the exact model used, the inference setup its running on, the harness, the system prompts, etc. It's all just vibes, you're vibe coding and expecting consistency.

Now, if you were running a local weights model on your own inference setup, with an open source harness, you'd at least have some more control of the setup. Of course, it's still a stochastic model, trained on who knows what data scraped from the internet and generated from previous versions of the model; there will always be some non-determinism. But if you're running it yourself, you at least have some control and can potentially bisect configuration changes to find what caused particular behavior regressions.

dev_l1x_be 33 minutes ago | parent | next [-]

The problem is degradation. It was working much better before. There are many people (some example of a well know person[0]), including my circle of friends and me who were working on projects around the Opus 4.6 rollout time and suddenly our workflows started to degrade like crazy. If I did not have many quality gates between an LLM session and production I would have faced certain data loss and production outages just like some famous company did. The fun part is that the same workflow that was reliably going through the quality gates before suddenly failed with something trivial. I cannot pinpoint what exactly Claude changed but the degradation is there for sure. We are currently evaling alternatives to have an escape hatch (Kimi, Chatgpt, Qwen are so far the best candidates and Nemotron). The only issue with alternatives was (before the Claude leak) how well the agentic coding tool integrates with the model and the tool use, and there are several improvements happening already, like [1]. I am hoping the gap narrows and we can move off permanently. No more hoops, you are right, I should not have attempted to delete the production database moments.

https://x.com/theo/status/2041111862113444221

https://x.com/_can1357/status/2021828033640911196

stavros 43 minutes ago | parent | prev [-]

Same as how I expect a coin to come up heads 50% of the time.

malfist 2 hours ago | parent | prev | next [-]

It also completely ignores the increase in behavioral tracking metrics. 68% increase in swearing at the LLM for doing something wrong needs to be addressed and isn't just "you're holding it wrong"

alchemist1e9 an hour ago | parent [-]

I’m think a great marketing line for local/selfhosted LLMs in the future - “You can swear at your LLM and nobody will care!”

bcherny an hour ago | parent | prev | next [-]

Christopher, would you be able to share the transcripts for that repo by running /bug? That would make the reports actionable for me to dig in and debug.

quietsegfault 2 hours ago | parent | prev | next [-]

I’m not sure being confrontational like this really helps your case. There are real people responding, and even if you’re frustrated it doesn’t pay off to take that frustration out on the people willing to help.

ctoth 2 hours ago | parent | next [-]

Fair point on tone. It's a bit of a bind isn't it? When you come with a well-researched issue as OP did, you get this bland corporate nonsense "don't believe your lyin' eyes, we didn't change anything major, you can fix it in settings."

How should you actually communicate in such a way that you are actually heard when this is the default wall you hit?

The author is in this thread saying every suggested setting is already maxed. The response is "try these settings." What's the productive version of pointing out that the answer doesn't address the evidence? Genuine question. I linked my repo because it's the most concrete example I have.

enraged_camel 9 minutes ago | parent | next [-]

I read the entire performance degradation report in the OP, and Boris's response, and it seems that the overwhelming majority of the report's findings can indeed be explained by the `showThinkingSummaries` option being off by default as of recently.

wonnage 2 hours ago | parent | prev [-]

Just use a different tool or stop vibe coding, it’s not that hard. I really don’t understand the logic of filing bug reports against the black box of AI

geysersam 17 minutes ago | parent [-]

People file tickets against closed source "black box" systems all the time. You could just as well say: Stop using MS SQL, just use a different tool, it's not that hard.

malfist 2 hours ago | parent | prev | next [-]

Is somebody saying "you're holding it wrong" a "people willing to help"?

TeMPOraL 37 minutes ago | parent | next [-]

They are if you are, in fact, holding it wrong.

As was the usual case in most of the few years LLMs existed in this world.

Think not of iPhone antennas - think of a humble hammer. A hammer has three ends to hold by, and no amount of UI/UX and product design thinking will make the end you like to hold to be a good choice when you want to drive a Torx screw.

Retr0id 2 hours ago | parent | prev [-]

You're holding it absolutely right!

BigTTYGothGF 2 hours ago | parent | prev | next [-]

The stated policy of HN is "don't be mean to the openclaw people", let's see if it generalizes.

throwaway613746 an hour ago | parent | prev [-]

[dead]

iwalton3 2 hours ago | parent | prev [-]

[dead]