Remix.run Logo
james_marks 6 hours ago

> One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest—for instance, to avoid making claims that they can’t support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims.

Would be awesome if true

majormajor 6 hours ago | parent | next [-]

"Honesty" seems like unnecessary (and annoying) anthropomorphism there. I don't think there's any intent of fraud or deception in outputs from these things, just overreaching of prediction. Based on the latter part of the paragraph, I wish they'd just say something like "less likely to skip steps or overemphasize thin evidence" in the first place.

Don't play to the sci-fi "this thing's trying to outsmart me" tropes.

Kiro 5 hours ago | parent | next [-]

Using words people understand is more important than this strange fixation on not anthropomorphizing things.

wasabi991011 5 hours ago | parent | next [-]

I think "honesty" is not a particularly good descriptor, independent of anthropomorphism. Previous commenters suggestion was much more understandable to me.

dugidugout 5 hours ago | parent | prev | next [-]

Being that can be understood is language. The previous commenter is making an particular argument for how we can improve this understanding. They didn't suggest we should use less familiar words, but different familiar words. Why is this strange?

giraffe_lady 5 hours ago | parent | prev | next [-]

Anthropomorphizing is a shorthand for a powerful and poorly defined set of metaphors. There are tradeoffs going both ways but trying to dismiss it as merely "strange fixation" shows your own weakness.

tadfisher 5 hours ago | parent | prev [-]

To be clear, this is about anthropomorphizing large language models, not the general category of "things". Also, we should be evaluating these constructs using well-defined and measurable criteria; evaluating "honesty" fails to achieve both goals.

derac 5 hours ago | parent [-]

I think Honesty can be evaluated. Does the model push back when it knows the user is wrong? How often does the model hallucinate data vs. say it doesn't know? Provide a prompt with contradictions or other issues and see if the model corrects you.

Here is an article by Anthropic that explains what they do and mean in more detail: https://alignment.anthropic.com/2025/honesty-elicitation/

swader999 5 hours ago | parent | prev | next [-]

Just swap 'Honesty' with 'correctness in its claims' and you'll get what you need out of this aspect of the model description.

stratos123 an hour ago | parent [-]

Honesty and correctness are not the same thing, even when talking about LLMs. Sometimes an LLM says a false thing and you don't know whether it's being dishonest or merely incorrect. Sometimes, however, you can see in the CoT that the model does know the true fact and is reasoning about how to deceive the user. That's lying, not just being incorrect.

adamtaylor_13 5 hours ago | parent | prev | next [-]

People get so wrapped around the axle with "anthropomorphizing". For regular folks with no technical background, sure maybe a bit of caveat sprinkled here or there is useful to help them understand what is or isn't true, but on HN it would seem to me that the bar is high enough that we can just use shared language to generally talk about capabilities.

When they say "Honesty" I don't think to myself, "Goodness, does this model have moral understanding?" No, I understand they mean it's less likely to directly bullshit me, which models frequently do.

I don't feel like this level of pedantry around language is useful for people who more or less know what's going on with LLMs. (Again, I concede that perhaps with a less technical audience, there's more need for it.)

krupan 2 hours ago | parent | prev [-]

I agree. In connection with LLMs we also shouldn't use the words intelligent, smart, reasoning, thinking, chat, conversation, etc.

ealready_value 5 hours ago | parent | prev | next [-]

Opus 4.7 was already trying hard to appear honest. Most conversations I have with it about advice or focusing an opinion often include "my honest take" or "my honest opinion".

The problem is that once I asked it "I'm thinking about A or B" twice, once with "I like A more but suspect B would be best" and a second time with them reversed. Not surprisingly, both times it chose the one I said I suspected was best as it's honest opinion.

MaxikCZ 3 hours ago | parent [-]

I wish I knew how to make it regressively verify its assumptions, like a kind of hook but firing before a sentence is written, or perhaps after and then corrected. I feel like it assuming things clearly wrong is its biggest weakness.

benzible 5 hours ago | parent | prev | next [-]

In the context of Claude Code, "honest" usually means that the agent took a shortcut, skipped requirements, etc. It's the model giving itself credit for admitting to failing rather than actually doing what was requested.

HAL3000 5 hours ago | parent | prev | next [-]

Yeah, it's super annoying. A few days ago, Opus 4.7 created a plan with several items on it, including an auth feature. It then went through the plan and reported that it had created the auth feature, that everything was secure, and that the tests passed.

The issue was that it hadn't actually implemented the auth feature. After I confronted it about this, it admitted that it indeed hadn't done it and said it would implement it now.

If we had just trusted its output, we would now have a security vulnerability in production, allowing anyone to access other people's accounts.

gwd 4 hours ago | parent | next [-]

> If we had just trusted its output, we would now have a security vulnerability in production, allowing anyone to access other people's accounts.

This is one reason you always get a different model to review a model's PR. Gemini Or GPT-codex would have certainly noticed the missing auth.

FireBeyond 4 hours ago | parent | prev | next [-]

I had a lower acuity incident exactly the same.

Had it implement a feature, "commit and merge to develop".

"Built, tested, committed, merged to develop. Up to you to continue testing and merge to main when ready."

Great. Poke at the web app. No feature.

"Where is feature, I can't see it on develop". "Well, that's because it's not on develop, but on feature-branch, so you wouldn't see it."

"I'm confused. I asked you to commit it and merge to develop."

"You're right, you asked me to and I said I would do it and I told you I did it but I did not actually do it. Want me to do it now, then?"

Claude is in sulky-teenager phase.

Schiendelman 5 hours ago | parent | prev | next [-]

How do you test other features?

5 hours ago | parent | prev [-]
[deleted]
legitster 5 hours ago | parent | prev | next [-]

Part of the problem is also garbage-in/garbage-out. There's a lot of human information on the internet that is also confidently wrong.

I use Sonnet a lot for learning about history or contextualizing news topics. It's really good at this for the most part. But there are a lot of topics where "consensus" between either academics or journalists is really "one secondary source which gets repeated a lot".

mitjam 4 hours ago | parent [-]

A failure mode I see more, recently is that it gives superficially correct answers but after digging deeper, I get answers that contradict the superficial answers - really an important thing to be aware of, in my point of view, and it often leaves me wondering if I dug deep enough.

pants2 5 hours ago | parent | prev | next [-]

[dead]

soperj 6 hours ago | parent | prev | next [-]

My guess is that Claude Opus 4.8 wrote that and is lying to you.

malfist 6 hours ago | parent | prev [-]

And yet, every release has claimed lower hallucination rates. But they persist.

kentm 6 hours ago | parent | next [-]

Do they persist at the same rates? Lower doesn't mean eliminated, so both of these can be true.

simianwords 5 hours ago | parent | prev [-]

False. Hallucination has meaningfully reduced.

Barbing 5 hours ago | parent [-]

Is Gemini still the biggest confabulator of the big three?