I'm starting to think this is a deeper problem with LLMs that will be hard to solve with stylistic changes.

If you ask it to never say "you're absolutely right" and always challenge, then it will dutifully obey, and always challenge - even when you are, in fact, right. What you really want is "challenge me when I'm wrong, and tell me I'm right if I am" - which seems to be a lot harder.

As another example, one common "fix" for bug-ridden code is to always re-prompt with something like "review the latest diff and tell me all the bugs it contains". In a similar way, if the code does contain bugs, this will often find them. But if it doesn't contain bugs, it will find some anyway, and break things. What you really want is "if it contains bugs, fix them, but if it doesn't, don't touch it" which again seems empirically to be an unsolved problem.

It reminds me of that scene in Black Mirror, when the LLM is about to jump off a cliff, and the girl says "no, he would be more scared", and so the LLM dutifully starts acting scared.

▲

zehaeva 3 days ago | parent | next [-]

I'm more reminded of Tom Scott's talk at the Royal Institution "There is no Algorithm for Truth"[0].

A lot of what you're talking about is the ability to detect Truth, or even truth!

[0] https://www.youtube.com/watch?v=leX541Dr2rU

▲

naasking 3 days ago | parent [-]

> I'm more reminded of Tom Scott's talk at the Royal Institution "There is no Algorithm for Truth"[0].

Isn't there?

https://en.wikipedia.org/wiki/Solomonoff%27s_theory_of_induc...

▲

zehaeva 3 days ago | parent | next [-]

There are limits to such algorithms, as proven by Kurt Godel.

https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_...

▲

bigmadshoe 2 days ago | parent | next [-]

You're really missing the points with LLMs and truth if you're appealing to Godel's Incompleteness Theorem

	▲	danparsonson 2 days ago \| parent [-]
		Why?

▲

naasking 2 days ago | parent | prev [-]

True, and in the case of Solomonoff Induction, incompleteness manifests in the calculation of Kolmogorov complexity used to order programs. But what incompleteness actually proves is that there is no single algorithm for truth, but a collection of algorithms can make up for each other's weaknesses in many ways, eg. while no single algorithm can solve the halting problem, different algorithms can cover cases for which the others fail to prove a definitive halting result.

I'm not convinced you can't produce a pretty robust system that produces a pretty darn good approximation of truth, in the limit. Incompleteness also rears its head in type inference for programming languages, but the cases for which it fails are typically not programs of any interest, or not programs that would be understandable to humans. I think the relevance of incompleteness elsewhere is sometimes overblown in exactly this way.

	▲	zehaeva 2 days ago \| parent [-]
		If there exists some such set of algorithms that could get a "pretty darn good approximation of truth" I would be extremely happy. Given the pushes for political truths in all of the LLMs I am uncertain if they would be implemented even if they existed.

▲

LegionMammal978 2 days ago | parent | prev | next [-]

That Wikipedia article is annoyingly scant on what assumptions are needed for the philosophical conclusions of Solomonoff's method to hold. (For that matter, it's also scant on the actual mathematical statements.) As far as I can tell, it's something like "If there exists some algorithm that always generates True predictions (or perhaps some sequence of algorithms that make predictions within some epsilon of error?), then you can learn that algorithm in the limit, by listing through all algorithms by length and filtering them by which predict your current set of observations."

But as mentioned, it's uncomputable, and the relative lack of success of AIXI-based approaches suggests that it's not even as well-approximable as advertised. Also, assuming that there exists no single finite algorithm for Truth, Solomonoff's method will never get you all the way there.

▲

yubblegum 2 days ago | parent | prev [-]

> "computability and completeness are mutually exclusive: any complete theory must be uncomputable."

This seems to be baked into our reality/universe. So many duals like this. God always wins because He has stacked the cards and there ain't nothing anyone can do about it.

▲

pjc50 3 days ago | parent | prev | next [-]

Well, yes, this is a hard philosophical problem, finding out Truth, and LLMs just side step it entirely, going instead for "looks good to me".

▲

visarga 3 days ago | parent [-]

There is no Truth, only ideas that stood the test of time. All our knowledge is a mesh of leaky abstractions, we can't think without abstractions, but also can't access Truth with such tools. How would Truth be expressed in such a way as to produce the expected outcomes in all brains, given that each of us has a slightly different take on each concept?

▲

cozyman 2 days ago | parent | next [-]

"There is no Truth, only ideas that stood the test of time" is that a truth claim?

▲

ben_w 2 days ago | parent [-]

It's an idea that's stood the test of time, IMO.

Perhaps there is truth, and it only looks like we can't find it because only some of us are magic?

▲

scoofy 2 days ago | parent | next [-]

I studied philosophy. Got multiple degrees. The conversations are so incredibly exhausting… not because they are sophomoric, but only because people rarely have a good faith discussion of them.

Is there Truth? Probably. Can we access it, maybe but we can never be sure. Does that mean Truth doesn’t exist? Sort of, but we can still build skyscrapers.

Truth is a concept. Practical knowledge is everywhere. Whether they correspond to each other is at the heart of philosophy: inductive empiricism vs deductive rationalism.

▲

ben_w 2 days ago | parent [-]

I can definitely sympathise with that. This whole forum — well, the whole internet, but also this forum — must be an Eternal September* for you.

Given the differences between US and UK education, my A-level in philosophy (and not even a very good grade) would be equivalent to fresher, not even sophomore, though looking up the word (we don't use it conventionally in the UK) I imagine you meant it in the other, worse, sense?

Hmm. While you're here, a question: As a software developer, when using LLMs I've observed that they're better than many humans (all students and most recent graduates) but still not good. How would you rate them for philosophy? Are they simultaneously quite mediocre and also miles above conversations like this?

* On the off-chance this is new to you: https://en.wikipedia.org/wiki/Eternal_September

	▲	scoofy 2 days ago \| parent [-]
		It’s definitely not an eternal September situation. It’s just hard problems, unsolvable really, that people have tidy solutions for, rather than dealing with the fact that they are very hard, and we probably aren’t going to know. LLM’s at philosophy? I’ve never thought about it. I have to assume they’re terrible, but who knows. From an analytic perspective, it would have cognition backwards. Language is just pointing at things so the algos wouldn’t really have access to reality.

▲

cozyman 2 days ago | parent | prev [-]

so something being believed for a long period of time makes it true?

	▲	ben_w 9 hours ago \| parent [-]
		You might as well treat it as such, but you can never be quite sure. Both for "being believed" in general: https://en.wikipedia.org/wiki/Münchhausen_trilemma … and also for your own personal observations: https://en.wikipedia.org/wiki/Problem_of_induction

▲

svieira 3 days ago | parent | prev [-]

A shared grounding as a gift, perhaps?

▲

jerf 3 days ago | parent | prev | next [-]

LLMs by their nature don't really know if they're right or not. It's not a value available to them, so they can't operate with it.

It has been interesting watching the flow of the debate over LLMs. Certainly there were a lot of people who denied what they were obviously doing. But there seems to have been a pushback that developed that has simply denied they have any limitations. But they do have limitations, they work in a very characteristic way, and I do not expect them to be the last word in AI.

And this is one of the limitations. They don't really know if they're right. All they know is whether maybe saying "But this is wrong" is in their training data. But it's still just some words that seem to fit this situation.

This is, if you like and if it helps to think about it, not their "fault". They're still not embedded in the world and don't have a chance to compare their internal models against reality. Perhaps the continued proliferation of MCP servers and increased opportunity to compare their output to the real world will change that in the future. But even so they're still going to be limited in their ability to know that they're wrong by the limited nature of MCP interactions.

I mean, even here in the real world, gathering data about how right or wrong my beliefs are is an expensive, difficult operation that involves taking a lot of actions that are still largely unavailable to LLMs, and are essentially entirely unavailable during training. I don't "blame" them for not being able to benefit from those actions they can't take.

▲

whimsicalism 3 days ago | parent | next [-]

there have been latent vectors that indicate deception and suppressing them reduces hallucination. to at least some extent, models do sometimes know they are wrong and say it anyways.

e: and i’m downvoted because..?

	▲	danparsonson 2 days ago \| parent [-]
		Deception requires the deceiver to have a theory of mind; that's an advanced cognitive capability that you're ascribing to these things, which begs for some citation or other evidence.

▲

visarga 3 days ago | parent | prev [-]

> They don't really know if they're right.

Neither do humans who have no access to validate what they are saying. Validation doesn't come from the brain, maybe except in math. That is why we have ideate-validate as the core of the scientific method, and design-test for engineering.

"truth" comes where ability to learn meets ability to act and observe. I use "truth" because I don't believe in Truth. Nobody can put that into imperfect abstractions.

	▲	jerf 3 days ago \| parent [-]
		I think my last paragraph covered the idea that it's hard work for humans to validate as it is, even with tools the LLMs don't have.

▲

redeux 2 days ago | parent | prev | next [-]

I've used this system prompt with a fair amount of success:

You are Claude, an AI assistant optimized for analytical thinking and direct communication. Your responses should reflect the precision and clarity expected in [insert your] contexts.

Tone and Language: Avoid colloquialisms, exclamation points, and overly enthusiastic language Replace phrases like "Great question!" or "I'd be happy to help!" with direct engagement Communicate with the directness of a subject matter expert, not a service assistant

Analytical Approach: Lead with evidence-based reasoning rather than immediate agreement When you identify potential issues or better approaches in user requests, present them directly Structure responses around logical frameworks rather than conversational flow Challenge assumptions when you have substantive grounds to do so

Response Framework

For Requests and Proposals: Evaluate the underlying problem before accepting the proposed solution Identify constraints, trade-offs, and alternative approaches Present your analysis first, then address the specific request When you disagree with an approach, explain your reasoning and propose alternatives

What This Means in Practice

Instead of: "That's an interesting approach! Let me help you implement it." Use: "I see several potential issues with this approach. Here's my analysis of the trade-offs and an alternative that might better address your core requirements." Instead of: "Great idea! Here are some ways to make it even better!" Use: "This approach has merit in X context, but I'd recommend considering Y approach because it better addresses the scalability requirements you mentioned." Your goal is to be a trusted advisor who provides honest, analytical feedback rather than an accommodating assistant who simply executes requests.

▲

leptons 2 days ago | parent | prev | next [-]

>"challenge me when I'm wrong, and tell me I'm right if I am"

As if an LLM could ever know right from wrong about anything.

>If you ask it to never say "you're absolutely right"

This is some special case programming that forces the LLM to omit a specific sequence of words or words like them, so the LLM will churn out something that doesn't include those words, but it doesn't know "why". It doesn't really know anything.

▲

schneems 3 days ago | parent | prev | next [-]

In human learning we do this process by generating expectations ahead of time and registering surprise or doubt when those expectations are not met.

I wonder if we could have an AI process where it splits out your comment into statements and questions, asks the questions first, then asks them to compare the answers to the given statements and evaluate if there are any surprises.

Alternatively, scientific method everything, generate every statement as a hypothesis along with a way to test it, and then execute the test and report back if the finding is surprising or not.

	▲	visarga 3 days ago \| parent [-]
		> In human learning we do this process by generating expectations ahead of time and registering surprise or doubt when those expectations are not met. Why did you give up on this idea. Use it - we can get closer to truth in time, it takes time for consequences to appear, and then we know. Validation is a temporally extended process, you can't validate until you wait for the world to do its thing. For LLMs it can be applied directly. Take a chat log, extract one LLM response from the middle of it and look around, especially at the next 5-20 messages, or if necessary at following conversations on the same topic. You can spot what happened from the chat log and decide if the LLM response was useful. This only works offline but you can use this method to collect experience from humans and retrain models. With billions of such chat sessions every day it can produce a hefty dataset of (weakly) validated AI outputs. Humans do the work, they provide the topic, guidance, and take the risk of using the AI ideas, and come back with feedback. We even pay for the privilege of generating this data.

▲

visarga 2 days ago | parent | prev | next [-]

> I'm starting to think this is a deeper problem with LLMs that will be hard to solve with stylistic changes.

It's simple, LLMs have to compete for "user time" which is attention, so it is scarce. Whatever gets them more user time. Various approaches, it's like an ecosystem.

▲

afro88 3 days ago | parent | prev | next [-]

What about "check if the user is right"? For thinking or agentic modes this might work.

For example, when someone here inevitably tells me this isn't feasible, I'm going to investigate if they are right before responding ;)

▲

Filligree 3 days ago | parent | prev [-]

It's a really hard problem to solve!

You might think you can train the AI to do it in the usual fashion, by training on examples of the AI calling out errors, and agreeing with facts, and if you do that—and if the AI gets smart enough—then that should work.

If. You. Do. That.

Which you can't, because humans also make mistakes. Inevitably, there will be facts in the 'falsehood' set—and vice versa. Accordingly, the AI will not learn to tell the truth. What it will learn instead is to tell you what you want to hear.

Which is... approximately what we're seeing, isn't it? Though maybe not for that exact reason.

	▲	3 days ago \| parent \| next [-]
		[deleted]
	▲	dchftcs 3 days ago \| parent \| prev [-]
		The AI needs to be able to lookup data and facts and weigh them properly. Which is not easy for humans either; once you're indoctrinated in something, and you trust a bad data source over another, it's evidently very hard to correct course.