I saw an article recently that talked about stringing likely inferences together but ending up with an unreliable outcome because enough 0.9 probabilities one after the other lead to an unlikely conclusion.

Edit: Couldn't find the article, but AI referenced Baysian "Chain of reasoning fallacy".

▲ godelski 7 days ago | parent | next [-]

I think you have this oversimplified. Stringing together inferences can take us in either direction. It really depends on how things are being done and this isn't always so obvious or simple. But just to show both directions I'll give two simple examples (real world holds many more complexities)

It is all about what is being modeled and how the inferences string together. If these are being multiplied, then yes, this is going to decreases as xy < x and xy < y for every x,y < 1.

But a good counter example is the classic Bayesian Inference example[0]. Suppose you have a test that detects vampirism with 95% accuracy (Pr(+|vampire) = 0.95) and has a false positive rate of 1% (Pr(+|mortal) = 0.01). But vampirism is rare, affecting only 0.1% of the population. This ends up meaning a positive test only gives us a 8.7% likelihood of a subject being a vampire (Pr(vampire|+). The solution here is that we repeat the testing. On our second test Pr(vampire) changes from 0.001 to 0.087 and Pr(vampire|+) goes to 89% and a third getting us to about 99%.

[0] Our equation is

                  Pr(+|vampire)Pr(vampire)
  Pr(vampire|+) = ------------------------
                           Pr(+)

And the crux is Pr(+) = Pr(+|vampire)Pr(vampire) + Pr(+|mortal)(1-Pr(vampire))

▲ p1necone 7 days ago | parent | next [-]

Worth noting that solution only works if the false positives are totally random, which is probably not true of many real world cases and would be pretty hard to work out.

▲ godelski 7 days ago | parent [-]

Definitely. Real world adds lots of complexities and nuances, but I was just trying to make the point that it matters how those inferences compound. That we can't just conclude that compounding inferences decreases likelihood

▲ Dylan16807 7 days ago | parent [-]

Well they were talking about a chain, A->B, B->C, C->D.

You're talking about multiple pieces of evidence for the same statement. Your tests don't depend on any of the previous tests also being right.

▲ godelski 7 days ago | parent [-]

Be careful with your description there, are you sure it doesn't apply to the Bayesian example (which was... illustrative...? And not supposed to be every possible example?)? We calculated f(f(f(x))), so I wouldn't say that this "doesn't depend on the previous 'test'". Take your chain, we can represent it with h(g(f(x))) (or (f∘g∘h)(x)). That clearly fits your case for when f=g=h. Don't lose sight of the abstractions.

▲ Dylan16807 7 days ago | parent [-]

So in your example you can apply just one test result at a time, in any order. And the more pieces of evidence you apply, the stronger your argument gets.

f = "The test(s) say the patient is a vampire, with a .01 false positive rate."

f∘f∘f = "The test(s) say the patient is a vampire, with a .000001 false positive rate."

In the chain example f or g or h on its own is useless. Only f∘g∘h is relevant. And f∘g∘h is a lot weaker than f or g or h appears on its own.

This is what a logic chain looks like, adapted for vampirism to make it easier to compare:

f: "The test says situation 1 is true, with a 10% false positive rate."

g: "If situation 1 then situation 2 is true, with a 10% false positive rate."

h: "If situation 2 then the patient is a vampire, with a 10% false positive rate."

f∘g∘h = "The test says the patient is a vampire, with a 27% false positive rate."

So there are two key differences. One is the "if"s that make the false positives build up. The other is that only h tells you anything about vampires. f and g are mere setup, so they can only weaken h. At best f and g would have 100% reliability and h would be its original strength, 10% false positive. The false positive rate of h will never be decreased by adding more chain links, only increased. If you want a smaller false positive rate you need a separate piece of evidence. Like how your example has three similar but separate pieces of evidence.

▲ godelski 7 days ago | parent [-]

Again, my only argument was that you can have both situations occur. We could still construct a f∘g∘h to increase probability if we want. I'm not saying it cannot go down, I'm saying there's no absolute rule you can follow.

▲ Dylan16807 7 days ago | parent [-]

I don't think you can make a chain of logic f∘g∘h where the probability of the combined function is higher than the probability of f or g or h on their own.

Chain of logic meaning that only the last function updates the probability you care about, and the preceeding functions give you intermediate information that is only useful to feed into the next function.

It is an absolute rule you can follow, as long as you're applying it the way it was intended, to a specific organization of functions. It's not any kind of combining, it's A->B->C->D combining. As opposed to multiple pieces that each independently imply D.

Just because you can use ∘ in both situations doesn't make them the same. Whether x∘y∘z is chaining depends on what x and y and z do. If all of them update the same probability, that's not chaining. If removing any of them would leave you with no information about your target probability, then it's chaining.

TL;DR: ∘ doesn't tell you if something is a chain, you're conflating chains with non-chains, the rule is useful when it comes to chains

▲ godelski 6 days ago | parent [-]

I'm not disagreeing with you. You understand that, right?

The parent was talking about stringing together inferences. My argument *was how you string them together matters*. That's all. I said "context matters."

I tried to reiterate this in my previous comment. So let's try one more time. Again, I'm not going to argue you're wrong. I'm going to argue that more context is needed to determine if likelihood increases or decreases. I need to stress this before moving on.

Let's go one more comment back, when I'm asking you if you're sure that this doesn't apply to the Bayesian case too. My point here was that, again, context matters. Are these dependent or independent? My whole point is that we don't know which direction things will go in without additional context. I __am not__ making the point that it always gets better like in the Bayesian example. The Bayesian case was _an example_. I also gave an example for the other case. So why focus on one of these and ignore the other?

  > ∘ doesn't tell you if something is a chain

∘ is the composition operator (at least in this context and you also interpreted it that way). So yes, yes it does. It is the act of chaining together functions. Hell, we even have "the chain rule" for this. Go look at the wiki if you don't believe me, or any calculus book. You can go into more math and you'll see the language change to use maps to specify the transition process.

  >  It's not any kind of combining, it's A->B->C->D combining.

Yes, yes it does. The *events* are independent but the *states* are dependent. Each test does not depend on the previous test, making the tests independent, but our marginal is! Hell, you see this in basic Markov Chains too. The decision process does not depend on other nodes in the chain but the state does. If you want to draw our Bayesian example as a chain you can do so. It's going to be really fucking big because you're going to need to calculate all potential outcomes making it both infinitely wide and infinitely deep, but you can. The inference process allows us to skip all those computations and lets us focus on only performing calculations for states we transition into.

Just ask yourself, how did you get to state B? *You drew arrows for a reason*. But arrows only tell us about a transition occurring, they do not tell us about that transition process. They lack context.

  > you're conflating chains with non-chains

No, you're being too strict in your definition of "chain". Which, brings us back to my first comment.

Look, we can still view both situations from the perspective of Markov Chains. We can speak about this with whatever language we want but if you want chains let's use something that is clearly a chain. Our classic MC is the easy case, right? Our state only depends on the previous state, right? P(x_{t}|x_{t-1}). Great, just like the Bayesian case (our state is dependent but our transition function is independent). So we can also have higher order MCs, depending on any n previous state. We can extend our transition function too. P(x_{t}|x_{t-1},...,x_0) = Q. We don't have to restrict ourselves to Q(x_{t-1}), we can do whatever the hell we want. In fact, our simple MC process is going to be equivalent to Q(x_{t-1},...,x_0) it is just that nothing ends up contributing except for that x_{t-1}. The process is still the same, but the context matters.

  >  It's not any kind of combining, it's A->B->C->D combining. ***As opposed to multiple pieces that each independently imply D.***

This tells me you drew your chain wrong. If multiple things are each contributing to D independently then that is not A->B->C->D (or as you wrote the first time: `A->B, B->C, C->D`, which is equivalent!) you instead should have written something like A -> C <- B. Or using all 4 letters

       B
       |
       v
  A -> D <- C

These are completely different things! This is not a sequential process. This is not (strictly) composition.

And yet, again, we still do not know if these are decreasing. They will decrease if A,B,C,D ∈ ℙ AND our transition functions are multiplicative (∏ x_i < x_j ∀ j ; where x_i ∈ ℙ), but this will not happen if the transition function is additive (∑ x_i ≥ x_j ∀ j ; where x_i ∈ ℙ)

We are still entirely dependent upon context.

Now, we're talking about LLMs, right? Your conversation (and CoT) is much closer to the Bayesian case than our causal DAG with dependence. Yes, the messages in the conversation transition us through states, but the generation is independent. The prompt and context lengthen, but this is not the same thing as the events being dependent. The LLM response is an independent event. Like the BI case the state has changed, but the generation event is identical (i.e. independent). We don't care how we got to the current state! You don't need to have the conversation with the LLM. Every inference from the LLM is independent, even if the state isn't. The inference only depends on the tokens currently in context. Assuming you turn on deterministic mode (setting seeds identically), you could generate an identical output by passing the conversation (and properly formatting) into a brand new fresh prompt. That shows that the dependence is on state, not inference. Just like our Bayesian example you'd generate the same output if you start from the same state. The independence is because we don't care how we got to that state, only that we are at that state (same with simple MCs). There are added complexities that can change this but we can't go there if we can't get to this place first. We'd need to have this clear before we can add complexities like memory and MoEs because the answer only gets more nuanced.

So again, our context really matters here and the whole conversation is about how these subtleties matter. The question was, if those errors compound. I hope you see that that's not so simple to answer. *Personally*, I'm pretty confident they will in current LLMs, because they rely far too heavily on their prompting (it'll give you incorrect answers if you prime it that way despite being able to give correct answers with better prompting) but this isn't a necessary condition now, is it?

TLDR: We can't determine if likelihood increases or decreases without additional context

▲ Dylan16807 6 days ago | parent [-]

I'll try to keep this simple.

> I'm not disagreeing with you. You understand that, right?

We disagree about whether context can make a difference, right?

> The parent was talking about stringing together inferences. My argument was how you string them together matters. That's all. I said "context matters."

> TLDR: We can't determine if likelihood increases or decreases without additional context

The situations you describe where inference acts differently do not fall under the "stringing together"/"chaining" they were originally talking about. Context never makes their original statement untrue. Chaining always makes evidence weaker.

To be extra clear, it's not about whether the evidence pushes your result number up or down, it's that the likelihood of the evidence itself being correct drops.

> It is the act of chaining together functions.

They were not talking about whether something is composition or not. When they said "string" and "chain" they were talking about a sequence of inferences where each one leads to the next one.

Composition can be used in a wide variety of contexts. You need context to know if composition weakens or strengthens arguments. But you do not need context to know if stringing/chaining weakens or strengthens.

> No, you're being too strict in your definition of "chain".

No, you're being way too loose.

> This tells me you drew your chain wrong. If multiple things are each contributing to D independently then that is not A->B->C->D

??? Of course those are different. That's why I wrote "as opposed to".

> I also gave an example for the other case. So why focus on one of these and ignore the other?

I'm focused on the one you called a "counter example" because I'm arguing it's not an example.

If you specifically want me to address "If these are being multiplied, then yes, this is going to decreases as xy < x and xy < y for every x,y < 1." then yes that's correct. I never doubted your math, and everyone agrees about that one.

TL;DR:

At this point I'm mostly sure we're only disagreeing about the definition of stringing/chaining? If yes, oops sorry I didn't mean to argue so much about definitions. If not, then can you give me an example of something I would call a chain where adding a step increases the probability the evidence is correct?

And I have no idea why you're talking about LLMs.

▲ godelski 6 days ago | parent [-]

  > I'm mostly sure we're only disagreeing about the definition of stringing/chaining?

Correct.

  > No, you're being way too loose.

Okay, instead of just making claims and for me to trust you, go point to something concrete. I've even tried to google, but despite my years of study in statistics, metric theory, and even mathematical logic I'm at a loss in finding your definition.

I'm aware of the Chain Rule of Probability, but this isn't the only type you will find the term "chain" in statistics. Hell, the calculus Chain Rule is still used there too! So forgive me for being flustered but you are literally arguing to me that a Markov Chain isn't a chain. Maybe I'm having a stroke, but I'm pretty sure the word "chain" is in Markov Chain.

▲ Dylan16807 6 days ago | parent [-]

> Okay, instead of just making claims and for me to trust you, go point to something concrete. I've even tried to google, but despite my years of study in statistics, metric theory, and even mathematical logic I'm at a loss in finding your definition.

Let's look again at what we're talking about:

>>> I think it’s that people tend to build up “logical” conclusions where they think each step is a watertight necessity that follows inevitably from its antecedents, but actually each step is a little bit leaky, leading to runaway growth in false confidence.

>> As a former mechanical engineer, I visualize this phenomenon like a "tolerance stackup". Effectively meaning that for each part you add to the chain, you accumulate error.

> I saw an article recently that talked about stringing likely inferences together but ending up with an unreliable outcome because enough 0.9 probabilities one after the other lead to an unlikely conclusion.

> Edit: Couldn't find the article, but AI referenced Baysian "Chain of reasoning fallacy".

The only term in there you could google is "tolerance stackup". The rest is people making ad-hoc descriptions of things. Except for "Chain of reasoning fallacy" which is a fake term. So I'm not surprised you didn't find anything in google, and I can't provide you anything from google. There is nothing "concrete" to ask for when it comes to some guy's ad-hoc description, you just have to read it and do your best.

And everything I said was referring back to those posts, primarily the last one by robocat. I was not introducing anything new when I used the terms "string" and "chain". I was not referring to any scientific definitions. I was only talking about the concept described by those three posts.

Looking back at those posts, I will confidently state that the concept they were talking about does not include markov chains. You're not having a stroke, it's just a coincidence that the word "chain" can be used to mean multiple things.

▲ godelski 6 days ago | parent [-]

I googled YOUR terms. And if you read my messages you'd notice that I'm not a novice when it comes to math. Hell, you should have gotten that from my very first comment. I was never questioning if I had a stroke, I was questioning your literacy.

  > I was not referring to any scientific definitions.

Yet, you confidently argued against ones that were stated.

If you're going to speak out your ass, at least have the decency to let everyone know first.

	▲	Dylan16807 6 days ago \| parent [-]
		They were never my terms. They were the terms from the people that were having a nice conversation before you interrupted. You told them they were wrong, that it could go either way. That's not true. What they were talking about cannot go either way. You were never talking about the same thing as them. I gave you the benefit of the doubt by thinking you were trying to talk about the same thing as them. Apparently I shouldn't have. You can't win this on definitions. They were talking about a thing without using formal definitions, and you replied to them with your own unrelated talk, as if it was what they meant. No. You don't get to change what they meant. That's why I argued against your definition. Your definition is lovely in some other conversation. Your definition is not what they meant, and cannot override what they meant.

▲ wombatpm 7 days ago | parent | prev | next [-]

Can’t you improve thing if you can calibrate with a known good vampire? You’d think NIST or the CDC would have one locked in a basement somewhere.

	▲	godelski 7 days ago \| parent \| next [-]
		IDK, probably? I'm just trying to say that iterative inference doesn't strictly mean decreasing likelihood. I'm not a virologist or whoever designs these kinds of medical tests. I don't even know the right word to describe the profession lol. But the question is orthogonal to what's being discussed here. I'm only guessing "probably" because usually having a good example helps in experimental design. But then again, why wouldn't the original test that we're using have done that already? Wouldn't that be how you get that 95% accurate test? I can't tell you the biology stuff, I can just answer math and ML stuff and even then only so much.
	▲	weard_beard 7 days ago \| parent \| prev \| next [-]
		GPT6 would come faster but we ran out of Casandra blood.
	▲	ethbr1 7 days ago \| parent \| prev [-]
		The thought of a BIPM Reference Vampire made me chuckle.

▲ tintor 7 days ago | parent | prev [-]

Assuming your vampire tests are independent.

	▲	godelski 7 days ago \| parent [-]
		Correct. And there's a lot of other assumptions. I did make a specific note that it was a simplified and illustrative example. And yes, in the real world I'd warn about being careful when making i.i.d. assumptions, since these assumptions are made far more than people realize.

▲ 7 days ago | parent | prev [-]

[deleted]