A Markov Chain trained by only a single article of text will very likely just regurgitate entire sentences straight from the source material. There just isn't enough variation in sentences.

But then, Markov Chains fall apart when the source material is very large. Try training a chain based on Wikipedia. You'll find that the resulting output becomes incoherent garbage. Increasing the context length may increase coherence, but at the cost of turning into just simple regurgitation.

In addition to the "attention" mechanism that another commenter mentioned, it's important to note that Markov Chains are discrete in their next token prediction while an LLM is more fuzzy. LLMs have latent space where the meaning of a word basically exists as a vector. LLMs will generate token sequences that didn't exist in the source material, whereas Markov Chains will ONLY generate sequences that existed in the source.

This is why it's impossible to create a digital assistant, or really anything useful, via Markov Chain. The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.

▲ thfuran 2 days ago | parent | next [-]

>Markov Chains will ONLY generate sequences that existed in the source.

A markov chain of order N will only generate sequences of length N+1 that were in the training corpus, but it is likely to generate sequences of length N+2 that weren't (unless N was too large for the training corpus and it's degenerate).

▲

Sohcahtoa82 2 days ago | parent | next [-]

Well yeah, but N+2 but the generation of the +2 loses the first part of N.

If you use a context window of 2, then yes, you might know that word C can follow words A and B, and D can follow words B and C, and therefore generate ABCD even if ABCD never existed.

But it could be that ABCD is incoherent.

For example, if A = whales, B = are, C = mammals, D = reptiles.

"Whales are mammals" is fine, "are mammals reptiles" is fine, but "Whales are mammals reptiles" is incoherent.

The longer you allow the chain to get, the more incoherent it becomes.

"Whales are mammals that are reptiles that are vegetables too".

Any 3-word fragment of that sentence is fine. But put it together, and it's an incoherent mess.

	▲	Y_Y 2 days ago \| parent [-]
		That are reptiles!

▲

Isamu 2 days ago | parent | prev | next [-]

Right, you can generate long sentences from a first-order markov model, and all of the transitions from one word to the next be in the training set but the full generated sentence may not.

▲

Jensson a day ago | parent | prev [-]

> A markov chain of order N will only generate sequences of length N+1 that were in the training corpus

Depends on how you trained it, an LLM is also a markov chain.

▲ johnisgood 2 days ago | parent | prev | next [-]

> The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.

I have seen the argument that LLMs can only give you what its been trained on, i.e. it will not be "creative" or "revolutionary", that it will not output anything "new", but "only what is in its corpus".

I am quite confused right now. Could you please help me with this?

Somewhat related: I like the work of David Hume, and he explains it quite well how we can imagine various creatures, say, a pig with a dragon head, even if we have not seen one ANYWHERE. It is because we can take multiple ideas and combine them together. We know how dragons typically look like, and we know how a pig looks like, and so, we can imagine (through our creativity and combination of these two ideas) how a pig with a dragon head would look like. I wonder how this applies to LLMs, if they even apply.

Edit: to clarify further as to what I want to know: people have been telling me that LLMs cannot solve problems that is not in their training data already. Is this really true or not?

▲ jldugger 2 days ago | parent | next [-]

Well, there's kind of two answers here:

1. To the extent that creativity is randomness, LLM inference samples from the token distribution at each step. It's possible (but unlikely!) for an LLM to complete "pig with" with the token sequence "a dragon head" just by random chance. The temperature settings commonly exposed control how often the system takes the most likely candidate tokens.

2. A markov chain model will literally have a matrix entry for every possible combination of inputs. So a 2 degree chain will have n^2 weights, where N is the number of possible tokens. In that situation "pig with" can never be completed with a brand new sentence, because those have literal 0's in the probability. In contrast, transformers consider huge context windows, and start with random weights in huge neural network matrices. What people hope happens is that the NN begins to represent ideas, and connections between them. This gives them a shot at passing "out of distribution" tests, which is a cornerstone of modern AI evaluation.

▲ thesz a day ago | parent | next [-]

  > A markov chain model will literally have a matrix entry for every possible combination of inputs.

The less frequent prefixes are usually pruned away and there is a penalty score to add to go to the shorter prefix. In the end, all words are included into the model's prediction and typical n-gram SRILM model is able to generate "the pig with dragon head," also with small probability.

Even if you think about Markov Chain information as a tensor (not matrix), the computation of probabilities is not a single lookup, but a series of folds.

▲ vrighter a day ago | parent | prev | next [-]

A markov chain model does not specify the implementation details of the function that takes a previous input (and only a previous input) and outputs a probability distribution. You could put all possible inputs into an llm (there's finitely many) and record the resulting output from each input in a table. "Temperature" is applied to the final output, not inside the function.

▲ theGnuMe a day ago | parent | prev | next [-]

You can have small epsilons instead of zeros.

▲

ruined a day ago | parent [-]

what, for all possible words?

▲

3eb7988a1663 a day ago | parent [-]

Instead of a naive dense matrix, you can use some implementation that allows sparsity. If element does not exist, gets a non-zero value which can still be sampled. Which theoretically enables all outputs.

	▲	ruined 8 hours ago \| parent [-]
		i think at that point it's definitionally not a markov chain anymore. how do you sample an open set of unknown values?

▲ otabdeveloper4 a day ago | parent | prev [-]

Re point 1: no, "temperature" is not an inherent property of LLM's.

The big cloud providers use the "temperature" setting because having the assistant repeat to you the exact same output sequence exposes the man behind the curtain and breaks suspension of disbelief.

But if you run the LLM yourself and you want the best quality output, then turning off "temperature" entirely makes sense. That's what I do.

(The downside is that the LLM can then, rarely, get stuck in infinite loops. Again, this isn't a big deal unless you really want to persist with the delusion that an LLM is a human-like assistant.)

	▲	czl 16 hours ago \| parent [-]
		I mostly agree with your intuition, but I’d phrase it a bit differently. Temperature 0 does not inherently improve “quality”. It just means you always pick the highest probability token at each step, so if you run the same prompt n times you will essentially get the same answer every time. That is great for predictability and some tasks like strict data extraction or boilerplate code, but “highest probability” is not always “best” for every task. If you use a higher temperature and sample multiple times, you get a set of diverse answers. You can then combine them, for example by taking the most common answer, cross checking details, or using one sample to critique another. This kind of self-ensemble can actually reduce hallucinations and boost accuracy for reasoning or open ended questions. In that sense, somewhat counterintuitively, always using temperature 0 can lead to lower quality results if you care about that ensemble style robustness. One small technical nit: even with temperature 0, decoding on a GPU is not guaranteed to be bit identical every run. Large numbers of floating point ops in parallel can change the order of additions and multiplications, and floating point arithmetic is not associative. Different kernel schedules or thread interleavings can give tiny numeric differences that sometimes shift an argmax choice. To make it fully deterministic you often have to disable some GPU optimizations or run on CPU only, which has a performance cost.

▲ withinboredom a day ago | parent | prev | next [-]

I’m working on a new type of database. There are parts I can use an LLM to help with, because they are common with other databases or software. Then there are parts it can’t help with, if I try, it just totally fails in subtle ways. I’ve provided it with the algorithm, but it can’t understand that it is a close variation of another algorithm and it shouldn’t implement the other algorithm. A practical example, is a variation of Paxos that only exists in a paper, but it consistently it will implement Paxos instead of this variation, no matter what you tell it.

Even if you point out that it implemented vanilla Paxos, it will just go “oh, you’re right, but the paper is wrong; so I did it like this instead”… the paper isn’t wrong, and instead of discussing the deviation before writing, it just writes the wrong thing.

▲ pama 2 days ago | parent | prev | next [-]

LLMs have the ability to learn certain classes of algorithms from their datasets in order to reduce errors when compressing their pretraining data. If you are technically inclined, read the reference: https://arxiv.org/abs/2208.01066 (optionally followup work) to see how llms can pick up complicated algorithms from training on examples that could have been generated by such algorithms (in one of the cases the LLM is better than anything we know; in the rest it is simply just as good as our best algos). Learning such functions from data would not work with Markov chains at any level of training. The LLMs in this study are tiny. They are not really learning a language, but rather how to perform regression.

	▲	thesz a day ago \| parent [-]
		Transformers are performing (soft, continuous) beam search inside them, the width of beam being not bigger than number of k-v pairs in attention mechanism. In my experience, having a Markov Chain to be equipped with the beam search greatly improve MC's predictive power, even if Markov Chain is ARPA 3-gram model, heavily pruned. What is more, Markov Chains are not restricted to immediate prefixes, you can use skip grams as well. How to use them and how to mix them into a list of probabilities is shown in the paper on Sparse Non-negative Matrix Language Modeling [1]. [1] https://aclanthology.org/Q16-1024/ I think I should look into that link of yours later. Have slimmed over it, I should say it... smells interesting at some places. For one example, decision trees learning is performed with greedy algorithm which, I believe, does not use oblique splits whereas transformers inherently learn oblique splits.

▲ koliber 2 days ago | parent | prev | next [-]

Here's how I see it, but I'm not sure how valid my mental model is.

Imagine a source corpus that consists of:

Cows are big. Big animals are happy. Some other big animals include pigs, horses, and whales.

A Markov chain can only return verbatim combinations. So it might return "Cows are big animals" or "Are big animals happy".

An LLM can get a sense of meaning in these words and can return ideas expressed in the input corpus. So in this case it might say "Pigs and horses are happy". It's not limited to responding with verbatim sequences. It can be seen as a bit more creative.

However, LLMs will not be able to represent ideas that it has not encountered before. It won't be able to come up with truly novel concepts, or even ask questions about them. Humans (some at least) have that unbounded creativity that LLMs do not.

▲

vidarh 2 days ago | parent | next [-]

> However, LLMs will not be able to represent ideas that it has not encountered before. It won't be able to come up with truly novel concepts, or even ask questions about them. Humans (some at least) have that unbounded creativity that LLMs do not.

There's absolutely no evidence to support this claim. It'd require humans to exceed the Turing computable, and we have no evidence that is possible.

▲

koliber 2 days ago | parent | next [-]

If you tell me that trees are big, and trees are made of hard wood, I as a human am capable of asking whether trees feel pain. I don't think what you said is false and I am not familiar with computational theory to be able to debate it. People occasionally have novel creative insights that do not derive from past experience or knowledge, and that is what I think of when I think of creativity.

Humans created novel concepts like writing literally out of thin air. I like how the book "Guns, Steels, and Germs" describes that novel creative process and contrasts it via a disseminative derivation process.

▲

vidarh 2 days ago | parent | next [-]

> People occasionally have novel creative insights that do not derive from past experience or knowledge, and that is what I think of when I think of creativity.

If they are not derived from past experience or knowledge, then unless humans exceed the Turing computable, they would need to be the result of randomness in one form or other. There's absolutely no reason why an LLM can not do that. The only reason a far "dumber" pure random number generator based string generator "can't" do that is because it would take too long to chance on something coherent, but it most certainly would keep spitting out novel things. The only difference is how coherent the novel things are.

▲

Jensson a day ago | parent [-]

> If they are not derived from past experience or knowledge

Every animal is born with intuition, you missed that part.

▲

vidarh a day ago | parent [-]

So knowledge encoded in the physical structure of the brain.

You're missing the part where unless there is unknown physics going on in the brain that breaks maths as me know it, there is no mechanism for a brain to exceed the Turing computable, in which case any Turing complete system is comptationally equivalent to it.

▲

a day ago | parent | next [-]

[deleted]

▲

arowthway a day ago | parent | prev | next [-]

Turing machines are deterministic, brain might not be because of quantum mechanics happening. Of course there is no proof that this is related to creativity.

▲

vidarh a day ago | parent [-]

Turing machines are deterministic if all their inputs are deterministic, which they do not need to be, and if we allow them to be. Indeed, by default, LLMs are by default not deterministic because we intentionally inject randomness.

	▲	arowthway a day ago \| parent [-]
		It doesn't mean we can accurately simulate the brain by swapping its source of nondeterminism with any other PRNG or TRNG. It might just so happen that to simulate ingenuity you have to simulate the universe first.

▲

johnisgood a day ago | parent | prev [-]

This Turing completeness equivalence is misleading. While all Turing-complete systems can theoretically compute the same class of functions, this says nothing about computational complexity, physical constraints, practical achievability in finite time, or the actual algorithms required. A Turing machine that can theoretically simulate a brain does not mean we know how to do it or that it is even feasible. This is like arguing that because weather systems and computers both follow physical laws, you should be able to perfectly simulate weather on your laptop.

Additionally, "No mechanism to exceed Turing computable" is a non-sequitur. Even granting that brains do not perform hypercomputation, this does not support your conclusion that artificial systems are "computationally equivalent" to brains in any practical sense. We would need: (1) complete understanding of brain algorithms, (2) the actual data/weights encoded in neural structures, (3) sufficient computational resources, and (4) correct implementation. None of these follow from Turing completeness alone, I believe.

More importantly, you completely dodged the actual point about intuition. Jensson's point is about evolutionary encoding vs. learned knowledge. Intuition represents millions of years of evolved optimization encoded in brain structure and chemistry. You acknowledge this ("knowledge encoded in physical structure") but then pivot to an irrelevant theoretical CS argument rather than addressing whether we can actually replicate such evolutionary knowledge in artificial systems.

Your original claim was "If they are not derived from past experience or knowledge" which creates a false dichotomy. Animals are born with innate knowledge encoded through evolutionary optimization. This is not learned from individual experience, yet it is still knowledge, specifically, it is millions of years of selection pressure encoded in neural architecture, reflexes, instincts, and cognitive biases.

So, for example: a newborn animal has never experienced a predator but knows to freeze or flee from certain stimuli. It has built-in heuristics for threat assessment, social behavior, spatial reasoning, and countless other domains that cost generations to develop through survival pressure.

Current AI systems lack this evolutionary substrate. They are trained on human data over weeks or months, not evolved over millions of years. We do not even know how to encode this type of knowledge artificially or even fully understand what knowledge is encoded in biological systems. Turing completeness does not bridge this gap any more than it bridges the gap between a Turing machine and actual weather.

Correct me if I'm misinterpreting your argument.

	▲	alansammarone 14 hours ago \| parent [-]
		I...I am very interested in this subject. There's a lot to unpack in your comment, but I think it's really pretty simple. > this does not support your conclusion that artificial systems are "computationally equivalent" to brains in any practical sense. You're making a point about engineering or practicality, and in that sense, you are absolutely correct. That's not the most interesting part of the question, however. > This is like arguing that because weather systems and computers both follow physical laws, you should be able to perfectly simulate weather on your laptop. Yes, that's exactly what I'd argue, and...hm.. yes, I think that's clearly true. Whether it takes 10 minutes or 10^100 minutes, 1~ or 10^100 human lifetimes to do so, it's irrelevant. Units (including human lifetimes) are arbitrary, and I think fundamental truths probably won't depend on such arbitrary things as how long a particular collection of atoms in a particular corner of the universe (i.e. humans) happens to be stable for. Ratios are closer to being fundamental, but I digress. To put it a different way - we think we know what the speed of light is. Traveling at v = 0.1c or at v = (1 - 10^(-100))c are equivalent in a fundamental sense, it's an engineering problem. Now, traveling at v = c...that's very different. That's interesting.

▲

c22 a day ago | parent | prev [-]

Wouldn't this insight derive from many past experiences of feeling pain yourself and the knowledge that others feel it too?

▲

somenameforme a day ago | parent | prev | next [-]

Turing computability is tangential to his claim, as LLMs are obviously not carrying out the breadth of all computable concepts. His claim can be trivially proven by considering the history of humanity. We went from a starting point of having literally no language whatsoever, and technology that would not have expanded much beyond an understanding of 'poke him with the pointy side'. And from there we would go on to discover the secrets of the atom, put a man on the Moon, and more. To say nothing of inventing language itself.

An LLM trained on this starting state of humanity is never going to do anything except remix basically nothing. It's never going to discover the secrets of the atom, or how to put a man on the Moon. Now whether any artificial device could achieve what humans did is where the question of computability comes into play, and that's a much more interesting one. But if we limit ourselves to LLMs, then this is very straight forward to answer.

▲

vidarh a day ago | parent [-]

> Turing computability is tangential to his claim, as LLMs are obviously not carrying out the breadth of all computable concepts

They don't need to. To be Turing complete a system including an LLM need to be able to simulate a 2-state 3-symbol Turing machine (or the inverse). Any LLM with a loop can satisfy that.

If you think Turing computability is tangential to this claim, you don't understand the implications of Turing computability.

> His claim can be trivially proven by considering the history of humanity.

Then show me a single example where humans demonstrably exceeding the Turing computable.

We don't even know any way for that to be possible.

▲

somenameforme a day ago | parent [-]

This is akin to claiming that a tic-tac-toe game is turing complete since after all we could simply just modify it to make it not a tic tac toe game. It's not exactly a clever argument.

And again there are endless things that seem to reasonably defy turing computability except when you assume your own conclusion. Going from nothing, not even language, to richly communicating, inventing things with no logical basis for such, and so is difficult to even conceive as a computable process unless again you simply assume that it must be computable. For a more common example that rapidly enters into the domain of philosophy - there is the nature of consciousness.

It's impossible to prove that such is Turing computable because you can't even prove consciousness exists. The only way I know it exists is because I'm most certainly conscious, and I assume you are too, but you can never prove that to me, anymore than I could ever prove I'm conscious to you. And so now we enter into the domain of trying to computationally imagine something which you can't even prove exists, it's all just a complete nonstarter.

-----

I'd also add here that I think the current consensus among those in AI is implicit agreement with this issue. If we genuinely wanted AGI it would make vastly more sense to start from as little as possible because it'd ostensibly reduce computational and other requirements by many orders of magnitude, and we could likely also help create a more controllable and less biased model by starting from a bare minimum of first principles. And there's potentially trillions of dollars for anybody that could achieve this. Instead, we get everything dumped into token prediction algorithms which are inherently limited in potential.

▲

vidarh a day ago | parent [-]

> This is akin to claiming that a tic-tac-toe game is turing complete since after all we could simply just modify it to make it not a tic tac toe game. It's not exactly a clever argument.

No, it is nowhere remotely like that. It is claiming that a machine capable of running a Turing machine is in fact capable of running any other Turing machine. In other words, it is pointing out the principle of Turing equivalence.

> And again there are endless things that seem to reasonably defy turing computability

Show us one. We have no evidence of any single one.

> It's impossible to prove that such is Turing computable because you can't even prove consciousness exists.

Unless you can show that humans exceeds the Turing computable, "consciousness" however you define it is either possible purely with a Turing complete system or can not affect the outputs of such a system. In either case this argument is irrelevant unless you can show evidence we exceed the Turing computable.

> I'd also add here that I think the current consensus among those in AI is implicit agreement with this issue. If we genuinely wanted AGI it would make vastly more sense to start from as little as possible because it'd ostensibly reduce computational and other requirements by many orders of magnitude, and we could likely also help create a more controllable and less biased model by starting from a bare minimum of first principles. And there's potentially trillions of dollars for anybody that could achieve this. Instead, we get everything dumped into token prediction algorithms which are inherently limited in potential.

This is fundamentally failing to engage with the argument. There is nothing in the argument that tells us anything about the complexity of a solution to AGI.

▲

somenameforme a day ago | parent [-]

LLMs are not capable of simulating turing machines - their output is inherently and inescapably probabilistic. You would need to fundamentally rewrite one to make this possible, at which point it is no longer an LLM.

And as I stated, you are assuming your own conclusion to debate the issue. You believe that nothing is incomputable, and are tying that assumption into your argument as an assumption. It's not on me to prove your assumption is wrong, it's on you to prove that it's correct - proving a negative is impossible. E.g. - I'm going to assume that there is an invisible green massless goblin on your shoulder named Kyzirgurankl. Prove me wrong. Can you give me even the slightest bit of evidence against it? Of course you cannot, yet absence of evidence is not evidence of absence, so the burden of my claim rests on me.

And so now feel free to prove that consciousness is computable, or even replicating humanity's successes from a comparable baseline. Without that proof you must understand that you're not making some falsifiable claim of fact, but simply appealing to your own personal ideology or philosophy, which is of course completely fine (and even a good thing), but also a completely subjective opinion on matters.

	▲	johnisgood 19 hours ago \| parent [-]
		After having read your comment, I feel I should have left my comment under this thread. I will just refer to it instead: https://news.ycombinator.com/item?id=46003870. This was my reply to your parent. I agree with you.

▲

Fargren a day ago | parent | prev [-]

You are making a big assumption here, which is that LLMs are the main "algorithm" that the human brain uses. The human brain can easily be a Turing machine, that's "running" something that's not an LLM. If that's the case, we can say that the fact that humans can come up with novel concept does not imply that LLMs can do the same.

▲

vidarh a day ago | parent [-]

No, I am not assuming anything about the structure of the human brain.

The point of talking about Turing completeness is that any universal Turing machine can emulate any other (Turing equivalence). This is fundamental to the theory of computation.

And since we can easily show that both can be rigged up in ways that makes the system Turing complete, for humans to be "special", we would need to be able to be more than Turing complete.

There is no evidence to suggest we are, and no evidence to suggest that is even possible.

▲

Fargren a day ago | parent [-]

An LLM is not a universal Turing machine, though. It's a specific family of algorithms.

You can't build an LLM that will factorize arbitrarily large numbers, even in infinite time. But a Turing machine can.

▲

vidarh a day ago | parent [-]

To make a universal Turing machine out of an LLM only requires a loop and the ability to make a model that will look up a 2x3 matrix of operations based on context and output operations to the context on the basis of them (the smallest Turing machine has 2 states and 3 symbols or the inverse).

So, yes, you can.

Once you have a (2,3) Turing machine, you can from that build a model that models any larger Turing machine - it's just a question of allowing it enough computation and enough layers.

It is not guaranteed that any specific architecture can do it efficiently, but that is entirely besides the point.

	▲	Fargren a day ago \| parent \| next [-]
		LLMs cannot loop (unless you have a counterexample?), and I'm not even sure they can do a lookup in a table with 100% reliability. They also have finite context, while a Turing machine can have infinite state.
	▲	johnisgood a day ago \| parent \| prev [-]
		Are you saying that LLMs are Turing complete or did I misunderstand it?

▲

marcellus23 2 days ago | parent | prev | next [-]

> A Markov chain can only return verbatim combinations. So it might return "Cows are big animals" or "Are big animals happy".

Just for my own edification, do you mean "Are big animals are happy"? "animals happy" never shows up in the source text so "happy" would not be a possible successor to "animals", correct?

	▲	koliber a day ago \| parent [-]
		Please forgive me. I am not a Markov chain.

▲

fragmede 2 days ago | parent | prev [-]

> However, LLMs will not be able to represent ideas that it has not encountered before.

Sure they do. We call them hallucinations and complain that they're not true, however.

▲

koliber 2 days ago | parent | next [-]

Hmmm. Didn't think about that.

In people there is a difference between unconscious hallucinations vs. intentional creativity. However, there might be situations where they're not distinguishable. In LLMs, it's hard to talk about intentionality.

I love where you took this.

	▲	gishh 2 days ago \| parent [-]
		A hallucination isn’t a creative new idea, it’s blatantly wrong information, provably. If an LLM had actual intellectual ability it could tell “us” how we can improve models. They can’t. They’re literally defined by their token count and they use statistics to generate token chains. They’re as creative as the most statistically relevant token chains they’ve been trained on by _people_ who actually used intelligence to type words on a keyboard.

▲

johnisgood 19 hours ago | parent | prev | next [-]

Hallucinations are not novel ideas. They are novel combinations of tokens constrained by learned probability distributions.

I have mentioned Hume before, and will do so again. You can combine "golden" and "mountain" without seeing a golden mountain, but you cannot conjure "golden" without having encountered something that gave you the concept.

LLMs may generate strings they have not seen, but those strings are still composed entirely from training-derived representations. The model can output "quantum telepathic blockchain" but each token's semantic content comes from training data. It is recombination, not creation. The model has not built representations of concepts it never encountered in training; it is just sampling poorly constrained combinations.

Can you distinguish between a false hallucination and a genuinely novel conceptual representation?

▲

anonzzzies a day ago | parent | prev [-]

Or, 10000000s times a day while coding all over the world and it hallucinating something it never saw before which turned out to be the thing needed.

▲ hugkdlief 2 days ago | parent | prev | next [-]

> we can imagine various creatures, say, a pig with a dragon head, even if we have not seen one ANYWHERE. It is because we can take multiple ideas and combine them together.

Funny choice of combination, pig and dragon, since Leonardo Da Vinci famously imagined dragons themselves by combining lizards and cats: https://i.pinimg.com/originals/03/59/ee/0359ee84595586206be6...

▲

johnisgood 2 days ago | parent [-]

Hah, interesting. Pig and dragon just sort of came to mind as I was writing the comment. :D But we can pretty much imagine anything, can't we? :)

I should totally try to generate images using AI with some of these prompts!

	▲	johnisgood 19 hours ago \| parent [-]
		FWIW, the results should be "good enough", considering they most likely have "pig" and "dragon" in their training data. I elaborated here on this: https://news.ycombinator.com/item?id=46006535.

▲ andoando 2 days ago | parent | prev | next [-]

That little quip from Hume has influenced my thinking so much Im happy to see it again

	▲	johnisgood a day ago \| parent [-]
		I agree, I love him and he has been a very influential person in my life. I started reading him from a very young age in my own language because his works in English were too difficult for me at the time. It is always nice to see someone mention him. FWIW I do not think he used the "pig with dragon head" example, it just came to my mind, but he did use an example similar to it when he was talking about creativity and the combining of ideas where there was a lack of impression (i.e. we have not actually seen one anywhere [yet we can imagine it]).

▲ astrange 2 days ago | parent | prev | next [-]

> Edit: to clarify further as to what I want to know: people have been telling me that LLMs cannot solve problems that is not in their training data already. Is this really true or not?

That is not true and those people are dumb. You may be on Bluesky too much.

If your training data is a bunch of integer additions and you lossily compress this into a model which rediscovers integer addition, it can now add other numbers. Was that in the training data?

▲

spookie a day ago | parent | next [-]

It was in the training data. There is implicit information in the way you present each addition. The context provided in the training data is what allows relationships to be perceived and modelled.

If you don't have that in your data you don't have the results.

▲

johnisgood a day ago | parent | prev | next [-]

I am not on Bluesky AT ALL. I have seen this argument here on HN, which is the only "social media" website I use.

▲

throwawaysoxjje a day ago | parent | prev [-]

I mean, you just said it was.

	▲	astrange 16 hours ago \| parent [-]
		It wasn't necessarily. You could redefine the "true meaning" of the training data such that it wasn't an addition operation but was actually some other one, with the same data, and then the generalization would be wrong. https://en.wikipedia.org/wiki/Gettier_problem

▲ umanwizard 2 days ago | parent | prev | next [-]

> I have seen the argument that LLMs can only give you what its been trained on, i.e. it will not be "creative" or "revolutionary", that it will not output anything "new", but "only what is in its corpus".

People who claim this usually don’t bother to precisely (mathematically) define what they actually mean by those terms, so I doubt you will get a straight answer.

	▲	DaiPlusPlus a day ago \| parent [-]
		How can anyone "mathematically" define "revolutionary"?

▲ franciscator 2 days ago | parent | prev | next [-]

Creativity need to be better defined. And the rest is a learning problem. If you keep on training, learning what you see ...

▲ dboreham 2 days ago | parent | prev | next [-]

> LLMs can only give you what its been trained on, i.e. it will not be "creative" or "revolutionary", that it will not output anything "new", but "only what is in its corpus

That's not true. Or at least it's only a true as for a human that read all the books in the world. That human only has seen that training data. But somehow it can come up with the Higgs Boson, or whatever.

	▲	coderatlarge 2 days ago \| parent [-]
		well the people who did the Higgs boson theory worked and re-worked for years all the prior work about elementary particles and arguably did a bunch of re-mixing of all the previous “there might be a new elementary particle here!” work until they hit on something that convinced enough peers that it could be validated in a real-world experiment. by which i mean to say that it doesn’t seem completely implausible that an llm could generate the first tentative papers in that general direction. perhaps one could go back and compute the likelihood of the first papers on the boson given only the corpus to date before it as researchers seem to be trying to do with the special relativity paper which is viewed as a big break with physics beforehand.

▲ sleepybrett 18 hours ago | parent | prev | next [-]

I think it's more about multidimensionality than anything

▲ godelski 2 days ago | parent | prev | next [-]

  > I have seen the argument that LLMs can only give you what its been trained

There's confusing terminology here and without clarification people talk past one another.

"What its been trained on" is a distribution. It can produce things from that distribution and only things from that distribution. If you train on multiple distributions, you get the union of the distribution, making a distribution.

This is entirely different from saying it can only reproduce samples which it was trained on. It is not a memory machine that is surgically piecing together snippets of memorized samples. (That would be a mind bogglingly impressive machine!)

A distribution is more than its samples. It is the things between too. Does the LLM perfectly capture the distribution? Of course not. But it's a compression machine so it compresses the distribution. Again, different from compressing the samples, like one does with a zip file.

So distributionally, can it produce anything novel? No, of course not. How could it? It's not magic. But sample wise can it produce novel things? Absolutely!! It would be an incredibly unimpressive machine if it couldn't and it's pretty trivial to prove that it can do this. Hallucinations are good indications that this happens but it's impossible to do on anything but small LLMs since you can't prove any given output isn't in the samples it was trained on (they're just trained on too much data).

  > people have been telling me that LLMs cannot solve problems that is not in their training data already. Is this really true or not?

Up until very recently most LLMs have struggled with the prompt

  Solve:
  5.9 = x + 5.11

This is certainly in their training distribution and has been for years, so I wouldn't even conclude that they can solve problems "in their training data". But that's why I said it's not a perfect model of the distribution.

  > a pig with a dragon head

One needs to be quite careful with examples as you'll have to make the unverifiable assumption that such a sample does not exist in the training data. With the size of training data this is effectively unverifiable.

But I would also argue that humans can do more than that. Yes, we can combine concepts, but this is a lower level of intelligence that is not unique to humans. A variation of this is applying a skill from one domain into another. You might see how that's pretty critical to most animals survival. But humans, we created things that are entirely outside nature require things outside a highly sophisticated cut and paste operation. Language, music, mathematics, and so much more are beyond that. We could be daft and claim music is simply cut and paste of songs which can all naturally be reproduced but that will never explain away the feelings or emotion that it produces. Or how we formulated the sounds in our heads long before giving them voice. There is rich depth to our experiences if you look. But doing that is odd and easily dismissed as our own familiarity deceives us into our lack of.

▲

XenophileJKO a day ago | parent | next [-]

The limit of a LLM "distribution" effectively is actually only at the token level though once the model has consumed enough language. Which is why those out of distribution tokens are so problematic.

From that point on the model can infer linguistics even on purely encountered words, concepts. I would even propose in context inferred meaning based on context, just like you would do.

It builds conceptual abstractions of MANY levels and all interrelated.

So imagine giving it a task like "design a car for a penguin to drive". The LLM can infer what kinda of input does a car need, what anatomy does a penguin have and it can wire it up descriptively. It is an easy task for an LLM. When you think about the other capabilities like introspection, and external state through observation (any external input), there really are not many fundamental limits on what they can do.

(Ignore image generation, it is an important distinction on how an image is made, end to end sequence vs. pure diffusion vs. hybrid.)

	▲	godelski 17 hours ago \| parent [-]
		I think you've confused some things. Pay careful note to what I'm calling a distribution. There are many distributions at play here but I'm referring to two specific ones that are clear from context. I think you've also made a leap in logic. The jury's still out on whether LLMs have internalized some world model or not. It's quite difficult to distinguish memorization from generalization. It's impossible to do when the "test set" is spoiled. You also need to remember that we train for certain attributes. Does the LLM actually have introspection or does it just appear that way because that's how it was optimized (which we definitely optimize it for that). Is there a difference? The duck test only lets us conclude something is probably a duck, not that it isn't a sophisticated animatronic that we just can't distinguish but someone or something else could.

▲

astrange 2 days ago | parent | prev [-]

> This is entirely different from saying it can only reproduce samples which it was trained on. It is not a memory machine that is surgically piecing together snippets of memorized samples. (That would be a mind bogglingly impressive machine!)

You could create one of those using both a Markov chain and an LLM.

https://arxiv.org/abs/2401.17377

	▲	godelski 16 hours ago \| parent [-]
		Though I enjoyed that paper, it's not quite the same thing. There's a bit more subtly to what I'm saying. To do a surgical patching you'd have to actually have a rich understanding of language but just not have the actual tools to produce words themselves. Think like the SciFi style robots that pull together clips or recordings to speak. Bumblebee from transformers might be the most well known example. But think hard about that because it requires a weird set of conditions and a high level of intelligence to perform the search and stitching. But speaking of Markov, we get that in LLMs through generation. We don't have conversations with them. Each chat is unique since you pass it the entire conversation. There's no memory. So the longer your conversations go the larger the token counts. That's Markovian ;)

▲ a day ago | parent | prev | next [-]

[deleted]

▲ thaumasiotes 2 days ago | parent | prev | next [-]

>> The fact that they only generate sequences that existed in the source

> I am quite confused right now. Could you please help me with this?

This is pretty straightforward. Sohcahtoa82 doesn't know what he's saying.

▲ Sohcahtoa82 2 days ago | parent [-]

I'm fully open to being corrected. Just telling me I'm wrong without elaborating does absolutely nothing to foster understanding and learning.

▲ thaumasiotes 2 days ago | parent [-]

If you still think there's something left to explain, I recommend you read your other responses. Being restricted to the training data is not a property of Markov output. You'd have to be very, very badly confused to think that it was. (And it should be noted that a Markov chain itself doesn't contain any training data, as is also true of an LLM.)

More generally, since an LLM is a Markov chain, it doesn't make sense to try to answer the question "what's the difference between an LLM and a Markov chain?" Here, the question is "what's the difference between a tiny LLM and a Markov chain?", and assuming "tiny" refers to window size, and the Markov chain has a similarly tiny window size, they are the same thing.

▲ astrange 2 days ago | parent | next [-]

An LLM is not a Markov chain of the input tokens, because it has internal computational state (the KV cache and residuals).

An LLM is a Markov process if you include its entire state, but that's a pretty degenerate definition.

▲

Jensson a day ago | parent [-]

> An LLM is a Markov process if you include its entire state, but that's a pretty degenerate definition.

Not any more degenerate than a multi word bag of words markov chain, its exactly the same concept: you input a context of words / tokens and get a new word / token, the things you mention there are just optimizations around that abstraction.

	▲	astrange a day ago \| parent [-]
		The difference is there are exponentially more states than an n-gram model. It's really not the same thing at all. An LLM can perform nearly arbitrary computation inside its fixed-size memory. https://arxiv.org/abs/2106.06981 (An LLM with tool use isn't a Markov process at all of course.)

▲ johnisgood 2 days ago | parent | prev | next [-]

He said LLMs are creative, yet people have been telling me that LLMs cannot solve problems that is not in their training data. I want this to be clarified or elaborated on.

▲ shagie 2 days ago | parent [-]

Make up a fanciful problem and ask it to solve it. For example, https://chatgpt.com/s/t_691f6c260d38819193de0374f090925a is unlikely to be found in the training data - I just made it up. Another example of wizards and witches and warriors and summoning... https://chatgpt.com/share/691f6cfe-cfc8-8011-b8ca-70e2c22d36... - I doubt that was in the training data either.

Make up puzzles of your own and see if it is able to solve it or not.

The blanket claim of "cannot solve problems that are not in its training data" seems to be something that can be disproven by making up a puzzle from your own human creativity and seeing if it can solve it - or for that matter, how it attempts to solve it.

It appears that there is some ability for it to reason about new things. I believe that much of this "an LLM can't do X" or "an LLM is parroting tokens that it was trained on" comes from trying to claim that all the material that it creates was created before, by a human and any use of an LLM is stealing from some human and thus unethical to use.

( ... and maybe if my block world or wizards and warriors and witches puzzle was in the training data somewhere, I'm unconsciously copying something somewhere else and my own use of it is unethical. )

▲ wadadadad 2 days ago | parent | next [-]

This is an interesting idea, but as you stated, it's all logic; it's hard to come up with an idea where you don't have to explain concepts yet still is dissimilar enough to be in the training.

In your second example with the wizards- did you notice that it failed to follow the rules? Step 3, the witch was summoned by the wizard. I'm curious as to why you didn't comment either way on this.

On a related note, instead of puzzles, what about presenting riddles? I would argue that riddles are creative, pulling bits and pieces of meaning from words to create an answer. If AI can solve riddles not seen before, would that count as creative and not solving problems in their dataset?

Here's one I created and presented (the first incorrect answer I got was Escape Room; I gave it 10 attempts and it didn't get the answer I was thinking of):

---

Solve the riddle:

Chaos erupts around

The shape moot

The goal is key

▲ shagie 2 days ago | parent [-]

The challenge is: for someone who is convinced that an LLM is only presenting material that they've seen before that was created by some human, how do you show them something that hasn't been seen before?

(Digging in old chats one from 2024 this one is amusing ... https://chatgpt.com/share/af1c12d5-dfeb-4c76-a74f-f03f48ce3b... was a fun one - epic rap battle between Paul Graham and Commander Taco )

Many people seem to believe that the LLM is not much more than a collage of words that it stole from other places and likewise images are a collage of images stolen from other people's pictures. (I've had people on reddit (which tends to be rather AI hostile outside of specific AI subs) downvote me for explaining how to use an LLM as an editor for your own writing or pointing out that some generative image systems are built on top of libraries where the company had rights (e.g. stock photography) to all the images)

With the wizards, I'm not interested necessarily in the correct solution, but rather how it did it and what the representation of the response was. I selected everything with 'W' to see how it handled identifying the different things.

As to riddles... that's really a question of mind reading. Your riddle isn't one that I can solve. Maybe if you told me the answer I'd understand how you got from the answer to the question, but I've got no idea how to go from the hint to a possible answer (does that make me an LLM?)

I feel its a question much more along some other classic riddles...

    “What have I got in my pocket?" he said aloud. He was talking to himself, but Gollum thought it was a riddle, and he was frightfully upset. "Not fair! not fair!" he hissed. "It isn't fair, my precious, is it, to ask us what it's got in its nassty little pocketsess?”

What do I have in my pocket? (and then a bit of "what would it do with that prompt?") https://chatgpt.com/s/t_691fa7e9b49081918a4ef8bdc6accb97

At this point, I'm much more of the opinion that some people are on "team anti-ai" and that it has become part of their identity to be against anything that makes use of AI to augment what a human can do unaided. Attempting to show that it's not a stochastic parrot or next token predictors (anymore than humans are) or that it can do things that help people (when used responsibly by the human) gets met with hostility.

I believe that this comes from the group identity and some of the things of group dynamics. https://gwern.net/doc/technology/2005-shirky-agroupisitsownw...

> The second basic pattern that Bion detailed is the identification and vilification of external enemies. This is a very common pattern. Anyone who was around the open source movement in the mid-1990s could see this all the time. If you cared about Linux on the desktop, there was a big list of jobs to do. But you could always instead get a conversation going about Microsoft and Bill Gates. And people would start bleeding from their ears, they would get so mad.

> ...

> Nothing causes a group to galvanize like an external enemy. So even if someone isn’t really your enemy, identifying them as an enemy can cause a pleasant sense of group cohesion. And groups often gravitate toward members who are the most paranoid and make them leaders, because those are the people who are best at identifying external enemies.

▲ wadadadad 21 hours ago | parent [-]

I don't think riddles are necessarily 'solvable' in that there's only one right answer; the very fact that they're open to interpretation, but when you get the 'right' answer it (hopefully) makes sense. So if an AI/LLM can answer such a nebulous thing correctly- that's more of the angle I was going at.

Regarding the wizards example, I'm a bit confused; I was thinking that the best way to judge answers for problem solving/creativity was for correctness. I'll think more on whether the 'thought process' counts in and of itself.

The answer to my riddle is 'ball'.

▲ shagie 17 hours ago | parent | next [-]

Perfect correctness is what you'd expect from a computer. I could write a program that solved it - and that would be an indication of my creativity as a human solving something that I haven't encountered before. Incidentally, that's also how it approached solving the block problem (by writing a program).

If you ask me the goat, wolf, cabbage problem I'd be able to recite (as an xkcd fan https://xkcd.com/1134/ and https://xkcd.com/2348/ and the exploration of what else it could do). However, if someone hasn't seen the problem before it could be a useful tool at seeing how they approach solving it.

The question of how does it tackle a new problem is one of creativity and exploration of thought in a new (untrained) domain.

A possible claim of "well, it's been trained on the meta-problem of how to solve problems that weren't in its training set" would get a side eye.

For the "ball" being the answer... consider the second response to https://chatgpt.com/share/6920b9e2-764c-8011-a14a-012e97573f... (make sure you click on the "Thought for 1m 5s" to get the internal process)

▲ johnisgood 19 hours ago | parent | prev [-]

How did you get "ball" from your riddle? I read it and I have no idea! :(

▲ shagie 16 hours ago | parent [-]

In my sibling comment, I linked the chat session where I prompted ChatGPT for possible answers and reasoning.

https://chatgpt.com/share/6920b9e2-764c-8011-a14a-012e97573f...

    Given the following riddle, identify the object to which it refers.
    #
    Chaos erupts around
    The shape moot
    The goal is key
    #
    Identify 10 different possible answers and identify the reasoning behind each guess and why it may or may not be correct.

The second item in the possible answers:

    Soccer ball
    Why it fits:
        “Chaos erupts around”: Players cluster and scramble around the ball; wherever it goes, chaos follows.
        “The shape moot”: Modern footballs vary in panel design and surface texture, but they must all be broadly spherical; to the game itself, variations in cosmetic shape are mostly irrelevant.
        “The goal is key”: Everyone’s objective is to get the ball into the goal.
    Why it might not be correct:
        The third line emphasizes the goal, which points more strongly to the scoring structure or concept of scoring rather than the ball.

▲ Ardren 2 days ago | parent | prev [-]

I think your example works, as it does try to solve a problem it hasn't seen (though it is very similar to existing problems).

... But, CharGPT makes several mistakes :-)

> Wizard Teleport: Wz1 teleports himself and Wz2 to Castle Beta. This means Wz1 has used his only teleport power.

Good.

> Witch Summon: From Castle Beta, Wi1 at Castle Alpha is summoned by Wz1. Now Wz1 has used his summon power.

Wizzard1 cannot summon.

> Wizard Teleport: Now, Wz2 (who is at Castle Beta) teleports back to Castle Alpha, taking Wa1 with him.

Warrior1 isn't at Castle beta

> Wizard Teleport: Wz2, from Castle Alpha, teleports with Wa2 to Castle Beta.

Wizzard2 has already teleported

▲ purple_turtle 2 days ago | parent | prev [-]

1) being restricted to exact matches in input is definition of Markov Chains

2) LLMs are not Markov Chains

▲

saithound 2 days ago | parent | next [-]

A Markov chain [1] is a discrete-time stochastic process, in which the value of each variable depends only on the value of the immediately preceding variable, and not any variables in the past.

LLMs are most definitely (discrete-time) Markov chains in this sense: the variables take their values in the context vectors, and the distribution of the new context window depends only on what was sampled previously context.

A Markov chain is a Markov chain, no matter how you implement it in a computer, whether as a lookup table, or an ordinary C function, or a one-layer neural net or a transformer.

LLMs and Markov text generators are technically both Markov chains, so some of the same math applies to both. But that's where the similarities end: e.g. the state space of an LLM is a context window, whereas the state space of a Markov text generator is usually an N-tuple of words.

And since the question here is how tiny LLMs differ from Markov text generators, the differences certainly matter here.

[1] https://en.wikipedia.org/wiki/Discrete-time_Markov_chain

▲

thaumasiotes 2 days ago | parent | prev [-]

> 1) being restricted to exact matches in input is definition of Markov Chains

Here's wikipedia:

> a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

A Markov chain is a finite state machine in which transitions between states may have probabilities other than 0 or 1. In this model, there is no input; the transitions occur according to their probability as time passes.

> 2) LLMs are not Markov Chains

As far as the concept of "Markov chains" has been used in the development of linguistics, they are seen as a tool of text generation. A Markov chain for this purpose is a hash table. The key is a sequence of tokens (in the state-based definition, this sequence is the current state), and the value is a probability distribution over a set of tokens.

To rephrase this slightly, a Markov chain is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then for the following token you should choose t_1 with probability p_1, t_2 with probability p_2, etc...".

Then, to tie this back into the state-based definition, we say that when we choose token t_k, we emit that token into the output, and we also dequeue the first token from our representation of the state and enqueue t_k at the back. This brings us into a new state where we can generate another token.

A large language model is seen slightly differently. It is a function. The independent variable is a sequence of tokens, and the dependent variable is a probability distribution over a set of tokens. Here we say that the LLM answers the question "if the last N tokens of a fixed text were s_1, s_2, ..., s_N, what is the following token likely to be?".

Or, rephrased, the LLM is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then the following token will be t_1 with probability p_1, t_2 with probability p_2, etc...".

You might notice that these two tables contain the same information organized in the same way. The transformation from an LLM to a Markov chain is the identity transformation. The only difference is in what you say you're going to do with it.

▲

Sohcahtoa82 2 days ago | parent | next [-]

> Here we say that the LLM answers the question "if the last N tokens of a fixed text were s_1, s_2, ..., s_N, what is the following token likely to be?".

> Or, rephrased, the LLM is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then the following token will be t_1 with probability p_1, t_2 with probability p_2, etc...".

This is an oversimplication of an LLM.

The output layer of an LLM contains logits of all tokens in the vocabulary. This can mean every word, word fragment, punctuation mark, or whatever emoji or symbol it knows. Because the logits are calculated through a whole lot of floating point math, it's very likely that most results will be non-zero. Very close to zero, but still non-zero.

This means that gibberish options for the next token have non-zero probabilities of being chosen. The only reason they don't in reality is because of top-k sampling, temperature, and other filtering that's done on the logits before actually choosing a token.

If you present a s_1, s_2, ... s_N to a Markov Chain when that series was never seen by the chain, then the resulting probabilities is an empty set. But if you present them to an LLM, it gets fed into a neural network and eventually a set of logits comes out, and you can still choose the next token based on it.

▲

thaumasiotes a day ago | parent [-]

> This means that gibberish options for the next token have non-zero probabilities of being chosen. The only reason they don't in reality is because of top-k sampling, temperature, and other filtering that's done on the logits before actually choosing a token.

> If you present a s_1, s_2, ... s_N to a Markov Chain when that series was never seen by the chain

No, you're confused.

The chain has never seen anything. The Markov chain is a table of probability distributions. You can create it by any means you see fit. There is no such thing as a "series" of tokens that has been seen by the chain.

	▲	Sohcahtoa82 20 hours ago \| parent [-]
		> The chain has never seen anything. The Markov chain is a table of probability distributions. You can create it by any means you see fit. There is no such thing as a "series" of tokens that has been seen by the chain. When I talk about the chain "seeing" a sequence, I mean that the sequence existed in the material that was used to generate the probability table. My instinct is to believe that you know this, but are being needlessly pedantic. My point is that if you're using a context length of two, if you prompt a Markov Chain with "my cat", but the sequence "my cat was" never appeared in the training material, than a Markov Chain will never choose "was" as the next word. This property is not true for LLMs. If you prompt an LLM with "my cat", then "was" has a non-zero chance of being chosen as the next word, even if "my cat was" never appeared in the training material.

▲

purple_turtle 2 days ago | parent | prev [-]

Maybe technically LLM can be converted to an equivalent Markov Chain.

The problem is that even for modest context sizes the size of Markov Chain would by hilariously and monstrously large.

You may as well tell that LLM and a hash table is the same thing.

▲

a day ago | parent | next [-]

[deleted]

▲

thaumasiotes a day ago | parent | prev [-]

As I just mentioned in the comment you're responding to, the way you convert an LLM into an equivalent Markov chain is by doing nothing, since it already is one.

> You may as well tell that LLM and a hash table is the same thing.

No. You may as well say that a hash table and a function are the same thing. And this is in fact a common thing to say, because they are the same thing.

An LLM is a significantly more restricted object than a function is.

	▲	purple_turtle a day ago \| parent [-]
		> LLM is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then the following token will be t_1 with probability p_1, t_2 with probability p_2, etc...". No, LLM is not a lookup table for all possible inputs.

▲ Sohcahtoa82 2 days ago | parent | prev [-]

LLMs can absolutely create things that are creative, at least for some definition of "creative".

For example, I can ask an LLM to create a speech about cross-site scripting the style of Donald Trump:

> Okay, folks, we're talking about Cross-Site Scripting, alright? I have to say, it's a bit confusing, but let's try to understand it. They call it XSS, which is a fancy term. I don't really know what it means, but I hear it's a big deal in the tech world. People are talking about it, a lot of people, very smart people. So, Cross-Site Scripting. It's got the word "scripting" in it, which sounds like it's about writing, maybe like a script for a movie or something. But it's on the internet, on these websites, okay? And apparently, it's not good. I don't know exactly why, but it's not good. Bad things happen, they tell me. Maybe it makes the website look different, I don't know. Maybe it makes things pop up where they shouldn't. Could be anything! But here's what I do know. We need to do something about it. We need to get the best people, the smartest people, to look into it. We'll figure it out, folks. We'll make our websites safe, and we'll do it better than anyone else. Trust me, it'll be tremendous. Thank you.

Certainly there's no text out there that contains a speech about XSS from Trump. There's some snippets here and there that likely sound like Trump, but a Markov Chain simply is incapable of producing anything like this.

▲

0cf8612b2e1e 2 days ago | parent | next [-]

Sure that specific text does not exist, but the discrete tokens that went into it would have been.

If you similarly trained a Markov chain at the token level on a LLM sized corpus, it could make the same. Lacking an attention mechanism, the token probabilities would be terribly non constructive for the effort, but it is not impossible.

▲

Sohcahtoa82 2 days ago | parent [-]

Let's assume three things here:

1. The corpus contains every Trump speech.

2. The corpus contains everything ever written about XSS.

3. The corpus does NOT contain Trump talking about XSS, nor really anything that puts "Trump" and "XSS" within the same page.

A Markov Chain could not produce a speech about XSS in the style of Trump. The greatest tuning factor for a Markov Chain is the context length. A short length (like 2-4 words) produces incoherent results because it only looks at the last 2-4 words when predicting the next word. This means if you prompted the chain with "Create a speech about cross-site scripting the style of Donald Trump", then even with a 4-word context, all the model processes is "style of Donald Trump". But the time it reached the end of the prompt, it's already forgotten the beginning of it.

If you increase the context to 15, then the chain would produce nothing because "Create a speech about cross-site scripting in the style of Donald Trump" has never appeared in its corpus, so there's no data for what to generate next.

The matching in a Markov Chain is discrete. It's purely a mapping of (series of tokens) -> (list of possible next tokens). If you pass in a series of tokens that was never seen in the training set, then the list of possible next tokens is an empty set.

	▲	johnisgood 19 hours ago \| parent \| next [-]
		An LLM should be able to produce speech about XSS in the style of Trump though, assuming it knows enough about both "XSS" and "Trump", and that is sufficient.
	▲	0cf8612b2e1e a day ago \| parent \| prev [-]
		At the token, not word level, it would be possible for a Markov chain. It never has to know about Trump or XSS, only that it sees tokens like “ing”, “ed”, “is”, and so forth. Given a LLM size corpus, which will have ~all token-to-token pairs with some non-zero frequency, the above could be generated. The actual probabilities will be terrible, but it is not impossible.

▲

johnisgood 2 days ago | parent | prev [-]

Oh, of course, what I want answered did not have much to do with Markov Chain, but LLMs, because I saw this argument often against LLMs.

▲ vrighter a day ago | parent | prev | next [-]

Building a static table of seen inputs is just one way of building a state transition table for a markov chain. It is just an implementation detail of a function (in the mathematical sense of the word. no side effects) that takes in some input and outputs a probability distribution.

You could make the table bigger and fill in the rest yourself. Or you could use machine learning to do it. But then the table would be too huge to actually store due to combinatorial explosion. So we find a way to reduce that memory cost. How about we don't precompute the whole table and lazily evaluate individual cells as/when they are needed? You achieve that by passing it through the machine-learned function (a trained network is a fixed function with no side effects.) You might say but that's not the same thing!!! But remember that the learned network will always output the same distribution if given the same input, because it is a function. Let's say you have a context size of 1000 and have 10 possible tokens. There are 10^1000 possible inputs. A huge number, but most importantly, a finite number. So you could in theory feed them all in one at a time and record the result in a table. While we can't really do this in practice, because the resulting table would be huge. It is, for all mathematical purposes, equivalent and you could, in theory, freely transform one into the other.

Et voila! You have built a markov chain anyway. Previous input goes in, magic happens inside (whichever implementation you used to implement the function doesn't matter), and a probability distribution comes out. It's a markov chain. It doesn't quack and walk like a duck. It IS an actual duck.

▲ lukan a day ago | parent | prev | next [-]

"This is why it's impossible to create a digital assistant, or really anything useful, via Markov Chain. The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative."

That reads like anything useful needs to be creative? I would disagree here. A digital assistent, in control of a automatic door for example should not be creative, but stupidly do exactly as told. "Open the door". "Close the door". And the "creativity" of KI agents I see rather as a danger here.

▲

cwillu a day ago | parent [-]

Most physical problems require some level of creativity: your door opening robot should be able to handle some level of dust on the handle, some level of slipperiness of the floor, some amount of packages blocking where it wants to stand, and crucially: some level of all of those things where it gives up rather than causing damage.

You can't open-loop everything, and the edge cases in a closed-loop absolutely explodes.

Yes, a “digital assistant“ responsible only for handling a door is manageable, but even “get a pot, fill it with water, and boil it” gets _remarkably_ complicated if you need reduce all the edge cases to known regions of behaviour that you can pre-program responses to.

	▲	lukan a day ago \| parent [-]
		Yes, but I want those sensor input and fail modes handled with deterministic classical algorithms without (LLM) creativity, otherwise I envision doors that refuse to open on certain dates, because somewhere in its training corpus was a satirical sci fi story about depressed smart doors. (And no, I was not talking about a robot butler, but a automatic door. Infrared triggered, but enhanced with a text mode for control)

▲ papyrus9244 2 days ago | parent | prev | next [-]

> This is why it's impossible to create a digital assistant, or really anything useful, via Markov Chain. The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.

Or, in other words, a Markov Chain won't hallucinate. Having a system that only repeats sentences from it's source material and doesn't create anything new on its own is quite useful on some scenarios.

	▲	Sohcahtoa82 2 days ago \| parent \| next [-]
		> Or, in other words, a Markov Chain won't hallucinate. It very much can. Remember, the context windows used for Markov Chains are usually very short, usually in the single digit numbers of words. If you use a context length of 5, then when asking it what the next word should be, it has no idea what the words were before the current context of 5 words. This results in incoherence, which can certainly mean hallucinations.
	▲	stavros a day ago \| parent \| prev \| next [-]
		A Markov chain certainly will not hallucinate, because we define hallucinations as garbage within otherwise correct output. A Markov chain doesn't have enough correct output to consider the mistakes "hallucinations", but in a sense that nothing is a hallucination when everything is one.
	▲	vrighter a day ago \| parent \| prev [-]
		You can very easily inject wrong information into the state transition function. And machine learning can and regularly does do so. That is not a difference between an LLM and a markov chain.

▲ dragonwriter 2 days ago | parent | prev | next [-]

> A Markov Chain trained by only a single article of text will very likely just regurgitate entire sentences straight from the source material.

Strictly speaking, this is true of one particular way (the most straightforward) to derive a Markov chain from a body of text; a Markov chain is just a probabilistic model of state transitions where the probability of each possible next state is dependent only on the current state. Having the states be word sequence of some number of words, overlapping by all but one word, and having the probabilities being simply the frequency with which the added word in the target state follows the sequence in the source state in the training corpus is one way you can derive a Markov chain from a body of text, but not the only one.

	▲	JPLeRouzic a day ago \| parent [-]
		> A Markov Chain trained by only a single article of text will very likely just regurgitate entire sentences straight from the source material. I see this as a strength; try training an LLM on a 42KB text to see if it can produce a coherent output.

▲ psychoslave 2 days ago | parent | prev | next [-]

> You'll find that the resulting output becomes incoherent garbage.

I also do that kind of things with LLM. The other day, I don't remember the prompt (something casual really, not trying to trigger any issue) but le chat mistral started to regurgitate "the the the the the...".

And this morning I was trying a some local models, trying to see if they could output some Esperanto. Well, that was really a mess of random morphs thrown together. Not syntactically wrong, but so out of touch with any possible meaningful sentence.

▲

lotyrin 2 days ago | parent [-]

Yeah, some of the failure modes are the same. This one in particular is fun because even a human, given "the the the" and asked to predict what's next will probably still answer "the". How a Markov chain starts the the train and how the LLM does are pretty different though.

▲

arowthway a day ago | parent | next [-]

I wonder if "X is not Y - its' Z" LLM shibboleth is just an artifact of "is not" being a third most common bigram starting with is, just after "is a" and "is the" [0]. It doesn't follow as simply as it does with markov chains, but maybe this is where the tendency originated, and later was trained and RLHFed into the shape that kind of makes sense instead of getting eliminated.

[0] https://books.google.com/ngrams/graph?content=is+*

▲

psychoslave a day ago | parent | prev [-]

I never saw any human starting to loop "the" as a reaction to any utterance though.

Personally my concern is more about the narrative that LLM are making "chain of thoughts", can "hallucinante" and that people should become "AI complement". They are definitely making nice inferences most of the time, but they are also totally different thing compared to human thoughts.

▲

stavros a day ago | parent [-]

I've definitely seen (and have myself) gotten stuck in phrase loops. We call it "stuttering".

▲

psychoslave a day ago | parent [-]

Good point, we human definitely have defects too. I’ll reflect on this, though this doesn’t make me consider that chip inferences to be fully analog to what happen in humans when they are thinking (or any animal/entity ongoing a thought process).

	▲	stavros a day ago \| parent [-]
		Yeah, fair. I'm just not generally a fan of the perfect world fallacy. Something might not be perfect, but it still might be as good as the alternative.

▲ vjerancrnjak 2 days ago | parent | prev | next [-]

If you learn with Baum Welch you can get nonzero ood probabilities.

Something like Markov Random Field is much better.

Not sure if anyone managed to create latent hierarchies from chars to words to concepts. Learning NNs is far more tinkery than brutality of probabilistic graphical models.

▲ anshumankmr a day ago | parent | prev | next [-]

>This is why it's impossible to create a digital assistant, or really anything useful, via Markov Chain. The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.

Its funny cause they say the same thing about LLMs (sort of)

▲ ssivark 2 days ago | parent | prev | next [-]

Uhhh... the above comment has a bunch of loose assertions that are not quite true, but with a enough truthiness that makes them hard to refute. So I'll point to my other comment for a more nuanced comparison of Markov models with tiny LLMs: https://news.ycombinator.com/item?id=45996794

	▲	nazgul17 a day ago \| parent [-]
		To add to this, the system offering text generation, i.e. the loop that builds the response one token at a time generated by a LLM (and at the same time feeds the LLM the text generated so far) is a Markov Model, where the transition matrix is replaced by the LLM, and the state space is the space of all texts.

▲ Kinrany a day ago | parent | prev [-]

Markov Chains could be applied on top of embeddings just as well though