Remix.run Logo
thaumasiotes 2 days ago

>> The fact that they only generate sequences that existed in the source

> I am quite confused right now. Could you please help me with this?

This is pretty straightforward. Sohcahtoa82 doesn't know what he's saying.

Sohcahtoa82 2 days ago | parent [-]

I'm fully open to being corrected. Just telling me I'm wrong without elaborating does absolutely nothing to foster understanding and learning.

thaumasiotes 2 days ago | parent [-]

If you still think there's something left to explain, I recommend you read your other responses. Being restricted to the training data is not a property of Markov output. You'd have to be very, very badly confused to think that it was. (And it should be noted that a Markov chain itself doesn't contain any training data, as is also true of an LLM.)

More generally, since an LLM is a Markov chain, it doesn't make sense to try to answer the question "what's the difference between an LLM and a Markov chain?" Here, the question is "what's the difference between a tiny LLM and a Markov chain?", and assuming "tiny" refers to window size, and the Markov chain has a similarly tiny window size, they are the same thing.

astrange 2 days ago | parent | next [-]

An LLM is not a Markov chain of the input tokens, because it has internal computational state (the KV cache and residuals).

An LLM is a Markov process if you include its entire state, but that's a pretty degenerate definition.

Jensson a day ago | parent [-]

> An LLM is a Markov process if you include its entire state, but that's a pretty degenerate definition.

Not any more degenerate than a multi word bag of words markov chain, its exactly the same concept: you input a context of words / tokens and get a new word / token, the things you mention there are just optimizations around that abstraction.

astrange a day ago | parent [-]

The difference is there are exponentially more states than an n-gram model. It's really not the same thing at all. An LLM can perform nearly arbitrary computation inside its fixed-size memory.

https://arxiv.org/abs/2106.06981

(An LLM with tool use isn't a Markov process at all of course.)

johnisgood 2 days ago | parent | prev | next [-]

He said LLMs are creative, yet people have been telling me that LLMs cannot solve problems that is not in their training data. I want this to be clarified or elaborated on.

shagie 2 days ago | parent [-]

Make up a fanciful problem and ask it to solve it. For example, https://chatgpt.com/s/t_691f6c260d38819193de0374f090925a is unlikely to be found in the training data - I just made it up. Another example of wizards and witches and warriors and summoning... https://chatgpt.com/share/691f6cfe-cfc8-8011-b8ca-70e2c22d36... - I doubt that was in the training data either.

Make up puzzles of your own and see if it is able to solve it or not.

The blanket claim of "cannot solve problems that are not in its training data" seems to be something that can be disproven by making up a puzzle from your own human creativity and seeing if it can solve it - or for that matter, how it attempts to solve it.

It appears that there is some ability for it to reason about new things. I believe that much of this "an LLM can't do X" or "an LLM is parroting tokens that it was trained on" comes from trying to claim that all the material that it creates was created before, by a human and any use of an LLM is stealing from some human and thus unethical to use.

( ... and maybe if my block world or wizards and warriors and witches puzzle was in the training data somewhere, I'm unconsciously copying something somewhere else and my own use of it is unethical. )

wadadadad 2 days ago | parent | next [-]

This is an interesting idea, but as you stated, it's all logic; it's hard to come up with an idea where you don't have to explain concepts yet still is dissimilar enough to be in the training.

In your second example with the wizards- did you notice that it failed to follow the rules? Step 3, the witch was summoned by the wizard. I'm curious as to why you didn't comment either way on this.

On a related note, instead of puzzles, what about presenting riddles? I would argue that riddles are creative, pulling bits and pieces of meaning from words to create an answer. If AI can solve riddles not seen before, would that count as creative and not solving problems in their dataset?

Here's one I created and presented (the first incorrect answer I got was Escape Room; I gave it 10 attempts and it didn't get the answer I was thinking of):

---

Solve the riddle:

Chaos erupts around

The shape moot

The goal is key

shagie 2 days ago | parent [-]

The challenge is: for someone who is convinced that an LLM is only presenting material that they've seen before that was created by some human, how do you show them something that hasn't been seen before?

(Digging in old chats one from 2024 this one is amusing ... https://chatgpt.com/share/af1c12d5-dfeb-4c76-a74f-f03f48ce3b... was a fun one - epic rap battle between Paul Graham and Commander Taco )

Many people seem to believe that the LLM is not much more than a collage of words that it stole from other places and likewise images are a collage of images stolen from other people's pictures. (I've had people on reddit (which tends to be rather AI hostile outside of specific AI subs) downvote me for explaining how to use an LLM as an editor for your own writing or pointing out that some generative image systems are built on top of libraries where the company had rights (e.g. stock photography) to all the images)

With the wizards, I'm not interested necessarily in the correct solution, but rather how it did it and what the representation of the response was. I selected everything with 'W' to see how it handled identifying the different things.

As to riddles... that's really a question of mind reading. Your riddle isn't one that I can solve. Maybe if you told me the answer I'd understand how you got from the answer to the question, but I've got no idea how to go from the hint to a possible answer (does that make me an LLM?)

I feel its a question much more along some other classic riddles...

    “What have I got in my pocket?" he said aloud. He was talking to himself, but Gollum thought it was a riddle, and he was frightfully upset. "Not fair! not fair!" he hissed. "It isn't fair, my precious, is it, to ask us what it's got in its nassty little pocketsess?”
What do I have in my pocket? (and then a bit of "what would it do with that prompt?") https://chatgpt.com/s/t_691fa7e9b49081918a4ef8bdc6accb97

At this point, I'm much more of the opinion that some people are on "team anti-ai" and that it has become part of their identity to be against anything that makes use of AI to augment what a human can do unaided. Attempting to show that it's not a stochastic parrot or next token predictors (anymore than humans are) or that it can do things that help people (when used responsibly by the human) gets met with hostility.

I believe that this comes from the group identity and some of the things of group dynamics. https://gwern.net/doc/technology/2005-shirky-agroupisitsownw...

> The second basic pattern that Bion detailed is the identification and vilification of external enemies. This is a very common pattern. Anyone who was around the open source movement in the mid-1990s could see this all the time. If you cared about Linux on the desktop, there was a big list of jobs to do. But you could always instead get a conversation going about Microsoft and Bill Gates. And people would start bleeding from their ears, they would get so mad.

> ...

> Nothing causes a group to galvanize like an external enemy. So even if someone isn’t really your enemy, identifying them as an enemy can cause a pleasant sense of group cohesion. And groups often gravitate toward members who are the most paranoid and make them leaders, because those are the people who are best at identifying external enemies.

wadadadad 21 hours ago | parent [-]

I don't think riddles are necessarily 'solvable' in that there's only one right answer; the very fact that they're open to interpretation, but when you get the 'right' answer it (hopefully) makes sense. So if an AI/LLM can answer such a nebulous thing correctly- that's more of the angle I was going at.

Regarding the wizards example, I'm a bit confused; I was thinking that the best way to judge answers for problem solving/creativity was for correctness. I'll think more on whether the 'thought process' counts in and of itself.

The answer to my riddle is 'ball'.

shagie 17 hours ago | parent | next [-]

Perfect correctness is what you'd expect from a computer. I could write a program that solved it - and that would be an indication of my creativity as a human solving something that I haven't encountered before. Incidentally, that's also how it approached solving the block problem (by writing a program).

If you ask me the goat, wolf, cabbage problem I'd be able to recite (as an xkcd fan https://xkcd.com/1134/ and https://xkcd.com/2348/ and the exploration of what else it could do). However, if someone hasn't seen the problem before it could be a useful tool at seeing how they approach solving it.

The question of how does it tackle a new problem is one of creativity and exploration of thought in a new (untrained) domain.

A possible claim of "well, it's been trained on the meta-problem of how to solve problems that weren't in its training set" would get a side eye.

For the "ball" being the answer... consider the second response to https://chatgpt.com/share/6920b9e2-764c-8011-a14a-012e97573f... (make sure you click on the "Thought for 1m 5s" to get the internal process)

johnisgood 19 hours ago | parent | prev [-]

How did you get "ball" from your riddle? I read it and I have no idea! :(

shagie 16 hours ago | parent [-]

In my sibling comment, I linked the chat session where I prompted ChatGPT for possible answers and reasoning.

https://chatgpt.com/share/6920b9e2-764c-8011-a14a-012e97573f...

    Given the following riddle, identify the object to which it refers.
    #
    Chaos erupts around
    The shape moot
    The goal is key
    #
    Identify 10 different possible answers and identify the reasoning behind each guess and why it may or may not be correct.
The second item in the possible answers:

    Soccer ball
    Why it fits:
        “Chaos erupts around”: Players cluster and scramble around the ball; wherever it goes, chaos follows.
        “The shape moot”: Modern footballs vary in panel design and surface texture, but they must all be broadly spherical; to the game itself, variations in cosmetic shape are mostly irrelevant.
        “The goal is key”: Everyone’s objective is to get the ball into the goal.
    Why it might not be correct:
        The third line emphasizes the goal, which points more strongly to the scoring structure or concept of scoring rather than the ball.
Ardren 2 days ago | parent | prev [-]

I think your example works, as it does try to solve a problem it hasn't seen (though it is very similar to existing problems).

... But, CharGPT makes several mistakes :-)

> Wizard Teleport: Wz1 teleports himself and Wz2 to Castle Beta. This means Wz1 has used his only teleport power.

Good.

> Witch Summon: From Castle Beta, Wi1 at Castle Alpha is summoned by Wz1. Now Wz1 has used his summon power.

Wizzard1 cannot summon.

> Wizard Teleport: Now, Wz2 (who is at Castle Beta) teleports back to Castle Alpha, taking Wa1 with him.

Warrior1 isn't at Castle beta

> Wizard Teleport: Wz2, from Castle Alpha, teleports with Wa2 to Castle Beta.

Wizzard2 has already teleported

purple_turtle 2 days ago | parent | prev [-]

1) being restricted to exact matches in input is definition of Markov Chains

2) LLMs are not Markov Chains

saithound 2 days ago | parent | next [-]

A Markov chain [1] is a discrete-time stochastic process, in which the value of each variable depends only on the value of the immediately preceding variable, and not any variables in the past.

LLMs are most definitely (discrete-time) Markov chains in this sense: the variables take their values in the context vectors, and the distribution of the new context window depends only on what was sampled previously context.

A Markov chain is a Markov chain, no matter how you implement it in a computer, whether as a lookup table, or an ordinary C function, or a one-layer neural net or a transformer.

LLMs and Markov text generators are technically both Markov chains, so some of the same math applies to both. But that's where the similarities end: e.g. the state space of an LLM is a context window, whereas the state space of a Markov text generator is usually an N-tuple of words.

And since the question here is how tiny LLMs differ from Markov text generators, the differences certainly matter here.

[1] https://en.wikipedia.org/wiki/Discrete-time_Markov_chain

thaumasiotes 2 days ago | parent | prev [-]

> 1) being restricted to exact matches in input is definition of Markov Chains

Here's wikipedia:

> a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

A Markov chain is a finite state machine in which transitions between states may have probabilities other than 0 or 1. In this model, there is no input; the transitions occur according to their probability as time passes.

> 2) LLMs are not Markov Chains

As far as the concept of "Markov chains" has been used in the development of linguistics, they are seen as a tool of text generation. A Markov chain for this purpose is a hash table. The key is a sequence of tokens (in the state-based definition, this sequence is the current state), and the value is a probability distribution over a set of tokens.

To rephrase this slightly, a Markov chain is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then for the following token you should choose t_1 with probability p_1, t_2 with probability p_2, etc...".

Then, to tie this back into the state-based definition, we say that when we choose token t_k, we emit that token into the output, and we also dequeue the first token from our representation of the state and enqueue t_k at the back. This brings us into a new state where we can generate another token.

A large language model is seen slightly differently. It is a function. The independent variable is a sequence of tokens, and the dependent variable is a probability distribution over a set of tokens. Here we say that the LLM answers the question "if the last N tokens of a fixed text were s_1, s_2, ..., s_N, what is the following token likely to be?".

Or, rephrased, the LLM is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then the following token will be t_1 with probability p_1, t_2 with probability p_2, etc...".

You might notice that these two tables contain the same information organized in the same way. The transformation from an LLM to a Markov chain is the identity transformation. The only difference is in what you say you're going to do with it.

Sohcahtoa82 2 days ago | parent | next [-]

> Here we say that the LLM answers the question "if the last N tokens of a fixed text were s_1, s_2, ..., s_N, what is the following token likely to be?".

> Or, rephrased, the LLM is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then the following token will be t_1 with probability p_1, t_2 with probability p_2, etc...".

This is an oversimplication of an LLM.

The output layer of an LLM contains logits of all tokens in the vocabulary. This can mean every word, word fragment, punctuation mark, or whatever emoji or symbol it knows. Because the logits are calculated through a whole lot of floating point math, it's very likely that most results will be non-zero. Very close to zero, but still non-zero.

This means that gibberish options for the next token have non-zero probabilities of being chosen. The only reason they don't in reality is because of top-k sampling, temperature, and other filtering that's done on the logits before actually choosing a token.

If you present a s_1, s_2, ... s_N to a Markov Chain when that series was never seen by the chain, then the resulting probabilities is an empty set. But if you present them to an LLM, it gets fed into a neural network and eventually a set of logits comes out, and you can still choose the next token based on it.

thaumasiotes a day ago | parent [-]

> This means that gibberish options for the next token have non-zero probabilities of being chosen. The only reason they don't in reality is because of top-k sampling, temperature, and other filtering that's done on the logits before actually choosing a token.

> If you present a s_1, s_2, ... s_N to a Markov Chain when that series was never seen by the chain

No, you're confused.

The chain has never seen anything. The Markov chain is a table of probability distributions. You can create it by any means you see fit. There is no such thing as a "series" of tokens that has been seen by the chain.

Sohcahtoa82 20 hours ago | parent [-]

> The chain has never seen anything. The Markov chain is a table of probability distributions. You can create it by any means you see fit. There is no such thing as a "series" of tokens that has been seen by the chain.

When I talk about the chain "seeing" a sequence, I mean that the sequence existed in the material that was used to generate the probability table.

My instinct is to believe that you know this, but are being needlessly pedantic.

My point is that if you're using a context length of two, if you prompt a Markov Chain with "my cat", but the sequence "my cat was" never appeared in the training material, than a Markov Chain will never choose "was" as the next word. This property is not true for LLMs. If you prompt an LLM with "my cat", then "was" has a non-zero chance of being chosen as the next word, even if "my cat was" never appeared in the training material.

purple_turtle 2 days ago | parent | prev [-]

Maybe technically LLM can be converted to an equivalent Markov Chain.

The problem is that even for modest context sizes the size of Markov Chain would by hilariously and monstrously large.

You may as well tell that LLM and a hash table is the same thing.

a day ago | parent | next [-]
[deleted]
thaumasiotes a day ago | parent | prev [-]

As I just mentioned in the comment you're responding to, the way you convert an LLM into an equivalent Markov chain is by doing nothing, since it already is one.

> You may as well tell that LLM and a hash table is the same thing.

No. You may as well say that a hash table and a function are the same thing. And this is in fact a common thing to say, because they are the same thing.

An LLM is a significantly more restricted object than a function is.

purple_turtle a day ago | parent [-]

> LLM is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then the following token will be t_1 with probability p_1, t_2 with probability p_2, etc...".

No, LLM is not a lookup table for all possible inputs.