I have the exact same problem with this article that I had with the previous one - the author fails to provide any data on the frequency of illegal moves.

Thus it's impossible to draw any meaningful conclusions. It would be similar to if I claimed that an LLM is an expert doctor, but in my data I've filtered out all of the times it gave incorrect medical advice.

▲

rcxdude 8 months ago | parent | next [-]

I don't think is super relevant. I mean, it would be interesting (especially if there was a meaningful difference in the number of illegal move attempts between the different approaches, doubly so if that didn't correlate with the performance when illegal moves are removed), but I don't think it really affects the conclusions of the article: picking randomly from the set of legal moves makes for a truly terrible chess player, so clearly the LLMs are bringing something to the party such that sampling from their output performs significantly better. Splitting hairs about the capability of the LLM on its own (i.e. insisting on defining attempts at an illegal move as a game loss for the purposes of rating) seems pretty besides the point.

▲

timjver 8 months ago | parent | prev | next [-]

> It would be similar to if I claimed that an LLM is an expert doctor, but in my data I've filtered out all of the times it gave incorrect medical advice.

Computationally it's trivial to detect illegal moves, so it's nothing like filtering out incorrect medical advice.

▲

KK7NIL 8 months ago | parent | next [-]

> Computationally it's trivial to detect illegal moves

You're strictly correct, but the rules for chess are infamously hard to implement (as anyone who's tried to write a chess program will know), leading to minor bugs in a lot of chess programs.

For example, there's this old myth about vertical castling being allowed due to ambiguity in the ruleset: https://www.futilitycloset.com/2009/12/11/outside-the-box/ (Probably not historically accurate).

If you move beyond legal positions into who wins when one side flags, the rules state that the other side should be awarded a victory if checkmate was possible with any legal sequence of moves. This is so hard to check that no chess program tries to implement it, instead using simpler rules to achieve a very similar but slightly more conservative result.

▲

adelineJoOs 8 months ago | parent | next [-]

That link was new too me, thanks! However: I wrote some chess-program myself (nothing big, hobby level) and I would not call it hard to implement. Just harder than what someone might assume initially. But in the end, it is one of the simpler simulations/algorithms I did. It is just the state of the board, the state of the game (how many turns, castle rights, past positions for the repetition rule, ...) and picking one rule set if one really wants to be exact.

(thinking about which rule set is correct would not be meaningful in my opinion - chess is a social construct, with only parts of it being well defined. I would not bother about the rest, at least not when implementing it)

By the way: I read "Computationally it's trivial" as more along the lines of "it has been done before, it is efficient to compute, one just has to do it" versus "this is new territory, one needs to come up with how to wire up the LLM output with an SMT solver, and we do not even know if/how it will work."

▲

admax88qqq 8 months ago | parent | prev | next [-]

> You're strictly correct, but the rules for chess are infamously hard to implement

Come on. Yeah they're not trivial but they've been done numerous times. There's been chess programs for almost as long as there have been computers. Checking legal moves is a _solved problem_.

Detecting valid medical advice is not. The two are not even remotely comparable.

	▲	KK7NIL 8 months ago \| parent [-]
		> Detecting valid medical advice is not. The two are not even remotely comparable. Uh? Where exactly did I signal my support for LLM's giving medical advice?

▲

elif 8 months ago | parent | prev | next [-]

We implemented a whole chess engine in lisp during 3rd year it was really trivial actually implementing the legal move/state checking.

▲

rco8786 8 months ago | parent | prev [-]

I got a kick out of that link. Had certainly never heard of "vertical castling" previously.

▲

wavemode 8 months ago | parent | prev [-]

As I wrote in another comment - you can write scripts that correct bad math, too. But we don't use that to claim that LLMs have a good understanding of math.

▲

ben_w 8 months ago | parent | next [-]

I'd say that's because we don't understand what we mean by "understand".

Hardware that accurately performs maths faster than all of humanity combined is so cheap as to be disposable, but I've yet to see anyone claim that a Pi Zero has "understanding" of anything.

An LLM can display the viva voce approach that Turing suggested[0], and do it well. Ironically for all those now talking about "stochastic parrots", the passage reads:

"""… The game (with the player B omitted) is frequently used in practice under the name of viva voce to discover whether some one really understands something or has ‘learnt it parrot fashion’. …"

Showing that not much has changed on the philosophy of this topic since it was invented.

[0] https://academic.oup.com/mind/article/LIX/236/433/986238

	▲	danparsonson 8 months ago \| parent [-]
		> I'd say that's because we don't understand what we mean by "understand". I'll have a stab at it. The idea of LLMs 'understanding' maths is that, once having been trained on a set of maths-related material, the LLM will be able to generalise to solve other maths problems that it hasn't encountered before. If an LLM sees all the multiplication tables up to 10x10, and then is correctly able to compute 23x31, we might surmise that it 'understands' multiplication - i.e. that it has built some generalised internal representation of what multiplication is, rather than just memorising all possible answers. Obviously we don't expect generalisation from a Pi Zero without specifically being coded for it, because it's a fixed function piece of hardware. Personally I think this is highly unlikely given that maths and natural language are very different things, and being good at the latter does not bear any relationship to being good at the former (just ask anyone who struggles with maths - plenty of people do!). Not to mention that it's also much easier to test for understanding of maths because there is (usually!) a single correct answer regardless of how convoluted the problem - compared to natural language where imitation and understanding are much more difficult to tell apart.

▲

SpaceManNabs 8 months ago | parent | prev | next [-]

I don't know. I have talked to a few math professors, and they think LLMs are as good as a lot of their peers when it comes hallucinations and being able to discuss ideas on very niche topics, as long as the context is fed in. If Tao is calling some models "a mediocre, but not completely incompetent [...] graduate student", then they seem to understand math to some degree to me.

▲

lupire 8 months ago | parent | next [-]

Tao said that about a model brainstorming ideas that might be useful, not explaining complex ideas or generating new ideas or selecting a correct idea from a list of brainstormed ideas. Not replacing a human.

	▲	adelineJoOs 8 months ago \| parent [-]
		> Not replacing a human. Obviously not, but that is tangential to this discussion, I think. A hammer might be a useful tool in certain situations, and surely it does not replace a human (but it might make a human in those situations more productive, compared to a human without a hammer). > generating new ideas Is brainstorming not an instance of generating new ideas? I would strongly argue so. And whether the LLM does "understand" (or whatever ill-defined, ill-measurable concept one wants to use here) anything about the ideas if produces, and how they might be novel - that is not important either. If we assume that Tao is adequately assessing the situation and truthfully reporting his findings, then LLMs can, at the current state, at least occasionally be useful in generating new ideas, at least in mathematics.

▲

fijiaarone 8 months ago | parent | prev [-]

Being as good as a professor at confidently hallucinating nonsense when you don't know the answer is a very high level skill.

▲

fijiaarone 8 months ago | parent | prev [-]

Actually, LLMs do call scripts that correct bad math, and have gotten progressively better because of it. It's another special case example.

▲

sigmar 8 months ago | parent | prev | next [-]

Don't think that analogy works unless you could write a script that automatically removes incorrect medical advice, because then you would indeed have an LLM-with-a-script that was an expert doctor (which you can do for illegal chess move, but obviously not for evaluating medical advice)

▲

wavemode 8 months ago | parent | next [-]

You can write scripts that correct bad math, too. In fact most of the time ChatGPT will just call out to a calculator function. This is a smart solution, and very useful for end users! But, still, we should not try to use that to make the claim that LLMs have a good understanding of math.

▲

afro88 8 months ago | parent | next [-]

If a script were applied that corrected "bad math" and now the LLM could solve complex math problems that you can't one-shot throw at a calculator, what would you call it?

	▲	sixfiveotwo 8 months ago \| parent \| next [-]
		It's a good point. But this math analogy is not quite appropriate: there's abstract math and arithmetic. A good math practitioner (LLM or human) can be bad at arithmetic, yet good at abstract reasoning. The later doesn't (necessarily) requires the former. In chess, I don't think that you can build a good strategy if it relies on illegal moves, because tactics and strategies are tied.
	▲	danparsonson 8 months ago \| parent \| prev \| next [-]
		If I had wings, I'd be a bird. Applying a corrective script to weed out bad answers is also not "one-shot" solving anything, so I would call your example an elaborate guessing machine. That doesn't mean it's not useful, but that's not how a human being does maths, when they understand what they're doing - in fact you can readily program a computer to solve general maths problems correctly the first time. This is also exactly the problem with saying that LLMs can write software - a series of elaborate guesses is undeniably useful and impressive, but without a corrective guiding hand, ultimately useless, and not demonastrating generalised understanding of the problem space. The dream of AI is surely that the corrective hand is unnecessary?
	▲	at_a_remove 8 months ago \| parent \| prev [-]
		Then you could replace the LLM with a much cheaper RNG and let it guess until the "bad math filter" let something through. I was once asked by one of the Clueless Admin types if we couldn't just "fix" various sites such that people couldn't input anything wrong. Same principle.

▲

vunderba 8 months ago | parent | prev | next [-]

Agreed. It's not the same thing and we should strive for precision (LLMs are already opaque enough as it is).

An LLM that recognizes an input as "math" and calls out to a NON-LLM to solve the problem vs an LLM that recognizes an input as "math" and also uses next-token prediction to produce an accurate response ARE DIFFERENT.

▲

henryfjordan 8 months ago | parent | prev [-]

At what point does "knows how to use a calculator" equate to knowing how to do math? Feels pretty close to me...

	▲	Tepix 8 months ago \| parent [-]
		Well, LLMs are bad at math but they're ok at detecting math and delegating it to a calculator program. It's kind of like humans.

▲

kcbanner 8 months ago | parent | prev [-]

It would be possible to employ an expert doctor, instead of writing a script.

	▲	ben_w 8 months ago \| parent [-]
		Which is cheaper: 1. having a human expert creating every answer or 2. having an expert check 10 answers each of which have a 90% chance of being right and then manually redoing the one which was wrong Now add a complications that: • option 1 also isn't 100% correct • nobody knows which things in option 2 are correlated or not and if those are or aren't correlated with human errors so we might be systematically unable to even recognise the errors • even if we could, humans not only get lazy without practice but also get bored if the work is too easy, so a short-term study in efficiency changes doesn't tell you things like "after 2 years you get mass resignations by the competent doctors, while the incompetent just say 'LGTM' to all the AI answers"

▲

og_kalu 8 months ago | parent | prev | next [-]

3-turbo-instruct makes about 5 or less illegal moves in 8205. It's not here but turbo instruct has been evaled before.

https://github.com/adamkarvonen/chess_gpt_eval

▲

teleforce 8 months ago | parent | prev | next [-]

> It would be similar to if I claimed that an LLM is an expert doctor, but in my data I've filtered out all of the times it gave incorrect medical advice

Sharp eyes, similarly Andrew Ng and his Stanford University team pulled the same trick by having overfitting training to testing ratio for his famous cardiologist-level paper published in Nature Medicine [1].

The training ratio is more than 99% and testing less than 1% which failed AI validation 101. The paper would not stand in most AI conference but being published in Nature Medicine, one of the highest impact factor journal there is and has many citations for AI in healthcare and medicine.

[1] Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network:

https://www.nature.com/articles/s41591-018-0268-3

▲

Der_Einzige 8 months ago | parent | prev | next [-]

Correct - Dynamic grammar based/constrained sampling can be used to, at each time-step, force the model to only make valid moves (and you don't have to do it in the prompt like this article does!!!)

I have NO idea why no one seems to do this. It's a similar issue with LLM-as-judge evaluations. Often they are begging to be combined with grammar based/constrained/structured sampling. So much good stuff in LLM land isn't used for no good reason! There are several libraries for implementing this easily, outlines, guidance, lm-format-enforcer, and likely many more. You can even do it now with OpenAI!

Oobabooga text gen webUI literally has chess as one of it's candidate examples of grammar based sampling!!!

▲

theptip 8 months ago | parent | prev | next [-]

This is a crazy goal-post move. TFA is proving a positive capability, and rejecting the null hypothesis that “LLMs can’t think they just regurgitate”.

Making some illegal moves doesn’t invalidate the demonstrated situational logic intelligence required to play at ELO 1800.

(Another angle: a human on Chess.com also has any illegal move they try to make ignored, too.)

▲

photonthug 8 months ago | parent | next [-]

> Making some illegal moves doesn’t invalidate the demonstrated situational logic intelligence

That’s exactly what it does. 1 illegal move in 1 million or 100 million or any other sample size you want to choose means it doesn’t understand chess.

People in this thread are really distracted by the medical analogy so I’ll offer another: you’ve got a bridge that allows millions of vehicles to cross, and randomly falls down if you tickle it wrong, maybe a car of rare color. One key aspect of bridges is that they work reliably for any vehicle, and once they fail they don’t work with any vehicle. A bridge that sometimes fails and sometimes doesn’t isn’t a bridge as much as a death trap.

▲

og_kalu 8 months ago | parent | next [-]

>1 illegal move in 1 million or 100 million or any other sample size you want to choose means it doesn’t understand chess

Highly rated chess players make illegal moves. It's rare but it happens. They don't understand chess ?

▲

photonthug 8 months ago | parent | next [-]

> Then no human understands chess

Humans with correct models may nevertheless make errors in rule applications. Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect, incomplete, or totally absent models.

Without using a word like “understands” it seems clear that the same apparent mistake has different causes.. and model errors are very different from model-application errors. In a math or physics class this is roughly the difference between carry-the-one arithmetic errors vs using an equation from a completely wrong domain. The word “understands” is loaded in discussion of LLMs, but everyone knows which mistake is going to get partial credit vs zero credit on an exam.

▲

og_kalu 8 months ago | parent | next [-]

>Humans with correct models may nevertheless make errors in rule applications. Ok

>Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect or incomplete models.

I don't know why people continue to force the wrong abstraction. LLMs do not work like 'machines'. They don't 'follow rules' the way we understand normal machines to 'follow rules'.

>so when they fail to apply rules correctly, it means they have incorrect or incomplete models.

Everyone has incomplete or incorrect models. It doesn't mean we always say they don't understand. Nobody says Newton didn't understand gravity.

>Without using a word like “understands” it seems clear that the same apparent mistake has different causes.. and model errors are very different from model-application errors.

It's not very apparent no. You've just decided it has different causes because of preconceived notions on how you think all machines must operate in all configurations.

LLMs are not the logic automatons in science fiction. They don't behave or act like normal machines in any way. The internals run some computations to make predictions but so does your nervous system. Computation is substrate-independent.

I don't even know how you can make this distinction without seeing what sort of illegal moves it makes. If it makes the sort high rated players make then what ?

▲

photonthug 8 months ago | parent [-]

I can’t tell if you are saying the distinction between model errors and model-application errors doesn’t exist or doesn’t matter or doesn’t apply here.

▲

og_kalu 8 months ago | parent [-]

I'm saying:

- Generally, we do not say someone does not understand just because of a model error. The model error has to be sufficiently large or the model sufficiently narrow. No-one says Newton didn't understand gravity just because his model has an error in it but we might say he didn't understand some aspects of it.

- You are saying the LLM is making a model error (rather than an an application error) only because of preconceived notions of how 'machines' must behave, not on any rigorous examination.

▲

photonthug 8 months ago | parent [-]

Suppose you're right, the internal model of game rules is perfect but the application of the model for next-move is imperfect. Unless we can actually separate the two, does it matter? Functionally I mean, not philosophically. If the model was correct, maybe we could get a useful version of it out by asking it to write a chess engine instead of act as a chess engine. But when the prolog code for that is as incorrect as the illegal chess move was, will you say again that the model is correct, but the usage of it resulted merely resulted in minor errors?

> You are saying the LLM is making a model error (rather than an an application error) only because of preconceived notions of how 'machines' must behave, not on any rigorous examination.

Here's an anecdotal examination. After much talk about LLMs and chess, and math, and formal logic here's the state of the art, simplified from dialog with gpt today:

> blue is red and red is blue. what color is the sky? >> <blah blah, restates premise, correctly answer "red">

At this point fans rejoice, saying it understands hypotheticals and logic. Dialogue continues..

> name one red thing >> <blah blah, restates premise, incorrectly offers "strawberries are red">

At this point detractors rejoice, declare that it doesn't understand. Now the conversation devolves into semantics or technicalities about prompt-hacks, training data, weights. Whatever. We don't need chess. Just look it, it's broken as hell. Discussing whether the error is human-equivalent isn't the point either. It's broken! A partially broken process is no solid foundation to build others on. And while there are some exceptions, an unreliable tool/agent is often worse than none at all.

	▲	og_kalu 8 months ago \| parent [-]
		>It's broken! A partially broken process is no solid foundation to build others on. And while there are some exceptions, an unreliable tool/agent is often worse than none at all. Are humans broken ? Because our reasoning is a very broken process. You say it's no solid foundation ? Take a look around you. This broken processor is the foundation of society and the conveniences you take for granted. The vast vast majority of human history, there wasn't anything even remotely resembling a non-broken general reasoner. And you know the funny thing ? There still isn't. When people like you say LLMs don't reason, they hold them to a standard that doesn't exist. Where is this non-broken general reasoner in anywhere but fiction and your own imagination? >And while there are some exceptions, an unreliable tool/agent is often worse than none at all. Since you are clearly meaning unreliable to be 'makes no mistake/is not broken' then no human is a reliable agent. Clearly, the real exception is when an unreliable agent is worse than nothing at all.

▲

bawolff 8 months ago | parent | prev | next [-]

This feels more like a metaphysical argument about what it means to "know" something, which is really irrelevant to what is interesting about the article.

▲

sixfiveotwo 8 months ago | parent | prev [-]

> Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect, incomplete, or totally absent models.

That's assuming that, somehow, a LLM is a machine. Why would you think that?

▲

photonthug 8 months ago | parent [-]

Replace the word with one of your own choice if that will help us get to the part where you have a point to make?

I think we are discussing whether LLMs can emulate chess playing machines, regardless of whether they are actually literally composed of a flock of stochastic parrots..

	▲	sixfiveotwo 8 months ago \| parent \| next [-]
		That's simple logic. Quoting you again: > Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect, incomplete, or totally absent models. If this line of reasoning applies to machines, but LLMs aren't machines, how can you derive any of these claims? "A implies B" may be right, but you must first demonstrate A before reaching conclusion B.. > I think we are discussing whether LLMs can emulate chess playing machines That is incorrect. We're discussing whether LLMs can play chess. Unless you think that human players also emulate chess playing machines?
	▲	XenophileJKO 8 months ago \| parent \| prev [-]
		Engineers really have a hard time coming to terms with probabilistic systems.

▲

emptyfile 8 months ago | parent | prev [-]

[dead]

▲

benediktwerner 8 months ago | parent | prev [-]

Try giving a random human 30 chess moves and ask them to make a non-terrible legal move. Average humans even quite often try to make illegal moves when clearly seeing the board before them. There are even plenty of cases where people reported a bug because the chess application didn't let them do an illegal move they thought was legal.

And the sudden comparison to something that's safety critical is extremely dumb. Nobody said we should tie the LLM to a nuclear bomb that explodes if it makes a single mistake in chess.

The point is that it plays at a level far far above making random legal moves or even average humans. To say that that doesn't mean anything because it's not perfect is simply insane.

	▲	photonthug 8 months ago \| parent [-]
		> And the sudden comparison to something that's safety critical is extremely dumb. Nobody said we should tie the LLM to a nuclear bomb that explodes if it makes a single mistake in chess. But it actually is safety critical very quickly whenever you say something like “works fine most of the time, so our plan going forward is to dismiss any discussion of when it breaks and why”. A bridge failure feels like the right order of magnitude for the error rate and effective misery that AI has already quietly caused with biased models where one in a million resumes or loan applications is thrown out. And a nuclear bomb would actually kill less people than a full on economic meltdown. But I’m sure no one is using LLMs in finance at all right? It’s so arrogant and naive to ignore failure modes that we don’t even understand yet.. at least bridges and steel have specs. Software “engineering” was always a very suspect name for the discipline but whatever claim we had to it is worse than ever.

▲

wavemode 8 months ago | parent | prev [-]

It's not a goalpost move. As I've already said, I have the exact same problem with this article as I had with the previous one. My goalposts haven't moved, and my standards haven't changed. Just provide the data! How hard can it be? Why leave it out in the first place?

▲

GuB-42 8 months ago | parent | prev | next [-]

> Thus it's impossible to draw any meaningful conclusions. It would be similar to if I claimed that an LLM is an expert doctor, but in my data I've filtered out all of the times it gave incorrect medical advice.

Not really, you can try to make illegal moves in chess, and usually, you are given a time penalty and get to try again, so even in a real chess game, illegal moves are "filtered out".

And for the "medical expert" analogy, let's say that you compare to systems based on the well being of the patients after they follow the advise. I think it is meaningful even if you filter out advise that is obviously inapplicable, for example because it refers to non-existing body parts.

▲

koolala 8 months ago | parent | prev | next [-]

I want to see graphs of moves the author randomly made too. Maybe even plotting a random-move player on the performance graphs vs. the AIs.

It's beginner chess and beginners make moves at random all the time.

	▲	benediktwerner 8 months ago \| parent [-]
		1750 elo is extremely far from beginner chess. The random mover bot on Lichess has like 700 rating. And the article does show various graphs of the badly playing models which will hardly play worse than random but are clearly far below the good models.

▲

hansvm 8 months ago | parent | prev | next [-]

There's a subtle distinction though; if you're able to filter out illegal behavior, the move quality conditioned on legality can be extremely different from arbitrary move quality (and, as you might see in LLM json parsing, conditioning per-token can be very different from conditioning per-response).

If you're arguing that the singularity already happened then your criticism makes perfect sense; these are dumb machines, not useful yet for most applications. If you just want to use the LLM as a tool though, the behavior when you filter out illegal responses (assuming you're able to do so) is the only reasonable metric.

Analogizing to a task I care a bit about: Current-gen LLMs are somewhere between piss-poor and moderate at generating recipes. With a bit of prompt engineering most recipes pass my "bar", but they're still often lacking in one or more important characteristics. If you do nothing other than ask it to generate many options and then as a person manually filter to the subset of ideas (around 1/20) which look stellar, it's both very effective at generating good recipes, and they're usually much better than my other sources of stellar recipes (obviously not generally applicable because you have to be able to tell bad recipes from good at a glance for that workflow to make sense). The fact that most of the responses are garbage doesn't really matter; it's still an improvement to how I cook.

▲

sixo 8 months ago | parent | prev | next [-]

When I play chess I filter out all kinds of illegal moves. I also filter out bad moves. Human is more like "recursively thinking of ideas and then evaluating them with another part of your model", why not let the LLMs do the same?

▲

skydhash 8 months ago | parent [-]

Because that’s not what happens? We learn through symbolic meaning and rules which then form a consistent system. Then we can have a goal and continuously evaluate if we’re within the system and transitionning towards that goal. The nice thing is that we don’t have to compute the whole simulation in our brains and can start again from the real world. The more you train, the better your heuristics become and the more your efficiency increases.

The internal model of a LLM is statistical text. Which is linear and fixed. Not great other than generating text similar to what was ingested.

▲

fl7305 8 months ago | parent | next [-]

> The internal model of a LLM is statistical text. Which is linear and fixed. Not great other than generating text similar to what was ingested.

The internal model of a CPU is linear and fixed. Yet, a CPU can still generate an output which is very different from the input. It is not a simple lookup table, instead it executes complex algorithms.

An LLM has large amounts of input processing power. It has a large internal state. It executes "cycle by cycle", processing the inputs and internal state to generate output data and a new internal state.

So why shouldn't LLMs be capable of executing complex algorithms?

▲

skydhash 8 months ago | parent [-]

It probably can, but how will those algorithms be created? And the representation of both input and output. If it’s text, the most efficient way is to construct a formal system. Or a statistical model if ambiguous and incorrect result are ok in the grand scheme of things.

The issue is always inout consumption, and output correctness. In a CPU, we take great care with data representation and protocol definition, then we do formal verification on the algorithms, and we can be pretty sure that the output are correct. So the issue is that the internal model (for a given task) of LLMs are not consistent enough and the referential window (keeping track of each item in the system) is always too small.

	▲	fl7305 8 months ago \| parent [-]
		Neural networks can be evolved to do all sorts of algorithms. For example, controlling an inverted pendulum so that it stays balanced. > In a CPU, we take great care with data representation and protocol definition, then we do formal verification on the algorithms, and we can be pretty sure that the output are correct. Sure, intelligent design makes for a better design in many ways. That doesn't mean that an evolved design doesn't work at all, right?

▲

hackinthebochs 8 months ago | parent | prev [-]

>The internal model of a LLM is statistical text. Which is linear and fixed.

Not at all. Like seriously, not in the slightest.

▲

skydhash 8 months ago | parent [-]

What does it encode? Images? Scent? Touch? Some higher dimensional qualia?

	▲	hackinthebochs 8 months ago \| parent [-]
		Well, a simple description is that they discover circuits that reproduce the training sequence. It turns out that in the process of this, they recover relevant computational structures that generalize the training sequence. The question of how far they generalize is certainly up for debate. But you can't reasonably deny that they generalize to a certain degree. After all, most sentences they are prompted on are brand new and they mostly respond sensibly. Their representation of the input is also not linear. Transformers use self-attention which relies on the softmax function, which is non-linear.

▲

falcor84 8 months ago | parent | prev [-]

I world argue that it's more akin to filtering out the chit-chat with the patient, where the doctor explained things in an imprecise manner, keeping only the formal and valid medical notation

▲

caddemon 8 months ago | parent | next [-]

There is no legitimate reason to make an illegal move in chess though? There are reasons why a good doctor might intentionally explain things imprecisely to a patient.

▲

hnthrowaway6543 8 months ago | parent [-]

> There is no legitimate reason to make an illegal move in chess though?

If you make an illegal move and the opponent doesn't notice it, you gain a significant advantage. LLMs just have David Sirlin's "Playing to Win" as part of their training data.

	▲	fluoridation 8 months ago \| parent [-]
		You raise an interesting point. If the filtered out illegal moves were disadvantageous, it could be that if the model had been allowed to make any moves it wanted it would have played to a much worse level than it did.

▲

ses1984 8 months ago | parent | prev [-]

It’s like the doctor saying, “you have cancer? Oh you don’t? Just kidding. Parkinson’s. Oh it’s not that either? How about common cold?”

▲

falcor84 8 months ago | parent [-]

Big the difference is that valid bad moves (equivalents of "cancer") were included in the analysis, it's only invalid ones (like "your body is kinda outgrowing itself") that were excluded from the analysis

▲

ses1984 8 months ago | parent [-]

What makes a chess move invalid is the state of the board. I don’t think moves like “pick up the pawn and throw it across the room” were considered.

	▲	toast0 8 months ago \| parent [-]
		That's a valid move in Monopoly though. Although it's much prefered to pick up the table and throw it.