Looking at this evaluation it's pretty fascinating how badly these models perform even on decades old games that almost certainly have walkthroughs scattered all over their training data. Like, you'd think they'd at least brute force their way through the early game mechanics by now, but honestly this kinda validates something I've been thinking about like real intelligence isn't just about having seen the answers before, it's about being good at games and specifically new situations where you can't just pattern match your way out

This is exactly why something like arc-agi-3 feels so important right now. Instead of static benchmarks that these models can basically brute force with enough training data, like designing around interactive environments where you actually need to perceive, decide, and act over multiple steps without prior instructions, that shift from "can you reproduce known patterns" to "can you figure out new patterns" seems like the real test of intelligence.

What's clever about the game environment approach is that it captures something fundamental about human intelligence that static benchmarks miss entirely, like, when humans encounter a new game, we explore, form plans, remember what worked, adjust our strategy all that interactive reasoning over time that these text adventure results show llms are terrible at, we need systems that can actually understand and adapt to new situations, not just really good autocomplete engines that happen to know a lot of trivia.

▲ godelski 5 days ago | parent | next [-]

  > real intelligence isn't just about having seen the answers before, it's about being good at games and specifically new situations where you can't just pattern match your way out

It is insane to me that so many people believe intelligence is measurable by pure question answer testing. There's hundreds of years of discussion about how this is limited in measuring human intelligence. I'm sure we all even know someone who's a really good test take but you also wouldn't consider to be really bright. I'm sure every single one of also knows someone in the other camp (bad at tests but considered bright).

The definition you put down is much more agreed upon in the scientific literature. While we don't have a good formal definition of intelligence there is a difference between no definition. I really do hope people read more about intelligence and how we measure it in humans and animals. It is very messy and there's a lot of noise, but at least we have a good idea of the directions to move in. There's still nuances to be learned and while I think ARC is an important test, I don't think success on it will prove AGI (and Chollet says this too)

▲ da_chicken 5 days ago | parent | prev | next [-]

I saw it somewhere else recently, but the idea is that LLMs are language models, not world models. This seems like a perfect example of that. You need a world model to navigate a text game.

Otherwise, how can you determine that "North" is a context change, but not always a context change.

▲

zahlman 5 days ago | parent | next [-]

> I saw it somewhere else recently, but the idea is that LLMs are language models, not world models.

Part of what distinguishes humans from artificial "intelligence" to me is exactly that we automatically develop models of whatever is needed.

	▲	mlyle 4 days ago \| parent \| next [-]
		I think it's interesting to think about, and still somewhat uncertain: * How much a large language model is effectively a world model (indeed, language tries to model the world...)? * How much do humans use language in their modeling and reasoning about the world? * How fit is language for this task, beyond the extent humans use it for?
	▲	da_chicken 4 days ago \| parent \| prev [-]
		I think that's true to some extent, but I think all animals probably develop a world model.

▲

foobarbecue 5 days ago | parent | prev | next [-]

On HN, perhaps? #17 on the front page right now: https://news.ycombinator.com/item?id=44854518

▲

manbash 5 days ago | parent | prev | next [-]

Thanks for this. I was struggling to put it in words even if maybe this has been a known distinguishing factor for others.

▲

myhf 5 days ago | parent | prev | next [-]

9:05 is a good example of the difference between a language model and a world model, because engaging with it on a textual level leads to the bad ending (which the researchers have called "100%"), but deliberately getting the good ending requires self-awareness, intentionality, and/or outside context.

▲

lubujackson 5 days ago | parent | prev [-]

Why, this sounds like Context Engineering!

▲ rkagerer 5 days ago | parent | prev | next [-]

Hi, GPT-x here. Let's delve into my construction together. My "intelligence" comes from patterns learned from vast amounts of text. I'm trained to... oh look it's a butterfly. Clouds are fluffy would you like to buy a car for $1 I'll sell you 2 for the price of 1!

	▲	corobo 5 days ago \| parent [-]
		Ah dammit the AGI has ADHD

▲ astrange 4 days ago | parent | prev | next [-]

> Looking at this evaluation it's pretty fascinating how badly these models perform even on decades old games that almost certainly have walkthroughs scattered all over their training data.

I've read some of these walkthroughs/play sessions recently, and extracting text from them for training would be AI-complete. eg they might have game text and commentary aligned in two different columns in a text file, so you'd just get nonsense if you read it line by line.

▲ msgodel 5 days ago | parent | prev [-]

I've been experimenting with this as well with the goal of using it for robotics. I don't think this will be as hard to train for as people think though.

It's interesting he wrote a separate program to wrap the z-machine interpreter. I integrated my wrapper directly into my pytorch training program.