▲ | henriquegodoy 5 days ago | ||||||||||||||||||||||||||||||||||||||||||||||
Looking at this evaluation it's pretty fascinating how badly these models perform even on decades old games that almost certainly have walkthroughs scattered all over their training data. Like, you'd think they'd at least brute force their way through the early game mechanics by now, but honestly this kinda validates something I've been thinking about like real intelligence isn't just about having seen the answers before, it's about being good at games and specifically new situations where you can't just pattern match your way out This is exactly why something like arc-agi-3 feels so important right now. Instead of static benchmarks that these models can basically brute force with enough training data, like designing around interactive environments where you actually need to perceive, decide, and act over multiple steps without prior instructions, that shift from "can you reproduce known patterns" to "can you figure out new patterns" seems like the real test of intelligence. What's clever about the game environment approach is that it captures something fundamental about human intelligence that static benchmarks miss entirely, like, when humans encounter a new game, we explore, form plans, remember what worked, adjust our strategy all that interactive reasoning over time that these text adventure results show llms are terrible at, we need systems that can actually understand and adapt to new situations, not just really good autocomplete engines that happen to know a lot of trivia. | |||||||||||||||||||||||||||||||||||||||||||||||
▲ | godelski 5 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
It is insane to me that so many people believe intelligence is measurable by pure question answer testing. There's hundreds of years of discussion about how this is limited in measuring human intelligence. I'm sure we all even know someone who's a really good test take but you also wouldn't consider to be really bright. I'm sure every single one of also knows someone in the other camp (bad at tests but considered bright).The definition you put down is much more agreed upon in the scientific literature. While we don't have a good formal definition of intelligence there is a difference between no definition. I really do hope people read more about intelligence and how we measure it in humans and animals. It is very messy and there's a lot of noise, but at least we have a good idea of the directions to move in. There's still nuances to be learned and while I think ARC is an important test, I don't think success on it will prove AGI (and Chollet says this too) | |||||||||||||||||||||||||||||||||||||||||||||||
▲ | da_chicken 5 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
I saw it somewhere else recently, but the idea is that LLMs are language models, not world models. This seems like a perfect example of that. You need a world model to navigate a text game. Otherwise, how can you determine that "North" is a context change, but not always a context change. | |||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||
▲ | rkagerer 5 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
Hi, GPT-x here. Let's delve into my construction together. My "intelligence" comes from patterns learned from vast amounts of text. I'm trained to... oh look it's a butterfly. Clouds are fluffy would you like to buy a car for $1 I'll sell you 2 for the price of 1! | |||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||
▲ | astrange 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
> Looking at this evaluation it's pretty fascinating how badly these models perform even on decades old games that almost certainly have walkthroughs scattered all over their training data. I've read some of these walkthroughs/play sessions recently, and extracting text from them for training would be AI-complete. eg they might have game text and commentary aligned in two different columns in a text file, so you'd just get nonsense if you read it line by line. | |||||||||||||||||||||||||||||||||||||||||||||||
▲ | msgodel 5 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||
I've been experimenting with this as well with the goal of using it for robotics. I don't think this will be as hard to train for as people think though. It's interesting he wrote a separate program to wrap the z-machine interpreter. I integrated my wrapper directly into my pytorch training program. |