I’m missing from the article two things:
- testing prompt (were llms instructed to progress in game, as opposed to just explore — the author said smarter llms were more likely to explore)
- benchmark with humans