Remix.run Logo
modeless 2 hours ago

Claude Fable 5 beats Pokémon FireRed using only vision: https://www.youtube.com/watch?v=CIQBP1w4B1M

uludag 2 hours ago | parent | next [-]

Any suggestion on how I should calibrate my cynicism towards this?

I can immagine Anthropic running this experiment multiple times and picking the most impressive one. Or I could immagine like this entire run costing like $1000+ of tokens for this particular run. Or maybe they tried a bunch of Pokemon games and it couldn't even finish some of them. Or is it just able to do this because it has an immense amount of FireRed training data, and if you were to give it an "original" Pokemon game, where it actually had to navigate novel circumstances it would fail.

modeless an hour ago | parent [-]

Every model has encyclopedic knowledge of Pokémon FireRed, of course. Knowledge is not ability. This is the first model with the ability to apply that knowledge to beat the game without assistance.

I highly doubt they focused on FireRed specifically in pretraining or posttraining. But we'll see when the ARC-AGI-3 results come out. That will measure its performance on unseen games. Based on this I expect the ARC-AGI-3 score to be SOTA.

milkkarten an hour ago | parent | prev | next [-]

no reasoning shown. no explanation on any training information. Using vision-only should be an easier version of the task (given training).

there are many standardized evals to do this correctly and Anthropic ignored them to provide a 18 second sped up video of a 50 hour run?

yeah I don't trust this until they provide a live run by a 3rd party with full reasoning traces in real-time. The reason we all liked the Gemini Plays Pokemon style runs were because they were live and couldn't be faked

svcphr 2 hours ago | parent | prev | next [-]

Bold move putting in the lvl 3 Pidgey against Gary's Blastoise at the end there (~14sec in... integer timestamps insufficient here).

suddenlybananas 2 hours ago | parent | prev | next [-]

Is there any more detail about this besides the very fast slideshow?

modeless 2 hours ago | parent [-]

Seems like the harness was minimal with no extra game state or maps available. Apparently just the screen image. Seems like it took 50 hours in game time which according to Google is at the high end of a normal human playthrough. No idea how long it took in real time though.

ex-aws-dude 2 hours ago | parent | prev [-]

I mean that’s AGI confirmed right?