from a foundational research perspective, the pokemon benchmark is one of the most important ones.

these models are trained on a static task, text generation, which is to say the state they are operating in does not change as they operate. but now that they are out we are implicitly demanding they do dynamic tasks like coding, navigation, operating in a market, or playing games. this are tasks where your state changes as you operate

an example would be that as these models predict the next word, the ground truth of any further words doesnt change. if it misinterprets the word bank in the sentence "i went to the bank" as a river bank rather than a financial bank, the later ground truth wont change, if it was talking about the visit to the financial bank before, it will still be talking about that regardless of the model's misinterpretation. But if a model takes a wrong turn on the road, or makes a weird buy in the stock market, the environment will react and change and suddenly, what it should have done as the n+1th move before isnt the right move anymore, it needs to figure out a route of the freeway first, or deal with the FOMO bullrush it caused by mistakenly buying alot of stock

we need to push against these limits to set the stage for the next evolution of AI, RL based models that are trained in dynamic reactive environments in the first place

▲

hansmayer 2 months ago | parent [-]

Honestly I have no idea what is this supposed to mean, and the high verbosity of whatever it is trying to prove is not helping it. To repeat: We already tried making computers play games. Ever heard of Deep Blue, and ever heard of it again since the early 2000s?

▲

lechatonnoir 2 months ago | parent | next [-]

Here's a summary for you:

llm trained to do few step thing. pokemon test whether llm can do many step thing. many step thing very important.

▲

hansmayer 2 months ago | parent [-]

Are you showing off how the the extensive LLM usage impaired your writing and speaking capabilities?

	▲	lechatonnoir a month ago \| parent \| next [-]
		I am mocking you, but you didn't get it.
	▲	drdeca 2 months ago \| parent \| prev [-]
		You complained about the high verbosity.

▲

Rudybega 2 months ago | parent | prev [-]

The state space for actions in Pokemon is hilariously, unbelievably larger than the state space for chess. Older chess algorithms mostly used Brute Force (things like minimax) and the number of actions needed to determine a reward (winning or losing) was way lower (chess ends in many, many, many fewer moves than Pokemon).

Successfully navigating through Pokemon to accomplish a goal (beating the game) requires a completely different approach, one that much more accurately mirrors the way you navigate and goal set in real world environments. That's why it's an important and interesting test of AI performance.

▲

hansmayer 2 months ago | parent [-]

Thats all wishful thinking, with no direct relation to the actual use cases. Are you going to use it to play games for you? Here is a much more reliable test: Would you blindly copy and paste the code the GenAI spits out at you? Or blindly trust the recommendations it makes about your terraform code ? Unless you are a complete beginner, you would not, because it sometimes generates downright the opposite of what you asked it to do. It is because the tool is guessing the outputs and not really knowing what it all means. It just "knows" what character sequences are most likely (probability-wise) to follow the previous sequence. Thats all there is to it. There is no big magic, no oracle having knowledge you dont etc. So unless you tell me you are ready to blindly use whatever the GenAI playing pokemon tells you to do, I am sorry, but you are just fooling yourself. And in the case you are ready to blindly follow it - I sure hope you are ready for a life of an Eloi?

▲

Rudybega 2 months ago | parent [-]

All of that is totally unrelated to the point I'm trying to make.

Pokemon is interesting because it's a test of whether these models can solve long time horizon tasks.

That's it.

▲

hansmayer 2 months ago | parent [-]

Ok, well now that you phrase it clearly like that, it makes much more sense, so it's a test of being able to keep a relatively long context-length. Another incremental improvement I suppose.

	▲	Rudybega a month ago \| parent [-]
		It's not really a function of maintaining coherency across context length. It's more about whether the model can accomplish a long time horizon task when the context length of a single message isn't even close to sufficient for keeping track of the all the things that have occurred in pursuit of the task's completion. Basically, the model has to keep some notes about its overall goals and current progress. Then the context window has to be seeded with the relevant sections from these notes to accomplish sub goals that help with the completion of the overall goal (beat the game). The interesting part here is whether the models can even do this. A single context window isn't even close to sufficient to store all the things the model has done to drive the next action, so you have to figure out alternate methods and see if the model itself is smart enough to maintain coherency using those methods.