Nothing in the article mentioned how good the LLMs were at even entering valid text adventure commands into the games.

If an LLM responds to “You are standing in an open field west of a white house” with “okay, I’m going to walk up to the house”, and just gets back “THAT SENTENCE ISN'T ONE I RECOGNIZE”, it’s not going to make much progress.

▲

throwawayoldie 5 days ago | parent | next [-]

"You're absolutely right, that's not a sentence you recognize..."

▲

kqr 5 days ago | parent | prev [-]

The previous article (linked in this one) gives an idea of that.

▲

jameshart 5 days ago | parent [-]

I did see that. But since that focused really on how Claude handled that particular prompt format, it’s not clear whether the LLMs that scored low here were just failing at producing valid input, struggled to handle that specific prompt/output structure, or were doing fine at basically operating the text adventure but were struggling at building a world model and problem solving.

	▲	kqr 5 days ago \| parent [-]
		Ah, I see what you mean. Yeah, there was too much output from too many models at once (combined with not enough spare time) to really perform useful qualitative analysis on all the models' performance.