I agree, I don't understand why this is a useful test. It's a borderline trick question, it's worded weirdly. What does it demonstrate?

▲

rkomorn 3 hours ago | parent [-]

I don't know if it demonstrates anything, but I do think it's somewhat natural for people to want to interact with tools that feel like they make sense.

If I'm going to trust a model to summarize things, go out and do research for me, etc, I'd be worried if it made what looks like comprehension or math mistakes.

I get that it feels like a big deal to some people if some models give wrong answers to questions like this one, "how many rs are in strawberry" (yes: I know models get this right, now, but it was a good example at the time), or "are we in the year 2026?"

▲

jrowen 3 hours ago | parent [-]

In my experience the tools feel like they make sense when I use them properly, or at least I have a hard time relating the failure modes to this walk/drive thing with bizarre adversarial input. It just feels a little bit like garbage in, garbage out.

	▲	rkomorn 2 hours ago \| parent [-]
		Okay, but when you're asking a model to do things like summarizing documents, analyzing data, or reading docs and producing code, etc, you don't necessarily have a lot of control over the quality of the input.