Remix.run Logo
dwroberts 4 days ago

> Note: VERBAL models are asked using the verbalized test prompt. VISION models are asked the test image instead without any text prompts.

Just glancing at the bar graphs, the vision models mostly suck across the board for each question. Whereas verbal ones do OK.

And today's example of clock faces (#17) does a good job of demonstrating why: because when a lot of the diagrams are explained verbally, it makes it significantly easier to solve.

Maybe it's just me, but #17 for example - it's not immediately obvious those are even supposed to represent clocks, and yet the verbal prompt turns each one into clock times for the model (e.g. 1:30) which feels like 50% of the problem being solved before the model does anything at all.