Remix.run Logo
lordgrenville 2 hours ago

Almost every comment here is appealing to personal experience. By contrast, OP refers to two studies that compare performance on some kind of standardised test over a range of models.

Can't speak to how good those tests are, but they can't be worse than anecdotal evidence for something as vague/subjective as LLM performance.

nijave an hour ago | parent | next [-]

I'll respond with more anecdotal evidence, the Llama family has been terrible at following directions in all the tests I've done--not sure about the other models in RULER.

In the Chroma results, they look at Sonnet 4 which was also terrible in my experience. The same prompt that worked perfectly in Sonnet 4.5 would fail miserably in Sonnet 4

Would be good to see newer tests with both SOTA and open weight. The SOTA ones always seem to follow directions and stay on topic better but it'd be good to have some data to back it up.

bhy 2 hours ago | parent | prev [-]

But the studies are in 2024 and 2025. They don’t apply to current Claude models.