Remix.run Logo
philipbjorge 6 days ago

If you're looking to test an LLMs ability to solve a coding task without prior knowledge of the task at hand, I don't think their benchmark is super useful.

If you care about understanding relative performance between models for solving known problems and producing correct output format, it's pretty useful.

- Even for well-known problems, we see a large distribution of quality between models (5 to 75% correctness) - Additionally, we see a large distribution of model's ability to produce responses in formats they were instructed in

At the end of the day, benchmarks are pretty fuzzy, but I always welcome a formalized benchmark as a means to understand model performance over vibe checking.