Remix clone Hacker News

	▲	philipbjorge 6 days ago
		If you're looking to test an LLMs ability to solve a coding task without prior knowledge of the task at hand, I don't think their benchmark is super useful. If you care about understanding relative performance between models for solving known problems and producing correct output format, it's pretty useful. - Even for well-known problems, we see a large distribution of quality between models (5 to 75% correctness) - Additionally, we see a large distribution of model's ability to produce responses in formats they were instructed in At the end of the day, benchmarks are pretty fuzzy, but I always welcome a formalized benchmark as a means to understand model performance over vibe checking.