Remix clone Hacker News

new | show | ask | jobs Github

	▲	Invictus0 6 hours ago
		How is everyone monitoring the skill/utility of all these different models? I am overwhelmed by how many they are, and the challenge of monitoring their capability across so many different modalities.
	▲	redman25 6 hours ago \| parent \| next [-]
		https://www.swebench.com https://swe-rebench.com https://livebench.ai/#/ https://eqbench.com/# https://contextarena.ai/?needles=8 https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com... https://artificialanalysis.ai/leaderboards/models https://gorilla.cs.berkeley.edu/leaderboard.html https://github.com/lechmazur/confabulations https://dubesor.de/benchtable https://help.kagi.com/kagi/ai/llm-benchmark.html https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard
	▲	spoaceman7777 5 hours ago \| parent \| prev [-]
		This is the best summary, in my opinion. You can also see the individual scores on the benchmarks they use to compute their overall scores. It's nice and simple in the overview mode though. Breaks it down into an intelligence ranking, a coding ranking, and an agentic ranking. https://artificialanalysis.ai/