Remix clone Hacker News

new | show | ask | jobs Github

	▲	molticrystal 5 days ago
		For those curious on a few of the metrics, besides $/token, tokens/s, latency, context size, they use the results from: `MMLU-Pro (Reasoning & Knowledge) GPQA Diamond (Scientific Reasoning) Humanity's Last Exam (Reasoning & Knowledge) LiveCodeBench (Coding) SciCode (Coding) HumanEval (Coding) MATH-500 (Quantitative Reasoning) AIME 2024 (Competition Math) Chatbot Arena (selectively used)`
	▲	NitpickLawyer 5 days ago \| parent [-]
		> Humanity's Last Exam (Reasoning & Knowledge) Article yesterday was saying that ~30% of the chemistry/biology questions on HLE were either wrong, misleading or highly contested in scilit.