Remix.run Logo
molticrystal 5 days ago

For those curious on a few of the metrics, besides $/token, tokens/s, latency, context size, they use the results from:

    MMLU-Pro (Reasoning & Knowledge)  
    GPQA Diamond (Scientific Reasoning)  
    Humanity's Last Exam (Reasoning & Knowledge)  
    LiveCodeBench (Coding)  
    SciCode (Coding)  
    HumanEval (Coding)  
    MATH-500 (Quantitative Reasoning)  
    AIME 2024 (Competition Math)  
    Chatbot Arena  (selectively used)
NitpickLawyer 5 days ago | parent [-]

> Humanity's Last Exam (Reasoning & Knowledge)

Article yesterday was saying that ~30% of the chemistry/biology questions on HLE were either wrong, misleading or highly contested in scilit.