Remix.run Logo
refulgentis 4 hours ago

Thanks! Is the judge an LLM? There's lot of references to "just like LMArena", but LMArena is human evaluated?

skysniper 4 hours ago | parent [-]

> Is the judge an LLM?

Yes, judge is one of opus 4.6, gpt 5.4, gemini 3.1 pro (submitter can choose). Self judge (judge model is also one of the participants) is excluded when computing ranking.

> There's lot of references to "just like LMArena", but LMArena is human evaluated?

Yeah LMArena is human evaluated, but here i found it not practical to gather enough human evaluation data because the effort it take to compare the result is much higher:

- for code, judge needs to read through it to check code quality, and actually run it to see the output

- when producing a webpage or a document, judge needs to check the content and layout visually

- when anything goes wrong, judge needs to read the execution log to see whether partial credit shall be granted

if you look at the cost details of each battle (available at the bottom of battle detail page), judge typically cost more than any participant model.

if we evaluate with human, i would say each evaluation can easily take ~5-10 min

refulgentis 4 hours ago | parent [-]

Fair enough, yeah, agent evals are hard especially across N models :/

Thanks for replying btw, didn't mean any disrespect, good on you for not getting aggro about feedback

skysniper 4 hours ago | parent [-]

I appreciate honest feedback, best way to learn :)