Remix.run Logo
_boffin_ 2 days ago

What was the main focus when training this model? Besides the ELO score, it's looking like the models (31B / 26B-A4) are underperforming on some of the typical benchmarks by a wide margin. Do you believe there's an issue with the tests or the results are misleading (such as comparative models benchmaxxing)?

Thank you for the release.

BoorishBears 2 days ago | parent [-]

Becnhmarks are a pox on LLMs.

You can use this model for about 5 seconds and realize its reasoning is in a league well above any Qwen model, but instead people assume benchmarks that are openly getting used for training are still relevant.

girvo a day ago | parent | next [-]

They really are. Benchmaxxing is real… but also the Qwen 3.5 series of models are still very impressive. I’m looking forward to trying out Gemma

j45 2 days ago | parent | prev [-]

Definitely have to use each model for your use case personally, many models can train to perform better on these tests but that might not transfer to your use case.