From table 3 it appears that Deepseek R1 has the highest eval scores.
It's a 607B model vs 405B, so obviously "larger"
[dead]