Remix.run Logo
christianstump an hour ago

Please indicate which other models you would like to see included. (And I agree that the context window limitations were not reasonable to have.) Finally: running this few prompts would have been $10-20k if I would have run them myself via the API. (And the company didn't asked to contribute, but I asked whether they would be willing to do so, just saying.)

jona-f 8 minutes ago | parent [-]

Kimi K2.6 and mimo 2.5 pro are ahead of deepseek v4 in other benchmarks. Anyhow, great work, the benchmark seems to show great separation, so should be very useful to improve the math capabilities of the next generation of ai. I'm more interested in the prompt engineering/orchestration and technical details (what I can do without millions), but I get that you are mathematicians, so your focus is obviously on the math. Sorry for the nagging.