Remix.run Logo
gertlabs a day ago

One-shot performance often translates to the most difficult problems a model will be able to understand. We run an evaluation that tests both agentic and one-shot performance, and we find that Chinese models are almost universally very good at using tools and a harness to iterate towards a better solution, whereas their initial response ranks relatively low.

Compare that to Gemini models, which have impressive fluid intelligence on the first response, but fail to call tools or explore correctly which limits their usefulness for agentic coding.

Neither will be great for coding in a computational chemistry repo for different reasons, but the model with strong one-shot performance will be less likely to make subtle errors indicative of poor understanding, so we weight both capabilities into their final score.

The latest Anthropic and OpenAI models excel in both domains.

Data at https://gertlabs.com/rankings

mycall 19 hours ago | parent [-]

> The latest Anthropic and OpenAI models excel in both domains.

Is that because OpenAI models are not a single model but a cluster of models which specialize different domains?

gertlabs 19 hours ago | parent [-]

By domain, I really meant "tool calling" and "one-shot fluid intelligence"

Anthropic models were the original leaders in tool calling and agentic work, even when other models felt significantly smarter in (Claude Sonnet 3.5 vs Gemini 2.5 Pro, for example). OpenAI models were the opposite, starting smart (more correct solutions on the first try) and got better at exploring and iterating with tools in 2026. The latest releases (Opus 4.5+ and GPT 5.4+) excel at both.