The biggest increase is LiveCodeBench Pro: 2887. The rest are in line with Opus 4.6 or slightly better or slightly worse.
but is it still terrible at tool calls in actual agentic flows?