Your benchmark has Opus 4.7 performing significantly worse than Sonnet 4.6. Even if true on your benchmark, that is not representative of the overall performance of the models.

▲

guilamu 3 hours ago | parent [-]

Yes Opus 4.7 fast (no reasoning) did a worst job than Sonnet 4.6 high (with reasoning) according to Gemini 3.1 Pro evaluation.

▲

ac29 3 hours ago | parent [-]

Your table doesn't indicate reasoning vs non-reasoning, or reasoning level

	▲	guilamu 3 hours ago \| parent [-]
		When nothing is noted it's max reasoning (xhigh in copilot chat in vscode if available). The models not availble on copilot were tested through opencode (max reasoning) and deepseek v4 was tested through Cline (with max reasoning too).