"in this paper we primarily evaluate the LLM itself without external tool calls."

Maybe this is a factor?

No tools were used.

	▲	chromacity 3 days ago \| parent [-]
		IIRC, web chat often uses tools / code without surfacing this information in any obvious way.