"in this paper we primarily evaluate the LLM itself without external tool calls."
Maybe this is a factor?
No tools were used.
IIRC, web chat often uses tools / code without surfacing this information in any obvious way.