Remix.run Logo
pton_xd 3 days ago

"in this paper we primarily evaluate the LLM itself without external tool calls."

Maybe this is a factor?

simianwords 3 days ago | parent [-]

No tools were used.

chromacity 3 days ago | parent [-]

IIRC, web chat often uses tools / code without surfacing this information in any obvious way.