Remix.run Logo
edude03 7 hours ago

Essentially the more turns you have the more the agent is likely to fail since the error compounds per turn. Agentic model are tuned for “long horizon tasks” ie being able to go many many turns on the same problem without failing.

zamadatix 7 hours ago | parent [-]

Much appreciated, but I mean more around "what do the error bars in the figure represent" than what the turn scaling itself is.

esafak 7 hours ago | parent | next [-]

For the tasks in SWE-Bench Pro they obtained a distribution of agent turns, summarized as the box plot. The box likely describes the inter-quartile range while the whiskers describe the some other range. You'd have to read their report to be sure. https://en.wikipedia.org/wiki/Box_plot

jsnell 7 hours ago | parent | prev [-]

That's a box plot, so those are not error bars but a visualization of the distribution of a metric (min, max, median, 25th percentile, 75th percentile).

The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.