Remix.run Logo
zamadatix 7 hours ago

Much appreciated, but I mean more around "what do the error bars in the figure represent" than what the turn scaling itself is.

esafak 7 hours ago | parent | next [-]

For the tasks in SWE-Bench Pro they obtained a distribution of agent turns, summarized as the box plot. The box likely describes the inter-quartile range while the whiskers describe the some other range. You'd have to read their report to be sure. https://en.wikipedia.org/wiki/Box_plot

jsnell 7 hours ago | parent | prev [-]

That's a box plot, so those are not error bars but a visualization of the distribution of a metric (min, max, median, 25th percentile, 75th percentile).

The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.