| ▲ | esafranchik 8 hours ago |
| Is the benchmark measuring one-shot retrieval accuracy, or Coding agent response accuracy? |
|
| ▲ | stephantul 8 hours ago | parent [-] |
| Hey! Co-author here. The benchmark currently only measures retrieval accuracy. We’re interested in measuring it end to end and also optimizing, e.g. the prompt and tools, for this, but we just haven’t gotten around to it. |
| |
| ▲ | esafranchik 7 hours ago | parent [-] | | Two follow-ups: 1) How do you compare accuracy? by checking if the answer is in any of the returned grep/bm25/semble snippets? 2) How do you measure token use without the agent, prompt, and tools? | | |
| ▲ | stephantul 7 hours ago | parent [-] | | 1) yes! It’s not accuracy, but ndcg
2) we assume that if the agent gets the correct answer in the returned snippets it does not need to read further | | |
| ▲ | esafranchik 7 hours ago | parent [-] | | Wouldn't NDCG/token results vary wildly depending on the agent's query and the number of returned items? e.g. agents often run `grep -m 5 "QUERY"` with different queries, instead of one big grep for all items. | | |
| ▲ | stephantul 7 hours ago | parent [-] | | The same holds for semble: the agent can fire off many different semble queries with different k/parameters. I guess the point we’re trying to make is that you need fewer semble queries to achieve the same outcome, compared to grep+readfile calls. |
|
|
|
|