| ▲ | esafranchik 7 hours ago |
| Two follow-ups: 1) How do you compare accuracy? by checking if the answer is in any of the returned grep/bm25/semble snippets? 2) How do you measure token use without the agent, prompt, and tools? |
|
| ▲ | stephantul 7 hours ago | parent [-] |
| 1) yes! It’s not accuracy, but ndcg
2) we assume that if the agent gets the correct answer in the returned snippets it does not need to read further |
| |
| ▲ | esafranchik 7 hours ago | parent [-] | | Wouldn't NDCG/token results vary wildly depending on the agent's query and the number of returned items? e.g. agents often run `grep -m 5 "QUERY"` with different queries, instead of one big grep for all items. | | |
| ▲ | stephantul 7 hours ago | parent [-] | | The same holds for semble: the agent can fire off many different semble queries with different k/parameters. I guess the point we’re trying to make is that you need fewer semble queries to achieve the same outcome, compared to grep+readfile calls. |
|
|