| ▲ | jsnell 6 days ago |
| https://aider.chat/docs/leaderboards/ shows 73% rather than 69% for Gemini 2.5 Pro? Looks like they also added the cost of the benchmark run to the leaderboard, which is quite cool. Cost per output token is no longer representative of the actual cost when the number of tokens can vary by an order of magnitude for the same problem just based on how many thinking tokens the model is told to use. |
|
| ▲ | anotherpaulg 6 days ago | parent | next [-] |
| Aider author here. Based on some DMs with the Gemini team, they weren't aware that aider supports a "diff-fenced" edit format. And that it is specifically tuned to work well with Gemini models. So they didn't think to try it when they ran the aider benchmarks internally. Beyond that, I spend significant energy tuning aider to work well with top models. That is in fact the entire reason for aider's benchmark suite: to quantitatively measure and improve how well aider works with LLMs. Aider makes various adjustments to how it prompts and interacts with most every top model, to provide the very best possible AI coding results. |
| |
| ▲ | BonoboIO 6 days ago | parent | next [-] | | Thank you for providing such amazing tools for us. Aider is a godsend, when working with large codebase to get an overview. | |
| ▲ | modeless 6 days ago | parent | prev [-] | | Thanks, that's interesting info. It seems to me that such tuning, while making Aider more useful, and making the benchmark useful in the specific context of deciding which model to use in Aider itself, reduces the value of the benchmark in evaluating overall model quality for use in other tools or contexts, as people use it for today. Models that get more tuning will outperform models that get less tuning, and existing models will have an advantage over new ones by virtue of already being tuned. | | |
| ▲ | jmtulloss 6 days ago | parent [-] | | I think you could argue the other side too... All of these models do better and worse with subtly different prompting that is non-obvious and unintuitive. Anybody using different models for "real work" are going to be tuning their prompts specifically to a model. Aider (without inside knowledge) can't possibly max out a given model's ability, but it can provide a reasonable approximation of what somebody can achieve with some effort. |
|
|
|
| ▲ | modeless 6 days ago | parent | prev [-] |
| There are different scores reported by Google for "diff" and "whole" modes, and the others were "diff" so I chose the "diff" score. Hard to make a real apples-to-apples comparison. |
| |
| ▲ | jsnell 6 days ago | parent | next [-] | | The 73% on the current leaderboard is using "diff", not "whole". (Well, diff-fenced, but the difference is just the location of the filename.) | | | |
| ▲ | tcdent 6 days ago | parent | prev [-] | | They just pick the best performer out of the built-in modes they offer. Interesting data point about the models behavior, but even moreso it's a recommendation of which way to configure the model for optimal performance. I do consider this to be an apple-to-apples benchmark since they're evaluating real world performance. |
|