| ▲ | modeless 6 days ago |
| Numbers for SWE-bench Verified, Aider Polyglot, cost per million output tokens, output tokens per second, and knowledge cutoff month/year: SWE Aider Cost Fast Fresh
Claude 3.7 70% 65% $15 77 8/24
Gemini 2.5 64% 69% $10 200 1/25
GPT-4.1 55% 53% $8 169 6/24
DeepSeek R1 49% 57% $2.2 22 7/24
Grok 3 Beta ? 53% $15 ? 11/24
I'm not sure this is really an apples-to-apples comparison as it may involve different test scaffolding and levels of "thinking". Tokens per second numbers are from here: https://artificialanalysis.ai/models/gpt-4o-chatgpt-03-25/pr... and I'm assuming 4.1 is the speed of 4o given the "latency" graph in the article putting them at the same latency.Is it available in Cursor yet? |
|
| ▲ | anotherpaulg 6 days ago | parent | next [-] |
| I just finished updating the aider polyglot leaderboard [0] with GPT-4.1, mini and nano. My results basically agree with OpenAI's published numbers. Results, with other models for comparison: Model Score Cost
Gemini 2.5 Pro Preview 03-25 72.9% $ 6.32
claude-3-7-sonnet-20250219 64.9% $36.83
o3-mini (high) 60.4% $18.16
Grok 3 Beta 53.3% $11.03
* gpt-4.1 52.4% $ 9.86
Grok 3 Mini Beta (high) 49.3% $ 0.73
* gpt-4.1-mini 32.4% $ 1.99
gpt-4o-2024-11-20 18.2% $ 6.74
* gpt-4.1-nano 8.9% $ 0.43
Aider v0.82.0 is also out with support for these new models [1]. Aider wrote 92% of the code in this release, a tie with v0.78.0 from 3 weeks ago.[0] https://aider.chat/docs/leaderboards/ [1] https://aider.chat/HISTORY.html |
| |
| ▲ | pzo 6 days ago | parent | next [-] | | Did you benchmarked combo: DeepSeek R1 + DeepSeek V3 (0324)?
There is combo on 3rd place : DeepSeek R1 + claude-3-5-sonnet-20241022 and also V3 new beating claude 3.5 so in theory R1 + V3 should be even on 2nd place. Just curious if that would be the case | |
| ▲ | purplerabbit 6 days ago | parent | prev [-] | | What model are you personally using in your aider coding? :) | | |
|
|
| ▲ | jsnell 6 days ago | parent | prev | next [-] |
| https://aider.chat/docs/leaderboards/ shows 73% rather than 69% for Gemini 2.5 Pro? Looks like they also added the cost of the benchmark run to the leaderboard, which is quite cool. Cost per output token is no longer representative of the actual cost when the number of tokens can vary by an order of magnitude for the same problem just based on how many thinking tokens the model is told to use. |
| |
| ▲ | anotherpaulg 6 days ago | parent | next [-] | | Aider author here. Based on some DMs with the Gemini team, they weren't aware that aider supports a "diff-fenced" edit format. And that it is specifically tuned to work well with Gemini models. So they didn't think to try it when they ran the aider benchmarks internally. Beyond that, I spend significant energy tuning aider to work well with top models. That is in fact the entire reason for aider's benchmark suite: to quantitatively measure and improve how well aider works with LLMs. Aider makes various adjustments to how it prompts and interacts with most every top model, to provide the very best possible AI coding results. | | |
| ▲ | BonoboIO 6 days ago | parent | next [-] | | Thank you for providing such amazing tools for us. Aider is a godsend, when working with large codebase to get an overview. | |
| ▲ | modeless 6 days ago | parent | prev [-] | | Thanks, that's interesting info. It seems to me that such tuning, while making Aider more useful, and making the benchmark useful in the specific context of deciding which model to use in Aider itself, reduces the value of the benchmark in evaluating overall model quality for use in other tools or contexts, as people use it for today. Models that get more tuning will outperform models that get less tuning, and existing models will have an advantage over new ones by virtue of already being tuned. | | |
| ▲ | jmtulloss 6 days ago | parent [-] | | I think you could argue the other side too... All of these models do better and worse with subtly different prompting that is non-obvious and unintuitive. Anybody using different models for "real work" are going to be tuning their prompts specifically to a model. Aider (without inside knowledge) can't possibly max out a given model's ability, but it can provide a reasonable approximation of what somebody can achieve with some effort. |
|
| |
| ▲ | modeless 6 days ago | parent | prev [-] | | There are different scores reported by Google for "diff" and "whole" modes, and the others were "diff" so I chose the "diff" score. Hard to make a real apples-to-apples comparison. | | |
| ▲ | jsnell 6 days ago | parent | next [-] | | The 73% on the current leaderboard is using "diff", not "whole". (Well, diff-fenced, but the difference is just the location of the filename.) | | | |
| ▲ | tcdent 6 days ago | parent | prev [-] | | They just pick the best performer out of the built-in modes they offer. Interesting data point about the models behavior, but even moreso it's a recommendation of which way to configure the model for optimal performance. I do consider this to be an apple-to-apples benchmark since they're evaluating real world performance. |
|
|
|
| ▲ | meetpateltech 6 days ago | parent | prev | next [-] |
| Yes, it is available in Cursor[1] and Windsurf[2] as well. [1] https://twitter.com/cursor_ai/status/1911835651810738406 [2] https://twitter.com/windsurf_ai/status/1911833698825286142 |
| |
|
| ▲ | tomjen3 6 days ago | parent | prev | next [-] |
| Its available for free in Windsurf so you can try it out there. Edit: Now also in Cursor |
|
| ▲ | ilrwbwrkhv 6 days ago | parent | prev | next [-] |
| Yup GPT 4.1 isn't good at all compared to the others. I tried a bunch of different scenarios, for me the winners: Deepseek for general chat and research
Claude 3.7 for coding
Gemini 2.5 Pro experimental for deep research In terms of price Deepseek is still absolutely fire! OpenAI is in trouble honestly. |
| |
| ▲ | torginus 5 days ago | parent [-] | | One task I do is I feed the models the text of entire books, and ask them various questions about it ('what happened in Chapter 4', 'what did character X do in the book' etc.). GPT 4.1 is the first model that has provided a human-quality answer to these questions. It seems to be the first model that can follow plotlines, and character motivations accurately. I'd say since text processing is a very important use case for LLMs, that's quite noteworthy. |
|
|
| ▲ | soheil 6 days ago | parent | prev | next [-] |
| Yes on both Cursor and Windsurf. https://twitter.com/cursor_ai/status/1911835651810738406 |
|
| ▲ | 6 days ago | parent | prev | next [-] |
| [deleted] |
|
| ▲ | 6 days ago | parent | prev [-] |
| [deleted] |