Remix.run Logo
mdasen 7 hours ago

I'm a bit skeptical.

Cursor's benchmark finds that Cursor's model (Composer 2.5) is basically as good as Opus 4.8 max and GPT-5.5 xhigh, but at a fraction of the price.

Artificial Analysis' testing shows Composer 2.5 to be pretty far behind: https://artificialanalysis.ai/agents/coding-agents. You look at the DeepSWE benchmark (which is probably the hardest to game at this point) and GPT-5.5 xhigh gets a 64, Opus 4.8 max gets 56, and Cursor 2.5 gets 16.

I don't doubt that Cursor works well for some people. It's beating DeepSeek v4 Pro in the DeepSWE benchmark and that's a very capable model. But I'm skeptical of the claims that it's a competitor for Opus 4.8 and GPT-5.5. It just seems convenient that their model does so well on their own benchmark while third party benchmarks have it far behind. Maybe it's a really great benchmark and a better measure than third party ones - I'd love for a cheap model to do as well as the expensive ones.

justachillguy 3 hours ago | parent | next [-]

Naturally, given it’s their benchmark they have overfitted their model somewhat to it.

famouswaffles 7 hours ago | parent | prev | next [-]

Cursor sessions are pretty much what composer models are RL'd on. This bench and the training data are/should be basically the same distribution.

burmanm 5 hours ago | parent | prev | next [-]

DeepSWE is slightly flawed in the sense that is uses only its own harness and that causes issues on models that are not correctly supported by it. There's huge amount of evidence that the harness plays a big role in how these models work and yet DeepSWE entirely removes that (and has probably only tested that it works fine with some favourite model of them).

There's also issues with cost calculation (as that harness doesn't use caches) and so on as reported on their github issues.

None of the benchmarks are perfect, but that does explain a lot of the variations between benchmarks.

muzani 4 hours ago | parent | prev | next [-]

Anecdotally, I find Composer 2.5 to be useless. I do use light LLMs like Claude Haiku and some of Cursor's older free models, but Composer is negative productivity for me.

maxdo 4 hours ago | parent [-]

The opposite , I use for everything like trigger and monitor a 10 steps release process using composer , a very capable model

vorticalbox 2 hours ago | parent [-]

this is my finding too, i have moved to it fully for most of the plan/coding.

for most tasks is capable and very cheap, for a days worth of tasks is costing about $10

urbsgpw 43 minutes ago | parent [-]

Same here, maybe I'm underusing it a bit, because for anything that is a bit more complex i tend to err on the safe side and go with anthropic, but i wonder if thats just a placebo effect because i pay more for it.

I do feel that they've really upped their game with composer this year though.

datadrivenangel 6 hours ago | parent | prev | next [-]

For lighter interactive agentic coding, where you type stuff into an IDE and a minute or three later get results back for review, composer 2.5 is honestly pretty great. The results get notably worse for larger tasks though.

anon7000 3 hours ago | parent [-]

Agreed. It’s worse than Opus of course. But Opus takes more than 10x longer to give you something to look at. I’m not kidding, I “benchmarked” a real ticket I was working on. Opus 4.7 took more than 30min. Opus 4.8 took over an hour. Composer 2.5 took 5min on the exact same prompt & local setup. My subjective review is that composer’s code was only like 10-20% worse. It still worked, it was just a bit less clean and a little more hacky. But it’s not like Opus is flawless either. At the end of the day, if it takes an hour to get to draft code I can look at and iterate on… that’s fucking impossible for me. Unless it did an excellent job. But as long as I still need to review and follow up with changes, Opus is just too slow. It’s really frustrating because it’s a lot slower than it was 6mo ago, and not noticeably better. Fable seems a step in the right direction but is $$$$

WinstonSmith84 4 hours ago | parent | prev | next [-]

that benchmark seems to match my experience. GPT 5.5 is significantly better than Opus 4.8, last time I tried composer 2.5 it was truly dumb, and Fable to me looks to be on par with GPT 5.5 but .. different overall ... The best is to have a LLM-peer-review between GPT and Opus (now Fable) for best outcome.

apothegm 2 hours ago | parent | prev | next [-]

Composer writes the worst, stupidest, most naive and straight up brains-dead code you could imagine. Fast and cheap is about all it’s got going for it. I mostly use it for “sort these lines alphabetically” and stuff that’s a smidge too complex for regex find/replace.

simondotau 2 hours ago | parent [-]

I primarily use composer. I wanted to build something from scratch recently and, thinking I was missing out on something, I got Opus to build it. I wasn't blown away. I gave the same prompts to composer and the code it came up with different but similar in quality. I ended up progressing with the composer code because it was easier to progress with improvements due to its faster response time.

ciaf 6 hours ago | parent | prev | next [-]

By the same token, Fable 5 is given a score of 77 vs 76 for GPT 5.5

whazor 5 hours ago | parent | prev [-]

I mean, they train their model on their training data. So by it should score well on their own benchmark.