| ▲ | mcv 2 days ago |
| Opus 4.5 ate through my Copilot quota last month, and it's already halfway through it for this month. I've used it a lot, for really complex code. And my conclusion is: it's still not as smart as a good human programmer. It frequently got stuck, went down wrong paths, ignored what I told it to do to do something wrong, or even repeat a previous mistake I had to correct. Yet in other ways, it's unbelievably good. I can give it a directory full of code to analyze, and it can tell me it's an implementation of Kozo Sugiyama's dagre graph layout algorithm, and immediately identify the file with the error. That's unbelievably impressive. Unfortunately it can't fix the error. The error was one of the many errors it made during previous sessions. So my verdict is that it's great for code analysis, and it's fantastic for injecting some book knowledge on complex topics into your programming, but it can't tackle those complex problems by itself. Yesterday and today I was upgrading a bunch of unit tests because of a dependency upgrade, and while it was occasionally very helpful, it also regularly got stuck. I got a lot more done than usual in the same time, but I do wonder if it wasn't too much. Wasn't there an easier way to do this? I didn't look for it, because every step of the way, Opus's solution seemed obvious and easy, and I had no idea how deep a pit it was getting me into. I should have been more critical of the direction it was pointing to. |
|
| ▲ | hawtads 2 days ago | parent | next [-] |
| Copilot and many coding agents truncates the context window and uses dynamic summarization to keep costs low for them. That's how they are able to provide flat fee plans. You can see some of the context limits here: https://models.dev/ If you want the full capability, use the API and use something like opencode. You will find that a single PR can easily rack up 3 digits of consumption costs. |
| |
| ▲ | verdverm 2 days ago | parent | next [-] | | Gerring off of their plans and prompts is so worth it, I know from experience, I'm paying less and getting more so far, paying by token, heavy gemini-3-flash user, it's a really good model, this is the future (distillations into fast, good enough for 90% of tasks), not mega models like Claude. Those will still be created for distillations and the harder problems | |
| ▲ | mcv 2 days ago | parent | prev [-] | | Maybe not, then. I'm afraid I have no idea what those numbers mean, but it looks like Gemini and ChatGPT 4 can handle a much larger context than Opus, and Opus 4.5 is cheaper than older versions. Is that correct? Because I could be misinterpreting that table. | | |
| ▲ | esperent 2 days ago | parent | next [-] | | I don't know about GPT4 but the latest one (GPT 5.2) has 200k context window while Gemini has 1m, five times higher. You'll be wanting to stay within the first 100k on all of them to avoid hitting quotas very quickly though (either start a new task or compact when you reach that) so in practice there's no difference. I've been cycling between a couple of $20 accounts to avoid running out of quota and the latest of all of them are great. I'd give GPT 5.2 codex the slight edge but not by a lot. The latest Claude is about the same too but the limits on the $20 plan are too low for me to bother with. The last week has made me realize how close these are to being commodities already. Even the CLI the agents are nearly the same bar some minor quirks (although I've hit more bugs in Gemini CLI but each time I can just save a checkpoint and restart). The real differentiating factor right now is quota and cost. | | |
| ▲ | mcv 13 hours ago | parent [-] | | > You'll be wanting to stay within the first 100k on all of them I must admit I have no idea how to do that or what that even means. I get that bigger context window is better, but what does it mean exactly? How do you stay within that first 100k? 100k what exactly? | | |
| ▲ | hawtads 11 hours ago | parent [-] | | Okay, here's the tl;dr: Attention based neural network architectures (on which the majority of LLMs are built) has a unit economic cost that scales (roughly) n^2 i.e. quadratic (for both memory and compute). In other words, the longer the context window, the more expensive it is for the upstream provider. That's one cost. The second cost is that you have to resend the entire context every time you send a new message. So the context is basically (where a, b, and c are messages): first context: a, second context window: a->b, third context window: a->b->c. It's a mostly stateless (there are some short term caching mechanisms, YMMV based on provider, it's why "cached" messages, especially system prompts are cheaper) process from the point of view of the developer, the state i.e. context window string is managed by the end user application (in other words, the coding agent, the IDE, the ChatGPT UI client etc.) The per token cost is an amortized (averaged) cost of memory+compute, the actual cost is mostly quadratic with respect to each marginal token. The longer the context window the more expensive things are.
Because of the above, AI agent providers (especially those that charge flat fee subscription plans) are incentivized to keep costs low by limiting the maximum context window size. (And if you think about it carefully, your AI API costs are a quadratic cost curve projected into a linear line (flat fee per token, so the model hosting provider in some cases may make more profit if users send in shorter contexts, versus if they constantly saturate the window. YMMV of course, but it's a race to the bottom right now for LLM unit economics) They do this by interrupting a task halfway through and generating a "summary" of the task progress, then they prompt the LLM again with a fresh prompt and the "summary" so far and the LLM will restart the task from where it left of. Of course text is a poor representation of the LLM's internal state but it's the best option so far for AI application to keep costs low. Another thing to keep in mind is that LLMs have poorer performance the larger the input size. This is due to a variety of factors (mostly because you don't have enough training data to saturate the massive context window sizes I think). The general graph for LLM context performance looks something like this:
https://cobusgreyling.medium.com/llm-context-rot-28a6d039965...
https://research.trychroma.com/context-rot There are a bunch of tests and benchmarks (commonly referred to as "needle in a haystack") to improve the LLM performance at large context window sizes, but it's still an open area of research. https://cloud.google.com/blog/products/ai-machine-learning/t... The thing is, generally speaking, you will get a slightly better performance if you can squeeze all your code and problem into the context window, because the LLM can get a "whole picture" view of your codebase/problem, instead of a bunch of broken telephone summaries every dozen of thousands of tokens. Take this with a grain of salt as the field is changing rapidly so it might not be valid in a month or two. Keep in mind that if the problem you are solving requires you to saturate the entire context window of the LLM, a single request can cost you dollars. And if you are using 1M+ context window model like gemini, you can rack up costs fairly rapidly. |
|
| |
| ▲ | cma 2 days ago | parent | prev [-] | | You need to find where context breaks down, Claude was better at it even when Gemini had 5X more on paper, but both have improved with last releases. |
|
|
|
| ▲ | deanc 2 days ago | parent | prev | next [-] |
| People are completely missing the points about agentic development. The model is obviously a huge factor in the quality of the output, but the real magic lies in how the tools are managing and injecting context in to them, as well as the tooling. I switched from Copilot to Cursor at the end of 2025, and it was absolute night and day in terms of how the agents behaved. |
| |
| ▲ | port3000 2 days ago | parent | next [-] | | Interesting you have this opinion yet you're using Cursor instead of Claude Code. By the same logic, you should get even better results directly using Anthropic's wrapper for their own model. | | |
| ▲ | deanc 2 days ago | parent [-] | | My employer doesn't allow for Claude Code yet. I'm fully aware from speaking to other peers, that they are getting even better performance out of Claude Code. |
| |
| ▲ | causal a day ago | parent | prev [-] | | In my experience GPT-5 is also much more effective in the Cursor context than the Codex context. Cursor deserves props for doing something right under the hood. |
|
|
| ▲ | zmmmmm 2 days ago | parent | prev | next [-] |
| yes just using AI for code analysis is way under appreciated I think. Even the most sceptical people on using it for coding should try it out as a tool for Q&A style code interrogation as well as generating documentation. I would say it zero-shots documentation generation better than most human efforts would to the point it begs the question of whether it's worth having the documentation in the first place. Obviously it can make mistakes but I would say they are below the threshold of human mistakes from what I've seen. |
| |
| ▲ | sfink 2 days ago | parent [-] | | (I haven't used AI much, so feel free to ignore me.) This is one thing I've tried using it for, and I've found this to be very, very tricky. At first glance, it seems unbelievably good. The comments read well, they seem correct, and they even include some very non-obvious information. But almost every time I sit down and really think about a comment that includes any of that more complex analysis, I end up discarding it. Often, it's right but it's missing the point, in a way that will lead a reader astray. It's subtle and I really ought to dig up an example, but I'm unable to find the session I'm thinking about. This was with ChatGPT 5, fwiw. It's totally possible that other models do better. (Or even newer ChatGPT; this was very early on in 5.) Code review is similar. It comes up with clever chains of reasoning for why something is problematic, and initially convinces me. But when I dig into it, the review comment ends up not applying. It could also be the specific codebase I'm using this on? (It's the SpiderMonkey source.) | | |
| ▲ | zmmmmm 2 days ago | parent | next [-] | | My main experience is with anthropic models. I've had some encounters with inaccuracies but my general experience has been amazing. I've cloned completely foreign git repos, cranked up the tool and just said "I'm having this bug, give me an overview of how X and Y work" and it will create great high level conceptual outlines that mean I can drive straight in where without it I would spend a long time just flailing around. I do think an essential skill is developing just the right level of scepticism. It's not really different to working with a human though. If a human tells me X or Y works in a certain way i always allow a small margin of possibility they are wrong. | | |
| ▲ | imp0cat 2 days ago | parent [-] | | But have you actually thoroughly checked the documentation it generated? My experience suggests it can often be subtly wrong. |
| |
| ▲ | mcv 13 hours ago | parent | prev [-] | | They do have a knack for missing the point. Even Opus 4.5 can laser focus on the wrong thing. It does take skill and experience to interpret them correctly and set them straight when they go wrong. Even so, for understanding what happens in a big chunk of code, they're pretty great. |
|
|
|
| ▲ | yieldcrv 2 days ago | parent | prev | next [-] |
| It acts differently when using it through a third party tool Try it again using Claude Code and a subscription to Claude. It can run as a chat window in VS Code and Cursor too. |
| |
| ▲ | mcv 2 days ago | parent [-] | | My employer gets me a Copilot subscription with access to Claude, not a subscription to Claude Code, unfortunately. | | |
| ▲ | yieldcrv 2 days ago | parent [-] | | at this point I would suggest getting a $20 subscription to start, seeing if you can expense it the tooling is almost as important as the model | | |
| ▲ | mcv 13 hours ago | parent [-] | | Security and approval is considered more important here. Just getting approval for neo4j on the clearest ever use case for it, took a year. I'm not going to spend my energy on getting approval for Claude Code. |
|
|
|
|
| ▲ | josu 2 days ago | parent | prev | next [-] |
| >So my verdict is that it's great for code analysis, and it's fantastic for injecting some book knowledge on complex topics into your programming, but it can't tackle those complex problems by itself. I don't think you've seen the full potential. I'm currently #1 on 5 different very complex computer engineering problems, and I can't even write a "hello world" in rust or cpp. You no longer need to know how to write code, you just need to understand the task at a high level and nudge the agents in the right direction. The game has changed. - https://highload.fun/tasks/3/leaderboard - https://highload.fun/tasks/12/leaderboard - https://highload.fun/tasks/15/leaderboard - https://highload.fun/tasks/18/leaderboard - https://highload.fun/tasks/24/leaderboard |
| |
| ▲ | johndough 8 hours ago | parent | next [-] | | All the naysayer here have clearly no idea. Your large matrix multiplication implementation is quite impressive! I have set up a benchmark loop and let GPT-5.1-Codex-Max experiment for a bit (not 5.2/Opus/Gemini, because they are broken in Copilot), but it seems to be missing something crucial. With a bit of encouragement, it has implemented: - padding from 2000 to 2048 for easier power-of-two splitting
- two-level Winograd matrix multiplication with tiled matmul for last level
- unrolled AVX2 kernel for 64x64 submatrices
- 64 byte aligned memory
- restrict keyword for pointers
- better compiler flags (clang -Ofast -march=native -funroll-loops -std=c++17)
But yours is still easily 25 % faster. Would you be willing to write a bit about how you set up your evaluation and which tricks Claude used to solve it? | |
| ▲ | zarzavat 2 days ago | parent | prev | next [-] | | How are you qualified to judge its performance on real code if you don't know how to write a hello world? Yes, LLMs are very good at writing code, they are so good at writing code that they often generate reams of unmaintainable spaghetti. When you submit to an informatics contest you don't have paying customers who depend on your code working every day. You can just throw away yesterday's code and start afresh. Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash. | | |
| ▲ | josu 2 days ago | parent | next [-] | | I know what's like running a business, and building complex systems. That's not the point. I used highload as an example because it seems like an objective rebuttal to the claim that "but it can't tackle those complex problems by itself." And regarding this: "Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash" Again, a combination of LLM/agents with some guidance (from someone with no prior experience in this type of high performing architecture) was able to beat all human software developers that have taken these challenges. | |
| ▲ | VMG 2 days ago | parent | prev | next [-] | | > Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash. The skill of "a human software developer" is in fact a very wide distribution, and your statement is true for a ever shrinking tail end of that | |
| ▲ | FeepingCreature 2 days ago | parent | prev [-] | | > How are you qualified to judge its performance on real code if you don't know how to write a hello world? The ultimate test of all software is "run it and see if it's useful for you." You do not need to be a programmer at all to be qualified to test this. | | |
| ▲ | LucaMo 2 days ago | parent [-] | | What I think people get wrong (especially non-coders) is that they believe the limitation of LLMs is to build a complex algorithm.
That issue in reality was fixed a long time ago. The real issue is to build a product. Think about microservices in different projects, using APIs that are not perfectly documented or whose documentation is massive, etc. Honestly I don't know what commenters on hackernews are building, but a few months back I was hoping to use AI to build the interaction layer with Stripe to handle multiple products and delayed cancellations via subscription schedules. Everything is documented, the documentation is a bit scattered across pages, but the information is out there.
At the time there was Opus 4.1, so I used that. It wrote 1000 lines of non-functional code with 0 reusability after several prompts. I then asked something to Chat gpt to see if it was possible without using schedules, it told me yes (even if there is not) and when I told Claude to recode it, it started coding random stuff that doesn't exist.
I built everything to be functional and reusable myself, in approximately 300 lines of code. The above is a software engineering problem. Reimplementing a JSON parser using Opus is not fun nor useful, so that should not be used as a metric | | |
| ▲ | josu a day ago | parent | next [-] | | > The above is a software engineering problem. Reimplementing a JSON parser using Opus is not fun nor useful, so that should not be used as a metric. I've also built a bitorrent implementation from the specs in rust where I'm keeping the binary under 1MB. It supports all active and accepted BEPs: https://www.bittorrent.org/beps/bep_0000.html Again, I literally don't know how to write a hello world in rust. I also vibe coded a trading system that is connected to 6 trading venues. This was a fun weekend project but it ended up making +20k of pure arbitrage with just 10k of working capital. I'm not sure this proves my point, because while I don't consider myself a programmer, I did use Python, a language that I'm somewhat familiar with. So yeah, I get what you are saying, but I don't agree. I used highload as an example, because it is an objective way of showing that a combination of LLM/agents with some guidance (from someone with no prior experience in this type of high performing architecture) was able to beat all human software developers that have taken these challenges. | |
| ▲ | B56b a day ago | parent | prev | next [-] | | This hits the nail on the head. There's a marked difference between a JSON parser and a real world feature in a product. Real world features are complex because they have opaque dependencies, or ones that are unknown altogether. Creating a good solution requires building a mental model of the actual complex system you're working with, which an LLM can't do. A JSON parser is effectively a book problem with no dependencies. | | |
| ▲ | josu a day ago | parent [-] | | You are looking at this wrong. Creating a json parser is trivial. The thing is that my one-shot attempt was 10x slower than my final solution. Creating a parser for this challenge that is 10x more efficient than a simple approach does require deep understanding of what you are doing. It requires optimizing the hot loop (among other things) that 90-95% of software developers wouldn't know how to do. It requires deep understanding of the AVX2 architecture. Here you can read more about these challenges: https://blog.mattstuchlik.com/2024/07/12/summing-integers-fa... |
| |
| ▲ | FeepingCreature a day ago | parent | prev [-] | | You need to give it search and tool calls and the ability to test its own code and iterate. I too could not oneshot an interaction layer with Stripe without tools. It also helps to make it research a plan beforehand. |
|
|
| |
| ▲ | throw1235435 a day ago | parent | prev | next [-] | | If that is true; then all the commentary around software people having jobs still due to "taste" and other nice words is just that. Commentary. In the end the higher level stuff still needs someone to learn it (e.g. learning ASX2 architecture, knowing what tech to work with); but it requires IMO significantly less practice then coding which in itself was a gate. The skill morphs more into a tech expert rather than a coding expert. I'm not sure what this means for the future of SWE's though yet. I don't see higher levels of staff in big large businesses bothering to do this, and at some scale I don't see founders still wanting to manage all of these agents, and processes (got better things to do at higher levels). But I do see the barrier of learning to code gone; meaning it probably becomes just like any other job. | |
| ▲ | dajoh 2 days ago | parent | prev | next [-] | | >I'm currently #1 on 5 different very complex computer engineering problems Ah yes, well known very complex computer engineering problems such as: * Parsing JSON objects, summing a single field * Matrix multiplication * Parsing and evaluating integer basic arithmetic expressions And you're telling me all you needed to do to get the best solution in the world to these problems was talk to an LLM? | | |
| ▲ | josu 2 days ago | parent [-] | | Lol, the problem is not finding a solution, the problem is solving it in the most efficient way. If you think you can beat an LLM, the leaderboard is right there. |
| |
| ▲ | ModernMech a day ago | parent | prev [-] | | None of the problems you've shown there are anything close to "very complex computer engineering problems", they're more like "toy problems with widely-known solutions given to students to help them practice for when they encounter actually complex problems". |
|
|
| ▲ | Davidzheng 2 days ago | parent | prev | next [-] |
| If it can consistently verify that the error persists after fix--you can run (ok maybe you can't budget wise but theoretically) 10000 parallel instances of fixer agents then verify afterwards (this is in line with how the imo/ioi models work according to rumors) |
|
| ▲ | NSPG911 2 days ago | parent | prev | next [-] |
| > Opus 4.5 ate theough my Copilot quota last month Sure, Copilot charges 3x tokens for using Opus 4.5, but, how were you still able to use up half the allocated tokens not even one week into January? I thought using up 50% was mad for me (inline completions + opencode), that's even worse |
| |
| ▲ | mcv 13 hours ago | parent [-] | | I have no idea. Careless use, I guess. I was fixing a bunch of mocks in some once-great but now poorly maintained code, and I wasn't really feeling it so I just fed everything to Claude. Opus, unfortunately. I could easily have downgraded a bit. |
|
|
| ▲ | 2 days ago | parent | prev [-] |
| [deleted] |