| ▲ | jmuguy 3 hours ago |
| Given your knowledge on this - do you think we'll see an open source model with Opus levels of capability? IMO if/when this happens - I would 100% stop using Anthropic. |
|
| ▲ | Greenpants 3 hours ago | parent | next [-] |
| Let me put it like this. I started with local LLMs when ChatGPT still used GPT-3.5. I was amazed how my MacBook with 8GB RAM could run openhermes2.5-mistral: a 7b parameter model that could generate short stories that sort of made sense. Incredible! Two years later, and I'm running Qwen3.6 35b agentically to develop the start of a repository and automatically run tests to then improve on itself. I never thought we'd get here so quickly with LLMs back then. I'm pretty sure in two years we'll have current Opus-like quality in the 30-100b parameter model range. But at that point, Opus 6.3 will reason along for us so much better still, that we'll still look at those models in awe. It's great to look ahead, but let's not forget to appreciate how effective the current local models already are :) |
| |
| ▲ | jmuguy 3 hours ago | parent [-] | | Haha well I ask because I don't really want/need anything beyond Opus most of the time. And I'm paranoid that Anthropic is going to be forced to charge the true cost of all this before too long. | | |
| ▲ | Greenpants 2 hours ago | parent [-] | | The other upside of running local LLMs is that there's no cloud provider to suddenly charge more for the same, or even less, model use. It's personal, but I prefer CapEx over OpEx for this. If you can purchase a device upfront that runs a decent local LLM, you get the peace of mind that your setup won't suddenly change over time and can only get better. |
|
|
|
| ▲ | lambda 3 hours ago | parent | prev | next [-] |
| If you believe the benchmarks, Qwen 3.6 35B-A3B already outperforms Claude 4 Opus. Now, there's a bit of a degree to which some of the open source models do some benchmaxxing, and bigger models with more params may always feel like they have more depth. But anyhow, right now you have something that is arguably comparable to Claude 4 Opus on your laptop. I can't really compare myself because I never used it. It looks like Claude 4 Opus is still available on OpenRouter, so you could try it out and compare yourself if you're interested. It will likely always be the case that there are proprietary cloud models that are more powerful than what you can run on a laptop. You can just do a whole lot more with terabytes of VRAM on multi-GPU clusters than you can do on a laptop. So for folks who must have the most capable, you're probably not going to want to leave Anthropic. But right now, the models you can run on your laptop are comparable to the cloud models that were popular when vibecoding and Claude Code first took off. |
| |
| ▲ | MrScruff 3 hours ago | parent [-] | | You really need to take the benchmarks with a massive pinch of salt. I’ve been testing local LLMs since the original llama and there’s nothing I’ve tried that is in the same category as Opus. | | |
| ▲ | lambda 3 hours ago | parent [-] | | Which Opus? They certainly outperform Claude 3 Opus. Anyhow, feel free to try them out head to head on OpenRouter. I'd love to see someone write up their results, of a modern local sized open source model vs. frontier models from ~a year ago, on something other than the standard benchmarks. | | |
| ▲ | mapontosevenths 2 hours ago | parent | next [-] | | There's a guy on Youtube named Bijan Bowen who tests all the models (open and frontier) on a series of one/few shot programming exercises and has been for a long while now. You can pretty much watch him compare the results for any two models you're likely to be interested in. I'm not affiliated, I just like his style and have found it handy. I know it's not very rigorous, but it's good enough for me and I've found his examples to pretty closely match the results I see in real life. | | |
| ▲ | lambda an hour ago | parent [-] | | OK, it looks like he did a browser OS test with both Claude 4 Opus and Qwen 3.6 35B-A3B. Claude 4 Opus: https://youtu.be/J7omabtqnBM?t=193 Qwen 3.6 35B A3B: https://youtu.be/gVU-DQeqkI0?t=215 Qwen 3.6 produced far more working functionality than Claude 4 Opus did. Obviously, just one test of a single one-shot prompt of a silly toy OS, but yeah, this particular test shows Qwen 3.6 running locally dramatically outperforming Claude 4 Opus, which was a frontier model a year ago. |
| |
| ▲ | MrScruff 3 hours ago | parent | prev [-] | | I’m normally comparing frontier open/cheap models against frontier closed source. I use deepseek/glm regularly, they’re fine and you can get real work done with them but it’s super obvious when you switch back to opus or even sonnet. A 3B active param MoE model is not comparable. | | |
| ▲ | lambda 40 minutes ago | parent [-] | | Yeah. I was pointing out that local 3b active models outperform frontier models from a year ago. Will this trend continue? Who knows. Both the frontier and local model will probably continue to get better. Which one will hit the top of the S-curve first? Hard to say, really. But what you can do right now locally is better than what you could do a year ago on the frontier, and lots of people were already using it pretty heavily a year ago. Hoever, November is when most folks agree that the frontier models got good enough for much of their work. Local models aren't quite there yet (where by "local" I mean "can run at reasonable speed and quant on a system less that $10,000 with today's RAM and GPU prices"). The biggest open weights models are getting there, but those require something like an 8x H100 server to reasonably run. It's likely that there will always be a gap between frontier and local if you're comparing models at the same time, you can just do a lot more with terabytes of HBM than gigabytes of DDR. But will local models get good enough to be usable for useful work? For many folks, they already are. |
|
|
|
|
|
| ▲ | zozbot234 3 hours ago | parent | prev | next [-] |
| People can't seem to agree on what "Opus class" even means (the latest Opus is apparently pretty weak) but DeepSeek Pro, Kimi and GLM all are quite capable. |
| |
| ▲ | computerex 3 hours ago | parent [-] | | Nothing compares to Opus when it comes to "taste" in web design in my experience. Nothing compares to opus in very difficult HPC/model inference development. I worked on this with opus: https://github.com/computerex/dlgo OpenAI was offering 2x usage at one point and I still used opus just because it's so much more effective. |
|
|
| ▲ | rvnx 3 hours ago | parent | prev [-] |
| To me totally yes, even further, if they keep their existing route, over time people will stop using Anthropic. More and more specialized and ultra-performant chips are going to flood the consumer market. Especially once new hardware foundries will start producing (well if we don't die from WW3 in the interval). In 10 years from now, when even basic computers will have 128 GB of memory, and phones will have super optimized tuned models, then what will be the point of Anthropic ? Just use Gemma/Gemini/Siri or whatever. Pornography and uncensored models is also pushing toward local models. It's not like needs of people grows exponentially, the needs follow an asymptote instead (they are capped). The real revolution is offline robots and self-driving cars, but LLMs are already quite maxed. For programmers, now, what Anthropic offers is like 3% improvement on a known test (like this pelican riding a bicycle), or on questions leaked from benchmark insiders. It's ok but not like revolutionary (Fable was better but it was unusable, easy 20 minutes per one prompt due to overthinking). |