| ▲ | Aurornis 5 hours ago | ||||||||||||||||||||||||||||||||||||||||
> This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD). The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs. For running async work and background tasks the prompt processing and token generation speeds matter less, but a lot of Mac Studio buyers have discovered the hard way that it's not going to be as responsive as working with a model hosted in the cloud on proper hardware. For most people without hard requirements for on-site processing, the best use case for this model would be going through one of the OpenRouter hosted providers for it and paying by token. > This beats the latest Sonnet while running locally Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | nijave 19 minutes ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||
>Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead. This has been my experience as well. I've been testing an agent built with Strands Agents which receives a load balancer latency alert and is expected to query logs with AWS Athena (Trino) then drill down with Datadog spans/traces to find the root cause. Admittedly, "devops" domain knowledge is important here My notes so far: "us.anthropic.claude-sonnet-4-6" # working, good results "us.anthropic.claude-sonnet-4-20250514-v1:0" # has problems following the prompt instructions "us.anthropic.claude-sonnet-4-5-20250929-v1:0" # working, good results "us.anthropic.claude-opus-4-5-20251101-v1:0" "us.anthropic.claude-opus-4-6-v1" # best results, slower, more expensive "amazon.nova-pro-v1:0" # completely fails "openai.gpt-oss-120b-1:0" # tool calling broken "zai.glm-5" # seems to work pretty well, a little slow, more expensive than Sonnet "minimax.minimax-m2.5" # didn't diagnose correctly "zai.glm-4.7" # good results but high tool call count, more expensive than Sonnet "mistral.mistral-large-3-675b-instruct" # misdiagnosed--somehow claimed a Prometheus scrape issue was involved "moonshotai.kimi-k2.5" # identified the right endpoints but interpreted trace data/root cause incorrectly "moonshot.kimi-k2-thinking" # identified endpoint, 1 correct root cause, 1 missing index hallucination Using models on AWS Bedrock. I let Claude Code w/ Opus 4.7 iterate over the agent prompt but didn't try to optimize per model. Really the only thing that came close to Sonnet 4.5 was GLM-5. The real kicker is, Sonnet is also the cheapest since it supports prompt caching The Kimi ones were close to working but didn't quite make the mark | |||||||||||||||||||||||||||||||||||||||||
| ▲ | simjnd 4 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
> The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs. Very valid. This is an active area of research, and there are a lot of options to try out already today. - People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding. - Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4. - DFlash (block diffusion for speculative decoding) needs a good drafting model compatible with the big model, but can provide an uplift up to 5x in decoding (although usually in the 2-2.5x range) - Forcing a model's thinking to obey a simple grammar has been shown to improve results with drastically lower thinking output (faster effective result generation) although that has been more impactful on smaller models. We should be skeptical, but it's definitely trending in the right direction and I wouldn't be surprised if we are indeed able to run it at acceptable speeds. > Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead. This hasn't been my experience. After Anthropic's started their shenanigans I've switched to exclusively using open-weights models via OpenRouter and OpenCode and I can't really tell a difference (for better or for worse). | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
| ▲ | Computer0 an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
Sure but for a casual conversational use case I have not found speed to be a huge barrier. I chatted with a 100b model using ddr5 only on a plane recently and it was fine. It's mainly that I cannot do data classification and coding tasks in a timely manner. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | zozbot234 5 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||
Cloud hardware is not inherently more "proper" than what's being proposed here, there's nothing wrong per se about targeting slower inference speeds in an on prem single-user context. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||