> Use cloud models only when they’re genuinely necessary.

The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.

I just realized this with coding agents, yeah, you probably shouldn't always use latest version at xhigh, but you will end doing it because you do the job in less time, with less "effort" and basically at the same price.

I guess we'll see a real effort for local AI only when major vendors will start billing based on actual token usage.

▲

lelanthran 10 hours ago | parent | next [-]

> The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.

That's not a problem, that's a feature; I have something like 8 tabs open to different free-tier providers. ChatGPT, Claude and Gemini are the SOTA ones.

I have no problem maxing one out, then moving to the next. I can do this all day, have them implement specific functions (or classes) in my code. The things is, because I actually know how to write and design software, I don't need to run an agent in a loop to produce everything in a day, I can use the web chatbots with copy/paste to literally generate thousands of lines of code per hour while still having a strong mental model of the code that I can go in and change whatever I need to.[1]

---------------------

[1] Just did that this morning on a Python project: because I designed what I needed, each generation was me prompting for a single function. So when I needed to add something this morning I didn't even bother asking an chatbot to do it, I just went ahead directly to the correct place and did it.

You can't do that if you generate the entire thing from specs.

▲

vb-8448 10 hours ago | parent [-]

We are speaking about local AI, and having all this SOTA models basically for free is blocking the progress of local or independent third party setups.

	▲	lelanthran 10 hours ago \| parent [-]
		Maybe I should have clarified what the feature is (After re-reading my post, I see that I basically just ended after adding the footnote) The feature of using all these SOTAs to exhaustion on the free tiers is burning their VC money! The more I use for free, the more of their money I burn, the closer we'll get to actual 3rd-party and independent setups (local or otherwise).

▲

RataNova 10 hours ago | parent | prev | next [-]

The path of least resistance usually wins, especially when the pricing hides the real cost

▲

Analemma_ 11 hours ago | parent | prev [-]

I'm also just not seeing good performance from local models. Every time a thread about LLMs comes up, there are tons of people in the comments insisting that they're getting just as good results from the latest DeepSeek/qwen/whatever as with Opus, and that just hasn't been my experience at all: open-source models just fall over completely compared to Claude when asked to do anything remotely complicated.

I have a sneaking suspicion this is kinda like the situation with Linux in the 90s, where it kinda worked but it reeeeeally wasn't ready for the home user, but you had a lot of people who would insist to your face everything was fine, mostly for ideological reasons.

▲

kgeist 10 hours ago | parent | next [-]

It depends a lot on how you run those models. I think a lot of disagreement is because of that. A lot of people run local models with incredibly small context windows (makes an agentic LLM circle in loops), use very small quants (like 4 bit => huge degradation), don't set the recommended parameters (like top-p/temperature), or download GGUFs with broken chat templates. And then they claim model X is bad :)

I'm currently running both Sonnet 4.6 and Qwen 3.6-27b on the same codebase (via OpenCode, the parameters were carefully tuned to have a good quality/context size ratio), and on this project, they both struggle with complex non-trivial tasks, and both work flawlessly otherwise. Sonnet 4.6 understands the intent better if my task is ambiguously formulated, but otherwise the gap is pretty small for coding under a harness.

▲

lelanthran 10 hours ago | parent | prev | next [-]

> Every time a thread about LLMs comes up, there are tons of people in the comments insisting that they're getting just as good results from the latest DeepSeek/qwen/whatever as with Opus, and that just hasn't been my experience at all: open-source models just fall over completely compared to Claude when asked to do anything remotely complicated.

Different usage patterns - you want to issue a single spec then walk away and come back later (when it has consumed $10k worth of API tokens inside your $200/m subscription) to a finished product.

Many people issue a spec for a single function, a single class or similar. When you break it down like that, the advantages of SOTA models shrinks.

▲

vb-8448 10 hours ago | parent [-]

My experience is that in medium/big codebases even with single functions going with the xhigh is basically better from a user perspective (faster to get the result, and you can trust it) while going with lower models(e.g. sonnet instead of opus) you have to always carefully review the output because 1 of 10 it will hallucinate, you won't catch it immediately and at some point it will bite you.

▲

lelanthran 9 hours ago | parent [-]

> My experience is that in medium/big codebases even with single functions going with the xhigh is basically better from a user perspective (faster to get the result, and you can trust it) while going with lower models(e.g. sonnet instead of opus) you have to always carefully review the output because 1 of 10 it will hallucinate,

What do you mean "trust it"? It sounds like you want to vibe-code (never look at the output), and maybe for that you need SOTA, but like I said in a different comment, I can easily generate 1000s of lines of code per hour just prompting the chatbots.

I don't, because I actually review everything, but I can, and some of those chatbots are actually SOTA anyway.

▲

vb-8448 9 hours ago | parent [-]

With SOTA models I can just set up the instructions (even a little bit fuzzy), go away for 10 or 15 minutes, come back and just check result and adjust when necessary (and most of the time small adjustment are necessary, but the overall work is pretty good).

With subpar models I must be more careful on providing instructions and check it step by step because the path it chose is wrong, or I didn't ask for or the agent stuck in a loop somewhere.

	▲	catlifeonmars 6 hours ago \| parent [-]
		A lot of people aren’t using agents that way. Not saying that it’s not a legitimate use or anything, just that I think the use cases are different. And yeah maybe for your specific use case, sota hosted models are the right choice

▲

bilbo0s 10 hours ago | parent | prev [-]

This.

I’ve begun to suspect that most people are probably running different hardware. Sure, you run the latest deep flash on your brand new M5 128G maybe you get acceptable performance?

But honestly, how many people have an extra $9000 laying around these days?

Right now, running with acceptable performance is kind of a luxury. I wish the people who always say - “This is great!” - would realize that not everyone has their hardware.

▲

vb-8448 9 hours ago | parent [-]

Actually even with a 9k hardware you won't get good enough performance. There is an interesting video from antirez on trying to run deepseek v4 flash 2bits on a m3 max 128GB ... and the result is kind delusional: as soon as the context start growing you are around 20token/s.

	▲	8 hours ago \| parent \| next [-]
		[deleted]
	▲	zozbot234 9 hours ago \| parent \| prev [-]
		Prefill performance used to be the real bottleneck on antirez's DS4 and that's been greatly improved by now, it doesn't perceivably slow down with growing context.