Not sure why you got downvoted. 95% of people should be paying for a subscription. It's far cheaper, far more scalable, and far less hassle.

Local AI only makes sense for a couple of use cases:

  - Privacy
  - Constant churning on tokens
  - Latency
  - Availability

Local AI is "cheaper" when you already have the hardware sitting around, like an old MacBook or gaming GPU, or the API cost (subscriptions will all run out if you churn 24/7) is too high to bare. I'm surprised companies are still selling their old MacBooks to employees, when they could be turning them into Beowulf clusters for cheap AI compute on long-running jobs (the cost is just electricity)

If usage-based pricing is killing your vibe, find a cheaper subscription with higher limits. Here's a list of them compared on price-per-request-limit: https://codeberg.org/mutablecc/calculate-ai-cost/src/branch/...

▲ xscott 16 hours ago | parent | next [-]

I think you're right about the cost/benefit trade-off in general, but I do wonder how much "compaction" Codex and Claude do is to keep context fresh and how much is to save (them) runtime costs.

If you've got a 1M token context, but they constantly summarize it down to something much smaller, is it really 1M tokens of benefit? With a local model, you can use all 256k tokens on your own terms. However, I don't have any benchmarks to know.

▲

0xbadcafebee 11 hours ago | parent [-]

I think you might be confused a bit about compaction? The LLM API endpoint does not do compaction, it's an external agent harness that does it. And the Codex/Claude agents aren't constantly summarizing it down, they generally wait until you get within 3/4 of the max of the context size.

Compaction doesn't save them money, it just makes it possible for you to continue a session. If you compact a session too many times, besides the fact that the model basically stops being useful, you eventually just cannot do anything else in the session because all the context is taken up by compaction notes. But if you don't compact it, pretty soon the session is completely unusable because it can't output any more tokens. You can disable compaction in those agents if you want to see the difference.

Also, using a lot of context can make the model perform poorly, so compaction can improve results. If you have a much larger context size, it means you have more headroom before the model starts to perform poorly (as it grows closer to max context size). A larger context also lets you do things like handle larger documents or reason over a larger amount of data without having to break it up into subtasks. Eventually we want models' context to get much bigger so we can do more things in a session. (Some research is being done to see if we can get rid of the limit entirely)

	▲	cmrx64 10 hours ago \| parent \| next [-]
		LLM API endpoint does do compaction. OpenAI definitely does support serverside compaction, both explicit and automatic, and this is different than what could be implemented purely clientside: https://developers.openai.com/api/docs/guides/compaction (and there was rumors a few months ago on HN about how activation-preserving/latent it is, vs just summarization). Anthropic as well, in beta (new to me): https://platform.claude.com/docs/en/build-with-claude/compac...
	▲	xscott 10 hours ago \| parent \| prev [-]
		The names for the pieces are confusing, so it's easy to talk past each other. For instance, you're saying "Codex the agent", which isn't a thing now. It's currently GPT-5.5, and at one point it was GPT-5.3-Codex, so when I say "Codex", I meant the MacOS "harness". Similar for Claude Code vs Claude Opus/Sonnet. Anyways, I don't know specifics well enough to argue with you on anything, but there is a cost for input tokens, and you see/pay it when you use the API directly or through OpenRouter. Maybe you looked at the leaked source for the Claude Code and can tell me definitively otherwise, but Anthropics and OpenAI's incentives for when to compact are not always aligned with the users depending on pricing plans.

▲ ls612 14 hours ago | parent | prev | next [-]

I recently set up a Gemma 4 heretic fine tune on my MacBook to prove that I could more than anything else and it is probably around 4o levels of performance imo. Not fit for any real work. That said the fact that 4o was frontier two years ago and today I can equal it on local hardware and uncensored is pretty impressive.

▲ otabdeveloper4 14 hours ago | parent | prev [-]

> 95% of people should be paying for a subscription.

Subscription plans are the "first hit is free" plans. Real pricing once subscriptions are phased out in a year or two is gonna be orders of magnitude more.

	▲	0xbadcafebee 11 hours ago \| parent [-]
		Actually subscription plans will be here indefinitely. The cost of inference will only go down over time, and subscriptions are the end-game for all businesses as it's recurring revenue. Most subscribers don't use all the capacity, and there are limits imposed, so the financials work out. Same basic model as residential internet & mobile phones, but cheaper because there's an order of magnitude (or two) less support and maintenance.