> Just your GPU not counting the rest of the system costs 4 years of subscription

With my existing setup for non-coding tasks (GPU is a 3060 12GB which I bought prior to wanting local LLM inference, but use it now for that purpose anyway) the GPU alone was a once-off ~$350 cost (https://www.newegg.com/gigabyte-windforce-oc-gv-n3060wf2oc-1...).

It gives me literally unlimited requests, not pseudo-unlimited as I get from ChatGPT, Claude and Gemini.

> and with the sub you get the new models where your existing hardware will likely not be able to run it at all.

I'm not sure about that. Why wouldn't the new LLM models run on a 4yo GPU? Wasn't a primary selling point of the newer models being "They use less computation for inference"?

Now, of course there are limitations, but for non-coding usage (of which there is a lot) this cheap setup appears to be fine.

> It's closer to $3k to build a machine that you can reasonable use which is 12 whole years of subscription. It's not hard to see why no one is doing it.

But there are people doing it. Lots, actually, and not just for research purposes. With the costs apparently still falling, with each passing month it gets more viable to self-host, not less.

The calculus looks even better when you have a small group (say 3 - 5 developers) needing inference for an agent; then you can get a 5060ti with 16GB RAM for slightly over $1000. The limited RAM means it won't perform as well, but at that performance the agent will still capable of writing 90% of boilerplate, making edits, etc.

These companies (Anthropic, OpenAI, etc) are at the bottom of the value chain, because they are selling tokens, not solutions. When you can generate your own tokens continuously 24x7, does it matter if you generate at half the speed?

▲

tick_tock_tick a day ago | parent [-]

> does it matter if you generate at half the speed?

Yes, massively it's not even linear 1/2 speed is probably 1/8 or less the value of "full speed". It's going to be even more pronounced as "full speed" gets faster.

	▲	lelanthran a day ago \| parent [-]
		> Yes, massively it's not even linear 1/2 speed is probably 1/8 or less the value of "full speed". It's going to be even more pronounced as "full speed" gets faster. I don't think that's true for most use-cases (content generation, including artwork, code/software, reading material, summarising, etc). Something that takes a day without an LLM might take only 30m with GPT5 (artwork), or maybe one hour with Claude Code. Does the user really care that their full-day artwork task is now one hour and not 30m? Or that their full-day coding task is now only two hours, and not one hour? After all, from day one of the ChatGPT release, literally no one complained that it was too slow (and it was much slower than it is now). Right now no one is asking for faster token generation, everyone is asking for more accurate solutions, even at the expense of speed.