| ▲ | chadash 3 hours ago |
| > Will LLMs be cheaper than humans once the subsidies for tokens go away? At this point we have little visibility to what the true cost of tokens is now, let alone what it will be in a few years time. It could be so cheap that we don’t care how many tokens we send to LLMs, or it could be high enough that we have to be very careful. We do have some idea. Kimi K2 is a relatively high performing open source model. People have it running at 24 tokens/second on a pair of Mac Studios, which costs 20k. This setup requires less than a KW of power, so the $0.8-0.15 being spent there is negligible compared to a developer. This might be the cheapest setup to run locally, but it's almost certain that the cost per token is far cheaper with specialized hardware at scale. In other words, a near-frontier model is running at a cost that a (somewhat wealthy) hobbyist can afford. And it's hard to imagine that the hardware costs don't come down quite a bit. I don't doubt that tokens are heavily subsidized but I think this might be overblown [1]. [1] training models is still extraordinarily expensive and that is certainly being subsidized, but you can amortize that cost over a lot of inference, especially once we reach a plateau for ideas and stop running training runs as frequently. |
|
| ▲ | embedding-shape 3 hours ago | parent | next [-] |
| > a near-frontier model Is Kimi K2 near-frontier though? At least when run in an agent harness, and for general coding questions, it seems pretty far from it. I know what the benchmarks say, they always say it's great and close to frontier models, but is this other's impression in practice? Maybe my prompting style works best with GPT-type models, but I'm just not seeing that for the type of engineering work I do, which is fairly typical stuff. |
| |
| ▲ | crystal_revenge 2 hours ago | parent | next [-] | | I’ve been running K2.5 (through the API) as my daily driver for coding through Kimi Code CLI and it’s been pretty much flawless. It’s also notably cheaper and I like the option that if my vibe coded side projects became more than side projects I could run everything in house. I’ve been pretty active in the open model space and 2 years ago you would have had to pay 20k to run models that were nowhere near as powerful. It wouldn’t surprise me if in two more years we continue to see more powerful open models on even cheaper hardware. | | |
| ▲ | vuldin 2 hours ago | parent | next [-] | | I agree with this statement. Kimi K2.5 is at least as good as the best closed source models today for my purposes. I've switched from Claude Code w/ Opus 4.5 to OpenCode w/ Kimi K2.5 provided by Fireworks AI. I never run into time-based limits, whereas before I was running into daily/hourly/weekly/monthly limits all the time. And I'm paying a fraction of what Anthropic was charging (from well over $100 per month to less than $50 per month). | | | |
| ▲ | embedding-shape 2 hours ago | parent | prev | next [-] | | > it’s been pretty much flawless So above and beyond frontier models? Because they certainly aren't "flawless" yet, or we have very different understanding of that word. | | |
| ▲ | crystal_revenge an hour ago | parent [-] | | I have increasingly changed my view on LLMs and what they're good for. I still strongly believe LLMs cannot replace software engineers (they can assist yes, but software engineering requires too much 'other' stuff that LLMs really can't do), but LLMs can replace the need for software. During the day I am working on building systems that move lots of data around where context and understanding of the business problem is everything. I largely use LLMs for assistance. This is because I need the system to be robust, scalable, maintainable by other people and adaptable to large range of future needs. LLMs will never be flawless in a meaningful sense in this space (at least in my opinion). When I'm using Kimi I'm using it for purely vibe coded projects where I don't look at the code (and if I do I consider this a sign I'm not thinking about the problem correctly). Are these programs robust, scalable, generalizable, adaptable to future use case? No, not at all. But they don't need to be, they need to serve a single user for exactly the purpose I have. There are tasks that used to take me hours that now run in the background while I'm at work. In this latter sense I say "flawless" because 90% of my requests solve the problem on the first pass, and the 10% of the time where there is some error, it is resolved in a single request, and I don't have to ever look at the code. For me that "don't have to look at the code" is a big part of my definition of "flawless". | | |
| ▲ | mhitza an hour ago | parent [-] | | Your definition of flawless is fine for you and requires a big asterix. But without being called out on it look how your message would have read for someone that's not in the known of LLM limitations, and contributed further to the dissilusionment of the field and the gaslighting that's already going on by big comapnies. |
|
| |
| ▲ | varispeed an hour ago | parent | prev [-] | | Depends what you see as flawless. From my perspective even GPT 5.2 produces mostly garbage grade code (yes it often works, but it is not suitable for anywhere near production) and takes several iterations to get it to remotely workable state. | | |
| ▲ | crystal_revenge an hour ago | parent [-] | | > not suitable for anywhere near production This is what I've been increasingly understanding is the wrong way to understand how LLMs are changing things. I fully agree that LLMs are not suitable for creating production code. But the bigger question you need to ask is 'why do we need production code?' (and to be clear, there are and always will be cases where this is true, just increasingly less of them) The entire paradigm of modern software engineering is fairly new. I mean it wasn't until the invention of the programmable microprocessor that we even had the concept of software and that was less than 100 years ago. Even if you go back to the 80s, a lot of software doesn't need to be distributed or serve a endless variety of users. I've been reading a lot of old Common Lisp books recently and it's fascinating how often you're really programming lisp for you and your experiments. But since the advent of the web and scaling software to many users with diverse needs we've increasingly needed to maintain systems that have all the assumed properties of "production" software. Scalable, robust, adaptable software is only a requirement because it was previously infeasible for individuals to build non-trivial systems for solving any more than a one or two personal problems. Even software engineers couldn't write their own text editor and still have enough time to also write software. All of the standard requirements of good software exist for reasons that are increasingly becoming less relevant. You shouldn't rely on agents/LLMs to write production code, but you also should increasingly question "do I need production code?" | | |
|
| |
| ▲ | fullstackchris 3 hours ago | parent | prev [-] | | regardless its been 3 years since the release of chatgpt. literally 3. imagine in just 5 more years how much low hanging (or even big breakthroughs) will get into the pricing, things like quantization, etc. no doubt in my mind the question of "price per token" will head towards 0 |
|
|
| ▲ | lambda 2 hours ago | parent | prev | next [-] |
| You don't even need to go this expensive. An AMD Ryzen Strix Halo (AI Max+ 395) machine with 128 GiB of unified RAM will set you back about $2500 these days. I can get about 20 tokens/s on Qwen3 Coder Next at an 8 bit quant, or 17 tokens per second on Minimax M2.5 at a 3 bit quant. Now, these models are a bit weaker, but they're in the realm of Claude Sonnet to Claude Opus 4. 6-12 months behind SOTA on something that's well within a personal hobby budget. |
| |
| ▲ | sosodev 3 minutes ago | parent | next [-] | | I was testing the 4-bit Qwen3 Coder Next on my 395+ board last night. IIRC it was maintaining around 30 tokens a second even with a large context window. I haven't tried Minimax M2.5 yet. How do its capabilities compare to Qwen3 Coder Next in your testing? I'm working on getting a good agentic coding workflow going with OpenCode and I had some issues with the Qwen model getting stuck in a tool calling loop. | |
| ▲ | nyrikki an hour ago | parent | prev | next [-] | | It is crazy to me that it is that slow, 4 bit quants don't lose much with Qwen3 coder next and unsloth/Qwen3-Coder-Next-UD-Q4_K_XL gets 32 tps with a 3090 (24gb) as a VM with 256k context size with llama.cpp Same with unsloth/gpt-oss-120b-GGUF:F16 gets 25 tps and gpt-oss20b gets 195 tps!!! The advantage is that you can use the APU for booting, and pass through the GPU to a VM, and have nice safer VMs for agents at the same time while using DDR4 IMHO. | | |
| ▲ | lambda an hour ago | parent [-] | | Yeah, this is an AMD laptop integrated GPU, not a discrete NVIDIA GPU on a desktop. Also, I haven't really done much to try tweaking performance, this is just the first setup I've gotten that works. | | |
| ▲ | nyrikki an hour ago | parent [-] | | The memory bandwidth of the Laptop CPU is better for fine tuning, but MoE really works well for inference. I won’t use a public model for my secret sauce, no reason to help the foundation models on my secret sauce. Even an old 1080ti works well for FIM for IDEs. IMHO the above setup works well for boilerplate and even the sota models fail for the domain specific portions. While I lucked out and foresaw the huge price increases, you can still find some good deals. Old gaming computers work pretty well, especially if you have Claude code locally churn on the boring parts while you work on the hard parts. | | |
| ▲ | lambda 41 minutes ago | parent [-] | | Yeah, I have a lot of problems with the idea of handing our ability to write code over to a few big Silicon Valley companies, and also have privacy concerns, environmental concerns, etc, so I've refused to touch any agentic coding until I could run open weights models locally. I'm still not sold on the idea, but this allows me to experiment with it fully locally, without paying rent to some companies I find quite questionable, and I can know exactly how much power I'm drawing and the money is already spent, I'm not spendding hundreds a month on a subscription. And yes, the Strix Halo isn't the only way to run models locally for a relatively affordable price; it's just the one I happened to pick, mostly because I already needed a new laptop, and that 128 GiB of unified RAM is pretty nice even when I'm not using most of it for a model. |
|
|
| |
| ▲ | cowmix 2 hours ago | parent | prev [-] | | If you don't mind saying, what distro and/or Docker container are you using to bet Qwen3 Coder Next going? | | |
| ▲ | nyrikki an hour ago | parent | next [-] | | I can't answer for the OP but it works fine under llama.cpp's container. | |
| ▲ | lambda an hour ago | parent | prev [-] | | I'm running Fedora Silverblue as my host OS, this is the kernel: $ uname -a
Linux fedora 6.18.9-200.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Feb 6 21:43:09 UTC 2026 x86_64 GNU/Linux
You also need to set a few kernel command line paramters to set it up to allow it to use most of your memory as graphics memory, I have the following in my kernel command line, those are each 110 GiB expressed in number of pages (I figure leaving 18 GiB or so for CPU memory is probably a good idea): ttm.pages_limit=28835840 ttm.page_pool_size=28835840
Then I'm running llama.cpp in the official llama.cpp Docker containers. The Vulkan one works out of the box. I had to build the container myself for ROCm, the llama.cpp container has ROCm 7.0 but I need 7.2 to be compatible with my kernel. I haven't actually compared the speed directly between Vulkan and ROCm yet, I'm pretty much at the point where I've just gotten everything working.In a checkout of the llama.cpp repo: podman build -t llama.cpp-rocm7.2 -f .devops/rocm.Dockerfile --build-arg ROCM_VERSION=7.2 --build-arg ROCM_DOCKER_ARCH='gfx1151' .
Then I run the container with something like: podman run -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable --rm -it -v ~/.cache/llama.cpp/:/root/.cache/llama.cpp/ -v ./unsloth:/app/unsloth llama.cpp-rocm7.2 --model unsloth/MiniMax-M2.5-GGUF/UD-Q3_K_XL/MiniMax-M2.5-UD-Q3_K_XL-00001-of-00004.gguf --jinja --ctx-size 16384 --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --port 8080 --host 0.0.0.0 -dio
Still getting my setup dialed in, but this is working for now.Edit: Oh, yeah, you had asked about Qwen3 Coder Next. That command was: podman run -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable \
--rm -it -v ~/.cache/llama.cpp/:/root/.cache/llama.cpp/ -v ./unsloth:/app/unsloth llama.cpp-rocm7.2 -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q6_K_XL \
--jinja --ctx-size 262144 --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --port 8080 --host 0.0.0.0 -dio
(as mentioned, still just getting this set up so I've been moving around between using `-hf` to pull directly from HuggingFace vs. using `uvx hf download` in advance, sorry that these commands are a bit messy, the problem with using `-hf` in llama.cpp is that you'll sometimes get surprise updates where it has to download many gigabytes before starting up) |
|
|
|
| ▲ | 28 minutes ago | parent | prev | next [-] |
| [deleted] |
|
| ▲ | consp 3 hours ago | parent | prev | next [-] |
| 20k for such a setup for a hobbyist? You can leave the somewhat away and go into sub 1% region globally. A kw of power is still 2k/year at least for me, not that I expect it will run continuously but still not negligible if you can do with 100-200 a year on cheap subscriptions. |
| |
| ▲ | dec0dedab0de 2 hours ago | parent | next [-] | | There are plenty of normal people with hobbies that cost much more. Off the top of my head, recreational vehicles like racecars and motorcycles, but im sure there are others. You might be correct when you say the global 1%, but that's still 83 million people. | | |
| ▲ | markb139 2 hours ago | parent [-] | | I used to think photography was an expensive hobby until my wife got back into the horse world. |
| |
| ▲ | simonw 2 hours ago | parent | prev [-] | | "a (somewhat wealthy) hobbyist" |
|
|
| ▲ | manwe150 2 hours ago | parent | prev | next [-] |
| Reminder to others that $20k is the one time startup cost, and is amortized perhaps 2-4k/year (plus power). That is in the realm of a mere gym membership around me for a family |
| |
| ▲ | vuggamie an hour ago | parent [-] | | So 5-10 years to amortize the cost. You could get 10 years of Claude Max and your $20k could stay in the bank in case the robots steal your job or you need to take an ambulance ride in the US. |
|
|
| ▲ | blibble an hour ago | parent | prev | next [-] |
| > And it's hard to imagine that the hardware costs don't come down quite a bit. have you paid any attention to the hardware situation over the last year? this week they've bought up the 2026 supply of disks |
|
| ▲ | newsoftheday 3 hours ago | parent | prev | next [-] |
| > a cost that a (somewhat wealthy) hobbyist can afford $20,000 is a lot to drop on a hobby. We're probably talking less than 10%, maybe less than 5% of all hobbyists could afford that. |
| |
| ▲ | xboxnolifes 27 minutes ago | parent | next [-] | | Up front, yeah. But people with hobbies on the more expensive end can definitely put out 4k a year. Im thinking like people who have a workshop and like to buy new tools and start projects. | |
| ▲ | charcircuit 2 hours ago | parent | prev [-] | | You can rent computer from someone else to majorly reduce the spend. If you just pay for tokens it will be cheaper than buying the entire computer outright. |
|
|
| ▲ | msp26 an hour ago | parent | prev | next [-] |
| Horrific comparison point. LLM inference is way more expensive locally for single users than running batch inference at scale in a datacenter on actual GPUs/TPUs. |
| |
| ▲ | AlexandrB an hour ago | parent [-] | | How is that horrific? It sets an upper bound on the cost, which turns out to be not very high. |
|
|
| ▲ | qaq 2 hours ago | parent | prev | next [-] |
| If I remember correctly Dario had claimed that AI inference gross profit margins are 40%-50% |
| |
| ▲ | gjk3 28 minutes ago | parent [-] | | Why do you people trust what he has to say? Like omg dude. These folks play with numbers all the time to suit their narrative. They are not independently audited. What do you think scares them about going public? Things like this. They cannot massage the numbers the same way they do in the private market. The naivete on here is crazy tbh. |
|
|
| ▲ | PlatoIsADisease 2 hours ago | parent | prev [-] |
| >24 tokens/second this is marketing not reality. Get a few lines of code and it becomes unusable. |