| ▲ | pronik 11 hours ago |
| They will be, and that moment is not that far off. We've got the progression in place already: first, large data centers could have performant LLMs, we are now firmly in "a bunch of servers with a couple of H100s each" territory, slowly going into "128 GB VRAM on a MacBook Pro or a Strix Halo". Within the next year, the pattern of "expensive remote LLM for planning, local slow-but-faster-than-human LLM for execution" will become the norm for companies, slowly moving to "using local LLM for everything is good enough". And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed. The question will be: how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market. |
|
| ▲ | reisse 9 hours ago | parent | next [-] |
| > They will be, and that moment is not that far off. It's here, right now. I'm running quantized Qwen and Gemma on a decent, but three years old gaming rig (think RTX 3080 12GB and 32 GB RAM). Yes, it's slow, it has a small context window. But it can (given a proper harness) run through my trip photos and categorize them. It can OCR receipts and summarize spendings. It can answer simple questions, analyze code and even write code when little context is required. Probably I could get a half-decent autocomplete out of it, if I bother with VS Code integration. "128 GB VRAM on a MacBook Pro or a Strix Halo" is already a minimum viable setup for agentic coding, I think. > And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed. Currently, it works exactly the other way. The cloud versions are orders of magnitude cheaper than self hosting, because sharing can utilize servers much more efficiently. Company can spend half a million bucks on a rig running GLM 5.1, and get data security, flexibility and lack of censorship, but oh it's so expensive compared to Anthropic per-seat plans. |
| |
| ▲ | dust1n 8 minutes ago | parent | next [-] | | Can you share how you use it to categorize trip photos! | |
| ▲ | digitaltrees 7 hours ago | parent | prev | next [-] | | I built my own IDE and run my own model specifically to have private agentic coding. I can still access model APIs but I can be purely local if I want too. It’s amazing. | | | |
| ▲ | datadrivenangel 8 hours ago | parent | prev | next [-] | | In my experience once you get to ~30 gigs of ram for a model like Gemma4, the rest of the 128g of memory is simply nice to have. The speed and costs are what make it tough though, because its slower and more expensive than the same model served on a big accelerator card, and is going to be worse than a frontier model. | | |
| ▲ | digitaltrees 7 hours ago | parent [-] | | I wonder if it really needs to be worse. I am playing with the idea of fine tuning a model on my exact stack and coding patterns. I suspect I could get better performance by training “taste” into a model rather than breadth. | | |
| ▲ | andy_ppp 4 hours ago | parent | next [-] | | Fine tuning these models (at least with PPO or equivalent) requires even more VRAM than inference does, potentially 2-3 times more. | |
| ▲ | epicureanideal 4 hours ago | parent | prev [-] | | I also wonder about JS only, Python only, etc models. Maybe the future is a selection of local, specific stack trained models? | | |
| ▲ | andy_ppp 4 hours ago | parent [-] | | These models being able to generalise at coding will likely get worse if you remove high quality training data like all of python. |
|
|
| |
| ▲ | DrewADesign 5 hours ago | parent | prev | next [-] | | Multiple gazillion dollar companies each seem to be spending to ensure that they alone pretty much dominate all knowledge work, with customers eating up their tokens like Cookie Monster. I wonder if the any of them could survive as LLM providers if they not only failed to do that, but the entire industry ended up selling what the current Cookie Monster would call a “sometimes snack,” for very special occasions? | |
| ▲ | winocm 8 hours ago | parent | prev | next [-] | | Perhaps I am the odd one out here, but a small part of me wants to see what happens when you run a proprietary SOTA model on a laptop. | | |
| ▲ | pianopatrick 3 hours ago | parent | next [-] | | Currently I'm testing something like this just to see what happens. I have an old laptop with 4GB of RAM. I attached a USB drive with Gemma 4 31B model (which is 32.6 GB). Currently the laptop is running llama.cpp and trying to respond to a prompt by streaming the model from disk. The USB drive light is flickering, showing something is happening. It's been about 8 hours since I entered the prompt and I've gotten about 10 tokens back so far. I'm going to leave it running overnight and see what happens. | | |
| ▲ | stuaxo 2 minutes ago | parent [-] | | Nice. What did you use to do this, something standard like llamacpp or something else like vllm or your own contraption ? |
| |
| ▲ | amelius an hour ago | parent | prev | next [-] | | You burn your lap? | |
| ▲ | reisse 8 hours ago | parent | prev | next [-] | | Nothing special? I mean, inference engine might need to get some tweaks, to support whatever compute is available. But then, if you put a few terabytes of disk for swap, and replace RAM to bigger sticks if possible, it should work? Slowly, of course, but there is no reason it should not to. | | |
| ▲ | reverius42 7 hours ago | parent [-] | | The big difference will be measuring seconds per token instead of tokens per second. | | |
| |
| ▲ | yfw 8 hours ago | parent | prev | next [-] | | You can if you have enough ram slots? | |
| ▲ | SilentM68 4 hours ago | parent | prev [-] | | Not sure if this is exactly the scenario you envision but I run ComfyUI on an Acer Helio 300 laptop, from four years ago. Has 16GB RAM, NVIDIA GeForce RTX 2060 w/6144MiB of VRAM and have generated a few images using "NetaYumev35_pretrained_all_in_one.safetensors" @ 10.6GB checkpoint, (well beyond the 6GB capacity of the RTX 2060 card). That being said, it takes more than 10 minutes to complete the task. Of course, I have to turn off all other apps, and browser tabs or hibernate them. If I don't, the laptop's fans begin to spin up like an airplane propeller. It's worth mentioning that I've tried to do this with other IDEs and all seem to fail with some error or another, usually out of VRAM issue. I've only gotten it to work with ComfyUI. I use an anaconda environment, though would have preferred an "uv" environment, on Linux and automate the startup sequence using the following script (start_comfy.sh) from the term rather than manually starting the environment from same said term: #!/bin/bash # # temporary shell version eval "$(conda shell.bash hook)" conda activate comfy-env comfy launch -- --lowvram --cpu-vae Here are some of the images:
https://imgbox.com/nqjYhdx3
https://imgbox.com/93vSWFic
https://imgbox.com/qs1898dz I'm hesitant to increase the sizes of the renders as that will surely stress my laptop's components. | | |
| ▲ | t_mahmood 2 hours ago | parent [-] | | I'm not running local for exactly the same reason, to not stress my components. As it seems we are in for a long haul due to this AI bubble (can't wait for it to pop) so need to make sure I survive this madness, as for sure I can't afford to replace anything right now. |
|
| |
| ▲ | antidamage 7 hours ago | parent | prev | next [-] | | This is my exact setup as well and dear lord gemma is absolutely batshit insane. I'm trying to get a self-reflection and confidence loop going now, but it does feel like it's not the local resources, it's the limits of the training. Dedicated coding or dedicated real-world task models would be a good optimisation. | |
| ▲ | yieldcrv 6 hours ago | parent | prev [-] | | I need to see these proper harnesses I tried oMLX and OpenCode a few weeks ago and the 65k context window was useless, it tried to analyze a very small codebase before going full on agentic and ran out of context window immediately I don't have time to tweak 1,000 permutations of settings just re-prove that its not as smart as Opus 4.6 I need out the box multimodal behavior as similar as typing claude in the command line and its so not there yet but I'm open to seeing what people's workflows are | | |
| ▲ | phamilton 5 hours ago | parent | next [-] | | I'm running opencode with qwen3.6-35b-a3b at a 3-bit quant. I also have qwen3.5-0.8b used for context compaction. I run with 128k context. It's usable. I set it loose on the postgres codebase, told it to find or build a performance benchmark for the bloom filter index and then identify a performance improvement. It took a long time (overnight), but eventually presented an alternate hashing algorithm with experimental data on false positive rate, insertion speed and lookup speed. There wasn't a clear winner, but it was a reasonable find with rigorous data. | | |
| ▲ | Balinares 2 hours ago | parent [-] | | Do you encounter looping issues at such low quants? How do you deal with those? |
| |
| ▲ | cyberax an hour ago | parent | prev | next [-] | | I'm playing with a tape drive for backups, so I asked a local model to rewrite LTFS ( https://github.com/LinearTapeFileSystem/ltfs ) in Go. I gave it the reference C implementation, the LTFS spec from SNIA, and asked it to use the C implementation to verify the correctness of the Go code. LTFS is a pretty straightforward spec, so it made a very reasonable port within about 2 days. It's now working on implementing the iSCSI initiator (client) to speak with my tape drive directly, without involving the kernel. Edit: the model is Qwen3.6-35B | |
| ▲ | nullsanity 6 hours ago | parent | prev [-] | | Hey man, you can just say "I'm lazy, so I'm staying with the cloud. if I wanted to use my brain, I wouldn't be using AI, gosh" - it's much shorter. |
|
|
|
| ▲ | root_axis 7 hours ago | parent | prev | next [-] |
| You are greatly underestimating the hardware requirements for productive local LLMs. Research consistently shows that parameter count sets the practical ceiling for a model's reliability. Quantized models with double digit param counts will never be reliable enough to achieve results in the realm of something like Opus 4.6. |
| |
| ▲ | thot_experiment 3 hours ago | parent | next [-] | | Flat wrong. Q6 Gemma 31b feels a lot like opus 4.5 to me when run in a harness so it can retrieve information and ground itself. The gap is not that big for a lot of usecases. Qwen MoE is fast as fuck locally for things that are oneshottable. I have subscriptions to all the major providers right now and since Gemma 4 and Qwen 3.6 came out I haven't hit limits a single time. I'm actually super surprised by the number of things I try with Gemma 4 with the intent of seeing how it fails and then having Claude do it only to come away with something perfectly usable from the local model. | | |
| ▲ | cbg0 2 hours ago | parent | next [-] | | Your n=1 might not be very relevant outside your personal use. In less contaminated benchmarks Gemma 4 is way below Sonnet 4.5, let alone Opus models: https://swe-rebench.com/ | | |
| ▲ | thot_experiment 40 minutes ago | parent | next [-] | | Benchmarks only give you the roughest idea of how models compare in real world use. They're essentially useless beyond maybe classifying models into a few buckets. The only way you gain an understanding of something as complex as how an LLM integrates with your workflow is by doing it and measuring across many trials. I've been running Opus 4.7 in Claude Code and Gemma 4 31b in parallel on projects for hours a day this past week, Opus 4.7 is definitely better, but for many things they are roughly equivalent, there are some things on the edge that are just up to chance, and either model may stumble across the solution, and there are some areas of my work that reliably trip up both models and I get better mileage out of writing code the old fashioned way. I understand that I'm just one data point, but I'm not writing CRUD apps here, I'm doing DSPs and weird color math in shaders, I don't think any of it is hard, and the stuff that I think is hard none of the models are good at yet, but idk, they just don't seem that extremely disparate from one another. FWIW I think Gemma 4 31b is more likely to be of use to me than Sonnet, idfk, maybe it's a skill issue but I love Opus 4.7, undisputed king, but Sonnet seems borderline useless and I basically think of it as on the same level as Qwen 35b MoE. | |
| ▲ | larodi 2 hours ago | parent | prev | next [-] | | I’m building a pipeline and testing against gemma4 and Gemini’s 3-1 flash. Both are very good on certain tasks and even n-way clustering works almost perfect almost always. But they diverge greatly on other particular ones whenever the ViT tower and the apriori knowledge of the world is crucial. I wish Gemma was on par but both me and Google know they not. | |
| ▲ | onion2k 2 hours ago | parent | prev [-] | | You do need to ask whether or not Sonnet or Opus are overkill for a lot of work though. If Gemma4 with some human effort can achieve the same result as Sonnet then it's arguably a lot more cost effective as you're paying for the person to operate each one regardless. | | |
| ▲ | thot_experiment an hour ago | parent [-] | | I 100% agree with your philosophy but I wanna note that I genuinely find Gemma 4 31b to be better than Sonnet. To be clear, this makes NO sense to me, so I'm probably just high and making stuff up or just biased by a small sample size since I don't use Sonnet that often. I find that Gemma 4 makes the sort of "dumb AI" mistakes Sonnet makes less often, especially in agentic mode. I genuinely don't know how that can be true but Sonnet feels much more like "autocomplete" and Gemma 4 feels like "some facsimile of thought". |
|
| |
| ▲ | root_axis 2 hours ago | parent | prev | next [-] | | Sorry but you're just seeing what you want to see. The idea that a 31b model is anywhere even in the ballpark of something like Opus 4.5 is just absurd on its face. | | |
| ▲ | thot_experiment an hour ago | parent | next [-] | | False. The absolute capability is irrelevant, with the proper harness 31b is more than adequate for a very large portion of the tasks I ask AI to do. The metric isn't how good the model is at Erdos Problems, it's how reliably it can remove drudgery in my life. It just autonomously reverse engineered a bluetooth protocol with minimal intervention, it's ability to react to data and ground itself is constantly impressive to me. I do a ton of testing with these models, today I had Gemma answer a physics problem that Opus 4.7 gave up on. With a decent harness and context the set of tasks where their capabilities are both good enough is very surprising. The tasks I have that stump Gemma often also stump Opus 4.7. | | |
| ▲ | amelius an hour ago | parent [-] | | This is like saying that 640kB is enough for anybody. | | |
| ▲ | thot_experiment 27 minutes ago | parent [-] | | No, it isn't. I am saying that the set of tasks that can be completed by Opus 4.7 has a surprisingly large overlap with the set of tasks that can be completed by Gemma 31B. It is meaningfully equivalent in many cases. (of course if i'm being honest 640kB is fine, i'm sure tons of the world's commerce is handled by less for example, the delta between a system with 640kb of ram and a modern one is near nil for many people, the UX on a PoS terminal does not require more than that for example, the hacker news UX could also be roughly the same) |
|
| |
| ▲ | BoredomIsFun 36 minutes ago | parent | prev [-] | | It would be true, if model providers did not throttle their models. I do not have definitive proof they do but the rumors are abundant. |
| |
| ▲ | alfiedotwtf 2 hours ago | parent | prev [-] | | I’m guessing Qwen3.6 for agentic coding and Gemma4 for non-coding stuff? | | |
| ▲ | thot_experiment an hour ago | parent [-] | | No, exactly the opposite actually. Qwen3.6 is too imprecise for long running agentic tasks. It doesn't have the same ability to check itself as Gemma does in my testing. I keep Qwen MoE in vram by default because there are tons of tasks i trust it to oneshot and it's 90tok/sec is unparalleled, anything where I don't want to have to intervene too much it can't be trusted. |
|
| |
| ▲ | wincy 7 hours ago | parent | prev | next [-] | | Won’t these H100s drop in price in a few years? With the data center build out surely these will become 1/10th the price and you’ll be able to set up a local LLM as good as opus 4.7. Even if the frontier model become more advanced, and memory hungry, you could use the same power usage as your oven to run a current day frontier model as needed? If I could drop $10,000 to have an effectively permanent opus 4.7 subscription today, I would. | | |
| ▲ | root_axis 6 hours ago | parent | next [-] | | > Won’t these H100s drop in price in a few years Doubtful. The increase in demand is greatly outpacing supply, and all signs point to a continued acceleration in demand > If I could drop $10,000 to have an effectively permanent opus 4.7 subscription today, I would. lol well obviously, but realistically that price point is going to be closer to $100k, with a perpetual $1k a month in power costs. | | |
| ▲ | wincy 6 hours ago | parent | next [-] | | Cool, thanks for the information. I guess they drive prices down by massively parallelizing requests on say an H100 X8 array? So this is spread across. So if I say, wanted to use it for 8 hours a day in my theoretical world it’d be too expensive. My work definitely wouldn’t pay $100,000 for a server farm even if it’d give an AI to all our employees, you’d have to have engineers, a colocation space, basically all the problems that companies didn’t like and went to AWS for. | | |
| ▲ | root_axis 5 hours ago | parent [-] | | Well $100k was a generous guesstimate for some time in the future where something like an Opus 4.7 is old news. If we think about the near future, something like Kimi2.6 is within the realm of Opus 4.6 today, but requires closer to $700k in hardware to run. |
| |
| ▲ | dyauspitr 2 hours ago | parent | prev | next [-] | | Why? These models are going to keep drastically improving and given all the new data centers token prices will probably drop a lot in the future. Seems shortsighted given the absurd timelines these things have been improving on. | |
| ▲ | aaronblohowiak 2 hours ago | parent | prev [-] | | taalas!!! |
| |
| ▲ | 33MHz-i486 6 hours ago | parent | prev [-] | | opus 4.7 caliber models are trillions of params, and a single instance would likely run on multiple h200s. $100k of hardware. not coming to your laptop anytime soon. |
| |
| ▲ | segmondy 6 hours ago | parent | prev | next [-] | | Jokes on you. We are already running Deepseekv4Flash, Mimo2.5, MiniMax2.7, Qwen3-397B locally in very affordable hardware. These models are in the real of Opus4.6. For those of us a bit crazy, we are running KimiK2.6, GLM5.1 and more ... | | |
| ▲ | root_axis 6 hours ago | parent | next [-] | | I have two A100s and have been playing with local models for years. There's definitely moments where they are quite impressive, but small context sizes and unreliability become immediately obvious. > For those of us a bit crazy, we are running KimiK2.6, GLM5.1 Yes, those can compare to Opus, but you can't run those unquantized for less than $400k in hardware. | | |
| ▲ | doctorpangloss 6 hours ago | parent [-] | | Two Mac Studio M3 Ultra 512GB and 1 USB cable can run all those models - maybe about $30,000 in hardware - and based on my benchmarks, those Mac Studios were twice as fast as the A100s on Deepseek v4 Flash, which has a quantization but not really a lossy one. | | |
| ▲ | root_axis 5 hours ago | parent [-] | | That cannot run KimiK2.6 or GLM5.1 i.e models within the ballpark of anything offered by frontier companies. |
|
| |
| ▲ | binyu 6 hours ago | parent | prev [-] | | They all still fall short of Opus 4.6, definitely though. They are good but fail on extremely complex tasks, in contrast with a frontier model that will keep on trying until it succeeds or exhausts the solutions space. | | |
| ▲ | julianlam 5 hours ago | parent [-] | | Not by much, and moving goalposts makes for a bad comparison. Local open weight models are already more powerful than frontier models from only a year back. If you believe what you read here, the gap is closing fast. |
|
| |
| ▲ | stubish 3 hours ago | parent | prev | next [-] | | It depends on what you mean for 'productive'. Article mainly seems to be about targeting consumer level hardware, such as the Neural Processing Unit you need for a 'Copilot PC'. Windows Recall is (was?) one such local AI application. If Microsoft get their way and my next PC has one, I look forward to using it for 'productive' purposes such as playing games, handling natural language stuff and leaving my GPU free for GPUing. | |
| ▲ | CuriouslyC 6 hours ago | parent | prev | next [-] | | Parameter size gets you world knowledge and better persistence of behavior as context grows. Both of those things can be engineered around to a large degree, and the latest Qwen models show that small models can be quite smart in narrow domains and short time windows. | | |
| ▲ | alfiedotwtf 2 hours ago | parent [-] | | … maybe we should just teach models how to get their world knowledge from a local Postgres connection! Then the model can be tiny, and it can query to its little heart desires AND run on commodity hardware TODAY! |
| |
| ▲ | byzantinegene 7 hours ago | parent | prev | next [-] | | i would argue we don't need anything near Opus to be productive. Sonnet is plenty productive enough | | |
| ▲ | root_axis 6 hours ago | parent | next [-] | | I use Opus 4.6 as an example because it's the LLM that has been widely recognized by the public as being reliably capable of doing real work across many domains. However, the same logic applies to Opus 4.5 and even previous generations. These models have huge parameter counts and large context sizes, there's no training technique that can compensate for those qualities in small and quantized models. | |
| ▲ | JumpCrisscross 7 hours ago | parent | prev [-] | | > we don't need anything near Opus to be productive. Sonnet is plenty productive enough For niche applications, sure. For general use, I think the tendency towards the best model being used for everything will–to the model publishers' delight–continue. It's just much easier to get a feel for Opus and then do everything with it, versus switch back and forth and keep track of how Haiku came up with novel ways to dumbfuck this Sunday evening. |
| |
| ▲ | josteink an hour ago | parent | prev [-] | | > You are greatly underestimating the current hardware requirements for productive local LLMs. Fixed that for you. Right now most models produced are based on floating point maths and probabilities, which is "expensive" to do math on. Microsoft has researched 1-bit LLMs which can run much more efficiently, and on much cheaper hardware[1]. If this research is reproducable and reusable outside their research models, this means the cost of running self-hosted LLMs will be reduced by an order of magnitude once this hits mainstream. [1] https://github.com/microsoft/BitNet |
|
|
| ▲ | emadb 3 hours ago | parent | prev | next [-] |
| Do you think small models will arrive? I mean if I need to write a web application in typescript why should I use a model that knows all the programming languages and it is able to reply to any questions about almost everything? I just a need a small performant model that knows how to write web applications in typescript. That could be very helpful and easy to run on my laptop. |
| |
| ▲ | driese 2 hours ago | parent | next [-] | | For the same reason that a human who is fluent in five languages can probably express themselves better in either one compared to human that only speaks one, while also having a more nuanced understanding of general grammar.
From what I know, learning on a more diverse set makes a model better overall. | | |
| ▲ | amelius an hour ago | parent [-] | | This might be an interesting research question: can you train a model on many languages, and then extract a much smaller model that knows only one language without much loss of quality? |
| |
| ▲ | thot_experiment 3 hours ago | parent | prev [-] | | Depending on your laptop, if your laptop is a Strix Halo or a Macbook with a decent amount of ram, that day they arrived is about 6 months ago, and today if you can run Gemma 31b, you're golden for your basic workslop code. You can do most of it with local models. Heck, for a lot of the tier of programming you might encounter in the average job Qwen 35b MoE is good enough and it can hit 100tok/s on decent hardware. |
|
|
| ▲ | pier25 8 hours ago | parent | prev | next [-] |
| How fast do you reckon most people will be able to afford 128-256GB of RAM? |
| |
| ▲ | cpt_sobel 39 minutes ago | parent | next [-] | | Their prices are currently so unreachable because of the big players hoarding every chip they can get their hands on, but if/when the market realizes that locally deployed LLMs are the way to go, maybe (hopefully?) then more chips will be available to the consumers for lower prices. | |
| ▲ | Schiendelman 8 hours ago | parent | prev | next [-] | | Other than this recent spike, it's been trending cheaper continuously for decades. In a few years 128GB will be as affordable as 12GB (what flagship phones have now) is today. | | |
| ▲ | pier25 7 hours ago | parent | next [-] | | I'm sure it will happen but I don't think it will be soon. 10 years ago I was using 16GB in my MBP and today it's 48GB. It's just a 3x increase during mostly a bonanza period. | | |
| ▲ | DennisP 6 hours ago | parent | next [-] | | For most of that time, I don't think many people had much use for more ram than that. If demand picks up, companies will provide it. And the Mac Studio was available with 512GB until ram got scarce and they cut the max in half recently. | | |
| ▲ | pier25 6 hours ago | parent [-] | | The Mac Studio is a high end computer that the majority can't afford or justify its expense. There's plenty of demand for RAM right now. We'll see how this turns out. |
| |
| ▲ | an hour ago | parent | prev [-] | | [deleted] |
| |
| ▲ | amelius an hour ago | parent | prev [-] | | That "spike" could be a wall ... |
| |
| ▲ | discordance 6 hours ago | parent | prev [-] | | “Gradually, then suddenly” |
|
|
| ▲ | elbasti 7 hours ago | parent | prev | next [-] |
| > The question will be: how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market. This will depend on how much inference happens for consumer (desktop, local) vs enterprise ("cloud"), vs consumer mobile (probably also cloud). I would assume that the proportion of "consumer, local" is small relative to enterprise and mobile. |
| |
| ▲ | stubish 3 hours ago | parent [-] | | I think the proportion is small because someone has to pay for the cloud services. When phones, PCs and Desktops ship with NPUs whole new markets open up for all that stuff people want but not enough to pay for. |
|
|
| ▲ | inf3cti0n95 6 hours ago | parent | prev | next [-] |
| Certainly, I don't think Data centers are the way here. I guess, it'll most likely be an AI processing and everything else becoming API. In case of GPTs and Claudes of the world. They'll be just using an Indexing APIs and KB on top of their LLMs. |
|
| ▲ | RataNova 11 hours ago | parent | prev | next [-] |
| The biggest impact of local models may simply be that they prevent remote inference from becoming the only game in town |
|
| ▲ | dnnddidiej 5 hours ago | parent | prev | next [-] |
| Except you will want the frontier to compete. Local models are useful but you will always need $$$ to be in the same order of magintude as frontier. And also $$$ for same token speed. The question is would you choose to save $10 a day if it causes your inference to slow down 10x and waste 2 hours a day waiting on stuff. |
|
| ▲ | dakolli 11 hours ago | parent | prev [-] |
| This is simply delusional, It cost 20-30k a month to run Kimi 2.6. The tokens are sold for $3 per mm. To sell tokens profitably you'd need to be able to run inference at 150 tokens per second for less than $1,000 USD a month. I don't think people realize how expensive it is to host decently capable models and how much their use of capable models is subsidized. You can only squeeze so many parameters on consumer grade hardware(that's actually affordable, two 4090s is not consumer grade and neither is 128gb macbooks, this is incredibly expensive for the average person, and the models you can still run are not "good enough" they are still essentially useless). People are betting their competency on a future where billionaires are forever generous, subsidizing inference at a 10-1 20-1 loss ratio. Guess what, that WILL end and probably soon. This idea that companies can afford to give you access to 2mm in GPUs for 5 hours a day at a rate of $200.00 a month is simply unsustainable. Right now they are trying to get you hooked, DON'T FALL FOR IT. Study, work hard, sweat and you'll reap the benefits. The guy making handmade watches, one a month in Switzerland makes a whole lot more than the guy running a manufacturing line make 50k in China. Just write your own fkin code people. Don't bet your future on having access to some billionaire's thinking machine. Intelligence, knowledge and competency isn't fungible, the llm hype is a lie to convince you that it is. |
| |
| ▲ | zozbot234 11 hours ago | parent | next [-] | | No one runs SOTA models 24/7 for individual use or even for a single household or small business, whereas you can run your own hardware basically 24/7 for AI inference. With the new DeepSeek V4 series and its uniquely memory-light KV cache you can even extend this to parallel inference in order to hide memory bandwidth bottlenecks and increase compute intensity. This is perhaps not so useful on a 128GB or 96GB RAM Apple Silicon device (I've seen recent reports of DS4 runs with even one agent flow hitting serious thermal and power limits on these devices, so increasing compute intensity will probably not be helpful there) but it will become useful with 64GB devices or lower that have to stream from a slow disk, or with things like the DGX Spark or to a lesser extent Strix Halo, that greatly overprovision compute while being bottlenecked on memory bandwidth. | | |
| ▲ | doctorpangloss 6 hours ago | parent | next [-] | | deepseek v4 flash on mlx at 1m context runs at 20 t/s decode on a mac studio m3 ultra with 512gb of RAM | | |
| ▲ | alfiedotwtf 2 hours ago | parent | next [-] | | What is everyone running DeepSeek v4 Flash with?! It’s currently unsupported on Llama.cpp and vllm doesn’t support GPU+CPU MoE, so unless all of you have an array of DGX Sparks in your bedroom, what’s the secret sauce?! | | | |
| ▲ | dakolli 5 hours ago | parent | prev [-] | | Just because you read it on a github repo doesn't make it true, it also doesn't take into account cpu temps and inevitable throttling you'll encounter. | | |
| ▲ | doctorpangloss 5 hours ago | parent [-] | | i ran it on my own device haha i don't comprehend why people are in such disbelief at how much better this stuff runs on a mac studio than on NVIDIA hardware with 1/5th the VRAM. look, what can i say? NVIDIA is a bigger rip off than Apple is! | | |
| ▲ | platevoltage 5 hours ago | parent [-] | | Which is good, because Nvidia pulling a Micron and ceasing consumer hardware production is right around the corner. |
|
|
| |
| ▲ | 11 hours ago | parent | prev [-] | | [deleted] |
| |
| ▲ | NitpickLawyer 10 hours ago | parent | prev | next [-] | | API prices are most likely not subsidised. A brief look at openrouter can tell you that. There are plenty of providers that have 0 reason to subsidise that sell models at roughly the same average price. So the model works for them (or they wouldn't do it otherwise). | | |
| ▲ | ai_fry_ur_brain 9 hours ago | parent [-] | | They are subsidized, heavily. This is simple math, there are lots of reasons to subsidize. Please go look up the hardware requirements to run your favorite model and a given tok/ps then multiple that by 86400 (seconds in a day) then divide that by 1mm and multiple by the $ per mm tokens, then ask yourself if there's any possibility they could be profitable or even close to break even. You are going off vibes alone, this is easily verified, please go verify. What makes you think they have zero reason to subsidize, because the providers aren't a household names you assume they wouldn't operate at a loss? Whats your logic here? You make no sense. | | |
| ▲ | hibikir 6 hours ago | parent | next [-] | | The amounts of API tokens many large companies are using through, say AWS bedrock are quite high. We've seen leaks on the bills for real world use cases. It's not unreasonable to see normal individual subscriptions as possibly subsidized.... but do we think someone like Anthropic is going to be subsidizing 7, 8, or even 9 figures monthly bills from megacorps? Because said megacorps will swap out to a competitor immediately, so your subsidy is unlikely to lead to loyalty or anything. If Anthropic and OpenAI are subsidizing the metered API usage, their model is going to end up just as successful as MoviePass. They are burning enough money on the training costs already. | | |
| ▲ | dakolli 6 hours ago | parent [-] | | Large companies are paying an arm and a leg, but I'm still certain even at $15.00 per million tokens they are not profitible. If you have a machine running at 150 tok/ps you can only make $5820 a month at $15 per 1mm running 24/7. It costs a hell of a lot more than 6k a month to run Claude 4.7 @ 150 tok/ps on that machine 24/7. This math is a bit off, because you have input tokens too, but regardless its still not profitable especially for how long it takes to turn around a request and the caching is probably not all that profitable. | | |
| ▲ | NitpickLawyer 4 hours ago | parent [-] | | You are all over this thread, but you have no idea how inference works, and it's obvious. Your napkin math is off because you don't know what to add up, you lack the necessary background. And yet you persist and reply all over this thread. I don't get it. Serving models on dedicated hardware is not the same as your at home 150t/s thing. Inference is measured in thousands of tokens / s in aggregate (i.e. for all the sessions in parallel). That's how they make money. |
|
| |
| ▲ | CuriouslyC 6 hours ago | parent | prev [-] | | Anthropic and OpenAI make money on API calls, margins have been reported in public filings. Subs are subsidized. | | |
| ▲ | dakolli 6 hours ago | parent [-] | | That's not possible, read my comment above. These are private companies, there are no public filings regarding their profitability in any sense. You're just making things up. If you have a machine running at 150 tok/ps you can only make $5820 a month at $15 per 1mm running 24/7. It costs a hell of a lot more than 6k a month to run Claude 4.7 @ 150 tok/ps on that machine 24/7. This math is a bit off, because you have input tokens too, but regardless its still not profitable especially for how long it takes to turn around a request and the caching is probably not all that profitable. | | |
| ▲ | mtone 4 hours ago | parent [-] | | You're forgetting a critical factor: concurrency. If a given hardware serves a single request at 150 tokens/s, it can also serve 20-30 requests at 100 tokens/s. Suddenly your $5K becomes $100K/month, enough to recoup the cost of the hardware in a year or so. The reason it works: each time you read the model (memory bound) to calculate the next token, you can also update multiple requests (compute bound) while at it. It's also much more energy-efficient per token. [1] https://aimultiple.com/gpu-benchmark | | |
| ▲ | dakolli 4 hours ago | parent [-] | | Interesting I didn't know about this, but it makes sense after reading the article. They are benchmarking on a single GPU on a 20bb param model. Does it scale across 60 H100s over NVLink/NVSwitch. I would be interested to see those benchmarks. The idea that everyone is spinning up a $2 million in GPUs to scan their email inbox, search the web or avoid learning something is still ridiculous to me regardless. |
|
|
|
|
| |
| ▲ | CamperBob2 10 hours ago | parent | prev | next [-] | | It cost 20-30k a month to run Kimi 2.6. The tokens are sold for $3 per mm. Not if you're OK with 4-bit quantization. More like $30K-$50K one time. Spring for 8 RTX6000s instead of 4, and you can use the full-precision K2.6 weights ( https://github.com/local-inference-lab/rtx6kpro/blob/master/... ). | | |
| ▲ | reissbaker 10 hours ago | parent | next [-] | | RTX 6000 Pro retails for $10k so an 8x is $80k before anything else in the computer, and long-context will have... pretty bad performance (20+ seconds of waiting before any tokens come out), but it's true it technically works. I don't think cloud models are going away; the hardware for good perf is expensive and higher param count models will remain smarter for a looong time. Even if the hardware cost for kind-of-usable perf fell to only $10k, cloud ones will be way faster and you'd need a lot of tokens to break even. | | |
| ▲ | zozbot234 10 hours ago | parent | next [-] | | > I don't think cloud models are going away; the hardware for good perf is expensive I think local AI will win in its niche by repurposing users' existing hardware, especially as cloud hardware itself gets increasingly bottlenecked in all sorts of ways and the price of cloud tokens rises. You don't have to care about "bad" performance when you've got dedicated hardware that runs your workloads 24/7. Time-critical work that also requires the latest and greatest model can stay on the cloud, but a vast amount of AI work just isn't that critical. | | |
| ▲ | reissbaker 6 hours ago | parent | next [-] | | Users do not have an existing $80k of hardware, are not going to buy $80k of hardware for worse performance than paying $100/month, and models are continuing to grow in size while memory grows in price. | | |
| ▲ | zozbot234 43 minutes ago | parent | next [-] | | You said you need $80k in hardware for "good performance". I'm saying the local AI inference workflow will be a lot more flexible about performance than that, and can get away with something vastly cheaper and in line with what the user owns already. | |
| ▲ | otabdeveloper4 3 hours ago | parent | prev [-] | | > paying $100/month There will not ever be a monthly subscription for LLM tokens. The economics isn't there. Local tokens will always be cheaper. |
| |
| ▲ | ai_fry_ur_brain 9 hours ago | parent | prev [-] | | "I think" Well your thinking is completely vibes based and not cemented in any reality I exist in. | | |
| |
| ▲ | otabdeveloper4 3 hours ago | parent | prev | next [-] | | > higher param count models will remain smarter for a looong time They're not smarter, they just know more stuff. You probably don't need knowledge about Pokemon or the Diamond Sutra in your enterprise coding LLM. The "smarts" comes from post-training, especially around tool use. | | |
| ▲ | anon7725 3 hours ago | parent [-] | | If the smarts came from post-training, we could show significant gains by doing that post-training again for previous generations of models. But we know that isn’t happening - effective post training is necessary but not sufficient for model performance. |
| |
| ▲ | alfiedotwtf 2 hours ago | parent | prev [-] | | If 8 x RTX 6000 is getting you 20s before initial token, how are cloud vendors doing this? |
| |
| ▲ | zozbot234 10 hours ago | parent | prev [-] | | 4-bit quantization is native for Kimi 2.x series. | | |
| ▲ | CamperBob2 10 hours ago | parent [-] | | You're right, I was thinking of Qwen. K2.6 will run at UD-Q2_K_XL precision on 4x RTX6000 boards, but I have no idea if it's worthwhile. |
|
| |
| ▲ | hparadiz 11 hours ago | parent | prev | next [-] | | Posts like this are so funny to me. I'm staring at a mountain of old hardware right now that cost about $20k ten years ago. I have to pay someone now to come haul it away. What makes you think the current new hardware won't end up with the same fate. > Just write your own fkin code people Bro is nostalgic for googling random stack overflow threads for 10 days to figure out a bug the agent fixes in an hour. | | |
| ▲ | HWR_14 3 hours ago | parent | next [-] | | Do you have any old laptop ram? | | |
| ▲ | hparadiz 3 hours ago | parent [-] | | It's old rack mounts. Only one of them has some ECC DDR4 worth something. |
| |
| ▲ | cindyllm 11 hours ago | parent | prev | next [-] | | [dead] | |
| ▲ | dakolli 11 hours ago | parent | prev [-] | | I'm just saying that agent that can fix your bugs actually cost $100-150 an hour to run and you're getting it essentially for $200.00 a month. The cost of cloud compute actually hasn't gone down for old hardware all that much, it still costs $500.00 a year rent 4 core i7700k that's 10 years old. Don't expect much more valuable hardware, like modern GPUs to deflate in price all that quickly. There's 3 fabs in the world that make ddr7 and they aren't going to be selling their stock to consumers going forward, it will be purchased by datacenters almost entirely and stay in them until EOL. Your brain is going to atrophy (this is proven), they'll raise the price to something thats closer to break even and you'll be forced to pay it because you no longer have those muscles. | | |
| ▲ | hparadiz 10 hours ago | parent [-] | | The architectural problems I deal with day in day out leave no room for atrophy. This is just cope. | | |
| ▲ | platevoltage 5 hours ago | parent [-] | | You're going to see major cope once that bargain $200/month plan goes away, and every person or company that has embedded these services into their workflows gets to see their actual costs. | | |
| ▲ | hparadiz 3 hours ago | parent [-] | | Have you actually tried this stuff or are you just saying stuff you hear on the internet? |
|
|
|
| |
| ▲ | nullc 11 hours ago | parent | prev [-] | | > two 4090s is not consumer grade I think that is a very narrow perspective. Enormous numbers of consumers own $50,000 cars, but a pair of $2000 GPUs is "not consumer"? I agree with your view that cheap tokens on SOTA are a trap-- people should use local AI or no AI. | | |
| ▲ | ac29 9 hours ago | parent | next [-] | | > Enormous numbers of consumers own $50,000 cars, but a pair of $2000 GPUs is "not consumer"? $50k is a median priced car in the US. I'd guess >99.9% of people do not own $4000 of GPUs. I consider myself a computer person and I dont think I even own $4000 of computer hardware in total | | |
| ▲ | swiftcoder 2 hours ago | parent | next [-] | | > I consider myself a computer person and I dont think I even own $4000 of computer hardware in total A top-spec MacBook Pro is >$4k, so I assure you that plenty of computer people do own $4k of computer hardware. Hell, most tech folks are wandering around with a ~$1k smartphone in their pocket too. | |
| ▲ | janalsncm 3 hours ago | parent | prev | next [-] | | Fwiw you can finance a car over something like 7 years now. So a lot of people will be paying like $750 per month, not $50k lump sum. | |
| ▲ | zozbot234 9 hours ago | parent | prev | next [-] | | Plenty of gamers own serious GPU rigs that are reusable (at least to some extent) for local AI inference. That's almost certainly more than 0.1% of the populatiom. | |
| ▲ | nullc 9 hours ago | parent | prev [-] | | I guess I wasn't clear-- I wasn't so much making the point people do own $4000 in GPUs (though I suspect you are massively underestimating the number who do, also before the current market conditions this would have been more like $2500 in gpus...), but they certainly could per the evidence of car ownership. A car is super useful, so is an AI. But even if we decide cars are incomparably more useful a great many people pay much more than $4000 over the minimum viable car, and that's money that could be deployed to secure access to private, secure, and autonomous AI facilities. A few thousand dollars in computing is consumer hardware, or at least could easily be with more reason and awareness driving adoption. People spend a LOT of money in things less useful than local copy of qwen3.6-27b can be. |
| |
| ▲ | dakolli 11 hours ago | parent | prev [-] | | I would still question what usefulness there is with a local model even with 10k in GPUs. I certainly haven't seen any great uses myself from these smaller models (<500 parameters) except claims from people who are totally enamored with AI and basically anything output from an LLM impresses them like a toddler who's entertained by the sound their velcro shoes makes. | | |
| ▲ | robot-wrangler 10 hours ago | parent | next [-] | | Probably you're focused on coding agents? I bet someone could use that kind of hardware to filter snarky comments | |
| ▲ | nullc 10 hours ago | parent | prev [-] | | Here is an example-- I'm running hermes + qwen3.6-27b on a workstation GPU (an older RTX A6000 which gets 55tok/s, though people run this model on more limited hardware). A friend an I had previously worked on an entropy extraction scheme and he recently got around to making a writeup about our work: https://wuille.net/posts/binomial-randomness-extractors/ I instructed the agent to read the URL, implement the technique in C++ for 32-bit registers, then make a SIMD version that interleaves several extractors in parallel for better performance. It implemented it (not hard since there was an implementation there that it read), then wrote more extensive tests. Then it vectorized it. It got confused a few times during debugging because the algorithm uses some number theory tricks so that overflows of intermediate products don't matter and it was obviously trained a lot on ordinary code were such overflows are usually fatal. I instructed it to comment the code explaining why the overflows are fine and had it continue which mostly solved its confusion. It successfully got the initial 12MB/s scalar implementation to about 48MB/s. Then I told it to keep optimizing until it reaches 100MB/s. I came back the next day and it had stopped after 6 hours when it achieved just over 100MB/s. Reading what it did: it went off looking at disassembly, figured out what hardware it was running on, and reading microarch timing tables online and made some better decisions, tried a lot of things that didn't work, etc. (And of course, the implementation is correct). I'm pretty skeptical about AI and borderline hateful of many people who (ab)use it and are deluded by it-- but I think this experience shows that a small local model can be objectively useful. (oh and this experience was also while I only had the model running at 19tok/s) Running the model in a loop where it can get feedback from actually testing stuff allows you to make progress in spite of making many mistakes. I could have done this work myself but I didn't have to and I certainly spent less time checking in and prodding it than it would have taken me to do it. In my case I wondered how much faster parallel extractors using SIMD might be-- an idle curiosity that would have gone unanswered if not for the AI. | | |
| ▲ | ai_fry_ur_brain 9 hours ago | parent [-] | | This is maybe the first time Ive seen someone claim to do something useful with such a small model. Congrats, but you're in the 0.0001% thats not just frying their brains, fapping to their local models or doing various magic tricks like a toddler entertained by playing with velcro. At the end of the day you lost an opportunity to improve yourself and excercise your brain, maybe the opportunity cost is worth it idk, but Im going to keep taking things slow. Handmade swiss watches > mass manufactured immitations. Handmade clothes > walmart clothes. | | |
| ▲ | otabdeveloper4 3 hours ago | parent | next [-] | | Sounds like you're coping for the vendor lock-in you cornered yourself into. | |
| ▲ | nullc 9 hours ago | parent | prev [-] | | This is a change that's been happening gradually over time-- I don't think I could have done this on a local model that could run on a consumer class gpu a couple months ago. There are plenty of other uses that people have been making for a long time-- e.g. I know someone who uses a fine tuned local model to sort their incoming email and scan their outgoing messages for accidental privacy leaks. I don't agree with your assessment on an opportunity lost-- I got my reps in on the original work, the AI gave an incremental step forward which made the whole exercise somewhat more valuable to me with minimal additional cost. I think this improves the cost vs benefit in a way that makes me more likely to try other pointless activities, knowing that when I run out of gas I can toss it to AI to try some variations. Sometimes you're also 27 steps deep on a nested subproblem and you're really just trying to solve sometime. Even in finr craftsmanship not every step needs to be about maximum craftsmanship. :) Sometimes it's just good to get something done. I think this is much like any other tool. One can carve furniture using only hand tools, but the benefits of a router are hard to dispute. Both approaches exist in the world and sometimes both are used in concert. As far as people frying their brains with AI -- you don't need local models for that, plenty of people are driving themselves into deep personally and socially destructive delusion just using the chat interfaces. | | |
| ▲ | ai_fry_ur_brain 9 hours ago | parent [-] | | I do think post training smaller open source models for very narrow tasks is largely overlooked and there'll be lots of value there if one puts in the effort. However, in a lot of cases we're just compeleting a circle back to deterministic behavior at 1000x the memory/compute requirements just to avoid writing regex. I agree with you, there's a way to use them responsibly like your router anology, I just think most aren't doing this correctly and its a slippery slope. I'll contend that you probably have used them responsibly in your example. |
|
|
|
|
|
|