| ▲ | kristopolous 3 hours ago | ||||||||||||||||||||||||||||||||||||||||
We need a new word, not "local model" but "my own computers model" CapEx based This distinction is important because some "we support local model" tools have things like ollama orchestration or use the llama.cpp libraries to connect to models on the same physical machine. That's not my definition of local. Mine is "local network". so call it the "LAN model" until we come up with something better. "Self-host" exists but this usually means more "open-weights" as opposed to clamping the performance of the model. It should be defined as ~sub-$10k, using Steve Jobs megapenny unit. Essentially classify things as how many megapennies of spend a machine is that won't OOM on it. That's what I mean when I say local: running inference for 'free' somewhere on hardware I control that's at most single digit thousands of dollars. And if I was feeling fancy, could potentially fine-tune on the days scale. A modern 5090 build-out with a threadripper, nvme, 256GB RAM, this will run you about 10k +/- 1k. The MLX route is about $6000 out the door after tax (m3-ultra 60 core with 256GB). Lastly it's not just "number of parameters". Not all 32B Q4_K_M models load at the same rate or use the same amount of memory. The internal architecture matters and the active parameter count + quantization is becoming a poorer approximation given the SOTA innovations. What might be needed is some standardized eval benchmark against standardized hardware classes with basic real world tasks like toolcalling, code generation, and document procesing. There's plenty of "good enough" models out there for a large category of every day tasks, now I want to find out what runs the best Take a gen6 thinkpad P14s/macbook pro and a 5090/mac studio, run the benchmark and then we can say something like "time-to-first-token/token-per-second/memory-used/total-time-of-test" and rate this as independent from how accurate the model was. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | zozbot234 2 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||
You can run plenty of models on a $10K machine or even a lot less than that, it all depends how much you want to wait for results. Streaming weights from SSD storage using mmap() is already a reality when running the largest and sparsest models. You can save even more on memory by limiting KV caching at the cost of extra compute, and there may be ways to push RAM savings even higher simply by tweaking the extent to which model activations are recomputed as needed. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
| ▲ | openclawai an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
For context on what cloud API costs look like when running coding agents: With Claude Sonnet at $3/$15 per 1M tokens, a typical agent loop with ~2K input tokens and ~500 output per call, 5 LLM calls per task, and 20% retry overhead (common with tool use): you're looking at roughly $0.05-0.10 per agent task. At 1K tasks/day that's ~$1.5K-3K/month in API spend. The retry overhead is where the real costs hide. Most cost comparisons assume perfect execution, but tool-calling agents fail parsing, need validation retries, etc. I've seen retry rates push effective costs 40-60% above baseline projections. Local models trading 50x slower inference for $0 marginal cost start looking very attractive for high-volume, latency-tolerant workloads. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | echelon 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
I don't even need "open weights" to run on hardware I own. I am fine renting an H100 (or whatever), as long as I theoretically have access to and own everything running. I do not want my career to become dependent upon Anthropic. Honestly, the best thing for "open" might be for us to build open pipes and services and models where we can rent cloud. Large models will outpace small models: LLMs, video models, "world" models, etc. I'd even be fine time-sharing a running instance of a large model in a large cloud. As long as all the constituent pieces are open where I could (in theory) distill it, run it myself, spin up my own copy, etc. I do not deny that big models are superior. But I worry about the power the large hyperscalers are getting while we focus on small "open" models that really can't match the big ones. We should focus on competing with large models, not artisanal homebrew stuff that is irrelevant. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
| ▲ | christkv 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
I won't need a heater with that running in my room. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
| [deleted] | |||||||||||||||||||||||||||||||||||||||||
| ▲ | bigyabai 3 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||
OOM is a pretty terrible benchmark too, though. You can build a DDR4 machine that "technically" loads 256gb models for maybe $1000 used, but then you've got to account for the compute aspect and that's constrained by a number of different variables. A super-sparse model might run great on that DDR4 machine, whereas a 32b model would cause it to chug. There's just not a good way to visualize the compute needed, with all the nuance that exists. I think that trying to create these abstractions are what leads to people impulse buying resource-constrained hardware and getting frustrated. The autoscalers have a huge advantage in this field that homelabbers will never be able to match. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||