| ▲ | mitjam 7 hours ago |
| As a heavy user of OpenAI, Anthropic, and Google AI APIs, I’m increasingly tempted to buy a Mac Studio (M3 Ultra or M4 Pro) as a contingency in case the economics of hosted inference change significantly. |
|
| ▲ | utopiah 4 hours ago | parent | next [-] |
| Don't buy anything physical, benchmark the models you could run on your potential hardware on (neo) cloud provider like HuggingFace. Only if you believe the quality is up to your expectation then do it. The test itself should take you $100 and few hours top. |
|
| ▲ | boredatoms 7 hours ago | parent | prev | next [-] |
| If theres a market crash, there could be a load cheap H100s hitting ebay |
| |
| ▲ | wmf 6 hours ago | parent [-] | | You can't run those at home. | | |
| ▲ | SoKamil 6 hours ago | parent [-] | | Why? | | |
| ▲ | wmf 6 hours ago | parent [-] | | Because they are extremely loud, consume 8-10 kW, and probably cost $20K used. | | |
| ▲ | kccqzy 5 hours ago | parent [-] | | The 8-10kW isn’t a big deal anymore given the prevalence of electric vehicles and charging them at home. A decade ago very few homes have this kind of hookup. Now it’s reasonably common, and if not, electricians wouldn’t bat an eye on installing it. | | |
| ▲ | MandieD 4 hours ago | parent | next [-] | | In the winter in northern Europe or the colder parts of North America, as part of a radiator system? Kind of works! Any other time and place? The power to run it, plus the power to cool it. | |
| ▲ | alt227 5 hours ago | parent | prev [-] | | But the cost of running them is. |
|
|
|
|
|
|
| ▲ | pram 7 hours ago | parent | prev | next [-] |
| FWIW the M5 appears to be an actual large leap for LLM inference with the new GPU and Neural Accelerator. So id wait for the Pro/Max before jumping on M3 Ultra. |
| |
|
| ▲ | mifreewil 6 hours ago | parent | prev | next [-] |
| You'd want to get something like a RTX Pro 6000 (~ $8,500 - $10,000) or at least a RTX 5090 (~$3,000). That's the easiest thing to do or cluster of some lower-end GPUs. Or a DGX Spark (there are some better options by other manufacturers than just Nvidia) (~$3000). |
| |
| ▲ | mitjam 6 hours ago | parent [-] | | Yes, I also considered the RTX 6000 Pro Max-Q, but it’s quite expensive and probably only makes sense if I can use it for other workloads as well. Interestingly, its price hasn’t gone up since last summer, here in Germany. | | |
| ▲ | storus 6 hours ago | parent [-] | | I have MacStudio with 512GB RAM, 2x DGX Spark and RTX 6000 Pro WS (planing to buy a few of those in Max-Q version next). I am wondering if we ever see local inference so "cheap" as we see it right now given RAM/SSD price trends. | | |
| ▲ | clusterhacks 5 hours ago | parent [-] | | Good grief. I'm here cautiously telling my workplace to buy a couple of dgx sparks for dev/prototyping and you have better hardware in hand than my entire org. What kind of experiments are you doing? Did you try out exo with a dgx doing prefill and the mac doing decode? I'm also totally interested in hearing what you have learned working with all this gear. Did you buy all this stuff out of pocket to work with? | | |
| ▲ | storus 3 hours ago | parent [-] | | Yeah, Exo was one of the first things to do - MacStudio has a decent throughput at the level of 3080, great for token generation and Sparks have decent compute, either for prefill or for running non-LLM models that need compute (segment anything, stable diffusion etc). RTX 6000 Pro just crushes them all (it's essentially like having 4x3090 in a single GPU). I bought 2 sparks to also play with Nvidia's networking stack and learn their ecosystem though they are a bit of a mixed bag as they don't expose some Blackwell-specific features that make a difference. I bought it all to be able to run local agents (I write AI agents for living) and develop my own ideas fully. Also I was wrapping up grad studies at Stanford so they came handy for some projects there. I bought it all out of pocket but can amortize them in taxes. | | |
| ▲ | clusterhacks an hour ago | parent [-] | | Very cool - thanks for the info. That you are writing AI agents for a living is fascinating to hear. We aren't even really looking at how to use agents internally yet. I think local agents are incredibly off the radar at my org despite some really good additions as supplement resources for internal apps. What's deployment look like for your agents? You're clearly exploring a lot of different approaches . . . |
|
|
|
|
|
|
| ▲ | mohsen1 7 hours ago | parent | prev | next [-] |
| the thing is GLM 4.7 is easily doing the work Opus was doing for me but to run it fully you'll need a much bigger hardware than a Mac Studio. $10k buys you a lot of API calls from z.ai or Anthropic. It's just not economically viable to run a good model at home. |
| |
| ▲ | zozbot234 6 hours ago | parent | next [-] | | You can cluster Mac Studios using Thunderbolt connections and enable RDMA for distributed inference. This will be slower than a single node but is still the best bang-for-the-buck wrt. doing inference on very-large-sized models. | |
| ▲ | mitjam 6 hours ago | parent | prev [-] | | True — I think local inference is still far more expensive for my use case due to batching effects and my relatively sporadic, hourly usage. That said, I also didn’t expect hardware prices (RTX 5090, RAM) to rise this quickly. |
|
|
| ▲ | storus 6 hours ago | parent | prev | next [-] |
| M3 Ultra with DGX Spark is right now what M5 Ultra will be in who knows when. You can just buy those two, connect them together using Exo and have M5 Ultra performance/memory right away. Who knows what M5 Ultra will cost given RAM/SSD price explosion? |
|
| ▲ | PlatoIsADisease 6 hours ago | parent | prev [-] |
| There is a reason no one uses Apple for local models. Be careful not to fall for marketing and fanboyism. Just look at what people are actually using. Don't rely on a few people who tested a few short prompts with short completions. |
| |
| ▲ | mitjam 6 hours ago | parent [-] | | yes, I'm using smaller models on a Mac M2 Ultra 32GB and they work well, but larger models and coding use might be not a good fit for the architecture, after all. |
|