| ▲ | reilly3000 2 days ago |
| dang I wish I could share md tables. Here’s a text edition:
For $50k the inference hardware market forces a trade-off between capacity and throughput: * Apple M3 Ultra Cluster ($50k): Maximizes capacity (3TB). It is the only option in this price class capable of running 3T+ parameter models (e.g., Kimi k2), albeit at low speeds (~15 t/s). * NVIDIA RTX 6000 Workstation ($50k): Maximizes throughput (>80 t/s). It is superior for training and inference but is hard-capped at 384GB VRAM, restricting model size to <400B parameters. To achieve both high capacity (3TB) and high throughput (>100 t/s) requires a ~$270,000 NVIDIA GH200 cluster and data center infrastructure. The Apple cluster provides 87% of that capacity for 18% of the cost. |
|
| ▲ | mechagodzilla 2 days ago | parent | next [-] |
| You can keep scaling down! I spent $2k on an old dual-socket xeon workstation with 768GB of RAM - I can run Deepseek-R1 at ~1-2 tokens/sec. |
| |
| ▲ | Weryj 2 days ago | parent | next [-] | | Just keep going! 2TB of swap disk for 0.0000001 t/sec | | |
| ▲ | kergonath 2 days ago | parent [-] | | Hang on, starting benchmarks on my Raspberry Pi. | | |
| ▲ | euroderf 2 days ago | parent | next [-] | | By the year 2035, toasters will run LLMs. | |
| ▲ | pickle-wizard a day ago | parent | prev [-] | | On a lark a friend setup Ollama on a 8GB Raspberry Pi with one of the smaller models. It worked by it was very slow. IIRC it did 1 token/second. |
|
| |
| ▲ | jacquesm 2 days ago | parent | prev | next [-] | | I did the same, then put in 14 3090's. It's a little bit power hungry but fairly impressive performance wise. The hardest parts are power distribution and riser cards but I found good solutions for both. | | |
| ▲ | r0b05 2 days ago | parent | next [-] | | I think 14 3090's are more than a little power hungry! | | |
| ▲ | jacquesm 2 days ago | parent [-] | | to the point that I had to pull an extra circuit... but tri phase so good to go even if I would like to go bigger. I've limited power consumption to what I consider the optimum, each card will draw ~275 Watts (you can very nicely configure this on a per-card basis). The server itself also uses some for the motherboard, the whole rig is powered from 4 1600W supplies, the gpus are divided 5/5/4 and the mother board is connected to its own supply. It's a bit close to the edge for the supplies that have five 3090's on them but so far it held up quite well, even with higher ambient temps. Interesting tidbit: at 4 lanes/card throughput is barely impacted, 1 or 2 is definitely too low. 8 would be great but the CPUs don't have that many lanes. I also have a threadripper which should be able to handle that much RAM but at current RAM prices that's not interesting (that server I could populate with RAM that I still had that fit that board, and some more I bought from a refurbisher). | | |
| ▲ | nonplus a day ago | parent [-] | | What pcie version are you running? Normally I would not mention one of these, but you have already invested in all the cards, and it could free up some space if any of your lanes being used now are 3.0. If you can afford the 16 (pcie 3) lanes, you could get a PLX ("PCIe Gen3 PLX Packet switch X16 - x8x8x8x8" on ebay for like $300) and get 4 of your cards up to x8. | | |
| ▲ | jacquesm a day ago | parent [-] | | All are PCIe 3.0, I wasn't aware of those switches at all, in spite of buying my risers and cables from that source! Unfortunately all of the slots on the board are x8, there are no x16 slots at all. So that switch would probably work but I wonder how big the benefit would be: you will probably see effectively an x4 -> (x4 / x8) -> (x8 / x8) -> (x8 / x8) -> (x8 / x4) -> x4 pipeline, and then on to the next set of four boards. It might run faster on account of the three passes that are are double the speed they are right now as long as the CPU does not need to talk to those cards and all transfers are between layers on adjacent cards (very likely), and with even more luck (due to timing and lack of overlap) it might run the two x4 passes at approaching x8 speeds as well. And then of course you need to do this a couple of times because four cards isn't enough, so you'd need four of those switches. I have not tried having a single card with fewer lanes in the pipeline but that should be an easy test to see what the effect on throughput of such a constriction would be. But now you have me wondering to what extent I could bundle 2 x8 into an x16 slot and then to use four of these cards inserted into a fifth! That would be an absolutely unholy assembly but it has the advantage that you will need far fewer risers, just one x16 to x8/x8 run in reverse (which I have no idea if that's even possible but I see no reason right away why it would not work unless there are more driver chips in between the slots and the CPUs, which may be the case for some of the farthest slots). PCIe is quite amazing in terms of the topology tricks that you can pull off with it, and c-payne's stuff is extremely high quality. | | |
| ▲ | nonplus a day ago | parent [-] | | If you end up trying it please share your findings! I've basically been putting this kind of gear in my cart, and then deciding I dont want to manage more than the 2 3090s, 4090 and a5000 I have now, then I take the PLX out of my cart. Seeing you have the cards already it could be a good fit! | | |
| ▲ | jacquesm a day ago | parent [-] | | Yes, it could be. Unfortunately I'm a bit distracted by both paid work and some more urgent stuff but eventually I will get back to it. By then this whole rig might be hopelessly outdated but we've done some fun experiments with it and have kept our confidential data in-house which was the thing that mattered to me. | | |
| ▲ | r0b05 a day ago | parent [-] | | Yes, the privacy is amazing, and there's no rate limiting so you can be as productive as you want. There's also tons of learnings in this exercise. I have just 2x 3090's and I've learnt so much about pcie and hardware that just makes the creative process that more fun. The next iteration of these tools will likely be more efficient so we should be able to run larger models at a lower cost. For now though, we'll run nvidia-smi and keep an eye on those power figures :) | | |
| ▲ | jacquesm 21 hours ago | parent [-] | | You can tune that power down to what gives you the best tokencount per joule, which I think is a very important metric by which to optimize these systems and by which you can compare them as well. I have a hard time understanding all of these companies that toss their NDA's and client confidentiality into the wind and feed newfangled AI companies their corporate secrets with abandon. You'd think there would be a more prudent approach to this. |
|
|
|
|
|
|
| |
| ▲ | tucnak a day ago | parent | prev [-] | | You get occasional accounts of 3090 home-superscalers whereas they would put up eight, ten, fourteen cards. I normally attribute this to obsessive-compulsive behaviour. What kind of motherboard you ended up using and what's the bi-directional bandwidth you're seeing? Something tells me you're not using EPYC 9005's with up to 256x PCIe 5.0 lanes per socket or something... Also: I find it hard to believe the "performance" claims, when your rig is pulling 3 kW from the wall (assuming undervolting at 200W per card?) The electricity costs alone would surely make this intractable, i.e. the same as running six washing machines all at once. | | |
| ▲ | jacquesm a day ago | parent [-] | | I love your skepsis of what I consider to be a fairly normal project, this is not to brag, simply to document. And I'm way above 3 kW, more likely 5000 to 5500 with the GPUs running as high as I'll let them, or thereabouts, but I only have one power meter and it maxes out at 2500 watts or so. This is using two Xeons in a very high end but slightly older motherboard. When it runs the space that it is in becomes hot enough that even in the winter I have to use forced air from outside otherwise it will die. As for electricity costs, I have 50 solar panels and on a good day they more than offset the electricity use, at 2 pm (solar noon here) I'd still be pushing 8 KW extra back into the grid. This obviously does not work out so favorably in the winter. Building a system like this isn't very hard, it is just a lot of money for a private individual but I can afford it, I think this build is a bit under $10K, so a fraction of what you'd pay for a commercial solution but obviously far less polished and still less performant. But it is a lot of bang for the buck and I'd much rather have this rig at $10K than the first commercial solution available at a multiple of this. I wrote a bit about power efficiency in the run-up to this build when I only had two GPUs to play with: https://jacquesmattheij.com/llama-energy-efficiency/ My main issue with the system is that it is physically fragile, I can't transport it at all, you basically have to take it apart and then move the parts and re-assemble it on the other side. It's just too heavy and the power distribution is messy so you end up with a lot of loose wires and power supplies. I could make a complete enclosure for everything but this machine is not running permanently and when I need the space for other things I just take it apart, store the GPUs in their original boxes until the next home-run AI project. Putting it all together is about 2 hours of work. We call it Frankie, on account of how it looks. edit: one more note, the noise it makes is absolutely incredible and I would not recommend running something like this in your house unless you are (1) crazy or (2) have a separate garage where you can install it. | | |
| ▲ | tucnak 12 hours ago | parent [-] | | Thanks for replying, and your power story does make more sense all things considering. I'm no stranger to homelabbing, in fact just now I'm running both IBM POWER9 system (really power-hungry) as well as AMD 8004, both watercooled now while trying to bring the noise down. The whole rack, along with 100G switches and NIC/FPGA's, is certainly keeping us warm in the winter! And it's only dissipating up to 1.6 kW (mostly, thanks to ridiculous efficiency of 8434PN CPU which is like 48 cores at 150W or sommat) I cannot imagine dissipating 5 kW at home! | | |
| ▲ | jacquesm 3 hours ago | parent [-] | | I stick the system in my garage when it is working... I very enthusiastically put it together on the first iteration (with only 8 GPUs) in the living room while the rest of the family was holidaying but that very quickly turned out to be mistake. It has a whole pile of high speed fans mounted in the front and the noise was roughly comparable to sitting in a jet about to take off. One problem that move caused was that I didn't have a link to the home network in the garage and the files that go to and from that box are pretty large so in the end I strung a UTP cable through a crazy path of little holes everywhere until it reaches the switch in the hallway cupboard. The devil is always in the details... Running a POWER9 in the house is worthy of a blog post :) As for Frankie: I fear his days are numbered, I've already been eying more powerful solutions and for the next batch of AI work (most likely large scale video processing and model training) we will probably put something better together, otherwise it will simply take too long. I almost bought a second hand NVidia fully populated AI workstation but the seller was more than a little bit shady and kept changing the story about how they got it and what they wanted for it. In the end I abandoned that because I didn't feel like being used as a fence for what was looking more and more like stolen property. But buying something like that new is out of the ballpark for me, at 20 to 30% of list I might do it assuming the warranty transfers and that's not a complete fantasy, there are enough research projects that have this kind of gear and sell it off when the project ends. People joke I don't have a house but a series of connected workshops and that's not that far off the mark :) |
|
|
|
| |
| ▲ | ternus 2 days ago | parent | prev | next [-] | | And if you get bored of that, you can flip the RAM for more than you spent on the whole system! | |
| ▲ | a012 2 days ago | parent | prev | next [-] | | And heat the whole house in parallel | |
| ▲ | rpastuszak 2 days ago | parent | prev | next [-] | | Nice! What do you use it for? | | |
| ▲ | mechagodzilla 2 days ago | parent [-] | | 1-2 tokens/sec is perfectly fine for 'asynchronous' queries, and the open-weight models are pretty close to frontier-quality (maybe a few months behind?). I frequently use it for a variety of research topics, doing feasibility studies for wacky ideas, some prototypy coding tasks. I usually give it a prompt and come back half an hour later to see the results (although the thinking traces are sufficiently entertaining that sometimes it's fun to just read as it comes out). Being able to see the full thinking traces (and pause and alter/correct them if needed) is one of my favorite aspects of being able to run these models locally. The thinking traces are frequently just as or more useful than the final outputs. |
| |
| ▲ | fatata123 2 days ago | parent | prev [-] | | [dead] |
|
|
| ▲ | icedchai 2 days ago | parent | prev | next [-] |
| For $50K, you could buy 25 Framework desktop motherboards (128G VRAM each w/Strix Halo, so over 3TB total) Not sure how you'll cluster all of them but it might be fun to try. ;) |
| |
| ▲ | sspiff 2 days ago | parent | next [-] | | There is no way to achieve a high throughput low latency connection between 25 Strix Halo systems. After accounting for storage and network, there are barely any PCIe lanes left to link two of them together. You might be able to use USB4 but unsure how the latency is for that. | | |
| ▲ | 0manrho 2 days ago | parent | next [-] | | In general I agree with you, the IO options exposed by Strix Halo are pretty limited, but if we're getting technical you can tunnel PCIe over USB4v2 by the spec in a way that's functionally similar to Thunderbolt 5. That gives you essentially 3 sets of native PCIe4x4 from the chipset and an additional 2 sets tunnelled over USB4v2. TB5 and USB4 controllers are not made equal, so in practice YMMV. Regardless of USB4v2 or TB5, you'll take a minor latency hit. Strix Halo IO topology: https://www.techpowerup.com/cpu-specs/ryzen-ai-max-395.c3994 Frameworks mainboard implements 2 of those PCIe4x4 GPP interfaces as M.2 PHY's which you can use a passive adapter to connect a standard PCIe AIC (like a NIC or DPU) to, and also interestingly exposes that 3rd x4 GPP as a standard x4 length PCIe CEM slot, though the system/case isn't compatible with actually installing a standard PCIe add in card in there without getting hacky with it, especially as it's not an open-ended slot. You absolutely could slap 1x SSD in there for local storage, and then attach up to 4x RDMA supporting NIC's to a RoCE enabled switch (or Infiniband if you're feeling special) to build out a Strix Halo cluster (and you could do similar with Mac Studio's to be fair). You could get really extra by using a DPU/SmartNIC that allows you to boot from a NVMeoF SAN to leverage all 5 sets of PCIe4x4 for connectivity without any local storage but we're hitting a complexity/cost threshold with that that I doubt most people want to cross. Or if they are willing to cross that threshold, they'd also be looking at other solutions better suited to that that don't require as many workarounds. Apple's solution is better for a small cluster, both in pure connectivity terms and also with respect to it's memory advantages, but Strix Halo is doable. However, in both cases, scaling up beyond 3 or especially 4 nodes you rapidly enter complexity and cost territory that is better served by nodes that are less restrictive unless you have some very niche reason to use either Mac's (especially non-pro) or Strix Halo specifically. | |
| ▲ | bee_rider 2 days ago | parent | prev | next [-] | | Do they need fast storage, in this application? Their OS could be on some old SATA drive or whatever. The whole goal is to get them on a fast network together; the models could be stored on some network filesystem as well, right? | | |
| ▲ | pests 2 days ago | parent [-] | | It's more than just the model weights. During inference there would be a lot of cross-talk as each node broadcasts its results and gathers up what it needs from the others for the next step. |
| |
| ▲ | icedchai 2 days ago | parent | prev [-] | | I figured, but it's good to have confirmation. |
| |
| ▲ | 3abiton 2 days ago | parent | prev [-] | | You could use llama.cpp rpc mode over "network" via usb4/thunderbolt connection |
|
|
| ▲ | 3abiton 2 days ago | parent | prev | next [-] |
| What's the math on the $50k nvidia cluster? My understanding these things cost ~$8k and you can at least get 5 for $40k, that's around half a tb. That being said, for inference mac still remain the best, and the M5 Ultra will even be a better value with its better PP. |
| |
| ▲ | reilly3000 2 days ago | parent [-] | | GPUs: 4x NVIDIA RTX 6000 Blackwell (96GB VRAM each)
• Cost: 4 × $9,000 = $36,000 • CPU: AMD Ryzen Threadripper PRO 7995WX (96-Core)
• Cost: $10,000 • Motherboard: WRX90 Chipset (supports 7x PCIe Gen5 slots)
• Cost: $1,200 • RAM: 512GB DDR5 ECC Registered
• Cost: $2,000 • Chassis & Power: Supermicro or specialized Workstation case + 2x 1600W PSUs.
• Cost: $1,500 • Total Cost: ~$50,700 It’s a bit maximalist, but if you had to spend $50k it’s going to be about as fast as you can make it. | | |
|
|
| ▲ | FuckButtons 2 days ago | parent | prev | next [-] |
| Are you factoring in the above comment about as yet un-implemented parallel speed up in there? For on prem inference without any kind of asic this seems quite a bargain relatively speaking. |
|
| ▲ | conradev 2 days ago | parent | prev | next [-] |
| Apple deploys LPDDR5X for the energy efficiency and cost (lower is better), whereas NVIDIA will always prefer GDDR and HBM for performance and cost (higher is better). |
| |
| ▲ | _zoltan_ 2 days ago | parent [-] | | the GH/GB compute has LPDDR5X - a single or dual GPU shares 480GB, depending if it's GH or GB, in addition to the HBM memory, with NVLink C2C - it's not bad! | | |
| ▲ | wtallis 2 days ago | parent [-] | | Essentially, the Grace CPU is a memory and IO expander that happens to have a bunch of ARM CPU cores filling in the interior of the die, while the perimeter is all PHYs for LPDDR5 and NVLink and PCIe. | | |
| ▲ | rbanffy a day ago | parent | next [-] | | > have a bunch of ARM CPU cores filling in the interior of the die The main OS needs to run somewhere. At least for now. | | |
| ▲ | wtallis a day ago | parent [-] | | Sure, but 72x Neoverse V3 (approximately Cortex X3) is a choice that seems more driven by convenience than by any real need for an AI server to have tons of somewhat slow CPU cores. | | |
| ▲ | _zoltan_ a day ago | parent [-] | | there are uses cases where those cores are used for aux processing. there is more to these boxes than AI :-) |
|
| |
| ▲ | _zoltan_ 2 days ago | parent | prev [-] | | fully agree! with MGX and CX8 we see PCIe root moving to the NIC, which is very exciting. |
|
|
|
|
| ▲ | dsrtslnd23 2 days ago | parent | prev | next [-] |
| what about a GB300 workstation with 784GB unified mem? |
| |
| ▲ | rbanffy a day ago | parent | next [-] | | That thing will be extremely expensive I guess. And neither CPU nor GPU have that much memory. It's also not a great workstation either - macOS is a lot more comfortable to use. | |
| ▲ | wmf a day ago | parent | prev [-] | | $95K | | |
| ▲ | dsrtslnd23 6 hours ago | parent | next [-] | | do you have a source for that? I am trying to find pricing information but was not successful yet. | |
| ▲ | rbanffy a day ago | parent | prev [-] | | I miss the time you could go to Apple's website and build the most obscene computer possible. With the M series, all options got a lot more limited. IIRC, an x86 Mac Pro with 1.5 TB of RAM, a big GPU and the two accelerators would yield an eye watering hardware bill. Now you need to add 8 $5K monitors to get something similarly ludicrous. |
|
|
|
| ▲ | yieldcrv 2 days ago | parent | prev [-] |
| 15 t/s way too slow for anything but chatting, call and response, and you don't need a 3T parameter model for that Wake me up when the situation improves |
| |