| ▲ | bastawhiz 3 days ago |
| There's no way the red v2 is doing anything with a 120b parameter model. I just finished building a dual a100 ai homelab (80gb vram combined with nvlink). Similar stats otherwise. 120b only fits with very heavy quantization, enough to make the model schizophrenic in my experience. And there's no room for kv, so you'll OOM around 4k of context. I'm running a 70b model now that's okay, but it's still fairly tight. And I've got 16gb more vram then the red v2. I'm also confused why this is 12U. My whole rig is 4u. The green v2 has better GPUs. But for $65k, I'd expect a much better CPU and 256gb of RAM. It's not like a threadripper 7000 is going to break the bank. I'm glad this exists but it's... honestly pretty perplexing |
|
| ▲ | oceanplexian 2 days ago | parent | next [-] |
| It will work fine but it’s not necessarily insane performance. I can run a q4 of gpt-oss-120b on my Epyc Milan box that has similar specs and get something like 30-50 Tok/sec by splitting it across RAM and GPU. The thing that’s less useful is the 64G VRAM/128G System RAM config, even the large MoE models only need 20B for the router, the rest of the VRAM is essentially wasted (Mixing experts between VRAM and/System RAM has basically no performance benefit). |
| |
| ▲ | androiddrew 2 days ago | parent | next [-] | | Could you share what you are using for inference and how you are running it? I have a 64G VRAM/128G system RAM setup. | | |
| ▲ | sosodev 2 days ago | parent [-] | | Most people are using something in the llama family for inference. Llama server is my go to. Unsloth guides describe how to configure inference for your model of choice. |
| |
| ▲ | syntaxing 2 days ago | parent | prev | next [-] | | Split RAM and GPU impacts it more than you think. I would be surprised if the red box doesn’t outperform you by 2-3X for both PP and TG | |
| ▲ | datadrivenangel 2 days ago | parent | prev [-] | | Yeah I've got the q4 gpt-oss-120b running at ~40-60 tokens per second on an M5 Pro. |
|
|
| ▲ | overfeed 2 days ago | parent | prev | next [-] |
| > I'm also confused why this is 12U. My whole rig is 4u. I imagine that's because they are buying a single SKU for the shell/case. I imagine their answer to your question would be: In order to keep prices low and quality high, we don't offer any customization to the server dimensions |
| |
| ▲ | ottah 2 days ago | parent [-] | | That's just such a massively oversized server for the number of gpus. It's not like they're doing anything special either. I can buy an appropriately sized supermicro chassis myself and throw some cards in it. They're really not adding enough value add to overspend on anything. | | |
| ▲ | randomgermanguy 2 days ago | parent [-] | | The major selling point of the tinyboxes is that you're able to run them in your office without any hassle. I used to own a Dell Poweredge for my home-office, but those fans even on minimal setting kept me up at night |
|
|
|
| ▲ | ericd 2 days ago | parent | prev | next [-] |
| Was that cheaper than a Blackwell 6000? But yeah, 4x Blackwell 6000s are ~32-36k, not sure where the other $30k is going. |
| |
| ▲ | bastawhiz 2 days ago | parent | next [-] | | I bought the A100s used for a little over $6k each. | | |
| ▲ | ericd 2 days ago | parent [-] | | Oh, why'd you go that route? Considering going beyond 80 gigs with nvlink or something? | | |
| ▲ | bastawhiz a day ago | parent [-] | | When the costs come down, I'll add two H100s. Until I have more work to saturate the GPUs, they're really at the limit of what I can make time to use them for. Give me a year of writing code and I'll have the need! |
|
| |
| ▲ | segmondy 2 days ago | parent | prev [-] | | folks have too much money than sense, gpt-oss-120b full quant runs on my quad 3090 at 100tk/sec and that's with llama.cpp, with vllm it will probably run at 150tk/sec and that's without batching. | | |
| ▲ | Aurornis 2 days ago | parent | next [-] | | > gpt-oss-120b full quant runs on my quad 3090 A 120B model cannot fit on 4 x 24GB GPUs at full quantization. Either you're confusing this with the 20B model, or you have 48GB modded 3090s. | | |
| ▲ | segmondy 2 days ago | parent [-] | | Some of you folks on here love to argue, gpt-oss-120b was trained in 4 bits, so it pretty much takes up 60gb. | | |
| ▲ | Aurornis 2 days ago | parent [-] | | Good point, but you still need KV cache and more. Fitting the model alone to RAM doesn’t get the job done. | | |
| ▲ | ColonelPhantom a day ago | parent | next [-] | | GPT-OSS is tailored to be extremely memory efficient. Not only is it natively using the 4.25 bit per token MXFP4 format, but it also uses sliding window attention for half of its layers. It also doesn't have that many layers, only 36 for the 120B version and 24 for the 120B version. (The 120B is also much much sparser than the 20B.) I found a Reddit comment claiming only 36 KiB per token. With that, half a million tokens fits in 18 GB, which is less than one GPU. And three GPUs fit the parameters with room to spare (64 out of 72 GB). | |
| ▲ | segmondy 2 days ago | parent | prev [-] | | Yeah, it doesn't take much. I'm looking at it right now, KV cache is about 4gb of vram, compute buffer =~ 1.5gb at full 128k context. |
|
|
| |
| ▲ | integralid 2 days ago | parent | prev | next [-] | | Thanks for chiming in. I'm looking for a reasonably cheap local LLM machine, and multiple 3090s is exactly what I planned to buy. Do you have any recommendations or recommend any reading material before I decide to spend money on that? edit: Found your comment about /r/localllama, but if you have anything more to add I'm still very interested. | |
| ▲ | amarshall 2 days ago | parent | prev | next [-] | | You're almost certainly (definitely, in fact) confusing the 120b and 20b models. | | |
| ▲ | segmondy 2 days ago | parent [-] | | I'm most certainly not doing so. seg@seg-epyc:~/models$ du -sh * /llmzoo/models/* | sort -n
4.0K metrics.txt
4.0K opus
4.0K start_llama
8.2G nvidia_Orchestrator-8B-Q8_0.gguf
12K config.ini
34G Qwen3.5-27B
47G Qwen3.5-35B
51G Qwen3.5-27B-BF16
61G gpt-oss-120b-F16.gguf
65G Qwen3.5-35B-BF16
106G Qwen3.5-122B-Q6
117G GLM4.6V
175G MiniMax-M2.5
232G /llmzoo/models/small_models
240G Ernie4.5-300B
377G DeepSeekv3.2-nolight
380G /llmzoo/models/DeepSeek-V3.2-UD
400G /llmzoo/models/Qwen3.5-397B-Q8
424G /llmzoo/models/KimiK2Thinking
443G DeepSeek-Math-v2
443G DeepSeek-V3-0324-Q5
500G /llmzoo/models/GLM5-Q5
546G /llmzoo/models/KimiK2.5
| | |
| |
| ▲ | ericd 2 days ago | parent | prev [-] | | How're you fitting a model made for 80 gig cards onto a GPU with 24 gigs at full quant? | | |
| ▲ | zozbot234 2 days ago | parent | next [-] | | MoE layers offload to CPU inference is the easiest way, though a bit of a drag on performance | | |
| ▲ | ericd 2 days ago | parent [-] | | Yeah, I'd just be pretty surprised if they were getting 100 tokens/sec that way. EDIT: Either they edited that to say "quad 3090s", or I just missed it the first time. | | |
| ▲ | segmondy 2 days ago | parent [-] | | you are correct, I did forget to add quad. you should join us in r/localllama check out what other people are getting. you're welcome. https://www.reddit.com/r/LocalLLaMA/comments/1nunq7s/gptoss1...
https://www.reddit.com/r/LocalLLaMA/comments/1p4evyr/most_ec... | | |
| ▲ | ericd 2 days ago | parent [-] | | Thanks for the confirmation, wasn't sure if I was just going a bit senile heh. Yeah, I love /r/localllama, some of the best actual practitioners of this stuff on the internet. Also, crazy awesome frankenrigs to try and get that many huge cards working together. I was considering picking up a couple of the 48 gig 4090/3090s on an upcoming trip to China, but I just ended up getting one of the Max-Q's. But maybe the token throughput would still be higher with the 4090 route? Impressive numbers with those 3090s! What's the rig look like that's hosting all that? |
|
|
| |
| ▲ | Havoc 2 days ago | parent | prev [-] | | He said quad 3090 not single | | |
| ▲ | ericd 2 days ago | parent [-] | | Yeah, pretty sure that was edited in after I commented because 150 toks/sec was also new, but could’ve just missed it. |
|
|
|
|
|
| ▲ | gfiorav 2 days ago | parent | prev | next [-] |
| I think Hotz basically created super specific software for the gpus that throws away anything that doesn't contribute to inference (not turing complete, for example). |
|
| ▲ | Aurornis 2 days ago | parent | prev | next [-] |
| > There's no way the red v2 is doing anything with a 120b parameter model. I don't see the 120B claim on the page itself. Unless the page has been edited, I think it's something the submitter added. I agree, though. The only way you're running 120B models on that device is either extreme quantization or by offloading layers to the CPU. Neither will be a good experience. These aren't a good value buy unless you compare them to fully supported offerings from the big players. It's going to be hard to target a market where most people know they can put together the exact same system for thousands of dollars less and have it assembled in an afternoon. RTX 6000 96GB cards are in stock at Newegg for $9000 right now which leaves almost $30,000 for the rest of the system. Even with today's RAM prices it's not hard to do better than that CPU and 256GB of RAM when you have a $30,000 budget. |
|
| ▲ | zozbot234 2 days ago | parent | prev | next [-] |
| > And there's no room for kv, so you'll OOM around 4k of context. Can't you offload KV to system RAM, or even storage? It would make it possible to run with longer contexts, even with some overhead. AIUI, local AI frameworks include support for caching some of the KV in VRAM, using a LRU policy, so the overhead would be tolerable. |
| |
| ▲ | tcdent 2 days ago | parent | next [-] | | Not worth it. It is a very significant performance hit. With that said, people are trying to extend VRAM into system RAM or even NVMe storage, but as soon as you hit the PCI bus with the high bandwidth layers like KV cache, you eliminate a lot of the performance benefit that you get from having fast memory near the GPU die. | | |
| ▲ | zozbot234 2 days ago | parent [-] | | > With that said, people are trying to extend VRAM into system RAM or even NVMe storage Only useful for prefill (given the usual discrete-GPU setup; iGPU/APU/unified memory is different and can basically be treated as VRAM-only, though a bit slower) since the PCIe bus becomes a severe bottleneck otherwise as soon as you offload more than a tiny fraction of the memory workload to system memory/NVMe. For decode, you're better off running entire layers (including expert layers) on the CPU, which local AI frameworks support out of the box. (CPU-run layers can in turn offload to storage for model parameters/KV cache as a last resort. But if you offload too much to storage (insufficient RAM cache) that then dominates the overhead and basically everything else becomes irrelevant.)" |
| |
| ▲ | bastawhiz 2 days ago | parent | prev | next [-] | | The performance already isn't spectacular with it running all in vram. It'll obviously depend on the model: MoE will probably perform better than a dense model, and anything with reasoning is going to take _forever_ to even start beginning its actual output. | |
| ▲ | ranger_danger 2 days ago | parent | prev [-] | | I know llama.cpp can, it certainly improved performance on my RAM-starved GPU. |
|
|
| ▲ | packetlost 2 days ago | parent | prev | next [-] |
| This does not match my experience with 120B~ models. I run Qwen3.5 122b A10B on about 80GB of vRAM just fine. |
| |
| ▲ | bastawhiz a day ago | parent [-] | | Qwen 3.5 is MoE. But you're also almost certainly running a quantized version. 120B is well over 200gb at bf16. With int4 you're looking at 60gb or so. Qwen uses relatively little kv (only about 2gb for 64k context). So you're not too snug, but if qwen isn't cutting it for you, as it didn't for me, you're kind of in a pickle. For writing tasks, int4 was simply too chaotic. I also couldn't get it to use tools. For me, qwen didn't cut it. You're not fine tuning a 120b parameter model with 80gb. You're probably not going to be able to abliterate it either, because it's moe. Other options use more vram, and where you'd have a fair amount of buffer with qwen, you're pressed with other big models. |
|
|
| ▲ | ottah 2 days ago | parent | prev | next [-] |
| Honestly two rtx 8000s would probably have a better return on investment than the red v2. I have an eight gpu server, five rtx 8000, three rtx 6000 ada. For basic inference, the 8000s aren't bad at all. I'm sure the green with four rtx pro 6000s are dramatically faster, but there's a $25k markup I don't honestly understand. |
|
| ▲ | sosodev 2 days ago | parent | prev [-] |
| What models are you testing? A 120b model with hybrid attention should fit within 80gb of VRAM fine at a 4-bit quant. Also, 4-bit quants that are done well are generally fine. They certainly don’t make the model unusable. |