Remix.run Logo
oceanplexian 18 hours ago

It will work fine but it’s not necessarily insane performance. I can run a q4 of gpt-oss-120b on my Epyc Milan box that has similar specs and get something like 30-50 Tok/sec by splitting it across RAM and GPU.

The thing that’s less useful is the 64G VRAM/128G System RAM config, even the large MoE models only need 20B for the router, the rest of the VRAM is essentially wasted (Mixing experts between VRAM and/System RAM has basically no performance benefit).

androiddrew an hour ago | parent | next [-]

Could you share what you are using for inference and how you are running it? I have a 64G VRAM/128G system RAM setup.

datadrivenangel an hour ago | parent | prev | next [-]

Yeah I've got the q4 gpt-oss-120b running at ~40-60 tokens per second on an M5 Pro.

syntaxing 16 hours ago | parent | prev [-]

Split RAM and GPU impacts it more than you think. I would be surprised if the red box doesn’t outperform you by 2-3X for both PP and TG