Remix clone Hacker News

new | show | ask | jobs Github

	▲	oceanplexian 18 hours ago
		It will work fine but it’s not necessarily insane performance. I can run a q4 of gpt-oss-120b on my Epyc Milan box that has similar specs and get something like 30-50 Tok/sec by splitting it across RAM and GPU. The thing that’s less useful is the 64G VRAM/128G System RAM config, even the large MoE models only need 20B for the router, the rest of the VRAM is essentially wasted (Mixing experts between VRAM and/System RAM has basically no performance benefit).
	▲	androiddrew an hour ago \| parent \| next [-]
		Could you share what you are using for inference and how you are running it? I have a 64G VRAM/128G system RAM setup.
	▲	datadrivenangel an hour ago \| parent \| prev \| next [-]
		Yeah I've got the q4 gpt-oss-120b running at ~40-60 tokens per second on an M5 Pro.
	▲	syntaxing 16 hours ago \| parent \| prev [-]
		Split RAM and GPU impacts it more than you think. I would be surprised if the red box doesn’t outperform you by 2-3X for both PP and TG