Remix clone Hacker News

new | show | ask | jobs Github

	▲	Borealid 3 days ago
		A machine with 128GB of unified system RAM will run reasonable-fidelity quantizations (4-bit or more). If you ever want to answer this type of question yourself, you can look at the size of the model files. Loading a model usually uses an amount of RAM around the size it occupies on disk, plus a few gigabytes for the context window. Qwen3.5-122B-A10B is 120GB. Quantized to 4 bits it is ~70GB. You can run a 70GB model in 80GB of VRAM or 128GB of unified normal RAM. Systems with that capability cost a small number of thousand USD to purchase new. If you are willing to sacrifice some performance, you can take advantage of the model being a mixture-of-experts and use disk space to get by with less RAM/VRAM, but inference speed will suffer.