Remix clone Hacker News

new | show | ask | jobs Github

	▲	SirMaster 8 days ago
		You don't really need it to fit all in VRAM due to the efficient MoE architecture and with llama.cpp The 120B is running at 20 tokens/sec on my 5060Ti 16GB with 64GB of system ram. Now personally I find 20 tokens/sec quite usable, but for some maybe it's not enough.
	▲	dexterlagan 8 days ago \| parent [-]
		I have a similar setup but with 32 GB of RAM. Do you partly offload the model to RAM? Do you use LMStudio or other to achieve this? Thanks!