Remix clone Hacker News

new | show | ask | jobs Github

	▲	pdyc 7 hours ago
		correct but it should be some ratio of model size like if model size is x GB, max context would occupy x * some constant of RAM. For quantized version assuming its 18GB for Q4 it should be able to support 64-128k with this mac
	▲	abhikul0 7 hours ago \| parent [-]
		For the 9B model, I can use the full context with Q8_0 KV. This uses around ~16GB, while still leaving a comfortable headroom. Output after I exit the llama-server command: `llama_memory_breakdown_print: \| memory breakdown [MiB] \| total free self model context compute unaccounted \| llama_memory_breakdown_print: \| - MTL0 (Apple M3 Pro) \| 28753 = 14607 + (14145 = 6262 + 4553 + 3329) + 0 \| llama_memory_breakdown_print: \| - Host \| 2779 = 666 + 0 + 2112 \|`