Remix clone Hacker News

	▲	sroussey a day ago
		How are 70b LLMs running on that?
	▲	cma a day ago \| parent \| next [-]
		Qwen coder 32b instruct is the state of the art for local LLM coding and will run with a smallish context with that on a 64GB laptop with partial GPU offload. Probably around .8 tok/sec. With a quantization of it you can run larger contexts and go a bit faster. 1.4 tok/sec at 8b quant with offload to a 6GB laptop GPU. Speculative decoding has been being added to lots of the runtimes recently and can give a 20-30% boost with a 1 billion weight model running the speculative token stream.
	▲	jocaal a day ago \| parent \| prev [-]
		The free version of chatgpt is better than your 70b LLM, whats the point?