Remix clone Hacker News

new | show | ask | jobs Github

	▲	mikeayles 5 hours ago
		So for people wondering if it can be used to accelerate LLM inference, sadly not. I've been trying to hit 100,000tokens/s with a 3.28m dumb model, and even this is an order of magnitude too large to benefit. It appears to be focussed more on latency, than throughput. Happy to be corrected?
	▲	ssivark 27 minutes ago \| parent \| next [-]
		When aiming for 100k tok/s, you would still have CUDA overheads (on the order of microseconds) -- which might become the bottleneck, even if you do everything else right with the inference architecture. How are you planning to overcome that? EDIT: Oh, on second read, do you mean you're running the model on an FPGA?
	▲	ag2718 5 hours ago \| parent \| prev \| next [-]
		You're correct that this work is not very applicable for LLMs and that the focus here is primarily on latency.
	▲	ai_fry_ur_brain 2 hours ago \| parent \| prev [-]
		Was anyone thinking this?