Remix clone Hacker News

new | show | ask | jobs Github

	▲	ComputerGuru 6 hours ago
		I would have liked to see a bit more on the theory side of things, explaining optimal weight and inference splits, actual issues with existing drivers, etc instead of what’s essentially just a recipe.
	▲	atq2119 5 hours ago \| parent \| next [-]
		Agreed. To put this in perspective, batch 1 token decode is bandwidth limited in theory. Memory bandwidth of RTX 3090 is listed as 936GB/s. The post isn't fully clear on which model they used and how big it is, but even assuming it perfectly filled the 24GB of that GPU, 30tok/s means the achieved bandwidth is only 720GB/s. There's a bunch of room for improvement here even without MTP, and those improvements should largely stack with MTP.
	▲	verdverm 6 hours ago \| parent \| prev [-]
		I've been using https://spark-arena.com/leaderboard to glean this kind of information for DGX Spark, a sort of recipe book. The Nvidia forum has people talking about the things you wish to know. I see some on Discord/Reddit/et al, but less cohesive I've switched from using the spark as a way to run one model as best it can to running several support models for the md kb I'm working on