Remix clone Hacker News

new | show | ask | jobs Github

	▲	thr3at-surfac3 3 days ago
		The unified memory architecture is what makes Strix Halo interesting for inference workloads. No PCIe bottleneck moving weights between CPU and GPU memory. For anyone getting started, the Unsloth UD quants are the way to go their imatrix calibration makes a real difference in output quality at Q6/Q8 compared to naive quantization. Curious about the ROCm vs Vulkan situation though. Has anyone benchmarked the prompt processing speed difference? For agentic workflows where you're constantly feeding new context, first-token latency matters more than raw tok/s.