Remix clone Hacker News

new | show | ask | jobs Github

	▲	wongarsu 3 hours ago
		Approximately as challenging as training a regular 100B model from scratch. Maybe a bit more challenging because there's less experience with it The key insight of the BitNet paper was that using their custom BitLinear layer instead of normal Linear layers (as well as some more training and architecture changes) lead to much, much better results than quantizing an existing model down to 1.58 bits. So you end up making a full training run in bf16 precision using the specially adapted model architecture