Remix.run Logo
wongarsu 3 hours ago

Approximately as challenging as training a regular 100B model from scratch. Maybe a bit more challenging because there's less experience with it

The key insight of the BitNet paper was that using their custom BitLinear layer instead of normal Linear layers (as well as some more training and architecture changes) lead to much, much better results than quantizing an existing model down to 1.58 bits. So you end up making a full training run in bf16 precision using the specially adapted model architecture