Remix.run Logo
rustyhancock 3 hours ago

Yes. I had to read it over twice, it does strike me as odd that there wasn't a base model to work with.

But it seems the biggest model available is 10B? Somewhat unusual and does make me wonder just how challenging it will be to train any model in the 100B order of magnitude.

wongarsu 3 hours ago | parent | next [-]

Approximately as challenging as training a regular 100B model from scratch. Maybe a bit more challenging because there's less experience with it

The key insight of the BitNet paper was that using their custom BitLinear layer instead of normal Linear layers (as well as some more training and architecture changes) lead to much, much better results than quantizing an existing model down to 1.58 bits. So you end up making a full training run in bf16 precision using the specially adapted model architecture

naasking an hour ago | parent | prev [-]

What's unusual about it? It seems pretty standard to train small models to validate an approach, and then show that training scales with model size to 8B to 14B parameter models, which is what they did.