▲ | akoboldfrying 3 days ago | |||||||
They give some description of how their weights are stored: they pack 4 weights into an int8, indicating that their storage format isn't optimal (2 bits per weight instead of the optimal ~1.58 bits). But I don't know enough about LLM internals to know how material this is. Could anyone break down the steps further? | ||||||||
▲ | Fubwubs 2 days ago | parent [-] | |||||||
This model maps weights to ternary values {-1, 0, 1} (aka trits). One trit holds log(3)/log(2) ≈ 1.58 bits of information. To represent a single trit by itself would require 2 bits, but it is possible to pack 5 trits into 8 bits. This article explains it well: https://compilade.net/blog/ternary-packing By using 4 ternary weights per 8 bits, the model is not quite as space-efficient as it could be in terms of information density. (4*1.58)/8 = 0.79 vs (5*1.58)/8 = 0.988 There is currently no hardware acceleration for doing operations on 5 trits packed into 8 bits, so the weights have to be packed and unpacked in software. Packing 5 weights into 8 bits requires slower, more complex packing/unpacking algorithms. | ||||||||
|