Remix.run Logo
WithinReason 3 hours ago

> a fundamentally different compute profile on commodity CPU

In what way? On modern processors, a Fused Multiply-Add (FMA) instruction generally has the exact same execution throughput as a basic addition instruction

ismailmaj an hour ago | parent | next [-]

You drop the memory throughput requirements because of the packed representation of bits so an FMA can become the bottleneck, and you bypass the problem of needing to upscale the bits to whatever FP the FMA instruction needs.

typically for 1-bit matmul, you can get away with xors and pop_counts which should have a better throughput profile than FMA when taking into account the SIMD nature of the inputs/outputs.

ActivePattern 35 minutes ago | parent | prev | next [-]

The win is in how many weights you process per instruction and how much data you load.

So it's not that individual ops are faster — it's that the packed representation lets each instruction do more useful work, and you're moving far less data from memory to do it.

actionfromafar 2 hours ago | parent | prev [-]

Bitnet encoding more information dense per byte perhaps? CPUs have slow buses so would eke out more use of bandwidth?