Remix.run Logo
ismailmaj 3 hours ago

You drop the memory throughput requirements because of the packed representation of bits so an FMA can become the bottleneck, and you bypass the problem of needing to upscale the bits to whatever FP the FMA instruction needs.

typically for 1-bit matmul, you can get away with xors and pop_counts which should have a better throughput profile than FMA when taking into account the SIMD nature of the inputs/outputs.

WithinReason an hour ago | parent [-]

yes but this is not 1 bit matmul, it's 1.58 bits with expensive unpacking

ismailmaj 33 minutes ago | parent [-]

The title and the repo uses 1-bit when it means 1.58 bits tertiary values, it doesn't change any of my arguments (still xors and pop_counts).