Remix.run Logo
pavpanchekha 15 hours ago

I loved this clever, weird, awesome paper, so a short summary.

In many dynamic languages some values are stored on the heap ("boxed") and represented as a pointer, while others are represented as an immediate value ("unboxed" or "immediate"). Pointer tagging is a common way to do that: the low bit of the value tells you the value's type, and some types are immediate while others are boxed.

Naturally, the tag bits have a fixed value, so can't be used to store data. So for example your language might offer 61-bit integer immediates instead of 64-bit integers; the other three bits are used for tags. Possibly, larger integers are stored on the heap and treated as a different type (for example Python 2.X had separate int and long types for these cases).

However, it's hard to use this strategy for floats, because floats need all 64 bits (or 32 bits for single-precision, same difference). There's a trick called "NaN boxing" which makes use of the large number of NaNs in the float representation, but read the paper if you want more on that.

The authors' insight is that, suppose you have a three-bit tag and 011 is the tag for floats. By totally random chance, _some_ floats will end in 011; you can represent those as immediates with those tag bits. Obviously, that's unlikely, though you can raise the chances by using, like, 010, 011, 100, and 101 all as float tags. Still, the low bits are a bad choice. But what about high bits? Most floats have one of a couple high bit patterns, because most floats are either 0 or between, say, 1e-100 and 1e100. Floats outside that range can be boxed but since they're really rare it's not a big cost to box them.

So basically, we use high bits as our tag bits and map all the common float prefixes to float tags. This allows unboxing the vast majority of floats, which leads to big speedups on float-heavy benchmarks.

A personal note: I've been working in numerics and floating-point for a decade now and have had to deal with float boxing both from a research point of view (lots of runtime analysis systems for floats), from a user point of view (using unboxed float vectors for significant speedup in my own software), and from a teaching point of view (discussing boxing in my compilers class, using NaN-boxing as an example of cleverness).

This idea is so simple, so crazy, so stupid, and works so well, but I never thought of it. Bravo to the authors.

daanx 13 hours ago | parent | next [-]

> This idea is so simple, so crazy, so stupid, and works so well, but I never thought of it. Bravo to the authors.

Thanks for the nice summary -- looking forward to read the paper!

The same idea of self-tagging is actually also used in Koka language [1] runtime system where by default the Koka compiler only heap allocates float64's when their absolute value is outside the range [2e-511,2e512) and not 0, infinity, or NaN (see [2]). This saves indeed many (many!) heap allocations for float intensive programs.

Since Koka only uses 1 bit to distinguish pointers from values, another slightly faster option is to only box negative float64's but of course, negative numbers are still quite common so it saves less allocations in general.

[1] https://koka-lang.github.io/koka/doc/book.html#sec-value-typ...

[2] https://github.com/koka-lang/koka/blob/dev/kklib/src/box.c#L...

ps. If you enjoy reading about tagging, I recently wrote a note on efficiently supporting seamless large integer arithmetic (as used in Koka as well) and discuss how certain hardware instructions could really help to speed this up [3]:

[3] https://www.microsoft.com/en-us/research/uploads/prod/2022/0... (ML workshop 2022)

gus_massa 5 hours ago | parent | prev | next [-]

Nice explanation but it took me a while to understand the trick. They are hidding the tag in the "exponent" of the float, not in the "mantisa"!

rtpg 14 hours ago | parent | prev | next [-]

Do all float operations need to reconfirm those bits afterwards though? I suppose if you have some sort of JIT you can end up with a bunch of unboxed floats and would only pay the cost on boundaries though

pansa2 14 hours ago | parent | next [-]

> reconfirm those bits afterwards

Thanks - I hadn't thought about that but it seems to be the main downside of this approach. The benefit of NaN-boxing is that it reassigns values that are otherwise unused - floating-point calculations will never generate NaNs with those bit patterns.

AlotOfReading 9 hours ago | parent [-]

An additional wrinkle is that NaNs are a bit unstable and can have large performance penalties. You can't let the NaNs ever escape into arithmetic and you may even have issues even storing them in a register.

phire 10 hours ago | parent | prev | next [-]

Yes, but there should be some optimisation opportunities.

Off the top of my head: Any multiply by a constant less than 1.0 will never overflow the unboxed range (but might underflow) and there should be times when it's provably better to check the inputs are inside a range, rather than checking the outputs.

It's worth pointing out that these overflow/underflow checks will be very cold (on typical code). They won't waste much in the way of branch-prediction resources.

I wonder if it's worth taking advantage of floating point overflow/underflow exceptions. I think a multiplication by 2^767 will trigger an exception if the value would overflow, and the corresponding multiply by 2^-765 will catch underflows.

It's tempting to allocate two more tags for floats (001 and 010), covering the entire range from -2^257 to +2^257. It will be rare to actually see those small floats near zero, but it could be worth eliminating the possibility of underflows.

ithkuil 5 hours ago | parent | prev | next [-]

You check the tag before doing float operations

pansa2 3 hours ago | parent [-]

And afterwards, because floating-point arithmetic can change the value of the tag. This isn't necessary with NaN-boxing, because it uses NaN bit patterns that the hardware never generates.

lifthrasiir 14 hours ago | parent | prev [-]

Only when they have to be boxed, but yes if you are talking about that.

ndjdjddjsjj 10 hours ago | parent [-]

You need to check after the floating point operation though just in case. Or after the boundary where you pass the float to something else expecting this scheme.

pansa2 14 hours ago | parent | prev | next [-]

> This allows unboxing the vast majority of floats, which leads to big speedups on float-heavy benchmarks.

NaN-boxing allows all floats to be unboxed though. The main benefit of the self-tagging approach seems to be that by boxing some floats, we can make space for 64-bit pointers which are too large for NaN-boxing.

The surprising part of the paper is that "some floats" is only a small minority of values - not, say, 50% of them.

skybrian 10 hours ago | parent | next [-]

A small minority, but apparently it includes all the floats you’re likely to use. It seems the insight is that you only need 8 bits of exponent in most cases. (And single-precision floating point only has 8 bits of exponent.)

Most double-precision floats are never used because they have high exponents.

pansa2 3 hours ago | parent [-]

> A small minority, but apparently it includes all the floats you’re likely to use.

Sorry, I meant a small minority need to be boxed - all the floats you're likely to use can remain unboxed.

adgjlsfhk1 12 hours ago | parent | prev [-]

50% means you only get 1 tag bit.

also you totally can fit 64 bit pointers inside a NaN. 46 bit pointers are only 48 bits and you have 53 bits of NaN payload. (you also could get an extra 3 bits if you only allow storing 8 byte aligned pointers unboxed)

pansa2 12 hours ago | parent [-]

> 50% means you only get 1 tag bit.

That's enough to distinguish between "unboxed float" and "something else", where the latter can have additional tag bits.

> [64-bit] pointers are only 48 bits and you have 53 bits of NaN payload.

The paper specifically talks about support for "high memory addresses that do not fit in 48 bits". If you don't have to handle those high addresses, I don't think this approach has any benefits compared to NaN-boxing.

dzaima 2 hours ago | parent [-]

Of note is that even if you have some massive ≥2^48 data sources, you could still quite likely get away with having NaN-boxed pointers to within the low-size heap, with an extra indirection for massive data. This only would break apart if you managed to reach around 2^45 distinct referenceable objects, which you probably shouldn't ever have (esp. in a GCd language).

skybrian 10 hours ago | parent | prev | next [-]

It’s clever, but not random chance. That would be too much of a coincidence. They rotate the floats to make it happen the way they want.

It’s hardly random that only 8 bits of exponent are needed for many calculations.

jonnycomputer 15 hours ago | parent | prev | next [-]

Thank you for the clear explanation!

10 hours ago | parent | prev [-]
[deleted]