Remix.run Logo
Joker_vD 10 hours ago

So every function call will need to spill even more call-clobbered registers to the stack!

Like, I get that leaf functions with truly huge computational cores are a thing that would benefit from more ISA-visible registers, but... don't we have GPUs for that now? And TPUs? NPUs? Whatever those things are called?

cvoss 8 hours ago | parent | next [-]

With an increase in available registers, every value that a compiler might newly choose to keep in a register was a value that would previously have lived in the local stack frame anyway.

It's up to the compiler to decide how many registers it needs to preserve at a call. It's also up to the compiler to decide which registers shall be the call-clobbered ones. "None" is a valid choice here, if you wish.

jandrewrogers 10 hours ago | parent | prev | next [-]

Most function calls are aggressively inlined by the compiler such that they are no longer "function calls". More registers will make that even more effective.

burnt-resistor 8 hours ago | parent [-]

That depends on if something like LTO is possible and a function isn't declared to use one of the plethora of calling conventions. What it means is that new calling conventions will be needed and that this new platform will be able to use pass by register for higher arity functions.

throwaway17_17 10 hours ago | parent | prev | next [-]

Why does having more more registers lead to spilling? I would assume (probably) incorrectly, that more registers means less spill. Are you talking about calls inside other calls which cause the outer scope arguments to be preemptively spilled so the inner scope data can be pre placed in registers?

BeeOnRope 9 hours ago | parent | next [-]

More registers leads to less spilling not more, unless the compiler is making some really bad choices.

Any easy way to see that is that the system with more registers can always use the same register allocation as the one with fewer, ignoring the extra registers, if that's profitable (i.e. it's not forced into using extra caller-saved registers if it doesn't want to).

Joker_vD 9 hours ago | parent | prev | next [-]

So, let's take a function with 40 alive temporaries at a point where it needs to call a helper function of, say, two arguments.

On a 16 register machine with 9 call-clobbered registers and 7 call-invariant ones (one of which is the stack pointer) we put 6 temporaries into call-invariant registers (so there are 6 spills in the prologue of this big function), another 9 into the call-clobbered registers; 2 of those 9 are the helper function's arguments, but 7 other temporaries have to be spilled to survive the call. And the rest 25 temporaries live on the stack in the first place.

If we instead take a machine with 31 registers, 19 being call-clobbered and 12 call-invariant ones (one of which is a stack pointer), we can put 11 temporaries into call-invariant registers (so there are 11 spills in the prologue of this big function), and another 19 into the call-clobbered registers; 2 of those 19 are the helper function's arguments, so 17 other temporaries have to be spilled to survive the call. And the rest of 10 temporaries live on the stack in the first place.

So, there seems to be more spilling/reloading whether you count pre-emptive spills or the on-demand-at-the-call-site spills, at least to me.

csjh 8 hours ago | parent | next [-]

You’re missing the fact that the compiler isn’t forced to fill every register in the first place. If it was less efficient to use more registers, the compiler simply wouldn’t use more registers.

The actual counter proof here would be that in either case, the temporaries have to end up on the stack at some point anyways, so you’d need to look at the total number of loads/stores in the proximity of the call site in general.

dahart 7 hours ago | parent | prev | next [-]

This argument doesn’t make sense to me. Generally speaking, having more registers does not result in more spilling, it results in less spilling. Obviously, if you have 100 registers here, there’s no spilling at all. And think through what happens in your example with a 4 register machine or a 1 register machine, all values must spill. You can demonstrate the general principle yourself by limiting the number of registers and then increasing it using the ffixed-reg flags. In CUDA you can set your register count and basically watch the number of spills go up by one every time you take away a register and go down by one every time you add a register.

mjevans 8 hours ago | parent | prev | next [-]

I recalled there were some new instructions added that greatly help with this. Unfortunately I'm not finding any good _webpages_ that describe the operation generally to give me a good overview / refresher. Everything seems to either directly quote published PDF documents or otherwise not actually present the information in it's effective for end use form. E.G. https://www.felixcloutier.com/x86/ -- However availability is problematic for even slightly older silicon https://en.wikipedia.org/wiki/X86-64

- XSAVE / XRSTOR

- XSAVEOPT / XRSTOR

- XSAVEC / XRSTOR

- XSAVES / XRSTORS

cv5005 8 hours ago | parent | prev [-]

A good compiler will only do that if the register spilling is more efficient than using more stack varibles, so I don't really see the problem.

CamelCaseCondo 10 hours ago | parent | prev [-]

op is probably referring to the push all/pop all approach.

Joker_vD 9 hours ago | parent [-]

No, I don't. I use a common "spill definitely reused call-invariant registers at the prologue, spill call-clobbered registers that need to survive a call at precisely the call site" approach, see the sibling comment for the arithmetic.

saagarjha 6 minutes ago | parent [-]

That’s the push all/pop all approach.

bjourne 8 hours ago | parent | prev [-]

Most modern compilers for modern languages do an insane amount of inlining so the problem you're mentioning isn't a big issue. And, basically, GPUs and TPUs can't handle branches. CPUs can.