Remix.run Logo
Someone a day ago

Some of those could be partially fixed in hardware. Examples:

- returns can run in as good as zero time

- when calling a word, the CPU could prefetch the cache line containing the next word to be called, on the assumption that it would be called soon (an assumption that would be correct for many “almost leaf” calls)

- the top of the stack could be kept in register-speed memory.

For an example, see https://users.ece.cmu.edu/~koopman/stack_computers/sec4_4.ht...:

“The internal structure of the NC4016 is designed for single clock cycle instruction execution. All primitive operations except memory fetch, memory store, and long literal fetch execute in a single clock cycle. This requires many more on-chip interconnection paths than are present on the Canonical Stack Machine, but provides much better performance.

[…]

The NC4016 subroutine return bit allows combining a subroutine return with other instructions in a similar manner. This results in most subroutine exit instructions executing "for free" in combination with other instructions. An optimization that is performed by NC4016 compilers is tail-end recursion elimination. Tail-end recursion elimination involves replacing a subroutine call/subroutine exit instruction pair by an unconditional branch to the subroutine that would have been called.”

(¿Almost?) all modern hardware is designed for other language paradigms, though, so these won’t help with hardware you can buy.