Remix.run Logo
nasretdinov 6 hours ago

Nice to see common and natural patterns to have their performance improved. Theoretically appending to a slice would be possible to handle with just stack growth, but that would require having large gaps between goroutine stacks and mapping them lazily upon access instead of moving goroutines to the new contiguous blocks as it's implemented right now. But given how many questionable changes it requires from runtime it's certainly not going to happen :)

ivanjermakov 6 hours ago | parent [-]

Having big stack frames is bad for cache locality. Stack is not something magical, it's mapped to the same physical memory as heap and needs to be loaded. Pretty sure such optimization would reduce performance in most cases.

wahern 4 hours ago | parent [-]

In the case where you're using the top of the stack as a, well, stack, I don't see the problem. It would only work if you're not interleaving processing of dynamically-sized objects and function codegen works out. It's similar to TCO in the sense of maintaining certain invariants across calls (e.g. no temporaries need be preserved), and actually in languages with TCO, like Lua, you can hack an application-level stack data structure using tail recursion (and coroutines/threads if you need more than one) that can sometimes be more performant or more convenient than using a native data structure.

There's been a least one experiment (posted a few years ago to HN) where someone benchmarked a stackful coroutine implementation with hundreds of thousands (millions?) of stacks that could grow contiguously on-demand up to, e.g., 2MB, but were initially minimally sized and didn't reserve the maximum stack size upfront. The bottleneck was the VMA bookkeeping--the syscalls, exploding the page table, TLB flushing, etc. In principle it could work well and be even more performant than existing solutions, and it might work better today since Linux 6.13's lightweight guard page feature, MADV_GUARD_INSTALL, but we probably still need more architectural support from the system (kernel, if not hardware) to make it performant and competitive with language-level solutions like goroutines, Rust async, etc.