▲ | hardwaresofton 3 days ago | |
> Personally I'm more annoyed of async-Rust itself than not having a blessed async solution in-tree. Having to just Arc<T> away things here and there because you can't do thread::scope(f) honestly just demonstrates how stackless coroutine is unreasonably hard to everyone. Yeah as annoying as this is, I think it actually played out to benefit Rust -- imagine if the churn that we saw in tokio/async-std/smol/etc played out in tree? I think things might ahve been even worse That said, stackless coroutines are certainly unreasonably hard. > Back to the original topic, I bring this up because I believe the performance advantages claimed in these "fibers bad" papers are superficial, and the limit is almost the same (think 1.00 vs 1.02 level almost), even in languages which consider raw performance as a selling-point. In case you need the absolutely lowest overhead and latency you usually want the timing to be as deterministic as possible too, and it's not even a given in async-await solutions, you still need to be very careful about that. Yeah, I don't think this is incorrect, and I'd love to see some numbers on it. The only thing that I can say definitively is that there is overhead to doing the literal stack switch. There's a reason async I/O got us past the C10k problem so handily. One of the nice things about some recent Zig work was how clearly you can see how they do their stack switch -- literally you can jump in the Zig source code (on a branch IIRC) and just read the ASM for various platforms that represents a user space context switch. Agree with the deterministic timing thing too -- this is one of the big points that people who only want to use threads (and are against tokio/etc) argue -- the pure control and single-mindedness of a core against a problem is clearly simple and performant. Thread per core is still the top for performance, but IMO the ultimate is async runtime thread per core, because some (important) problems are embarassingly concurrent. > Let alone Python. Yeah, I' trying really not to comment much on Python because I'm out of my depth and I think there are... I mean I'm of the opinion that JS (really TS) is the better scripting language (better bolt-on type systems, got threads faster, never had a GIL, lucked into being async-forward and getting all it's users used to async behavior), but obviously Python is a powerhouse and a crucially important ecosystem (excluding the AI hype). | ||
▲ | rfoo 2 days ago | parent [-] | |
> The only thing that I can say definitively is that there is overhead to doing the literal stack switch. There's a reason async I/O got us past the C10k problem so handily. You can also say that not having to constantly allocate & deallocate stuff and rely on a bump allocator (the stack) most of the time more than compensate for the stack switch overhead. Depends on workload of course :p IMO it's more about memory and nowadays it might just be path dependence. Back in C10k days address spaces were 32-bit (ok 31-bit really), and 2**31 / 10k ~= 210KiB. Makes static-ish stack management really messy. So you really need to extract the (minimal) state explicitly and pack them on heap. Now we happily run ASAN which allocates 1TiB (2**40) address space during startup for a bitmap of the entire AS (2**48) and nobody complains. |