Remix.run Logo
rfoo 3 days ago

Is there really something to lose? How often do we see "stackless coroutine" listed as advantage in Rust vs Go for network programming flamewars?

hardwaresofton 3 days ago | parent [-]

Rust vs Go is IMO not even a reasonable discussion to have -- you really have to be careful on the axes for which you compare them for the thought exercise to make any sense.

If you want a productive backend language that developers can pick up quickly, write code in, and be productive relatively quickly and write low latency applications, Go is an easy winner.

If you want a systems language, then Rust is the only choice between those two. Rust is harder to pick up, but produces more correct and faster code, with obviously a much stronger type system.

They could be directly comparable, but usually only if your goal is as abstract as "I need a language to write my backend in". But you can see how that question is a very difficult one to pull meaning from generally.

> How often do we see "stackless coroutine" listed as advantage in Rust vs Go for network programming flamewars?

I'm just going to go out on a limb and say network programming is bar none more efficient in Rust. I could see C/C++ beating out Rust in that domain, but I would not expect Go to do so, especially if you're reaching for unsafe rust.

Again, it also depends on what you mean by "network programming", and whether development speed/ecosystem/etc are a concern.

To avoid the flamewars and stay on topic, it's one of those things that is just a fundamentally limiting choice -- most will never hit the limit (most apps really aren't going to do more than 1000RPS of meaningful requests), but if you do actually need the perf it's quite disappointing.

People are very annoyed with Rust for not having a blessed async solution in-tree, and while it's produced a ton of churn I think it was ultimately beneficial for reasons like this. You can do either one of these in Rust, the choice isn't made for you.

That said, the OP's suggestion seems to be adding virtual threads on rather than swapping out asyncio for virtual threads, so maybe there's a world where people just use what they want, and python can interop the two as necessary.

rfoo 3 days ago | parent [-]

Good points.

Personally I'm more annoyed of async-Rust itself than not having a blessed async solution in-tree. Having to just Arc<T> away things here and there because you can't do thread::scope(f) honestly just demonstrates how stackless coroutine is unreasonably hard to everyone.

Back to the original topic, I bring this up because I believe the performance advantages claimed in these "fibers bad" papers are superficial, and the limit is almost the same (think 1.00 vs 1.02 level almost), even in languages which consider raw performance as a selling-point. In case you need the absolutely lowest overhead and latency you usually want the timing to be as deterministic as possible too, and it's not even a given in async-await solutions, you still need to be very careful about that.

Let alone Python.

hardwaresofton 3 days ago | parent [-]

> Personally I'm more annoyed of async-Rust itself than not having a blessed async solution in-tree. Having to just Arc<T> away things here and there because you can't do thread::scope(f) honestly just demonstrates how stackless coroutine is unreasonably hard to everyone.

Yeah as annoying as this is, I think it actually played out to benefit Rust -- imagine if the churn that we saw in tokio/async-std/smol/etc played out in tree? I think things might ahve been even worse

That said, stackless coroutines are certainly unreasonably hard.

> Back to the original topic, I bring this up because I believe the performance advantages claimed in these "fibers bad" papers are superficial, and the limit is almost the same (think 1.00 vs 1.02 level almost), even in languages which consider raw performance as a selling-point. In case you need the absolutely lowest overhead and latency you usually want the timing to be as deterministic as possible too, and it's not even a given in async-await solutions, you still need to be very careful about that.

Yeah, I don't think this is incorrect, and I'd love to see some numbers on it. The only thing that I can say definitively is that there is overhead to doing the literal stack switch. There's a reason async I/O got us past the C10k problem so handily.

One of the nice things about some recent Zig work was how clearly you can see how they do their stack switch -- literally you can jump in the Zig source code (on a branch IIRC) and just read the ASM for various platforms that represents a user space context switch.

Agree with the deterministic timing thing too -- this is one of the big points that people who only want to use threads (and are against tokio/etc) argue -- the pure control and single-mindedness of a core against a problem is clearly simple and performant. Thread per core is still the top for performance, but IMO the ultimate is async runtime thread per core, because some (important) problems are embarassingly concurrent.

> Let alone Python.

Yeah, I' trying really not to comment much on Python because I'm out of my depth and I think there are...

I mean I'm of the opinion that JS (really TS) is the better scripting language (better bolt-on type systems, got threads faster, never had a GIL, lucked into being async-forward and getting all it's users used to async behavior), but obviously Python is a powerhouse and a crucially important ecosystem (excluding the AI hype).

rfoo 2 days ago | parent [-]

> The only thing that I can say definitively is that there is overhead to doing the literal stack switch. There's a reason async I/O got us past the C10k problem so handily.

You can also say that not having to constantly allocate & deallocate stuff and rely on a bump allocator (the stack) most of the time more than compensate for the stack switch overhead. Depends on workload of course :p

IMO it's more about memory and nowadays it might just be path dependence. Back in C10k days address spaces were 32-bit (ok 31-bit really), and 2**31 / 10k ~= 210KiB. Makes static-ish stack management really messy. So you really need to extract the (minimal) state explicitly and pack them on heap.

Now we happily run ASAN which allocates 1TiB (2**40) address space during startup for a bitmap of the entire AS (2**48) and nobody complains.