Remix.run Logo
hardwaresofton 3 days ago

If I'm understanding the suggestion, the proposed python virtual threads are ~= fibers ~= stackful coroutines.

I have this paper saved in my bookmarks as "fibers bad":

https://www.open-std.org/JTC1/SC22/WG21/docs/papers/2018/p13...

AFAIK sync/await and stackless coroutines are the most efficient way to do async operations as far as I can tell, even if they are a unwieldly and complicated. Is there something to be gained here other than usability?

Python is certainly in the business of trading efficiency and optimal solutions for readability and some notion of simplicity, and that has held it back (and all the programmers that overindex on the pythonic way, it's incredibly sad that all of modern ML is essentially built on python) IMO, but the language is certainly easy to write.

[EDIT] - wanted to put here, people spend a lot of characters complaining about tokio in Rust land, but I honestly think it's just fine. It was REALLY rough early on, but at this point the ergonomics are quite easy to use and understand, and it's quite performant out of the box. It's not perfect, but it's really quite pleasing to use and understand (i.e. running into a bug/surprising behavior almost always ends in understanding more about the fundamental tradeoffs and system design for async systems)

Swift doing something similar seems to be an endorsement of the approach. In fact, IIRC this might be where I saw that first paper? Maybe it was a HN comment that pointed to it:

https://forums.swift.org/t/why-stackless-async-await-for-swi...

Rust and Swift are the most impressive modern languages IMO, the sheer amount of lessons they've taken from previous generations of PL is encouraging.

gpderetta 2 days ago | parent | next [-]

That paper is specifically for C++ and even there not every body agrees (there is still a proposal to add stackful coroutines). One claim in the paper is that stackfull coroutines have an higher stack switching cost. I disagree, but also this is completely irrelevant in python where spending a couple of additional nanoseconds is completely hidden by the inefficiency of the interpreter.

It is true that stackless coroutines are more memory efficient when you are running millions of lightweight tasks. But you won't be running millions of tasks (or even hundreds of thousands) in python, so it is a non-issue.

There is really no reason for python to have chosen the async/await model.

janalsncm 2 days ago | parent | prev | next [-]

> Python is certainly in the business of trading efficiency and optimal solutions for readability and some notion of simplicity, and that has held it back

Sometimes a simple, suboptimal solution is faster in clock time than an optimal one. You have to write the code and then execute it after all.

As for why ML is dominated by python, I feel like it just gets to the point quicker than other languages. In Java or typescript or even rust there are just too many other things going on. I feel like writing even a basic training loop in Java would be a nightmare.

hardwaresofton 2 days ago | parent [-]

I do wonder how much hand wringing there is around tools like uv. The perf of Rust with the usability of python if everyone just ignores the binary blobs and FFI involved.

I personally think uv is a huge brightspot in the python ecosystem and think others will follow suit (i.e other langs embedding rust in tooling for speedups). Wonder what the rest of the ecosystem thinks.

Maybe the uv approach is really the best of both worlds — pythons dx with rusts rigor.

llamavore 2 days ago | parent [-]

Totally agree, the FFI escape hatch and excellent tooling from rust, maturin pyo3 etc means so many python problems can just be solved with rust. Which begs the question, has anyone tried doing a greenthread implementation in rust? Maybe offload some of the dynamically evaled python code to a seperate process maybe with https://github.com/RustPython/RustPython

rfoo 3 days ago | parent | prev [-]

Is there really something to lose? How often do we see "stackless coroutine" listed as advantage in Rust vs Go for network programming flamewars?

hardwaresofton 3 days ago | parent [-]

Rust vs Go is IMO not even a reasonable discussion to have -- you really have to be careful on the axes for which you compare them for the thought exercise to make any sense.

If you want a productive backend language that developers can pick up quickly, write code in, and be productive relatively quickly and write low latency applications, Go is an easy winner.

If you want a systems language, then Rust is the only choice between those two. Rust is harder to pick up, but produces more correct and faster code, with obviously a much stronger type system.

They could be directly comparable, but usually only if your goal is as abstract as "I need a language to write my backend in". But you can see how that question is a very difficult one to pull meaning from generally.

> How often do we see "stackless coroutine" listed as advantage in Rust vs Go for network programming flamewars?

I'm just going to go out on a limb and say network programming is bar none more efficient in Rust. I could see C/C++ beating out Rust in that domain, but I would not expect Go to do so, especially if you're reaching for unsafe rust.

Again, it also depends on what you mean by "network programming", and whether development speed/ecosystem/etc are a concern.

To avoid the flamewars and stay on topic, it's one of those things that is just a fundamentally limiting choice -- most will never hit the limit (most apps really aren't going to do more than 1000RPS of meaningful requests), but if you do actually need the perf it's quite disappointing.

People are very annoyed with Rust for not having a blessed async solution in-tree, and while it's produced a ton of churn I think it was ultimately beneficial for reasons like this. You can do either one of these in Rust, the choice isn't made for you.

That said, the OP's suggestion seems to be adding virtual threads on rather than swapping out asyncio for virtual threads, so maybe there's a world where people just use what they want, and python can interop the two as necessary.

rfoo 3 days ago | parent [-]

Good points.

Personally I'm more annoyed of async-Rust itself than not having a blessed async solution in-tree. Having to just Arc<T> away things here and there because you can't do thread::scope(f) honestly just demonstrates how stackless coroutine is unreasonably hard to everyone.

Back to the original topic, I bring this up because I believe the performance advantages claimed in these "fibers bad" papers are superficial, and the limit is almost the same (think 1.00 vs 1.02 level almost), even in languages which consider raw performance as a selling-point. In case you need the absolutely lowest overhead and latency you usually want the timing to be as deterministic as possible too, and it's not even a given in async-await solutions, you still need to be very careful about that.

Let alone Python.

hardwaresofton 3 days ago | parent [-]

> Personally I'm more annoyed of async-Rust itself than not having a blessed async solution in-tree. Having to just Arc<T> away things here and there because you can't do thread::scope(f) honestly just demonstrates how stackless coroutine is unreasonably hard to everyone.

Yeah as annoying as this is, I think it actually played out to benefit Rust -- imagine if the churn that we saw in tokio/async-std/smol/etc played out in tree? I think things might ahve been even worse

That said, stackless coroutines are certainly unreasonably hard.

> Back to the original topic, I bring this up because I believe the performance advantages claimed in these "fibers bad" papers are superficial, and the limit is almost the same (think 1.00 vs 1.02 level almost), even in languages which consider raw performance as a selling-point. In case you need the absolutely lowest overhead and latency you usually want the timing to be as deterministic as possible too, and it's not even a given in async-await solutions, you still need to be very careful about that.

Yeah, I don't think this is incorrect, and I'd love to see some numbers on it. The only thing that I can say definitively is that there is overhead to doing the literal stack switch. There's a reason async I/O got us past the C10k problem so handily.

One of the nice things about some recent Zig work was how clearly you can see how they do their stack switch -- literally you can jump in the Zig source code (on a branch IIRC) and just read the ASM for various platforms that represents a user space context switch.

Agree with the deterministic timing thing too -- this is one of the big points that people who only want to use threads (and are against tokio/etc) argue -- the pure control and single-mindedness of a core against a problem is clearly simple and performant. Thread per core is still the top for performance, but IMO the ultimate is async runtime thread per core, because some (important) problems are embarassingly concurrent.

> Let alone Python.

Yeah, I' trying really not to comment much on Python because I'm out of my depth and I think there are...

I mean I'm of the opinion that JS (really TS) is the better scripting language (better bolt-on type systems, got threads faster, never had a GIL, lucked into being async-forward and getting all it's users used to async behavior), but obviously Python is a powerhouse and a crucially important ecosystem (excluding the AI hype).

rfoo 2 days ago | parent [-]

> The only thing that I can say definitively is that there is overhead to doing the literal stack switch. There's a reason async I/O got us past the C10k problem so handily.

You can also say that not having to constantly allocate & deallocate stuff and rely on a bump allocator (the stack) most of the time more than compensate for the stack switch overhead. Depends on workload of course :p

IMO it's more about memory and nowadays it might just be path dependence. Back in C10k days address spaces were 32-bit (ok 31-bit really), and 2**31 / 10k ~= 210KiB. Makes static-ish stack management really messy. So you really need to extract the (minimal) state explicitly and pack them on heap.

Now we happily run ASAN which allocates 1TiB (2**40) address space during startup for a bitmap of the entire AS (2**48) and nobody complains.