We aren't focused on performance yet (it is often workload and executor dependent, and as the post says we currently do some inefficient polling) but Rust futures compile down to state machines so they are a zero-cost abstraction.

The anticipated benefits are similar to the benefits of async/await on CPU: better ergonomics for the developer writing concurrent code, better utilization of shared/limited resources, fewer concurrency bugs.

▲

textlapse 2 hours ago | parent [-]

warp is expensive - essentially it's running a 'don't run code' to maintain SIMT.

GPUs are still not practically-Turing-complete in the sense that there are strict restrictions on loops/goto/IO/waiting (there are a bunch of band-aids to make it pretend it's not a functional programming model).

So I am not sure retrofitting a Ferrari to cosplay an Amazon delivery van is useful other than for tech showcase?

Good tech showcase though :)

▲

zozbot234 an hour ago | parent [-]

I think you're conflating GPU 'threads' and 'warps'. GPU 'threads' are SIMD lanes that are all running with the exact same instructions and control flow (only with different filtering/predication), whereas GPU warps are hardware-level threads that run on a single compute unit. There's no issue with adding extra "don't run code" when using warps, unlike GPU threads.

	▲	textlapse 2 minutes ago \| parent [-]
		My understanding of warp (https://docs.nvidia.com/cuda/cuda-programming-guide/01-intro...) is that you are essentially paying the cost of taking both the branches. I understand with newer GPUs, you could have threadblocks that distribute the warps in such a way block A takes branch A vs block B that takes branch B with sync/barrier essentially relying on some smart 'oracle' to schedule these in a way that still fits in the SIMT model. It still doesn't feel Turing complete to me. Is there an nvidia doc you can refer me to?