My understanding of warp (https://docs.nvidia.com/cuda/cuda-programming-guide/01-intro...) is that you are essentially paying the cost of taking both the branches.

I understand with newer GPUs, you have clever partitioning / pipelining in such a way block A takes branch A vs block B that takes branch B with sync/barrier essentially relying on some smart 'oracle' to schedule these in a way that still fits in the SIMT model.

It still doesn't feel Turing complete to me. Is there an nvidia doc you can refer me to?

▲

rowanG077 an hour ago | parent [-]

That applies inside a single warp, notice the wording:

> In SIMT, all threads in the warp are executing the same kernel code, but each thread may follow different branches through the code. That is, though all threads of the program execute the same code, threads do not need to follow the same execution path.

This doesn't say anything about dependencies of multiple warps.

	▲	textlapse an hour ago \| parent [-]
		It's definitely possible, I am not arguing against that. I am just saying it's not as flexible/cost-free as you would on a 'normal' von Neumann-style CPU. I would love to see Rust-based code that obviates the need to write CUDA kernels (including compiling to different architectures). It feels icky to use/introduce things like async/await in the context of a GPU programming model which is very different from a traditional Rust programming model. You still have to worry about different architectures and the streaming nature at the end of the day. I am very interested in this topic, so I am curious to learn how the latest GPUs help manage this divergence problem.