Remix.run Logo
positron26 8 hours ago

> Most popular codecs were designed decades ago, when video resolutions were far smaller. As resolutions have exploded, those fixed-size minimum units now represent a much smaller fraction of a frame — which means far more of them can be processed in parallel. Modern GPUs have also gained features enabling cross-invocation communication, opening up further optimization opportunities.

One only needs to look at GPU driven rendering and ray tracing in shaders to deduce that shader cores and memory subsystems these days have become flexible enough to do work besides lock-step uniform parallelism where the only difference was the thread ID.

Nobody strives for random access memory read patterns, but the universal popularity of buffer device address and descriptor arrays can be taken somewhat as proof that these indirections are no longer the friction for GPU architectures that they were ten years ago.

At the same time, the languages are no longer as restrictive as they once were. People are recording commands on the GPU. This kind of fiddly serial work is an indication that the ergonomics of CPU programming have less of a relative advantage, and that cuts deeply into the tradeoff costs.

pandaforce 7 hours ago | parent | next [-]

Yeah, Vulkan is shedding most of the abstractions off. Buffers are no longer needed - just device addresses. Shaders don't need to be baked into a pipeline - you can use shader objects. Even images rarely provide any speedup advantages over buffers, since texel cache is no longer separate from memory cache.

GPUs these days have massive cache often hundreds of megabytes large, on top of an already absurd amount of registers. A random read will often load a full cacheline into a register and keep it there, reusing it as needed between invocations.

mort96 7 hours ago | parent | prev [-]

These GPUs are still big SIMD devices at their core though, no?

pandaforce 7 hours ago | parent | next [-]

Yes, but no. No, in that these days, GPUs are entirely scalar from the point of view of invocations. Using vectors in shaders is pointless - it will be as fast as scalar variables (double instruction dispatch on AMD GPUs is an exception).

But yes from the point of view that a collection of invocations all progressing in lockstep get arithmetic done by vector units. GPUs have just gotten really good at hiding what happens with branching paths between invocations.

positron26 5 hours ago | parent | prev [-]

SIMT is distinct model. Ergonomics are wildly different. Instead of contracting a long iteration by packing its steps together to make them "wider", you rotate the iteration across cores.

The critical difference is that SIMD and parallel programming are totally different in terms of ergonomics while SIMT is almost exactly the same as parallel programming. You have to design for SIMD and parallelism separately while SIMT and parallelism are essentially the same skill set.

The fan-in / fan-out and iteration rotation are the key skills for SIMT.