Remix.run Logo
dcrazy 2 days ago

It was, in fact, a problem. DX11 and earlier tried to solve it with DXBC, an intermediate bytecode format that all drivers could consume. The driver would only need to lower the bytecode to the GPU’s ISA, which is much faster than a full compilation from HLSL. Prior to the emergence of Vulkan, OpenGL didn’t ever try to solve this; GLSL was always the interface to the driver. (Nowadays SPIR-V support is available to OpenGL apps if your driver implements the GL_ARB_gl_spirv extension, and of course DXIL has replaced DXBC.)

Compilation stutters was perhaps less noticeable in the DX9/OpenGL 3 era because shaders were less capable, and games relied more on fixed functionality which was implemented directly in the driver. Nowadays, a lot of the legacy API surface is actually implemented by dynamically written and compiled shaders, so you can get shader compilation hitches even when you aren’t using shaders at all.

In the N64 era of consoles, games would write ISA (“microcode”) directly into the GPU’s shared memory, usually via a library. In Nintendo’s case, SGI provided two families of libraries called “Fast3D” and “Turbo3D”. You’d call functions to build a “display list”, which was just a buffer full of instructions that did the math you wanted the GPU to do.

kimixa 18 hours ago | parent [-]

Having worked on some of the latter part of that era of GPUs, the "frontend" of the shader compiler was a pretty small fraction of the total time cost, most of it was in the later optimization passes that often extremely hardware specific (so not really possible at the level of DXBC). Especially as hardware started to move away from the assumptions used in designing it.

I think a big part of the user-visible difference in stutter is simply the expected complexity of shaders and number of different shaders in an "average" scene - they're 100s of times larger, and CPUs aren't 100s of times faster (and many of the optimization algorithms used are more-than-linear in terms of time vs the input too)

Modern DXIL and SPIR-V are at a similar level of abstraction to DXBC, and certainly don't "solve" stutter.

dcrazy 12 hours ago | parent [-]

One advantage of contemporary bytecode implementations is that many optimizations can occur in the “middle end”—which is to say on the IR itself, before lowering to ISA.

kimixa 11 hours ago | parent [-]

Yes, many optimizations can be done at the vendor-neutral IR level, but my point is that on GPUs they tend to be some of the computationally less expensive ones - the vast majority of the compilers time (in my experience) was in levels lower than that, like register allocation (as on GPUs "registers" are normally shared for all waves - so there's trade offs in using fewer registers but allowing more waves, for example), or trying to reorder things to hide latency from asynchronous units or higher latency instructions. And all those are very hardware specific.

It's a classic example of the "first 50%" being relatively easy - like an "optimizing" compiler can get pretty good with pretty simple constant propagation/inlining/dead code elimination. But that second 50% takes so much more effort.