▲ | dragontamer 20 hours ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||
1. Not a problem for GPUs. NVdia and AMD are both 32-wide or 1024-bit wide hard coded. AMD can swap to 64-wide mode for backwards compatibility to GCN. 1024-bit or 2048-bit seems to be the right values. Too wide and you get branch divergence issues, so it doesn't seem to make sense to go bigger. In contrast, the systems that have flexible widths have never taken off. It's seemingly much harder to design a programming language for a flexible width SIMD. 2. Not a problem for GPUs. It should be noted that kernels allocate custom amounts of registers: one kernel may use 56 registers, while another kernel might use 200 registers. All GPUs will run these two kernels simultaneously (256+ registers per CU or SM is commonly supported, so both 200+56 registers kernels can run together). 3. Not a problem for GPUs or really any SIMD in most cases. Tail handling is O(1) problem in general and not a significant contributor to code length, size, or benchmarks. Overall utilization issues are certainly a concern. But in my experience this is caused by branching most often. (Branching in GPUs is very inefficient and forces very low utilization). | |||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | dzaima 20 hours ago | parent [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Tail handling is not significant for loops with tons of iterations, but there are a ton of real-world situations where you might have a loop take only like 5 iterations or something (even at like 100 iterations, with a loop processing 8 elements at a time (i.e. 256-bit vectors, 32-bit elements), that's 12 vectorized iterations plus up to 7 scalar ones, which is still quite significant. At 1000 iterations you could still have the scalar tail be a couple percent; and still doubling the L1/uop-cache space the loop takes). It's absolutely a significant contributor to code size (..in scenarios where vectorized code in general is a significant contributor to code size, which admittedly is only very-specialized software). | |||||||||||||||||||||||||||||||||||||||||||||||||||||
|