▲ | aengelke a day ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I agree; and the article seems to have also quite a few technical flaws: - Register width: we somewhat maxed out at 512 bits, with Intel going back to 256 bits for non-server CPUs. I don't see larger widths on the horizon (even if SVE theoretically supports up to 2048 bits, I don't know any implementation with ~~>256~~ >512 bits). Larger bit widths are not beneficial for most applications and the few applications that are (e.g., some HPC codes) are nowadays served by GPUs. - The post mentions available opcode space: while opcode space is limited, a reasonably well-designed ISA (e.g., AArch64) has enough holes for extensions. Adding new instructions doesn't require ABI changes, and while adding new registers requires some kernel changes, this is well understood at this point. - "What is worse, software developers often have to target several SIMD generations" -- no way around this, though, unless auto-vectorization becomes substantially better. Adjusting the register width is not the big problem when porting code, making better use of available instructions is. - "The packed SIMD paradigm is that there is a 1:1 mapping between the register width and the execution unit width" -- no. E.g., AMD's Zen 4 does double pumping, and AVX was IIRC originally designed to support this as well (although Intel went directly for 256-bit units). - "At the same time many SIMD operations are pipelined and require several clock cycles to complete" -- well, they are pipelined, but many SIMD instructions have the same latency as their scalar counterpart. - "Consequently, loops have to be unrolled in order to avoid stalls and keep the pipeline busy." -- loop unroll has several benefits, mostly to reduce the overhead of the loop and to avoid data dependencies between loop iterations. Larger basic blocks are better for hardware as every branch, even if predicted correctly, has a small penalty. "Loop unrolling also increases register pressure" -- it does, but code that really requires >32 registers is extremely rare, so a good instruction scheduler in the compiler can avoid spilling. In my experience, dynamic vector sizes make code slower, because they inhibit optimizations. E.g., spilling a dynamically sized vector is like a dynamic stack allocation with a dynamic offset. I don't think SVE delivered any large benefits, both in terms of performance (there's not much hardware with SVE to begin with...) and compiler support. RISC-V pushes further into this direction, we'll see how this turns out. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | camel-cdr a day ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> we somewhat maxed out at 512 bits Which still means you have to write your code at least thrice, which is two times more than with a variable length SIMD ISA. Also there are processors with larger vector length, e.g. 1024-bit: Andes AX45MPV, SiFive X380, 2048-bit: Akeana 1200, 16384-bit: NEC SX-Aurora, Ara, EPI > no way around this You rarely need to rewrite SIMD code to take advantage of new extensions, unless somebody decides to create a new one with a larger SIMD width. This mostly happens when very specialized instructions are added. > In my experience, dynamic vector sizes make code slower, because they inhibit optimizations. Do you have more examples of this? I don't see spilling as much of a problem, because you want to avoid it regardless, and codegen for dynamic vector sizes is pretty good in my experience. > I don't think SVE delivered any large benefits Well, all Arm CPUs except for the A64FX were build to execute NEON as fast as possible. X86 CPUs aren't built to execute MMX or SSE or the latest, even AVX, as fast as possible. Anyway, I know of one comparison between NEON and SVE: https://solidpixel.github.io/astcenc_meets_sve > Performance was a lot better than I expected, giving between 14 and 63% uplift. Larger block sizes benefitted the most, as we get higher utilization of the wider vectors and fewer idle lanes. > I found the scale of the uplift somewhat surprising as Neoverse V1 allows 4-wide NEON issue, or 2-wide SVE issue, so in terms of data-width the two should work out very similar. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | bjourne 17 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> "Loop unrolling also increases register pressure" -- it does, but code that really requires >32 registers is extremely rare, so a good instruction scheduler in the compiler can avoid spilling. No, it actually is super common in hpc code. If you unroll a loop N times you need N times as many registers. For normal memory-bound code I agree with you, but most hpc kernels will exploit as much of the register file as they can for blocking/tiling. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | xphos 15 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I think the variable length stuff does solve encoding issues, and RISCV takes so big strides with the ideas around chaining and vl/lmul/vtype registers. I think they would benefit from having 4 vtype registers, though. It's wasted scalar space, but how often do you actually rotate between 4 different vector types in main loop bodies. The answer is pretty rarely. And you'd greatly reduce the swapping between vtypes when. I think they needed to find 1 more bit but it's tough the encoding space isn't that large for rvv which is a perk for sure Can't wait to seem more implementions of rvv to actually test some of it's ideas | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | cherryteastain a day ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Fujitsu A64FX used in the Fugaku supercomputer uses SVE with 512 bit width | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | deaddodo a day ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> - Register width: we somewhat maxed out at 512 bits, with Intel going back to 256 bits for non-server CPUs. I don't see larger widths on the horizon (even if SVE theoretically supports up to 2048 bits, I don't know any implementation with >256 bits). Larger bit widths are not beneficial for most applications and the few applications that are (e.g., some HPC codes) are nowadays served by GPUs. Just to address this, it's pretty evident why scalar values have stabilized at 64-bit and vectors at ~512 (though there are larger implementations). Tell someone they only have 256 values to work with and they immediately see the limit, it's why old 8-bit code wasted so much time shuffling carries to compute larger values. Tell them you have 65536 values and it alleviates a large set of that problem, but you're still going to hit limits frequently. Now you have up to 4294967296 values and the limits are realistically only going to be hit in computational realms, so bump it up to 18446744073709551615. Now even most commodity computational limits are alleviated and the compiler will handle the data shuffling for larger ones. There was naturally going to be a point where there was enough static computational power on integers that it didn't make sense to continue widening them (at least, not at the previous rate). The same goes for vectorization, but in even more niche and specific fields. |