▲ | NullCascade 4 days ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
What is the actual process of identifying hotspots caused suboptimal compiler generated assembly? Would it ever make sense to write handwritten compiler intermediate representation like LLVM IR instead of architecture-specific assembly? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | astrange 4 days ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
So the main issues here are not what people think they are. They generally aren't "suboptimal assembly", at least not what you can reasonably expect out of a C compiler. The factors are something like: - specialization: there's already a decent plain-C implementation of the loop, asm/SIMD versions are added on for specific hardware platforms. And different platforms have different SIMD features, so it's hard to generalize them. - predictability: users have different compiler versions, so even if there is a good one out there not everyone is going to use it. - optimization difficulties: C's memory model specifically makes optimization difficult here because video is `char *` and `char *` aliases everything. Also, the two kinds of features compilers add for this (intrinsics and autovectorization) can fight each other and make things worse than nothing. - taste: you could imagine a better portable language for writing SIMD in, but C isn't it. And on Intel C with intrinsics definitely isn't it, because their stuff was invented by Microsoft, who were famous for having absolutely no aesthetic taste in anything. The assembly is /more/ readable than C would be because it'd all be function calls with names like `_mm_movemask_epi8`. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | duped 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Normally you spin up a tool like vtune or uprof to analyze your benchmark hotspots at the ISA level. No idea about tools like that for ARM. > Would it ever make sense to write handwritten compiler intermediate representation like LLVM IR instead of architecture-specific assembly? IME, not really. I've done a fair bit of hand-written assembly and it exclusively comes up when dealing with architecture-specific problems - for everything else you can just write C (unless you hit one of the edge cases where C semantics don't allow you to express something in C, but those are rare). For example: C and C++ compilers are really, really good at writing optimized code in general. Where they tend to be worse are things like vectorized code which requires you to redesign algorithms such that they can use fast vector instructions, and even then, you'll have to resort to compiler intrinsics to use the instructions at all, and even then, compiler intrinsics can lead to some bad codegen. So your code winds up being non-portable, looks like assembly, and has some overhead just because of what the compiler emits (and can't optimize). So you wind up just writing it in asm anyway, and get smarter about things the compiler worries about like register allocation and out-of-order instructions. But the real problem once you get into this domain is that you simply cannot tell at a glance whether hand written assembly is "better" (insert your metric for "better here) than what the compiler emits. You must measure and benchmark, and those benchmarks have to be meaningful. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | jcranmer 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Would it ever make sense to write handwritten compiler intermediate representation like LLVM IR instead of architecture-specific assembly? Not really. There are a couple of reasons to reach for handwritten assembly, and in every case, IR is just not the right choice: If your goal is to ensure vector code, your first choice is to try slapping explicit vectorize-me pragmas onto the loop. If that fails, your next effort is either to use generic or arch-specific vector intrinsics (or jump to something like ISPC, a language for writing SIMT-like vector code). You don't really gain anything in this use case from jumping to IR, since the intrinsics will satisfy your code. If your goal is to work around compiler suboptimality in register allocation or instruction selection... well, trying to write it in IR gives the compiler a very high likelihood of simply recanonicalizing the exact sequence you wrote to the same sequence the original code would have produced for no actual difference in code. Compiler IR doesn't add anything to the code; it just creates an extra layer that uses an unstable and harder-to-use interface for writing code. To produce the best handwritten version of assembly in these cases, you have to go straight to writing the assembly you wanted anyways. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | 4 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
[deleted] |