Remix clone Hacker News

new | show | ask | jobs Github

	▲	menaerus 2 days ago
		> On the 48 core system, building linux peaks at about 48GB/s; LLVM peaks at something like 25GB/s LLVM peak is suspiciously low since building LLVM is heavier than the kernel? Anyway, on my machine, which is dual-socket 2x22-core skylake-x, for pure release build without debug symbols (less memory pressure), I get ~60GB/s. `# python do_pair_combined.py out_clang_release Peak combined memory bandwidth found in block #180: S0_write: 8046.8 MB/s S0_read: 23098.2 MB/s S1_write: 7611.3 MB/s S1_read: 21231.3 MB/s Total: 59987.6 MB/s` For release build with debug symbols, which is much heavier, and what I normally use during the development, so my experience is probably more biased towards that workload, is >50% larger - ~98GB/s. `$ python do_pair_combined.py out_clang_relwithdeb Peak combined memory bandwidth found in block #601: S0_write: 11648.5 MB/s S0_read: 17347.9 MB/s S1_write: 31686.2 MB/s S1_read: 37532.7 MB/s Total: 98215.3 MB/s` I repeated the experiment with linux kernel, and I get almost the same figure as you do - ~48GB/s. `$ python do_pair_combined.py out_kernel Peak combined memory bandwidth found in block #329: S0_write: 8963.9 MB/s S0_read: 16584.1 MB/s S1_write: 7863.4 MB/s S1_read: 14371.0 MB/s Total: 47782.399999999994 MB/s` Now this was peak accumulated but I was also interested in what is the single highest read/write bw measured. For LLVM/clang release with debug symbols this is what I get ~32GB/s for write bw and ~52GB/s for read bw. `$ python do_single.py out_clang_relwithdeb Peak memory_bandwidth_write: 31686.2 MB/s Peak memory_bandwidth_read: 52038.0 MB/s` This is btw very close to what my socket can handle, store bandwidth is ~40GB/s, load bandwidth is ~80GB/s, and combined load-store bandwidth is 65G/s. So, I think it is not unreasonable to say that there are compiler workloads that can be limited by the memory bandwidth. I for sure worked with heavier codebases even than LLVM, and even though I did not do the measurements back then, the gut feeling I was having is that the bw is consumed. Some translation units would literally stay for few minutes "compiling" but no progress would have been made. I agree that random access memory patterns and the latency those patterns incur are also a cost that need to be added to this cost function. My initial comment on this topic was - I don't really believe that the bottleneck in compilation for larger codebases, of course not on _any_ given machine, is on the compute side, and therefore I don't see how modules are going to fix any of this.