Remix.run Logo
menaerus 2 days ago

> On the 48 core system, building linux peaks at about 48GB/s; LLVM peaks at something like 25GB/s

LLVM peak is suspiciously low since building LLVM is heavier than the kernel? Anyway, on my machine, which is dual-socket 2x22-core skylake-x, for pure release build without debug symbols (less memory pressure), I get ~60GB/s.

   # python do_pair_combined.py out_clang_release
   Peak combined memory bandwidth found in block #180:
   S0_write: 8046.8 MB/s
   S0_read: 23098.2 MB/s
   S1_write: 7611.3 MB/s
   S1_read: 21231.3 MB/s
   Total: 59987.6 MB/s
For release build with debug symbols, which is much heavier, and what I normally use during the development, so my experience is probably more biased towards that workload, is >50% larger - ~98GB/s.

  $ python do_pair_combined.py out_clang_relwithdeb
  Peak combined memory bandwidth found in block #601:
  S0_write: 11648.5 MB/s
  S0_read: 17347.9 MB/s
  S1_write: 31686.2 MB/s
  S1_read: 37532.7 MB/s
  Total: 98215.3 MB/s
I repeated the experiment with linux kernel, and I get almost the same figure as you do - ~48GB/s.

  $ python do_pair_combined.py out_kernel 
  Peak combined memory bandwidth found in block #329:
  S0_write: 8963.9 MB/s
  S0_read: 16584.1 MB/s
  S1_write: 7863.4 MB/s
  S1_read: 14371.0 MB/s
  Total: 47782.399999999994 MB/s
Now this was peak accumulated but I was also interested in what is the single highest read/write bw measured. For LLVM/clang release with debug symbols this is what I get ~32GB/s for write bw and ~52GB/s for read bw.

  $ python do_single.py out_clang_relwithdeb
    Peak memory_bandwidth_write: 31686.2 MB/s
    Peak memory_bandwidth_read: 52038.0 MB/s
This is btw very close to what my socket can handle, store bandwidth is ~40GB/s, load bandwidth is ~80GB/s, and combined load-store bandwidth is 65G/s.

So, I think it is not unreasonable to say that there are compiler workloads that can be limited by the memory bandwidth. I for sure worked with heavier codebases even than LLVM, and even though I did not do the measurements back then, the gut feeling I was having is that the bw is consumed. Some translation units would literally stay for few minutes "compiling" but no progress would have been made.

I agree that random access memory patterns and the latency those patterns incur are also a cost that need to be added to this cost function.

My initial comment on this topic was - I don't really believe that the bottleneck in compilation for larger codebases, of course not on _any_ given machine, is on the compute side, and therefore I don't see how modules are going to fix any of this.