Remix.run Logo
anarazel 3 days ago

That workstation has 2x10 cores / 20 threads. I also executed the test on a newer workstation with 2x24 cores with similar results, but I thought the older workstation is more interesting, as the older workstation has a much worse memory bandwidth.

Sorry, but compilation is simply not memory bandwidth bound. There are significant memory latency effects, but bandwidth != latency.

menaerus 3 days ago | parent | next [-]

I doubt you can saturate the bandwidth with dual-socket configuration with each having 10 cores. Perhaps if you have very recent cores, which I believe you don't, but Intel design hasn't been that good. What you're also measuring in your experiment, and needs to be taken into account, is the latency across the NUMA nodes which is ridiculously high, 1.5x to 2x to the local node, amounting to usually ~130ns. Because of this, in NUMA configurations, you usually need more (Intel) cores to saturate the bw. I know because I have one sitting at my desk. Memory bandwidth saturation usually begins at ~20 cores with the Intel design that is roughly ~5 year old. I might be off with that number but it's roughly something like that. Other cores if you have them burning the cycles are just sitting there and waiting in the line for the bus to become free.

bluGill 3 days ago | parent | prev [-]

At 48 cores you are right about at the point where memory bandwidth becomes the limit. I suspect you are over the line, but by so little it is impossible to measure with all the ther noise. Get a larger machine and report back.

anarazel 3 days ago | parent [-]

On the 48 core system, building linux peaks at about 48GB/s; LLVM peaks at something like 25GB/s.

The system has well over 450GB/s of memory bandwidth.

menaerus 2 days ago | parent [-]

> On the 48 core system, building linux peaks at about 48GB/s; LLVM peaks at something like 25GB/s

LLVM peak is suspiciously low since building LLVM is heavier than the kernel? Anyway, on my machine, which is dual-socket 2x22-core skylake-x, for pure release build without debug symbols (less memory pressure), I get ~60GB/s.

   # python do_pair_combined.py out_clang_release
   Peak combined memory bandwidth found in block #180:
   S0_write: 8046.8 MB/s
   S0_read: 23098.2 MB/s
   S1_write: 7611.3 MB/s
   S1_read: 21231.3 MB/s
   Total: 59987.6 MB/s
For release build with debug symbols, which is much heavier, and what I normally use during the development, so my experience is probably more biased towards that workload, is >50% larger - ~98GB/s.

  $ python do_pair_combined.py out_clang_relwithdeb
  Peak combined memory bandwidth found in block #601:
  S0_write: 11648.5 MB/s
  S0_read: 17347.9 MB/s
  S1_write: 31686.2 MB/s
  S1_read: 37532.7 MB/s
  Total: 98215.3 MB/s
I repeated the experiment with linux kernel, and I get almost the same figure as you do - ~48GB/s.

  $ python do_pair_combined.py out_kernel 
  Peak combined memory bandwidth found in block #329:
  S0_write: 8963.9 MB/s
  S0_read: 16584.1 MB/s
  S1_write: 7863.4 MB/s
  S1_read: 14371.0 MB/s
  Total: 47782.399999999994 MB/s
Now this was peak accumulated but I was also interested in what is the single highest read/write bw measured. For LLVM/clang release with debug symbols this is what I get ~32GB/s for write bw and ~52GB/s for read bw.

  $ python do_single.py out_clang_relwithdeb
    Peak memory_bandwidth_write: 31686.2 MB/s
    Peak memory_bandwidth_read: 52038.0 MB/s
This is btw very close to what my socket can handle, store bandwidth is ~40GB/s, load bandwidth is ~80GB/s, and combined load-store bandwidth is 65G/s.

So, I think it is not unreasonable to say that there are compiler workloads that can be limited by the memory bandwidth. I for sure worked with heavier codebases even than LLVM, and even though I did not do the measurements back then, the gut feeling I was having is that the bw is consumed. Some translation units would literally stay for few minutes "compiling" but no progress would have been made.

I agree that random access memory patterns and the latency those patterns incur are also a cost that need to be added to this cost function.

My initial comment on this topic was - I don't really believe that the bottleneck in compilation for larger codebases, of course not on _any_ given machine, is on the compute side, and therefore I don't see how modules are going to fix any of this.