▲ | anarazel 3 days ago | ||||||||||||||||
That workstation has 2x10 cores / 20 threads. I also executed the test on a newer workstation with 2x24 cores with similar results, but I thought the older workstation is more interesting, as the older workstation has a much worse memory bandwidth. Sorry, but compilation is simply not memory bandwidth bound. There are significant memory latency effects, but bandwidth != latency. | |||||||||||||||||
▲ | menaerus 3 days ago | parent | next [-] | ||||||||||||||||
I doubt you can saturate the bandwidth with dual-socket configuration with each having 10 cores. Perhaps if you have very recent cores, which I believe you don't, but Intel design hasn't been that good. What you're also measuring in your experiment, and needs to be taken into account, is the latency across the NUMA nodes which is ridiculously high, 1.5x to 2x to the local node, amounting to usually ~130ns. Because of this, in NUMA configurations, you usually need more (Intel) cores to saturate the bw. I know because I have one sitting at my desk. Memory bandwidth saturation usually begins at ~20 cores with the Intel design that is roughly ~5 year old. I might be off with that number but it's roughly something like that. Other cores if you have them burning the cycles are just sitting there and waiting in the line for the bus to become free. | |||||||||||||||||
▲ | bluGill 3 days ago | parent | prev [-] | ||||||||||||||||
At 48 cores you are right about at the point where memory bandwidth becomes the limit. I suspect you are over the line, but by so little it is impossible to measure with all the ther noise. Get a larger machine and report back. | |||||||||||||||||
|