> ... if you look at (say Agner Fog's optimisation PDFs of instruction latency) ...

That.... doesn't seem true? At least for most architectures I looked at?

While true the latency for ADDPS and ADDPD are the same latency, using the zen4 example at least, the double variant only calculates 4 fp64 values compared to the single-precision's 8 fp32. Which was my point? If each double precision instruction processes a smaller number of inputs, it needs to be lower latency to keep the same operation rate.

And DIV also has a significntly lower throughput for fp32 vs fp64 on zen4, 5clk/op vs 3, while also processing half the values?

Sure, if you're doing scalar fp32/fp64 instructions it's not much of a difference (though DIV still has a lower throughput) - but then you're already leaving significant peak flops on the table I'm not sure it's a particularly useful comparison. It's just the truism of "if you're not performance limited you don't need to think about performance" - which has always been the case.

So yes, they do at least have a 2:1 difference in throughput on zen4 - even higher for DIV.

▲

adgjlsfhk1 3 hours ago | parent [-]

This depends largely on your operations. There is lots of performance critical code that doesn't vectorize smoothly, and for those operations, 64 bit is just as fast.

	▲	kimixa 16 minutes ago \| parent [-]
		Yes, if you're not FP ALU limited (which is likely the case if not vectorized), or data cache/bandwidth/thermally limited from the increased cost of fp64, then it doesn't matter - but as I said that's true for every performance aspect that "doesn't matter". That doesn't mean that there are no situations where it does matter today - which is what I feel is implied by calling it "Ancient".