The fact that it scales sub linearly with hardware is well known and in fact foundational to the scaling laws on which modern LLMs are built, ie performance scales remarkably closely to log(compute+data), over many orders of magnitude.