▲ | mikewarot 5 days ago | |
> The main reason for the large energy costs of inference is that we are serving hundreds of millions of people with the same model. No humans have this type of scaling capability. Using CPUs or GPUs or even tensor units involve waiting for data to be moved from RAM to/from compute. It's my understanding that most of the power used in LLM compute is taken at that stage, and I further believe that 95% savings are possible by merging memory and compute to build a universal computing fabric. Alternatively, I'm deep in old man with goofy idea territory. Only time will tell. | ||
▲ | pama 3 days ago | parent [-] | |
There is room for improvement in inference, hence the presence of various startups in this space and the increased innovation in software. Large nvidia clusters are still cost optimal for scaling inference (as they move most of the memory transfer of smaller setups out of the critical path), and their energy cost is trivial compared to the cost of the hardware, but these conditions may change. Training is nearly fully compute bound and NVidia/CUDA provide decent abstractions for it. At least for now. We still need new ideas if training is to scale another 10 orders of magnitude in compute, but these ideas may not be practical for another decade. |