| What's not mentioned is a comparison vs FPGAs. You can have a systolic, fully pipelined system for any data processing not just vectorized SIMD. Every primitive is able to work independently of everything else. For example, if you have 240 DSP slices (which is far from outrageous on low scale), a perfect design could use those as 240 cores at 100% throughput. No memory, caching, decoding, etc overhead. |
| True, but FPGAs are suitable only for things that will not be produced in great numbers, because their cost and power consumption are many times higher than those of an ASIC. For a company of the size of Google, the development costs for a custom TPU are quickly recovered. Comparing a Google TPU with an FPGA is like comparing an injection-moulded part with a 3D-printed part. Unfortunately, the difference in performance between FPGAs and ASICs has greatly increased in recent years, because the FPGAs have remain stuck on relatively ancient CMOS manufacturing processes, which are much less efficient than the state-of-the-art CMOS manufacturing processes. |
| |
| ▲ | cpgxiii 3 hours ago | parent | next [-] | | > True, but FPGAs are suitable only for things that will not be produced in great numbers, because their cost and power consumption are many times higher than those of an ASIC. While common folk wisdom, this really isn't true. A surprising number of products ship with FPGAs inside, including ones designed to be "cheap". A great example of this is that Blackmagic, a brand known for being a "cheap" option in cinema/video gear, bases everything on Xilinx/AMD FPGAs (for some "software heavy" products they use the Xilinx/AMD Zynq line, which combines hard ARM cores with an FPGA). Pretty much every single machine vision camera on the market uses an FPGA for image processing as well. These aren't "one in every pocket" level products, but they are widely produced. > Unfortunately, the difference in performance between FPGAs and ASICs has greatly increased in recent years, because the FPGAs have remain stuck on relatively ancient CMOS manufacturing processes This isn't true either. At the high end, FPGAs are made on whatever the best process available is. Particularly in the top-end models that combine programmable fabric with hard elements, it would be insane not to produce them on the best process available. What is the big hindrance with FPGAs is that almost by definition the cell structures needed to produce programability are inherently more complex and less efficient than the dedicated circuits of an ASIC. That often means a big hit to maximum clock rate, with resulting consequences to any serial computation being performed. | | |
| ▲ | santaboom 2 hours ago | parent | next [-] | | All very informative, I had some quibbles. While it is true that cheap and expensive FPGAs exist, an FPGA system to replace TPU would not use a $0.50 or even $100 FPGA it would use a Versal or Ultrascale+ FPGA that costs thousands, compared to the (rough guess) $100/die you might spend for largest chip on most advanced process. Furthermore, overhead of FPGA means every single one my support a few million logic gates (maybe 2-5x if you use hardened blocks), compare to billions of transistors on largest chips in most advanced node —> cost per chip to buy is much much higher. To the second point, afaik, leading edge Versal FPGAs are in 7nm, not ancient also not cutting edge used for asic(n3). | |
| ▲ | adrian_b 2 hours ago | parent | prev [-] | | By high end I assume that you mean something like some of the AMD Versal series, which are made on a TSMC 7 nm process, like the AMD Zen 2 CPUs from 6 years ago. While TSMC 7 nm is much better than what most FPGAs use, it is still ancient in comparison with what the current CPUs and GPUs use. Moreover, I have never seen such FPGAs sold for less than thousands of $, i.e. they are much more expensive than GPUs of similar throughput. Perhaps they are heavily discounted for big companies, but those are also the companies which could afford to design an ASIC with better energy efficiency. I always prefer to implement an FPGA solution over any alternatives, but unfortunately much too often the FPGAs with high enough performance have also high enough prices to make them unaffordable. The FPGAs with the highest performance that are still reasonably priced are the AMD UltraScale+ families, made with a TSMC 14 nm process, which is still good in comparison with most FPGAs, but nevertheless it is a decade old manufacturing process. |
| |
| ▲ | Neywiny 6 hours ago | parent | prev | next [-] | | When you can ASIC, yes, do an ASIC. But my point was that there was a lot of GPU comparison. GPUs are also not ASICs relative to AI. | | |
| ▲ | QuadmasterXLII 6 hours ago | parent [-] | | They’re close, they’re basically matmul asics | | |
| ▲ | Neywiny 5 hours ago | parent [-] | | Arguably so are the DSP heavy FPGAs. And the unused logic will have a minimal static power draw relative to the unused but clocked G-only parts of the GPU. | | |
| ▲ | daxfohl 2 hours ago | parent [-] | | I have to imagine google considered this and decided against it. I assume it's that all the high-perf matmul stuff needs to be ASIC'd out to get max performance, quality heat dissipation, etc. And for anything reconfigurable, a CPU-based controller or logic chip is sufficient and easier to maintain. FPGA's kind of sit in this very niche middle ground. Yes you can optimize your logic so that the FPGA does exactly the thing that your use case needs, so your hardware maps more precisely to your use case than a generic TPU or GPU would. But what you gain in logic efficiency, you'll lose several times over in raw throughput to a generic TPU or GPU, at least for AI stuff which is almost all matrix math. Plus, getting that efficiency isn't easy; FPGAs have a higher learning curve and a way slower dev cycle than writing TPu or GPU apps, and take much longer to compile and test than CUDA code, especially when they get dense and you have to start working around gate timing constraints and such. It's easy to get to a point where even a tiny change can exceed some timing constraint and you've got to rewrite a whole subsystem to get it to synthesize again. |
|
|
| |
| ▲ | c-c-c-c-c 6 hours ago | parent | prev [-] | | fpga's are not expensive when ordered in bulk, the volume prices you see on mouser are way higher than the going rates. | | |
| ▲ | monocasa 6 hours ago | parent [-] | | The actual cost of the part (within reason) doesn't matter all that much for a hyperscaler. The real cost is in the perf/watt, which an FPGA is around an order of magnitude worse for the same RTL. |
|
|