TPU Deep Dive

What's not mentioned is a comparison vs FPGAs. You can have a systolic, fully pipelined system for any data processing not just vectorized SIMD. Every primitive is able to work independently of everything else. For example, if you have 240 DSP slices (which is far from outrageous on low scale), a perfect design could use those as 240 cores at 100% throughput. No memory, caching, decoding, etc overhead.

▲

adrian_b an hour ago | parent [-]

True, but FPGAs are suitable only for things that will not be produced in great numbers, because their cost and power consumption are many times higher than those of an ASIC.

For a company of the size of Google, the development costs for a custom TPU are quickly recovered.

Comparing a Google TPU with an FPGA is like comparing an injection-moulded part with a 3D-printed part.

Unfortunately, the difference in performance between FPGAs and ASICs has greatly increased in recent years, because the FPGAs have remain stuck on relatively ancient CMOS manufacturing processes, which are much less efficient than the state-of-the-art CMOS manufacturing processes.

▲

Neywiny an hour ago | parent | next [-]

When you can ASIC, yes, do an ASIC. But my point was that there was a lot of GPU comparison. GPUs are also not ASICs relative to AI.

▲

QuadmasterXLII 38 minutes ago | parent [-]

They’re close, they’re basically matmul asics

	▲	Neywiny a few seconds ago \| parent [-]
		Arguably so are the DSP heavy FPGAs. And the unused logic will have a minimal static power draw relative to the unused but clocked G-only parts of the GPU.

▲

c-c-c-c-c an hour ago | parent | prev [-]

fpga's are not expensive when ordered in bulk, the volume prices you see on mouser are way higher than the going rates.

	▲	monocasa an hour ago \| parent [-]
		The actual cost of the part (within reason) doesn't matter all that much for a hyperscaler. The real cost is in the perf/watt, which an FPGA is around an order of magnitude worse for the same RTL.

▲

lanthissa 4 hours ago | parent | prev | next [-]

can someone help me understand how the following can be true:

1. TPU's are a serious competitor to nvidia chips.

2. Chip makers with the best chips are valued at 1-3.5T.

3. Google's market cap is 2T.

4. It is correct for google to not sell TPU's.

i have heard the whole, its better to rent them thing, but if they're actually good selling them is almost as good a business as every other part of the company.

▲

Velorivox 2 hours ago | parent | next [-]

Wall street undervalued Google even on day one (IPO). Bezos has said that some of the times the stock had been doing the worst were when the company was doing great.

So, to help you understand how they can be true: market cap is governed by something other than what a business is worth.

As an aside, here's a fun article that embarrasses wall street. [0]

[0] https://www.nbcnews.com/id/wbna15536386

▲

michaelt an hour ago | parent | prev | next [-]

nvidia, who make AI chips with kinda good software support, and who have sales reflecting that, is worth 3.5T

google, who make AI chips with barely-adequate software, is worth 2.0T

AMD, who also make AI chips with barely-adequate software, is worth 0.2T

Google made a few decisions with TPUs that might have made business sense at the time, but with hindsight haven't helped adoption. They closely bound TPUs with their 'TensorFlow 1' framework (which was kinda hard to use) then they released 'TensorFlow 2' which was incompatible enough it was just as easy to switch to PyTorch, which has TPU support in theory but not in practice.

They also decided TPUs would be Google Cloud only. Might make sense, if they need water cooling or they have special power requirements. But it turns out the sort of big corporations that have multi-cloud setups and a workload where a 1.5x improvement in performance-per-dollar is worth pursuing aren't big open source contributors. And understandably, the academics and enthusiasts who are giving their time away for free aren't eager to pay Google for the privilege.

Perhaps Google's market cap already reflects the value of being a second-place AI chipmaker?

▲

smokel 3 hours ago | parent | prev | next [-]

Selling them and supporting that in the field requires quite some infrastructure you'd have to build. Why go through all that trouble if you already make higher margins renting them out?

Also, if they are so good, it's best to not level the playing field by sharing that with your competitors.

Also "chip makers with the best chips" == Nvidia, there aren't many others. And Alphabet does more than just produce TPUs.

▲

radialstub 3 hours ago | parent | prev | next [-]

I believe Broadcom is also very involved in the making of the TPU's and networking infrastructure and they are valued at 1.2T currently. Maybe consider the combined value of Broadcom and Google.

	▲	lftl 2 hours ago \| parent [-]
		Wouldn't you also need to add TSMC to Nvidia's side in that case?

▲

mft_ 3 hours ago | parent | prev | next [-]

If they think they’ve got a competitive advantage vs. GPUs which benefits one of their core products, it would make sense to retain that competitive advantage for the long term, no?

▲

jeffbee an hour ago | parent | prev | next [-]

Like other Google internal technologies, the amount of custom junk you'd need to support to use a TPU is pretty extreme, and the utility of the thing without the custom junk is questionable. You might as well ask why they aren't marketing their video compression cards.

▲

rwmj 3 hours ago | parent | prev | next [-]

Aren't Google's TPUs a bit like a research project with practical applications as a nice side effect?

▲

dismalaf 3 hours ago | parent | prev [-]

Nvidia is selling a ton of chips on hype.

Google is saving a ton of money by making TPUs, which will pay off in the future when AI is better monetized, but so far no one is directly making a massive profit from foundation models. It's a long term play.

Also, I'd argue Nvidia is massively overvalued.

	▲	CalChris an hour ago \| parent [-]
		Common in gold rushes but then they are selling chips. Are they overvalued? Maybe. Are they profitable (something WeWork and Uber aren't) ? Yes, quite.

▲

trostaft 21 minutes ago | parent | prev | next [-]

Excellent write up, thank you. The benefits section was illustrative

▲

RossBencina 9 hours ago | parent | prev | next [-]

Can you suggest a good reference for understanding which algorithms map well onto the regular grid systolic arrays used by TPUs? The fine article says dese matmul and convolution are good, but is there anything else? Eigendecomposition? SVD? matrix exponential? Solving Ax = b or AX = B? Cholesky?

	▲	cdavid 7 hours ago \| parent \| next [-]
		SVD/eigendecomposition will often boil down to making many matmul (e.g. when using Krylov-based methods, e.g. Arnoldi, Krylov-schur, etc.), so I would expect TPU to work well there. GMRES, one method to solve Ax = b is also based on Arnoldi decomp.
	▲	WithinReason 9 hours ago \| parent \| prev \| next [-]
		Anything that you can express as 128x128 (but ideally much larger) dense matrix multiplication and nothing else
	▲	musebox35 8 hours ago \| parent \| prev [-]
		I think https://jax-ml.github.io/scaling-book/ is one of the best references to go through. It details how single device and distributed computations map to TPU hardware features. The emphasis is on mapping the transformer computations, both forwards and backwards, so requires some familiarity with how transformer networks are structured.

▲

cheptsov an hour ago | parent | prev | next [-]

It’s so ridiculous to see TPUs being compared to NVIDIA GPUs. IMO proprietary chips such as TPU had no future sure to the monopoly on the cloud services. There is no competition across the cloud services providers. The only way to access TPUs is through GCP. As the result nobody wants to use them regardless of the technology. This is the biggest fault of GCP. Further the road, the gap between NVIDIA GPUs and Google TPUs (call it „moat“ or CUDA) is going to grow.

The opposite situation is with AMD which are avoiding the mistakes of Google.

My hope though is that AMD doesn’t start to compete with cloud service providers, e.g. by introducing their own cloud.

	▲	hiddencost 34 minutes ago \| parent [-]
		TPUs will thrive regardless of public adoption; Google's internal demand for TPU is such that they could buy every TPU ever produced.

▲

serf 9 hours ago | parent | prev | next [-]

does that cooling channel have a NEMA stepper on it as a pump or metering valve?[0]

If so, wild. That seems like overkill.

[0]: https://henryhmko.github.io/posts/tpu/images/tpu_tray.png

	▲	fellowmartian 7 hours ago \| parent [-]
		definitely closed-loop, might even be a servo

▲

almostgotcaught 11 hours ago | parent | prev | next [-]

> In essence, caches allow hardware to be flexible and adapt to a wide range of applications. This is a large reason why GPUs are very flexible hardware (note: compared to TPUs).

this is correct but mis-stated - it's not the caches themselves that cost energy but MMUs that automatically load/fetch/store to cache on "page faults". TPUs don't have MMUs and furthermore are a push architecture (as opposed to pull).

▲

cdg007 2 hours ago | parent | prev | next [-]

What will competitors say?

▲

sgt101 6 hours ago | parent | prev | next [-]

ELI5: how (specifically) do GPU and TPU optimisations effect determinism in LLMs? Or is this all a myth?

▲

barrkel 6 hours ago | parent | next [-]

LLMs are generally deterministic. The token sampling step is usually randomized to some degree because it gets better results (creativity) and helps avoid loops, but you can turn that off (temp zero for simple samplers).

	▲	Der_Einzige a minute ago \| parent \| next [-]
		This belief (LLMs are deterministic except for sampler) is very wrong and will get you into hilariously large amounts of trouble for assuming it's true. Also greedy sampling considered harmful: https://arxiv.org/abs/2506.09501
	▲	sgeisenh 2 hours ago \| parent \| prev \| next [-]
		This is an oversimplification. When distributed, the nondeterministic order of additions during reductions can produce nondeterministic results due to floating point error. It’s nitpicking for sure, but it causes real challenges for reproducibility, especially during model training.
	▲	perching_aix 3 hours ago \| parent \| prev [-]
		+ can also just pin the seed instead, right?

▲

jpgvm 5 hours ago | parent | prev [-]

They don't affect determinism of the results but different architectures have different determinism guarantees with respect to performance, as a result of scheduling and other things.

TPUs share a similar lineage to the Groq TPU accelerators (disclaimer: I work at Groq) which are actually fully deterministic which means not only do you get deterministic output, you get it in a deterministic number of cycles.

There is a trade off though, making the hardware deterministic means you give up HW level scheduling and other sources of non-determinism. This makes the architecture highly dependent on a "sufficiently smart compiler". TPUs and processors like them are generally considered VLIW and are all similarly dependent on the compiler doing all the smart scheduling decisions upfront to ensure good compute/IO overlap and eliminating pipeline bubbles etc.

GPUs on the other hand have very sophisticated scheduling systems on the chips themselves along with stuff like kernel swapping etc that make them much more flexible, less dependent on the compiler and generally easier to reach a fairly high utilisation of the processor without too much work.

TLDR: TPUs MAY have deterministic cycle guarantees. GPUs (of the current generation/architectures) cannot because they use non-deterministic scheduling and memory access patterns. Both still produce deterministic output for deterministic programs.

▲

b0a04gl 2 hours ago | parent | prev | next [-]

tpu's predictable latency under scale. when you control the compiler, the runtime, the interconnect and the chip, you eliminate so much variance that you can actually schedule jobs efficiently at data center scale. so the obvious question why haven't we seen anyone outside Google replicate this full vertical stack yet? is it because the hardware's hard or because no one has nailed the compiler/runtime contract at that scale?

▲

frays 8 hours ago | parent | prev | next [-]

How can someone have this level of knowledge about TPUs without working at Google?

▲

ipsum2 7 hours ago | parent | next [-]

Everything thats in the blog post is basically well known already. Google publishes papers and gives talks about their TPUs. Many details are lacking though, and require some assumptions/best guesses. Jax and XLA are (partially) open source and give clues about how TPUs work under the hood as well.

https://arxiv.org/abs/2304.01433

https://jax-ml.github.io/scaling-book/

▲

musebox35 7 hours ago | parent | prev [-]

From the acknowledgment at the end, I guess the author has access to TPUs through https://sites.research.google/trc/about/

This is not the only way though. TPUs are available to companies operating on GCP as an alternative to GPUs with a different price/performance point. That is another way to get hands-on experience with TPUs.

	▲	erwincoumans 6 hours ago \| parent [-]
		A quick free way to access TPUs is through https://colab.research.google.com, Runtime / Change Runtime Type / v2-8 TPU

▲

ariwilson 7 hours ago | parent | prev [-]

Cool article!