The Scaling ML textbook also has an excellent section on TPUs. https://jax-ml.github.io/scaling-book/tpus/

jauntywundrkind 3 hours ago | parent [-]

I also enjoyed https://henryhmko.github.io/posts/tpu/tpu.html https://news.ycombinator.com/item?id=44342977 .

The work that XLA & schedulers are doing here is wildly impressive.

This feels so much drastically harder to work with than Itanium must have been. ~400bit VLIW, across extremely diverse execution units. The workload is different, it's not general purpose, but still awe inspiring to know not just that they built the chip but that the software folks can actually use such a wildly weird beast.

I wish we saw more industry uptake for XLA. Uptakes not bad, per-se: there's a bunch of different hardware it can target! But what amazing secret sauce, it's open source, and it doesn't feel like there's the industry rally behind it it deserves. It feels like Nvidia is only barely beginning to catch up, to dig a new moat, with the just announced Nvidia Tiles. Such huge overlap. Afaik, please correct if wrong, but XLA isn't at present particularly useful at scheduling across machines, is it? https://github.com/openxla/xla

	▲	alevskaya 2 hours ago \| parent \| next [-]
		I do think it's a lot simpler than the problem Itanium was trying to solve. Neural nets are just way more regular in nature, even with block sparsity, compared to generic consumer pointer-hopping code. I wouldn't call it "easy", but we've found that writing performant NN kernels for a VLIW architecture chip is in practice a lot more straightforward than other architectures. JAX/XLA does offer some really nice tools for doing automated sharding of models across devices, but for really large performance-optimized models we often handle the comms stuff manually, similar in spirit to MPI.
	▲	desideratum 3 hours ago \| parent \| prev \| next [-]
		Thanks for sharing this. I agree w.r.t. XLA. I've been moving to JAX after many years of using torch and XLA is kind of magic. I think torch.compile has quite a lot of catching up to do. > XLA isn't at present particularly useful at scheduling across machines, I'm not sure if you mean compiler-based distributed optimizations, but JAX does this with XLA: https://docs.jax.dev/en/latest/notebooks/Distributed_arrays_...
	▲	cpgxiii 2 hours ago \| parent \| prev [-]
		In Itanium's heyday, the compilers and libraries were pretty good at handling HPC workloads, which is really the closest anyone was running then to modern NN training/inference. The problem with Itanium and its compilers was that people obviously wanted to run workloads that looked nothing like HPC (databases, web servers, etc) and the architecture and compilers weren't very good at that. There have always been very successful VLIW-style architectures in more specialized domains (graphics, HPC, DSP, now NPU) it just hasn't worked out well for general-purpose processors.