Remix.run Logo
1980phipsi 6 hours ago

> It is also important to note that, until recently, the GenAI industry’s focus has largely been on training workloads. In training workloads, CUDA is very important, but when it comes to inference, even reasoning inference, CUDA is not that important, so the chances of expanding the TPU footprint in inference are much higher than those in training (although TPUs do really well in training as well – Gemini 3 the prime example).

Does anyone have a sense of why CUDA is more important for training than inference?

qcnguy 15 minutes ago | parent | next [-]

CUDA is just a better dev experience. Lots of training is experiments where developer/researcher productivity matters. Googlers get to use what they're given, others get to choose.

Once you settle on a design then doing ASICs to accelerate it might make sense. But I'm not sure the gap is so big, the article says some things that aren't really true of datacenter GPUs (Nvidia dc gpus haven't wasted hardware on graphics related stuff for years).

augment_me an hour ago | parent | prev | next [-]

NVIDIA chips are more versatile. During training, you might need to schedule things to the SFU(Special Function unit that does sin, cos, 1/sqrt(x), etc), you might need to run epilogues, save intermediary computations, save gradients, etc. When you train, you might need to collect data from various GPUs, so you need to support interconnects, remote SMEM writing, etc.

Once you have trained, you have frozen weights/feed-forward networks that consist out of frozen weights that you can just program in and run data over. These weights can be duplicated across any amount of devices and just sit there and run inference with new data.

If this turns out to be the future use-case for NNs(it is today), then Google are better set.

grandmczeb 44 minutes ago | parent [-]

All of those are things you can do with TPUs

rbanffy 2 hours ago | parent | prev | next [-]

This is a very important point - the market for training chips might be a bubble, but the market for inference is much, much larger. At some point we might have good enough models and the need for new frontier models will cool down. The big power-hungry datacenters we are seeing are mostly geared towards training, while inference-only systems are much simpler and power efficient.

A real shame, BTW, all that silicon doesn't do FP32 (very well). After training ceases to be that needed, we could use all that number crunching for climate models and weather prediction.

Traster 2 hours ago | parent | prev | next [-]

Training is taking an enormous problem and trying to break it into lots of pieces and managing the data dependency between those pieces. It's solving 1 really hard problem. Inference is the opposite, it's lots of small independent problems. All of this "we have X many widgets connected to Y many high bandwidth optical telescopes" is all a training problem that they need to solve. Inference is "I have 20 tokens and I want to throw them at these 5,000,000 matrix multiplies, oh and I don't care about latency".

johnebgd 6 hours ago | parent | prev | next [-]

I think it’s the same reason windows is inportant to desktop computers. Software was written to depend on it. Same with most of the software out there today to train being built around CUDA. Even a version difference of CUDA can break things.

llm_nerd 6 hours ago | parent | prev | next [-]

It's just more common as a legacy artifact from when nvidia was basically the only option available. Many shops are designing models and functions, and then training and iterating on nvidia hardware, but once you have a trained model it's largely fungible. See how Anthropic moved their models from nvidia hardware to Inferentia to XLA on Google TPUs.

Further it's worth noting that the Ironwood, Google's v7 TPU, supports only up to BF16 (a 16-bit floating point that has the range of FP32 minus the precision. Many training processes rely upon larger types, quantizing later, so this breaks a lot of assumptions. Yet Google surprised and actually training Gemini 3 with just that type, so I think a lot of people are reconsidering assumptions.

qeternity an hour ago | parent [-]

This is not the case for LLMs. FP16/BF16 training precision is standard, with FP8 inference very common. But labs are moving to FP8 training and even FP4.

baby_souffle 6 hours ago | parent | prev | next [-]

That quote left me with the same question. Something about decent amount of ram on one board perhaps? That’s advantageous for training but less so for inference?

imtringued 5 hours ago | parent | prev | next [-]

When training a neural network, you usually play around with the architecture and need as much flexibility as possible. You need to support a large set of operations.

Another factor is that training is always done with batches. Inference batching depends on the number of concurrent users. This means training tends to be compute bound where supporting the latest data types is critical, whereas inference speeds are often bottlenecked by memory which does not lend itself to product differentiation. If you put the same memory into your chip as your competitor, the difference is going to be way smaller.

NaomiLehman 6 hours ago | parent | prev [-]

inference is often a static, bounded problem solvable by generic compilers. training requires the mature ecosystem and numerical stability of cuda to handle mixed-precision operations. unless you rewrite the software from the ground up like Google but for most companies it's cheaper and faster to buy NVIDIA hardware

never_inline 6 hours ago | parent [-]

> static, bounded problem

What does it even mean in neural net context?

> numerical stability

also nice to expand a bit.