Remix.run Logo
coredog64 2 hours ago

My take is it's the inference efficiency. It's one thing to have a huge GPU cluster for training, but come inference time you don't need nearly so much. Having the TPU (and models purpose built for TPU) allows for best cost in serving at hyperscale.

martinald 16 minutes ago | parent [-]

Yes potentially - but the OG TPUs were actually very poorly suited for LLM usage - designed for far smaller models with more parallelism in execution.

They've obviously adapted the design but it's a risk optimising in hardware like that - if there is another model architecture jump the risk of having a narrow specialised set of hardware means you can't generalise enough.

zozbot234 14 minutes ago | parent [-]

Prefill has a lot of parallelism, and so does decode with a larger context (very common with agentic tasks). People like to say "old inference chips are no good for LLM use" but that's not really true.