| ▲ | coredog64 2 hours ago | |||||||
My take is it's the inference efficiency. It's one thing to have a huge GPU cluster for training, but come inference time you don't need nearly so much. Having the TPU (and models purpose built for TPU) allows for best cost in serving at hyperscale. | ||||||||
| ▲ | martinald 16 minutes ago | parent [-] | |||||||
Yes potentially - but the OG TPUs were actually very poorly suited for LLM usage - designed for far smaller models with more parallelism in execution. They've obviously adapted the design but it's a risk optimising in hardware like that - if there is another model architecture jump the risk of having a narrow specialised set of hardware means you can't generalise enough. | ||||||||
| ||||||||