| ▲ | nl 4 days ago | ||||||||||||||||
> GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data. Where are you getting that? All the citations I've seen say the opposite, eg: > Inference Workloads: NVIDIA GPUs typically offer lower latency for real-time inference tasks, particularly when leveraging features like NVIDIA's TensorRT for optimized model deployment. TPUs may introduce higher latency in dynamic or low-batch-size inference due to their batch-oriented design. https://massedcompute.com/faq-answers/ > The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge. Both Cerebras and Grok have custom AI-processing hardware (not CPUs). The knowledge grounding thing seems unrelated to the hardware, unless you mean something I'm missing. | |||||||||||||||||
| ▲ | danpalmer 4 days ago | parent | next [-] | ||||||||||||||||
I thought it was generally accepted that inference was faster on TPUs. This was one of my takeaways from the LLM scaling book: https://jax-ml.github.io/scaling-book/ – TPUs just do less work, and data needs to move around less for the same amount of processing compared to GPUs. This would lead to lower latency as far as I understand it. The citation link you provided takes me to a sales form, not an FAQ, so I can't see any further detail there. > Both Cerebras and Grok have custom AI-processing hardware (not CPUs). I'm aware of Cerebras' custom hardware. I agree with the other commenter here that I haven't heard of Grok having any. My point about knowledge grounding was simply that Grok may be achieving its latency with guardrail/knowledge/safety trade-offs instead of custom hardware. | |||||||||||||||||
| |||||||||||||||||
| ▲ | mips_avatar 4 days ago | parent | prev [-] | ||||||||||||||||
I'm pretty sure xAI exclusively uses Nvidia H100s for Grok inference but I could be wrong. I agree that I don't see why TPUs would necessarily explain latency. | |||||||||||||||||
| |||||||||||||||||