> GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data.

Where are you getting that? All the citations I've seen say the opposite, eg:

> Inference Workloads: NVIDIA GPUs typically offer lower latency for real-time inference tasks, particularly when leveraging features like NVIDIA's TensorRT for optimized model deployment. TPUs may introduce higher latency in dynamic or low-batch-size inference due to their batch-oriented design.

https://massedcompute.com/faq-answers/

> The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.

Both Cerebras and Grok have custom AI-processing hardware (not CPUs).

The knowledge grounding thing seems unrelated to the hardware, unless you mean something I'm missing.

▲

danpalmer 4 days ago | parent | next [-]

I thought it was generally accepted that inference was faster on TPUs. This was one of my takeaways from the LLM scaling book: https://jax-ml.github.io/scaling-book/ – TPUs just do less work, and data needs to move around less for the same amount of processing compared to GPUs. This would lead to lower latency as far as I understand it.

The citation link you provided takes me to a sales form, not an FAQ, so I can't see any further detail there.

> Both Cerebras and Grok have custom AI-processing hardware (not CPUs).

I'm aware of Cerebras' custom hardware. I agree with the other commenter here that I haven't heard of Grok having any. My point about knowledge grounding was simply that Grok may be achieving its latency with guardrail/knowledge/safety trade-offs instead of custom hardware.

▲

nl 4 days ago | parent [-]

Sorry I meant Groq custom hardware, not Grok!

I don't see any latency comparisons in the link

	▲	danpalmer 4 days ago \| parent [-]
		The link is just to the book, the details are scattered throughout. That said the page on GPUs specifically speaks to some of the hardware differences and how TPUs are more efficient for inference, and some of the differences that would lead to lower latency. https://jax-ml.github.io/scaling-book/gpus/#gpus-vs-tpus-at-... Re: Groq, that's a good point, I had forgotten about them. You're right they too are doing a TPU-style systolic array processor for lower latency.

▲

mips_avatar 4 days ago | parent | prev [-]

I'm pretty sure xAI exclusively uses Nvidia H100s for Grok inference but I could be wrong. I agree that I don't see why TPUs would necessarily explain latency.

	▲	danpalmer 4 days ago \| parent [-]
		To be clear I'm only suggesting that hardware is a factor here, it's far from the only reason. The parent commenter corrected their comment that it was actually Groq not Grok that they were thinking of, and I believe they are correct about that as Groq is doing something similar to TPUs to accelerate inference.