Hardware is a factor here. GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data. There are lots of other factors here, but latency specifically favours TPUs.

The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.

▲

nl 4 days ago | parent | next [-]

> GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data.

Where are you getting that? All the citations I've seen say the opposite, eg:

> Inference Workloads: NVIDIA GPUs typically offer lower latency for real-time inference tasks, particularly when leveraging features like NVIDIA's TensorRT for optimized model deployment. TPUs may introduce higher latency in dynamic or low-batch-size inference due to their batch-oriented design.

https://massedcompute.com/faq-answers/

> The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.

Both Cerebras and Grok have custom AI-processing hardware (not CPUs).

The knowledge grounding thing seems unrelated to the hardware, unless you mean something I'm missing.

▲

danpalmer 4 days ago | parent | next [-]

I thought it was generally accepted that inference was faster on TPUs. This was one of my takeaways from the LLM scaling book: https://jax-ml.github.io/scaling-book/ – TPUs just do less work, and data needs to move around less for the same amount of processing compared to GPUs. This would lead to lower latency as far as I understand it.

The citation link you provided takes me to a sales form, not an FAQ, so I can't see any further detail there.

> Both Cerebras and Grok have custom AI-processing hardware (not CPUs).

I'm aware of Cerebras' custom hardware. I agree with the other commenter here that I haven't heard of Grok having any. My point about knowledge grounding was simply that Grok may be achieving its latency with guardrail/knowledge/safety trade-offs instead of custom hardware.

▲

nl 4 days ago | parent [-]

Sorry I meant Groq custom hardware, not Grok!

I don't see any latency comparisons in the link

	▲	danpalmer 4 days ago \| parent [-]
		The link is just to the book, the details are scattered throughout. That said the page on GPUs specifically speaks to some of the hardware differences and how TPUs are more efficient for inference, and some of the differences that would lead to lower latency. https://jax-ml.github.io/scaling-book/gpus/#gpus-vs-tpus-at-... Re: Groq, that's a good point, I had forgotten about them. You're right they too are doing a TPU-style systolic array processor for lower latency.

▲

mips_avatar 4 days ago | parent | prev [-]

I'm pretty sure xAI exclusively uses Nvidia H100s for Grok inference but I could be wrong. I agree that I don't see why TPUs would necessarily explain latency.

	▲	danpalmer 4 days ago \| parent [-]
		To be clear I'm only suggesting that hardware is a factor here, it's far from the only reason. The parent commenter corrected their comment that it was actually Groq not Grok that they were thinking of, and I believe they are correct about that as Groq is doing something similar to TPUs to accelerate inference.

▲

jrk 4 days ago | parent | prev [-]

Why are GPUs necessarily higher latency than TPUs? Both require roughly the same arithmetic intensity and use the same memory technology at roughly the same bandwidth.

	▲	eru 4 days ago \| parent \| next [-]
		And our LLMs still have latencies well into the human perceptible range. If there's any necessary, architectural difference in latency between TPU and GPU, I'm fairly sure it would be far below that.
	▲	danpalmer 4 days ago \| parent \| prev [-]
		My understanding is that TPUs do not use memory in the same way. GPUs need to do significantly more store/fetch operations from HBM, where TPUs pipeline data through systolic arrays far more. From what I've heard this generally improves latency and also reduces the overhead of supporting large context windows.