| ▲ | danpalmer 5 days ago | ||||||||||||||||||||||||||||||||||||||||
Hardware is a factor here. GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data. There are lots of other factors here, but latency specifically favours TPUs. The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | nl 4 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||
> GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data. Where are you getting that? All the citations I've seen say the opposite, eg: > Inference Workloads: NVIDIA GPUs typically offer lower latency for real-time inference tasks, particularly when leveraging features like NVIDIA's TensorRT for optimized model deployment. TPUs may introduce higher latency in dynamic or low-batch-size inference due to their batch-oriented design. https://massedcompute.com/faq-answers/ > The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge. Both Cerebras and Grok have custom AI-processing hardware (not CPUs). The knowledge grounding thing seems unrelated to the hardware, unless you mean something I'm missing. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
| ▲ | jrk 4 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||
Why are GPUs necessarily higher latency than TPUs? Both require roughly the same arithmetic intensity and use the same memory technology at roughly the same bandwidth. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||