Remix.run Logo
m4r1k 6 hours ago

Google's real moat isn't the TPU silicon itself—it's not about cooling, individual performance, or hyper-specialization—but rather the massive parallel scale enabled by their OCS interconnects.

To quote The Next Platform: "An Ironwood cluster linked with Google’s absolutely unique optical circuit switch interconnect can bring to bear 9,216 Ironwood TPUs with a combined 1.77 PB of HBM memory... This makes a rackscale Nvidia system based on 144 “Blackwell” GPU chiplets with an aggregate of 20.7 TB of HBM memory look like a joke."

Nvidia may have the superior architecture at the single-chip level, but for large-scale distributed training (and inference) they currently have nothing that rivals Google's optical switching scalability.

thelastgallon 6 hours ago | parent | next [-]

Also, Google owns the entire vertical stack, which is what most people need. It can provide an entire spectrum of AI services far cheaper, at scale (and still profitable) via its cloud. Not every company needs to buy the hardware and build models, etc., etc.; what most companies need is an app store of AI offerings they can leverage. Google can offer this with a healthy profit margin, while others will eventually run out of money.

jauntywundrkind 4 hours ago | parent | next [-]

Google's work on Jax, pytorch, tensorflow, and the more general XLA underneath are exactly the kind of anti-moat everyone has been clamoring for.

morkalork 3 hours ago | parent [-]

Anti-moat like commoditizing the compliment?

sharpy 2 hours ago | parent | next [-]

If they get things like PyTorch to work well without carinng what hardware it is running on, it erodes Nvidia's CUDA moat. Nvidia's chips are excellent, without doubt, but their real moat is the ecosystem around CUDA.

qeternity 2 hours ago | parent [-]

PyTorch is only part of it. There is still a huge amount of CUDA that isn’t just wrapped by PyTorch and isn’t easily portable.

svara an hour ago | parent [-]

... but not in deep learning or am I missing something important here?

layer8 6 minutes ago | parent | prev [-]

*complement

gigatexal 3 hours ago | parent | prev [-]

all this vertical integration no wonder Apple and Google have such a tight relationship.

mrbungie 5 hours ago | parent | prev | next [-]

It's fun when then you read last Nvidia tweet [1] suggesting that still their tech is better, based on pure vibes as anything in the (Gen)AI-era.

[1] https://x.com/nvidianewsroom/status/1993364210948936055

qcnguy 19 minutes ago | parent | next [-]

Not vibes. TPUs have fallen behind or had to be redesigned from scratch many times as neural architectures and workloads evolved, whereas the more general purpose GPUs kept on trucking and building on their prior investments. There's a good reason so much research is done on Nvidia clusters and not TPU clusters. TPU has often turned out to be over-specialized and Nvidia are pointing that out.

pests 12 minutes ago | parent [-]

You say that like I d a bad thing. Nvidia architectures keep changing and getting more advanced as well, with specialized tensor operations, different accumulators and caches, etc. I see no issue with progress.

bigyabai an hour ago | parent | prev | next [-]

> based on pure vibes

The tweet gives their justification; CUDA isn't ASIC. Nvidia GPUs were popular for crypto mining, protein folding, and now AI inference too. TPUs are tensor ASICs.

FWIW I'm inclined to agree with Nvidia here. Scaling up a systolic array is impressive but nothing new.

almostgotcaught 3 hours ago | parent | prev [-]

> NVIDIA is a generation ahead of the industry

a generation is 6 months

wmf 3 hours ago | parent [-]

For GPUs a generation is 1-2 years.

almostgotcaught 2 hours ago | parent [-]

no https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...

Arainach 2 hours ago | parent [-]

What in that article makes you think a generation is shorter?

* Turing: September 2018

* Ampere: May 2020

* Hopper: March 2022

* Lovelace (designed to work with Hopper): October 2022

* Blackwell: November 2024

* Next: December 2025 or later

With a single exception for Lovelace (arguably not a generation), there are multiple years between generations.

villgax 6 hours ago | parent | prev [-]

100 times more chips for equivalent memory, sure.

m4r1k 5 hours ago | parent | next [-]

Check the specs again. Per chip, TPU 7x has 192GB of HBM3e, whereas the NVIDIA B200 has 186GB.

While the B200 wins on raw FP8 throughput (~9000 vs 4614 TFLOPs), that makes sense given NVIDIA has optimized for the single-chip game for over 20 years. But the bottleneck here isn't the chip—it's the domain size.

NVIDIA's top-tier NVL72 tops out at an NVLink domain of 72 Blackwell GPUs. Meanwhile, Google is connecting 9216 chips at 9.6Tbps to deliver nearly 43 ExaFlops. NVIDIA has the ecosystem (CUDA, community, etc.), but until they can match that interconnect scale, they simply don't compete in this weight class.

cwzwarich 3 hours ago | parent | next [-]

Isn’t the 9000 TFLOP/s number Nvidia’s relatively useless sparse FLOP count that is 2x the actual dense FLOP count?

PunchyHamster 2 hours ago | parent | prev [-]

Yet everyone uses NVIDIA and Google is at catchup position.

Ecosystem is MASSIVE factor and will be a massive factor for all but the biggest models

epolanski an hour ago | parent [-]

Catch-up in what exactly? Google isn't building hardware to sell, they aren't in the same market.

Also I feel you completely misunderstand that the problem isn't how fast is ONE gpu vs ONE tpu, what matters is the costs for the same output. If I can fill a datacenter at half the cost for the same output, does it matters I've used twice the TPUs and that a single Nvidia Blackwell was faster? No...

And hardware cost isn't even the biggest problem, operational costs, mostly power and cooling are another huge one.

So if you design a solution that fits your stack (designed for it) and optimize for your operational costs you're light years ahead of your competition using the more powerful solution, that costs 5 times more in hardware and twice in operational costs.

All I say is more or less true for inference economics, have no clue about training.

butvacuum an hour ago | parent [-]

Also, isn't memory a bit moot? At scale I thought that the ASICs frequently sat idle waiting for memory.

pests 9 minutes ago | parent [-]

You're doing operations on the memory once it's been transferred to gpu memory. Either shuffling it around various caches or processors or feeding it into tensor cores or other matrix operations. You don't want to be sitting idle.

croon 6 hours ago | parent | prev | next [-]

Ironwood is 192GB, Blackwell is 96GB, right? Or am i missing something?

NaomiLehman 6 hours ago | parent | prev [-]

I think it's not about the cost but the limits of quickly accessible RAM