Remix.run Logo
londons_explore 2 hours ago

So why only 30,000 tokens per second?

If the chip is designed as the article says, they should be able to do 1 token per clock cycle...

And whilst I'm sure the propagation time is long through all that logic, it should still be able to do tens of millions of tokens per second...

wmf an hour ago | parent | next [-]

You still need to do a forward pass per token. With massive batching and full pipelining you might be able to break the dependencies and output one token per cycle but clearly they aren't doing that.

menaerus an hour ago | parent | prev [-]

Reading from and to memory alone takes much more than a clock cycle.