| ▲ | Legend2440 a day ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LLMs are enormously bandwidth hungry. You have to shuffle your 800GB neural network in and out of memory for every token, which can take more time/energy than actually doing the matrix multiplies. GPUs are almost not high bandwidth enough. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | socketcluster 20 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
But even so, for a single user, the output rate for a very fast LLM would be like 100 tokens per second. With graphics, we're talking like 2 million pixels, 60 times a second; 120 million pixels per second for a standard high res screen. Big difference between 100 tokens vs 120 million pixels. 24 bit pixels gives 16 million possible colors... For tokens, it's probably enough to represent every word of the entire vocabulary of every major national language on earth combined. > You have to shuffle your 800GB neural network in and out of memory Do you really though? That seems more like a constraint imposed by graphics cards. A specialized AI chip would just keep the weights and all parameters in memory/hardware right where they are and update them in-situ. It seems a lot more efficient. I think that it's because graphics cards have such high bandwidth that people decided to use this approach but it seems suboptimal. But if we want to be optimal; then ideally, only the inputs and outputs would need to move in and out of the chip. This shuffling should be seen as an inefficiency; a tradeoff to get a certain kind of flexibility in the software stack... But you waste a huge amount of CPU cycles moving data between RAM, CPU cache and Graphics card memory. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | Zambyte a day ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This doesn't seem right. Where is it shuffling to and from? My drives aren't fast enough to load the model every token that fast, and I don't have enough system memory to unload models to. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||