▲ | southp 5 days ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||
It's fascinating, even though my knowledge to LLM is so limited that I don't really understand what's happening. I'm curious how the examples are plotted and how much resemblance they are to the real models, though. If one day we could reliably plot a LLM into modules like this using an algorithm, does that mean we would be able to turn LLMs into chips, rather than data centers? | |||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | southp 2 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
I'm new in this area and I've learned a lot from the replies. Thanks for sharing, folks :) Just to clarify, when I said "to turn LLMs into chips", I didn't mean to run it on CPU/GPU/TPU or any general purpose computing units, but to hardwire the entire LLM as a chip. Rethinking about it, the answer is likely yes since it's serializable. However, given how fast the models are evolving, the business value might be quite dim at the moment. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | visarga 5 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
The resemblance is pretty good, they can't show all details because the diagram would be hard to see. But the essential parts are there. I find the model to be extremely simple, you can write the attention equation on a napkin. This is the core idea: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V The attention process itself is based on all-to-all similarity calculation Q * K | |||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | nl 5 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
LLMs already run on chips. You can run one on your phone. Having said it's interesting to point out that the modules are what allow CPU offload. It's fairly common to run some parts on the CPU and others on the GPU/NPU/TPU depending on your configuration. This has some performance costs but allows more flexibility. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | yapyap 5 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
in my understanding the data centers are mostly for scaling so that many people can use an LLM service at a time and training so that training a new LLM’s weights won’t take months to years because of GPU constraints. Its already possible to run an LLM off chips, of course depending on the LLM and the chip. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | xwolfi 5 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
... you can run a good LLM on a macbook laptop. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
|