| ▲ | kouteiheika 2 days ago | |||||||
> What does it mean that only 3B parameters are active at a time? In a nutshell: LLMs generate tokens one at a time. "only 3B parameters active a a time" means that for each of those tokens only 3B parameters need to be fetched from memory, instead of all of them (30B). | ||||||||
| ▲ | tgv 2 days ago | parent [-] | |||||||
Then I don't understand why it would matter. Or does it really mean that for each input token 10% of the total network runs, and then another 10% for the next token, rather than running each 10 batches of 10% for each token? If so, any idea or pointer to how the selection works? | ||||||||
| ||||||||