| ▲ | zackangelo 7 hours ago | |
17b per token. So when you’re generating a single stream of text (“decoding”) 17b parameters are active. If you’re decoding multiple streams, it will be 17b per stream (some tokens will use the same expert, so there is some overlap). When the model is ingesting the prompt (“prefilling”) it’s looking at many tokens at once, so the number of active parameters will be larger. | ||