| ▲ | zozbot234 2 hours ago | |
Ultimately it would amount to lazy-loading the model, but the parameters themselves would be fetched from the network as needed, which still decreases time-to-first-token. It's true that "expert" choices will span most of the model, regardless of any particular "subject" or "topic" choice, but if we simply care about time-to-first-token it's still a viable strategy. | ||