Remix.run Logo
xscott 5 hours ago

Of course I can't be certain, but I think the "mixture of experts" design plays into it too. Metaphorically, there's a mid-level manager who looks at your prompt and tries to decide which experts it should be sent to. If he thinks you won't notice, he saves money by sending it to the undergraduate intern.

Just a theory.

victorbjorklund 5 hours ago | parent [-]

Notice that MOE isn’t different experts for different types of problems. It’s per token and not really connect to problem type.

So if you send a python code then the first one in function can be one expert, second another expert and so on.

dotancohen 2 hours ago | parent [-]

Can you back this up with documentation? I don't believe that this is the case.

pixelmelt an hour ago | parent [-]

Check out Unsloths REAP models, you can outright delete a few of the lesser used experts without the model going braindead since they all can handle each token but some are better posed to do so.