▲ | mmmllm 5 days ago | ||||||||||||||||
Isn't that essentially how the MoE models already work? Besides, if that were infinitely scalable, wouldn't we have a subset of super-smart models already at very high cost? Besides, this would only apply for very few use cases. For a lot of basic customer care work, programming, quick research, I would say LLMs are already quite good without running it 100X. | |||||||||||||||||
▲ | mcrutcher 5 days ago | parent | next [-] | ||||||||||||||||
MoE models are pretty poorly named since all the "experts" are "the same". They're probably better described as "sparse activation" models. MoE implies some sort of "heterogenous experts" that a "thalamus router" is trained to use, but that's not how they work. | |||||||||||||||||
▲ | amelius 5 days ago | parent | prev | next [-] | ||||||||||||||||
> if that were infinitely scalable, wouldn't we have a subset of super-smart models already at very high cost The compute/intelligence curve is not a straight line. It's probably more a curve that saturates, at like 70% of human intelligence. More compute still means more intelligence. But you'll never reach 100% human intelligence. It saturates way below that. | |||||||||||||||||
| |||||||||||||||||
▲ | mirekrusin 5 days ago | parent | prev [-] | ||||||||||||||||
MoE is something different - it's a technique to activate just a small subset of parameters during inference. Whatever is good enough now, can be much better for the same cost (time, computation, actual cost). People will always choose better over worse. | |||||||||||||||||
|