| ▲ | jychang 7 hours ago | |
They didn't do something stupid like Llama 4 "one active expert", but 4 of 256 is very sparse. It's not going to get close to Deepseek or GLM level performance unless they trained on the benchmarks. I don't think that was a good move. No other models do this. | ||