| ▲ | miven a day ago | |
I'm really glad that these HNet-inspired approaches are getting traction, I'm a big fan of that paper. Though I wonder how much of the gains in this case are actually due to 75% extra parameters compared to the baseline, even if the inference FLOPs are matched. Can't help but see this as a just different twist on parameter use sparsity idea leveraged by MoE models, as those also gain in performance at constant forward pass FLOPs because of extra parameters. | ||