Remix.run Logo
miven a day ago

I'm really glad that these HNet-inspired approaches are getting traction, I'm a big fan of that paper.

Though I wonder how much of the gains in this case are actually due to 75% extra parameters compared to the baseline, even if the inference FLOPs are matched.

Can't help but see this as a just different twist on parameter use sparsity idea leveraged by MoE models, as those also gain in performance at constant forward pass FLOPs because of extra parameters.