That follows reason, but in practice I find that its often not the case. My suspicion is that it's hard to establish that your method is superior to another if, for example, it takes 10-100x the compute to train a model. This is largely in part due to the fact that machine learning is currently a deeply empirical field.

Nvidia isn't likely to start releasing updated firmware for an obscure architecture for which there is limited evidence of improvement, and even less adoption.

▲

kouteiheika 3 months ago | parent [-]

Indeed. Especially when a lot of papers are just using cherry-picked results that show some improvements just so they can publish something, but their method doesn't work that well when it comes in contact with reality (e.g. see the deluge of papers which claim to have come up with an optimizer better than AdamW), and when the majority of people are not even properly benchmarking their new methods wrt to the time overhead (no, it doesn't matter if your method achieves 1% better loss if it takes 10% longer to train, because if I'd trained for 10% longer without your method I'd get an even better loss; and don't even get me started on people not tuning their baselines).

I've been burnt way too many times by fancy new methods that claimed improvement, where I spent a ton of effort to implement them, and they ended up being poop.

Every person working in the field and pushing papers should read this blog post and apply what's written in it: https://kellerjordan.github.io/posts/muon/#discussion-solvin...

	▲	porridgeraisin 3 months ago \| parent [-]
		Yep. Offline RL is especially full of these types of papers too. The sheer number of alternatives to the KL divergence to prevent the offline distribution from diverging too far from the collected data distribution... There's probably one method for each person on earth.