Remix.run Logo
energy123 5 days ago

It's a little more inductive bias. That's not necessarily a step backwards. You need the right amount of inductive bias for a given data size and model capacity, no more and no less. Transformers already make the inductive bias of temporal locality by being causal.