Remix.run Logo
bluecoconut 4 hours ago

I almost feel like this goes opposite to what attention is good at. This would be good at approximating all the places where attention is low / not sharp. Where attention/the exponential is key is when it selects out / needle-in-haystack / winner-takes-all focus (the word "attention" itself sorta implies this), and this is where the taylor expression would fail to represent the values well. This just... softens attentions ability to attend?

(I'm imagining that if in the context there's ~4-8 "similar" attention-targets that should be sharp, and regular attention learns to select the correct one, this taylor approximation version would wash out any difference and they'd all loosly be attended to, and it'd fail to isolate the correct signal)

Really wish this had some downstream tests -- apply it to a pretrained model and see how performance degrades, train a fresh one, etc. The tests are worth doing, but I somehow don't feel that hopeful this is the unlock required for sub-quadratic attention. It's possible that a freshly trained model with this learns to attend without the sharp attention signals, but that seems a bit dubious to me.

But also, maybe this combined with some other selective (sparse attention) trick, means that the hybrid model gets the "fuzzy long tail" of attention well represented as well as the sharpness well represented, and all together it could actually be a part of the larger solution.

energy123 4 hours ago | parent | next [-]

> this is where the taylor expression would fail to represent the values well.

"In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution"

seanhunter 4 hours ago | parent [-]

I read that too, but I wondered whether elementwise error is the right metric. Surely the actual error metric should be to evaluate model performance for a conventional transformer model and then the same model with the attention mechanism replaced by this 4th order Taylor approximation?

vlovich123 3 hours ago | parent [-]

Bounded error weights by definition is a more strict evaluation criterion than “performance” metrics through running the model.

mapontosevenths 4 hours ago | parent | prev | next [-]

> This just... softens attentions ability to attend?

I think this does soften, but not linearly. That is to say the fixed state size limitation means that it softens more as it gets further into the past.

tehsauce 4 hours ago | parent | prev [-]

Right, and when they compare to floating point accuracy they seem to be using the number of decimals supported by the mantissa, but the exponent is important no?

seanhunter 4 hours ago | parent [-]

When someone says the error is of a certain magnitude they mean the absolute value of the difference between the the two things, so what they're saying is that the values they produced with their approximation are about as wrong as the difference between the actual values and those values cast to float16. The exponent is most definitely important and would be included in that.