> self-attention is efficiently computable to arbitrary precision with constant cost per token

This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.

▲ logicchains 3 hours ago | parent | next [-]

It can't be successful at that any more than 1+1 can equal 3. Fundamentally, if every token wants to be able to look at every previous token without loss of information, it must be O(n^2); N tokens looking at N tokens is quadratic. Any sub-quadratic attention must hence necessarily lose some information and be unable to support perfect recall on longer sequences.

▲ orlp 38 minutes ago | parent | next [-]

> N tokens looking at N tokens is quadratic

Convolving two arrays can be done perfectly accurately in O(n log n), despite every element being combined with every other element.

Or consider the even more basic sum of products a[i] * b[j] for all possible i, j:

    total = 0
    for i in range(len(a)):
        for j in range(len(b)):
            total += a[i] * b[j]

This can be computed in linear time as sum(a) * sum(b).

Your logic that 'the result contains terms of all pairs, therefore the algorithm must be quadratic' simply doesn't hold.

▲ hellohello2 2 hours ago | parent | prev | next [-]

I'm not saying if the paper is correct or not (since I can't tell), but I don't think your argument really holds. Consider applying it to multiplication:

Fundamentally, multiplication need to look at every pair of integer from the two input numbers. It must be O(n^2); N digits looking at N other digits is quadratic. Any sub-quadratic multiplication must hence necessarily lose some information.

▲

actionfromafar 2 hours ago | parent [-]

Doesn't that have to do with how many bits you allow in the actual calculation in physical reality?

	▲	hellohello2 35 minutes ago \| parent [-]
		Well, for multiplication complexity is defined in terms of on the number of digits/bits digits directly. For attention, complexity is defined on terms of the number of input vectors which are all at fixed precision. I don't understand what happens to the method proposed in the paper at higher precision (since I don't understand the paper), but in reality in doesn't matter since there is no value in anything over float16 for machine learning.

▲ oasisaimlessly 2 hours ago | parent | prev | next [-]

That argument could also be used to say that the FFT's time complexity of O(n log n) should be impossible.

▲ naasking an hour ago | parent | prev [-]

Your argument just assumes there is no latent structure that can be exploited. That's a big assumption.

▲ energy123 3 hours ago | parent | prev [-]

It's like claims of room temperature superconductors or millenium prize solutions. Earth shattering if true. It'd be such a black swan. Terrible for Nvidia.

	▲	SeanAnderson 3 hours ago \| parent [-]
		Well, we solved one of the Millennium Prize problems (honestly kinda quickly) so maybe there's hope :)