Okay ... the answer really is "much better than other RNNs of any type". And transformers do: LSTMs and GRUs networks effectively have a context size of 1.
Note that what people claim here is not true: just because you go past the context window does NOT mean everything is forgotten, it means that you're back to RNN levels of performance.
In transformers, just like in RNNs, processing the nth token influences the nth token AND the (n+1)th token. So the influence of the nth token "slowly dies down". It influences the (n+1)th token a lot, the (n+2)th token a bit less and so on. The special thing about attention is that it doesn't start dying until you go past the context size.
Or with a bit of math notation. If a neural network is a function f, then:
The difference between LSTMs and GRUs is in the structure of f. And "sum" is, like a lot of things in here, uh, basically accurate but missing a lot of detail.
RNNs have a theoretically infinite context size, but in practice it’s limited by vanishing gradients due to too much recursion. That’s why the recursive units are usually LSTMs or GRUs that have explicit functionality to cut off the recursion, using a learnable threshold that’s a function of both the current and recursive inputs.
In your example, x[n-1] = f(x[n-2], x[n-1]). So really, your expression should be
Note that what people claim here is not true: just because you go past the context window does NOT mean everything is forgotten, it means that you're back to RNN levels of performance.
In transformers, just like in RNNs, processing the nth token influences the nth token AND the (n+1)th token. So the influence of the nth token "slowly dies down". It influences the (n+1)th token a lot, the (n+2)th token a bit less and so on. The special thing about attention is that it doesn't start dying until you go past the context size.
Or with a bit of math notation. If a neural network is a function f, then:
RNN: f(x[n]) = f( f(x[n-1]), x[n] )
Transformer: f(x[n]) = f(sum(i=0...context size, f(x-i)), x[n])
The difference between LSTMs and GRUs is in the structure of f. And "sum" is, like a lot of things in here, uh, basically accurate but missing a lot of detail.