As you said. Transformers are using linear attention for each token. It's just t...

As you said. Transformers are using linear attention for each token. It's just that n times n is quadratic. There is no way around this other than by adding a separate token that indicates rerunning the SSM from the beginning. Then you have a dynamically scaling system that seamlessly switches between linear and quadratic complexity depending on the problem.

MLA is probably the closest thing that is in-between both.