Can you explain like I'm 5 why this matters distinctly from how transformers are normally trained with autodiff and what its possible applications are?
The paper speculates that it is analogous to gradient descent and empirically confirms it is similar in behavior, but it is not a rigorous proof of any kind.
The momentum experiment they made also does not seem related. E.g. it just adds past values to V, which extends the effective context length.