HN2new | past | comments | ask | show | jobs | submitlogin

Can you explain like I'm 5 why this matters distinctly from how transformers are normally trained with autodiff and what its possible applications are?


I’m talking about attention only transformers. Those don’t have an autodiff but still learn. The math is actually really cool.


> attention only transformers

Can you share any good link on the subject?



Maybe I am missing something, but I don't see any learning without autodiff.


I thought you were asking about attention only transformers. This paper touches on some of it https://arxiv.org/abs/2212.10559v2.


The paper speculates that it is analogous to gradient descent and empirically confirms it is similar in behavior, but it is not a rigorous proof of any kind.

The momentum experiment they made also does not seem related. E.g. it just adds past values to V, which extends the effective context length.


> but it is not a rigorous proof of any kind.

Such is the nature of early theories.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: