Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Yup, if you follow this definition of attention. It makes sense.

The mixing step, is computed with the trained weights, meaning the model does learn on its own, when/what to emphasise in this "mixing" process.

Hypothetically speaking, if a token does not matter, it mixing weights could end up being literally 0 (or something close to it)



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: