Yup, if you follow this definition of attention. It makes sense. The mixing step...

Yup, if you follow this definition of attention. It makes sense.

The mixing step, is computed with the trained weights, meaning the model does learn on its own, when/what to emphasise in this "mixing" process.

Hypothetically speaking, if a token does not matter, it mixing weights could end up being literally 0 (or something close to it)