Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

> far as I can tell, there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n

Wouldn't "pass-through" identity connections have exactly that effect? These are quite common in transformer models.



Yeah, that's what I meant with "initialised as identity and the training process did not get to change them much".


There are explicit residual connections in a transformer block. Look up "residual connections" in Google images and you will see.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: