> far as I can tell, there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n
Wouldn't "pass-through" identity connections have exactly that effect? These are quite common in transformer models.
Wouldn't "pass-through" identity connections have exactly that effect? These are quite common in transformer models.