> far as I can tell, there is nothing about the training process of these models... | Hacker News

Hacker Timesnew | past | comments | ask | show | jobs | submit

		zozbot234 19 days ago \| parent \| context \| favorite \| on: Show HN: Duplicate 3 layers in a 24B LLM, logical ... > far as I can tell, there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n Wouldn't "pass-through" identity connections have exactly that effect? These are quite common in transformer models.

4bpp 19 days ago [–]

Yeah, that's what I meant with "initialised as identity and the training process did not get to change them much".

SCLeo 18 days ago | [–]

There are explicit residual connections in a transformer block. Look up "residual connections" in Google images and you will see.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact