Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

I was wondering if the longer training thing was a similar phenomenon to the double-descent we see in other deep learning models. Training for a really long time can improve generalization (as can adding more parameters) - but I don't know enough about LLM architecture to know if that's relevant here. My skim of the blog post led me to think it's proposing a different mechanism (scaling laws).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: