Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

SWE-Bench measures single tasks in isolation. In a real loop the model usually loses track of what I was trying to do long before code quality becomes the issue.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: