SWE-Bench measures single tasks in isolation. In a real loop the model usually l...

yaodub 14 days ago | parent | context | favorite | on: Claude Fable 5

SWE-Bench measures single tasks in isolation. In a real loop the model usually loses track of what I was trying to do long before code quality becomes the issue.