Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

It's pretty funny to test in-distribution for AI models. But they fail horribly once you push them a bit[1].

I recently made LLMs play Minesweeper and ALL LLMs that I tested had a pretty bad win to loose ratio. Like the only model that won more than 3 times was R1 (mind you there were 50 games).

[1] https://snats.xyz/pages/articles/minesweeper_bench.html



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: