Only Gemini 2.0-flash and Claude 3.5-sonnet got this right for me. And only Claude seems to be getting such things with good consistency. It seems to have a strategy for it and applied it well while other models are basically guessing.
Also Deep Seek gets it right similarly to Claude but it's more verbose in standard mode and much more verbose in R1 deep thinking mode. The reasoning is verbose but nearly 100% sensible.
Especially when given follow up question "how about 9.8?"
Gemini-2.0-Flash is also constantly correct when "Think step by step." is appended to the prompt.