I work in this space and Claude's ability to count pixels and interact with a sc...

suchintan · on Oct 24, 2024

Agreed. In the short term (X months) I expect the HTML Distillation + giving text to LLMs to win out.. but the long term (Y years) screenshot only + pixels will definitely be the more "scalable" approach

One very subtle advantage of doing HTML analysis is that you can cut out a decent number of LLM calls by doing static analysis of the page

For example, you don't need to click on a dropdown to understand the options behind it, or scroll down on a page to find a button to click.

Certainly, as LLMs get cheaper the extra LLM calls will matter less (similar to what we're seeing happen with Solar panels where cost of panel < cost of labour now, but was reversed the preceding decade)

drothlis · on Oct 25, 2024

> Claude's ability to count pixels and interact with a screen using precise coordinate

I guess you mean its "Computer use" API that can (if I understand correctly) send mouse click at specific coordinates?

I got excited thinking Claude can finally do accurate object detection, but alas no. Here's its output:

> Looking at the image directly, the SPACE key appears near the bottom left of the keyboard interface, but I cannot determine its exact pixel coordinates just by looking at the image. I can see it's positioned below the letter grid and appears wider than the regular letter keys, but I apologize - I cannot reliably extract specific pixel coordinates from just viewing the screenshot.

This is 3.5 Sonnet (their most current model).

And they explicitly call out spatial reasoning as a limitation:

> Claude’s spatial reasoning abilities are limited. It may struggle with tasks requiring precise localization or layouts, like reading an analog clock face or describing exact positions of chess pieces.

--https://docs.anthropic.com/en/docs/build-with-claude/vision#...

Since 2022 I occasionally dip in and test this use-case with the latest models but haven't seen much progress on the spatial reasoning. The multi-modality has been a neat addition though.

philipbjorge · on Oct 30, 2024

They report that they trained the model to count pixels and based on accurate mouse clicks coming out of it, it seems to be the case for at least some code path.

> When a developer tasks Claude with using a piece of computer software and gives it the necessary access, Claude looks at screenshots of what’s visible to the user, then counts how many pixels vertically or horizontally it needs to move a cursor in order to click in the correct place. Training Claude to count pixels accurately was critical.

wintonzheng · on Oct 25, 2024

Curious: what use cases do you use to test the spacial reasoning ability of these models?

makestuff · on Oct 25, 2024

I don't use LLMs that often, but I recently used Claude Sonnet and was more impressed than I was with Chat GPT for similar AWS CDK questions.

In your opinion is Claude in the lead now? Or is it still really just dependent on what use case/question you are trying to solve?