even a specious attempt to define generally what "object" and "put away" mean is still 100s of PhD theses away
Is this part still true? There are widely available APIs (and even running at home on consumer level hardware to some extent) that can pick an object out of an image, describe what it might be useful for and where it could go.
Imagine you program a robot to "put away" a towel. Then it opens the door and finds there's a cup in the place already. Now what? Or a mouse. Or a piece of paper that looks like a towel in this lighting. Or a child.
Imagine the frustration if the robot kept returning to you saying "I cannot put this away". You'd get rid of the robot quickly. Reasoning at that level is so difficult.
But then imagine it was just a towel all along - oops, your perception system screwed up and now you put the towel in the dishwasher. Maybe this happens 1/1,000,000 times, but that person posts pictures on the internet and your company stock tanks.
Most robotic companies today still use traditional tracking and filtering (e.g. kalman filters) to help with associating detected objects with tracks (objects over time). Solving this in an fully differentiable / ML-first way for multiple targets is still WIP at most companies, since deepnet-to-detect + filtering is still a strong baseline and there are still challenges to be solved.
Occlusions, short-lived tracks, misassociations, low frame rate + high-rate-of-change features (e.g. flashing lights) are all still very challenging when you get down to brass tacks.
It's definitely not a solved problem in general, especially in realtime.
It's a lot easier to get started on something interesting and maybe even useful than it was even 10 years ago.
A lot of the "ah we can just use X API" falls apart pretty fast when you do risk analysis on a real system. Lots of these APIs are do a decent job most of the time under somewhat ideal conditions, beyond that things get hairy.
Robots could make a short stop or go slower to process an unclear picture, that is probably not the problem - but the image processing itself, is still way too unreliable. Under ideal condition it mostly works, but have some light fog in the picture or strong sunlight and ... usually all fails.
Otherwise the Teslas would have indeed full self driving mode, using only cameras.
>Robots could make a short stop or go slower to process an unclear picture
The costs of doing so are hugely dependent application. It is not, for example, an attractive strategy for an image-guided missile, though it's probably fine for an autonomous vacuum cleaner.
If someone could readily do it using GPT-4V with its apparent sentience, it must be happening already. So far there have been just few demos that shows obvious signs of manual programming, manual remote operation, and/or even VFX editing in some cases.
Is this part still true? There are widely available APIs (and even running at home on consumer level hardware to some extent) that can pick an object out of an image, describe what it might be useful for and where it could go.