*even a specious attempt to define generally what "object" and "put away" mean i...

jvanderbot · on March 14, 2024

Imagine you program a robot to "put away" a towel. Then it opens the door and finds there's a cup in the place already. Now what? Or a mouse. Or a piece of paper that looks like a towel in this lighting. Or a child.

Imagine the frustration if the robot kept returning to you saying "I cannot put this away". You'd get rid of the robot quickly. Reasoning at that level is so difficult.

But then imagine it was just a towel all along - oops, your perception system screwed up and now you put the towel in the dishwasher. Maybe this happens 1/1,000,000 times, but that person posts pictures on the internet and your company stock tanks.

kajecounterhack · on March 14, 2024

Most robotic companies today still use traditional tracking and filtering (e.g. kalman filters) to help with associating detected objects with tracks (objects over time). Solving this in an fully differentiable / ML-first way for multiple targets is still WIP at most companies, since deepnet-to-detect + filtering is still a strong baseline and there are still challenges to be solved.

Occlusions, short-lived tracks, misassociations, low frame rate + high-rate-of-change features (e.g. flashing lights) are all still very challenging when you get down to brass tacks.

ska · on March 14, 2024

It's definitely not a solved problem in general, especially in realtime.

It's a lot easier to get started on something interesting and maybe even useful than it was even 10 years ago.

A lot of the "ah we can just use X API" falls apart pretty fast when you do risk analysis on a real system. Lots of these APIs are do a decent job most of the time under somewhat ideal conditions, beyond that things get hairy.

kaibee · on March 14, 2024

> that can pick an object out of an image

You have to do it in real time, from a video feed, and make sure that you're tracking the same unique instance of that object between frames.

lukan · on March 14, 2024

Robots could make a short stop or go slower to process an unclear picture, that is probably not the problem - but the image processing itself, is still way too unreliable. Under ideal condition it mostly works, but have some light fog in the picture or strong sunlight and ... usually all fails.

Otherwise the Teslas would have indeed full self driving mode, using only cameras.

thfuran · on March 14, 2024

>Robots could make a short stop or go slower to process an unclear picture

The costs of doing so are hugely dependent application. It is not, for example, an attractive strategy for an image-guided missile, though it's probably fine for an autonomous vacuum cleaner.

YeGoblynQueenne · on March 14, 2024

And then you need to grasp it.

numpad0 · on March 15, 2024

If someone could readily do it using GPT-4V with its apparent sentience, it must be happening already. So far there have been just few demos that shows obvious signs of manual programming, manual remote operation, and/or even VFX editing in some cases.

transitionnel · on March 14, 2024

That language sounds borne of hair-pulling disbelief.

If they can put ImageNet on a SOC, they can do it. [probably too big/watt]

Better yet: ImageNet bones on SOC, cacheable "Immediate Situation" fed by [the obvious logic programming that everyone glances past :) ]

transitionnel · on March 15, 2024

> This is how Cybernetics starts y'all. <