The "visual inputs" samples are extraordinary, and well worth paying extra attention to.
I wasn't expecting GPT-4 to be able to correctly answer "What is funny about this image?" for an image of a mobile phone charger designed to resemble a VGA cable - but it can.
(Note that they have a disclaimer: "Image inputs are still a research preview and not publicly available.")
you mean this http://karpathy.github.io/2012/10/22/state-of-computer-visio...?
Very funny to revisit. How primitive our tools were in comparison to now is astounding. It feels like the first flight of the Wright Brothers vs a jetliner. Imagenet was the new frontier. Simpler times...
I think the interesting thing here is the very, very surprising result that LLMs would be capable of abstracting the things in the second to last paragraph from the described experiences of amalgamated written human data.
It's the thing most people even in this thread don't seem to realize has emerged in research in the past year.
Give a Markov chain a lot of text about fishing and it will tell you about fish. Give GPT a lot of text about fishing and it turns out that it will probably learn how to fish.
World model representations are occuring in GPT. And people really need to start realizing there's already published research demonstrating that, as it goes a long way to explaining why the multimodal parts work.
If they are using popular images from the internet, then I strongly suspect the answers come from the text next to the known image. The man ironing on the back of the taxi has the same issue. https://google.com/search?q=mobile+phone+charger+resembling+...
I would bet good money that when we can test prompting with our own unique images, GPT4 will not give similar quality answers.
They literally sent it 1) an a screenshot of the Discord session they were in and 2) an audience submitted image
It described the Discord image in incredible detail, including what was in that, what channels they subscribed to, how many users were there. And for the audience image, it correctly described it as an astronaut on an alien planet, with a spaceship on a distant hill.
99% of the comments here have no iota of a clue what they are talking about.
There's easily a 10:1 ratio of "it doesn't understand it's just fancy autocomplete" to the alternative, in spite of published peer reviewed research from Harvard and MIT researchers months ago demonstrating even a simplistic GPT model builds world representations from which it draws its responses and not simply frequency guessing.
Watch the livestream!?! But why would they do that because they already know it's not very impressive and not worth their time outside commenting on it online.
I imagine this is coming from some sort of monkey brain existential threat rationalization ("I'm a smart monkey and no non-monkey can do what I do"). Or possibly just an overreaction to very early claims of "it's alive!!!" in an age when it was still just a glorified Markov chain. But whatever the reason, it's getting old very fast.
>published peer reviewed research from Harvard and MIT researchers months ago
Curious, source?
EDIT: Oh, the Othello paper. Be careful extrapolating that too far. Notice they didn't ask it to play the same game on a board of arbitrary size (something easy for a model with world understanding to do).
In the livestream demo they did something similar but with a DALLE-generated image of a squirrel holding a camera and it still was able to explain why it was funny. As the image was generated by DALLE, it clearly doesn't appear anywhere on the internet with text explaining why its funny. So I think this is perhaps not the only possible explanation.
It didn't correctly explain why it was funny though: which is that it's a squirrel "taking a picture of his nuts", nuts here being literal nuts and not the nuts we expect with phrasing like that.
What is funny is neither GPT-4 nor the host noticed that (or maybe the host noticed it but didn't want to bring it up due to it being "inappropriate" humor).
That interpretation never occurred to me either, actually. I suppose that makes more sense. But since it did not occur to me, I can give GPT4 some slack. It came up at the same explanation I would have.
That’s exactly their point though. It requires intuition to decide if a picture of feet is sexualized or not. Hence the “I know it when I see it” standard they mentioned.
I suppose It could do it from porn snapshots, kinda like porn-id thing on reddit. I can see more nefarious uses like identifying car licence plates or faces from public cameras for digital stalking. I know it can be done RN with ALPRs but they have to be manually designed with specialty cameras setups. if this makes it ubiquitous then that would be a privacy/security nightmare.
Am I the only one who thought that GPT-4 got this one wrong? It's not simply that it's ridiculous to plug what appears to be an outdated VGA cable into a phone, it's that the cable connector does nothing at all. I'd argue that's what actually funny. GPT-4 didn't mention that part as far as I could see.
I wasn't expecting GPT-4 to be able to correctly answer "What is funny about this image?" for an image of a mobile phone charger designed to resemble a VGA cable - but it can.
(Note that they have a disclaimer: "Image inputs are still a research preview and not publicly available.")