HN2new | past | comments | ask | show | jobs | submitlogin

The "visual inputs" samples are extraordinary, and well worth paying extra attention to.

I wasn't expecting GPT-4 to be able to correctly answer "What is funny about this image?" for an image of a mobile phone charger designed to resemble a VGA cable - but it can.

(Note that they have a disclaimer: "Image inputs are still a research preview and not publicly available.")



Wow. I specifically remember "AIs will never be able to explain visual humor" as a confident prediction from the before times of 2020.


Yes! I remember the "Obama stepping on the scale" example that was used in that article. Would love to know how GPT-4 performs on that test.


you mean this http://karpathy.github.io/2012/10/22/state-of-computer-visio...? Very funny to revisit. How primitive our tools were in comparison to now is astounding. It feels like the first flight of the Wright Brothers vs a jetliner. Imagenet was the new frontier. Simpler times...


I think the interesting thing here is the very, very surprising result that LLMs would be capable of abstracting the things in the second to last paragraph from the described experiences of amalgamated written human data.

It's the thing most people even in this thread don't seem to realize has emerged in research in the past year.

Give a Markov chain a lot of text about fishing and it will tell you about fish. Give GPT a lot of text about fishing and it turns out that it will probably learn how to fish.

World model representations are occuring in GPT. And people really need to start realizing there's already published research demonstrating that, as it goes a long way to explaining why the multimodal parts work.


Especially funny since the author, Andrej Karpathy, wrote at the end of the 2012 article that

>we are very, very far and this depresses me. What is the way forward? :( Maybe I should just do a startup

and was a founding member of OpenAI just a few years later in 2015


And he just rejoined them in February.


Didn't realize this was from 2012, but yes this is definitely what I was thinking of.


They say there are 3 mirrors in the scene but there are at least 5 - one which can only be seen indirectly through one of the other mirrors!


If they are using popular images from the internet, then I strongly suspect the answers come from the text next to the known image. The man ironing on the back of the taxi has the same issue. https://google.com/search?q=mobile+phone+charger+resembling+...

I would bet good money that when we can test prompting with our own unique images, GPT4 will not give similar quality answers.

I do wonder how misleading their paper is.


Did you watch the livestream?

They literally sent it 1) an a screenshot of the Discord session they were in and 2) an audience submitted image

It described the Discord image in incredible detail, including what was in that, what channels they subscribed to, how many users were there. And for the audience image, it correctly described it as an astronaut on an alien planet, with a spaceship on a distant hill.

And that image looked like it was AI created!

These aren't images it's been "trained on".


99% of the comments here have no iota of a clue what they are talking about.

There's easily a 10:1 ratio of "it doesn't understand it's just fancy autocomplete" to the alternative, in spite of published peer reviewed research from Harvard and MIT researchers months ago demonstrating even a simplistic GPT model builds world representations from which it draws its responses and not simply frequency guessing.

Watch the livestream!?! But why would they do that because they already know it's not very impressive and not worth their time outside commenting on it online.

I imagine this is coming from some sort of monkey brain existential threat rationalization ("I'm a smart monkey and no non-monkey can do what I do"). Or possibly just an overreaction to very early claims of "it's alive!!!" in an age when it was still just a glorified Markov chain. But whatever the reason, it's getting old very fast.


>published peer reviewed research from Harvard and MIT researchers months ago

Curious, source?

EDIT: Oh, the Othello paper. Be careful extrapolating that too far. Notice they didn't ask it to play the same game on a board of arbitrary size (something easy for a model with world understanding to do).


In the livestream demo they did something similar but with a DALLE-generated image of a squirrel holding a camera and it still was able to explain why it was funny. As the image was generated by DALLE, it clearly doesn't appear anywhere on the internet with text explaining why its funny. So I think this is perhaps not the only possible explanation.


It didn't correctly explain why it was funny though: which is that it's a squirrel "taking a picture of his nuts", nuts here being literal nuts and not the nuts we expect with phrasing like that.

What is funny is neither GPT-4 nor the host noticed that (or maybe the host noticed it but didn't want to bring it up due to it being "inappropriate" humor).


That interpretation never occurred to me either, actually. I suppose that makes more sense. But since it did not occur to me, I can give GPT4 some slack. It came up at the same explanation I would have.


Can it identify porn vs e.g. family pics? Could it pass the "I'll know it when I see it" test?


Some people are sexually aroused by feet. How would YOU define "porn?"


Does it know what a "man of culture" is?


https://xkcd.com/468/

anything not on your list


That’s exactly their point though. It requires intuition to decide if a picture of feet is sexualized or not. Hence the “I know it when I see it” standard they mentioned.


I’d bet they pass images through a porn filter prior to even giving GPT-4 a chance to screw that up…


I suppose It could do it from porn snapshots, kinda like porn-id thing on reddit. I can see more nefarious uses like identifying car licence plates or faces from public cameras for digital stalking. I know it can be done RN with ALPRs but they have to be manually designed with specialty cameras setups. if this makes it ubiquitous then that would be a privacy/security nightmare.



Yea it's incredible. Looks like tooling in the LLM space is quickly following suit: https://twitter.com/gpt_index/status/1635668512822956032


Am I the only one who thought that GPT-4 got this one wrong? It's not simply that it's ridiculous to plug what appears to be an outdated VGA cable into a phone, it's that the cable connector does nothing at all. I'd argue that's what actually funny. GPT-4 didn't mention that part as far as I could see.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: