Well, that didn't take long, did it? 50% on ARC public test set [1] less than a ...

godelski · on June 23, 2024

> Well, that didn't take long, did it? 50% on ARC public test set [1] less than a week after the announcement of the prize.

I think you also misunderstand the challenge and very clearly the author misunderstands neurosymbolic AI, as he implements it... He has it generate programs and then search over those programs. He also tries to challenge Francois's claims (What it means about current LLMs) while he actively performs "claim 1" and misunderstands the context of "claim 3" (model weights are frozen, so there is no online learning. This is distinct from what's going on here, since he is updating the model's priors before answering. But whatever insights the model has gained from this exercise do not persist after execution. i.e. there is no continual learning). "claim 2" is just irrelevant.

A key part that is concerning to me is this

  > In addition to iterating on the training set, I also did a small amount of iteration on a 100 problem subset of the public test set.

The train and test sets are quite different, so if he learned anything from the test set than that invalidates it. And as far as I can tell, he does combine... https://github.com/rgreenblatt/arc_draw_more_samples_pub/blo...

Potentially the confusion is that each data file has a pair where one has "train" and "test" which is your sample and then your actual input/output pair. So you're only supposed to train from ARC-AGI/data/training, but you cannot use ARC-AGI/data/evaluation for anything other than... evaluation.

Not to mention that we don't know what data is in GPT. It would not be surprising if this was in it. Maybe they filtered out the official repo but there are plenty of examples around the web. Did they take check for all such examples? If not, then the result is entirely invalidated.

There's a lot of reason to believe information leakage exists here.

So I'll wait for an open solution before I start to

> Re: academics - good ideas get noticed.

I also need to stress that ARC has been tested in LLMs for quite some time now. You can go see it in both the GPT2 and GPT3 papers. Though these are different versions than the one in the current competition. That version has ARC-e and ARC-c for easy and challenge. GPT2 gets 68.8/51.4 with "zero-shot" (I'm not confident) and the original LLaMA gets 78.9/56.0. So really, if people aren't aware of ARC (prior to the video) then it really demonstrates that they are not doing this kind of research or even reading the papers.

And I think we need to be clear that we need to differentiate academics and normal people. And I'm including anyone with a "machine learning researcher" and "machine learning engineer" title in "academics." This is where all the building is happening and these people all should be very aware of ARC. The public not knowing, well, that's a whole different story and isn't really all that important now is it. They're not the ones improving these systems (for the most part. There are of course always exceptions to the rule).