Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

I'm really tired of these papers and experiments.

You cannot test reasoning when you don't know what's in the training set. You have to be able to differentiate reasoning from memorization, and that's not trivial.

Moreso, the results look to confirm that at least some memorization is going on. Do we really not think GPT has extensively been trained on arithmetic in base 10, 8, and 16? This seems like a terrible prior. Even if not explicitly, how much code has it read that performs these tasks. How many web pages, tutorials, Reddit posts cover oct and hex? They also haven't defined zero shot correctly. Arithmetic in these bases aren't 0-shot. They're explicitly in distribution...

I'm unsure about base 9 and 11. It's pretty interesting to see that GPT 4 is much better at these. Anyone know why? Did they train on these? More bases? Doesn't seem unreasonable but I don't know.

The experimentation is also extremely lacking. The arithmetic questions only have 1000 tests where they add two digits. This is certainly in the training data. I'm also unconvinced by the syntax reasoning tasks since the transformer (attention) architecture seems to be designed for this. I'm also unconvinced these tasks aren't in training. Caesar ciphers are also certainly in the training data.

The prompts are also odd and I guess that's why they're in the appendix. For example, getting GPT to be better at math or many tasks by having it write python code is not novel.

There's some stuff here but this really doesn't seem like a lot of work for 12 people from a top university and a trillion dollar company. It's odd to see that many people when the experiments can be run in a fairly short time.



We can tell some of what's in the training set. One of the answers for the inductive reasoning test begins "begin from the rightmost digit". Look that phrase up in Google. It shows up in Chegg, Course Hero, and Brainly content for elementary arithmetic. If you bash on those how-to articles, available for bases 2 and 10, you can probably generate the pattern for base 8.

This looks like an LLM doing the usual LLM thing - finding relevant items and combining them to fit. This doesn't require the level of abstraction and induction the authors impute to the LLM. Ordinary LLM behavior explains this, once you've found the relevant training data.

People often solve problems that way too, of course.


That reminds me of an old paper about "Benny's Rules", a case-study focused on a kid who seemed to be doing better than average in math tests when it came to final answers... but for all the wrong reasons, using an inferred set of arbitrary text manipulation rules.

The intent was to point out that the educational approach was flawed, but I think there are interesting parallels to token processing in LLMs, which--unlike a human child--are built in such a way that crazy partial-fit rules are likely their only option.

> Benny believed that the fraction 5/10 = 1.5 and 400/400 = 8.00 because he believed the rule was to add the numerator and denominator and then divide by the number represented by the highest place value.

https://blog.mathed.net/2011/07/rysk-erlwangers-bennys-conce...


This is a problem with some tests. The students may detect a pattern in the test answers which reflects the work of those generating the answers, not the content.

See this article on SAT test prep.[1] The requirement that only one answer can be right means that wrong answers have easily identifiable properties.

[1] https://blog.prepscholar.com/the-critical-fundamental-strate...


You'll probably find this talk [1] interesting. They control all the training data for small LLMs and then perform experiments (including reasoning experiments).

[1] Physics of LLMs: https://www.youtube.com/watch?v=yBL7J0kgldU&t=7s


How to you define memorization and reasoning? There is a large grey area in between them. Some say that if you can memorize facts and algorithms and apply them to new data, it is a memorization. Some say that it is reasoning.

More than that -- It's not clear that what humans do is not "just" a memorization. We can always look at human experience mechanisticly and say that we don't think -- we just memorized thinking patterns and apply them when speaking and "thinking"


  >  It's not clear that what humans do is not "just" a memorization.
While I agree that there is a lot of gray in-between I think you are misrepresenting my comment. And I'm absolutely certain humans do more than memorization. Not all humans, but that's not the bar. Some humans are brain damaged and some are in fact babies (and many scientific do agree that sentience doesn't appear at birth).

If you doubt me I very much encourage you to dive deeper into the history of science and get doing deep deep knowledge on any subject. Because you'll find this happen all the time. But if you apply a loose enough definition to memorization (that isn't one that would be generally agreed upon if you used it's logical conclusions) then yeah, everything is memorization. But everything is foo if I define everything to be foo, so let's not.


A lot of reasoning is similar to interpolation within a sparse set of observations. Memorization is rounding up to the nearest known example. Basic guess is linear interpolation. And reasoning is about discovering the simplest rule that explains all the observations and using this rule to extrapolate.


>> Some say that if you can memorize facts and algorithms and apply them to new data, it is a memorization. Some say that it is reasoning.

Who are those "some" who say it is reasoning?

Here's a question. If you enter a command in your terminal do you expect your shell to have memorised the result of the command from some previous experience, or do you expect your shell to compute the result of your command according to its programming and your inputs? A rhetorical question: we all assume the latter.

Which one is more like what most peoples' informal conception of "reasoning"? Retrieving an answer from storage, or computing an answer from the inputs given to a program?

>> We can always look at human experience mechanisticly and say that we don't think -- we just memorized thinking patterns and apply them when speaking and "thinking"

I think this is confusing memorisation of the rules required to perform a computation, like a program stored in computer memory, with memorisation of the results of a computation. When we distinguish between memorisation and reasoning we usually make a distinction between computing a result from scratch and retrieving it from storage without having to re-compute it, like in caching or memoization, or getting data from a database.

For a real world example, we memorise our time tables but we don't memorise the result of every sum x + y = z, instead we memorise a summation algorithm that we then use to derive the sum of two numbers.


> Some say that if you can memorize facts and algorithms and apply them to new data, it is a memorization. Some say that it is reasoning.

Memorizing facts and algorithms is memorization. The rest of what you are talking about is not.

Applying existing knowledge on new data without deriving new information is generalization. An example of this is the case of a semantic segmentation model classifying a car that it has never seen. If the model was not trained on birds, it will never classify a bird as a bird.

Computation of decidable problems is a large, possibly the largest subset of reasoning. Most humans do not struggle with solving decidable problems, the problem is that they are slow and can only solve small problem sizes, but most problems encountered in practice aren't one large decidable problem, but a long chain of many small, dozens to hundreds of heterogeneous problems that are seamlessly mixed with one another. LLMs struggle with decidable problems that are out of distribution, but you can give a human instructions on how to do something they have never done before and they will follow them with no problem.

> More than that -- It's not clear that what humans do is not "just" a memorization.

I hope it is clear that I did not memorize this message I am writing here and that it is the unique result of processes inside my brain that were not captured in the training process of an LLM.

>We can always look at human experience mechanisticly and say that we don't think -- we just memorized thinking patterns and apply them when speaking and "thinking"

Again you are trying to twist this in an absurd direction. Let's come up with a teleoperated humanoid robot on Mars that is controlled by a human on Earth. The robot acts exactly like a human does. Does this mean the robot is now capable of reasoning and thinking like a human, simply because it is replaying a recording of the human's body and speech? This is the argument you are making. You're arguing that the robot's ability to replay a human's actions is equivalent to the processes that brought about that human action.


  > Let's come up with a teleoperated humanoid robot on Mars
One example I've always liked is from Star Trek. They got holodecks and no one thinks those are sentient people even though they are adaptive.

I don't care what Iilya said, mimicking a human does not make a human. It may look like a duck, swims like a duck, and quack like a duck, then it's probably a duck, but you haven't ruled out an advanced animatronic. In fact, I'm betting right now people could make an animatronic that would convince most people it is a duck because most people just don't know the nuances of duck behavior.


I think your "advanced animatronic" is a duck until you can devise a test that cleanly separates it from a "real duck". A test of "duckness".

If todays LLMs are not really intelligent (hardly anyone is arguing that LLMs are literally humans) then by all means, devise your intelligence test that will cleanly qualify Humans (and all creatures you ascertain are capable on intelligence) and disqualify LLMs. In good faith, it should be something both or all can reasonably attempt.

I'll save you some time but feel free to prove me wrong. You will not be able to do so. Not because LLMs can solve every problem under the Sun but because you will at best create something that disqualifies LLMs and a good chunk of humans along the way.

Anybody who cannot manage this task (which should be very simple if the difference is as obvious as a lot of people claim) has no business saying LLMs aren't intelligent.


  > I think your "advanced animatronic" is a duck until you can devise a test that cleanly separates it from a "real duck". A test of "duckness".
I think you're reaching and willfully misinterpreting.

  > You will not be able to do so.
I am unable to because any test I make or from any other researcher in the field makes will result in an answer you don't like.

River crossing puzzles are a common test. Sure, humans sometimes fail even the trivial variations, but the important part is how they fail. Humans guess the wrong answer. LLMs will tell you the wrong answer while describing steps that are correct and result in a different answer. It's the inconsistency and contradiction that's the demonstration of lack of reasoning and intelligence, not the failure itself.


>I think you're reaching and willfully misinterpreting.

That's the spirit of the popular phrase isn't it ? I genuinely never read that and thought that literally only those 3 properties were the bar.

>I am unable to because any test I make or from any other researcher in the field makes will result in an answer you don't like.

This is kind of an odd response. I mean maybe. I'll be charitable and agree wholeheartedly.

But i don't know what my disappointment has to do with anything? I mean if i could prove something i deeply believed to be True and which would also be paper worthy, i certainly wouldn't let the reactions of an internet stranger stop me from doing it.

>River crossing puzzles are a common test. Sure, humans sometimes fail even the trivial variations, but the important part is how they fail. Humans guess the wrong answer. LLMs will tell you the wrong answer while describing steps that are correct and result in a different answer. It's the inconsistency and contradiction that's the demonstration of lack of reasoning and intelligence, not the failure itself.

Inconsistency and contradiction with the reasoning they say (and even believe) and the decisions they make is such a common staple of human reasoning we have a name for it...At worst, you could say these contradictions don't always take the same form but this just kind of loops back to my original point.

Let me be clear here, if you want to look at results like these and say - "This is room for improvement", then Great!, I agree. But it certainly feels like a lot of people have a standard of reasoning (for machines) that only exists in fiction or their own imaginations. This general reasoning engine that makes neither mistake nor contradiction in output or process does not exist in real life whether you believe humans are the only beings capable of reasoning or are gracious enough to extend this capability to some of our animal friends.

Also, i've seen LLMs fail trivial variations of said logic puzzles, only to get them right when you present the problem in a way that doesn't look exactly like the logic puzzle they've almost certainly memorized. Sometimes, it's as simple as changing the names involved. Isn't that fascinating ? Humans have a similar cognitive shortcoming.


I think the results still tell us something.

Discrepancies in mathematical ability between the various bases would seem to suggest memorization as opposed to generalization.


>> The arithmetic questions only have 1000 tests where they add two digits. This is certainly in the training data.

Yeah, it's confirmation bias. People do that sort of thing all the time in machine learning research, specially in the recent surge of LLM-poking papers. If they don't do that, they don't have a paper, so we're going to see much more of it before the trend exhausts itself.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: