buttered_toast's comments

buttered_toast · 2026-02-19T18:45:40 1771526740

I think we need to reevaluate what purpose these sorts of questions serve and why they're important in regards to judging intelligence.

The model getting it correct or not at any given instance isn't the point, the point is if the model ever gets it wrong we can still assume that it still has some semblance of stochasticity in its output, given that a model is essentially static once it is released.

Additionally, hey don't learn post training (except for in context which I think counts as learning to some degree albeit transient), if hypothetically it answers incorrectly 1 in 50 attempts, and I explain in that 1 failed attempt why it is wrong, it will still be a 1-50 chance it gets it wrong in a new instance.

This differs from humans, say for example I give an average person the "what do you put in a toaster" trick and they fall for it, I can be pretty confident that if I try that trick again 10 years later they will probably not fall for it, you can't really say that for a given model.

energy123 · 2026-02-19T18:56:36 1771527396

They're important but not as N=1. It's like cherry picking a single question from SimpleQA and going aha! It got it right! Meanwhile it's 8% lower score than some other model when evaluated on all questions.

buttered_toast · 2026-02-19T19:05:24 1771527924

Makes me wonder what people would consider better, a model that gets 92% of questions right 100% of the time, or a model that gets 95% of the questions right 90% of the time and 88% right the other 10%?

I think that's why benchmarking is so hard for me to fully get behind, even if we do it over say, 20 attempts and average it. For a given model, those 20 attempts could have had 5 incredible outcomes and 15 mediocre ones, whereas another model could have 20 consistently decent attempts and the average score would be generally the same.

We at least see variance in public benchmarks, but in the internal examples that's almost never the case.

buttered_toast · 2026-02-13T20:23:21 1771014201

I would interpret it as implying that the result was due to a lot more hand-holding that what is let on.

Was the initial conjecture based on leading info from the other authors or was it simply the authors presenting all information and asking for a conjecture?

Did the authors know that there was a simpler means of expressing the conjecture and lead GPT to its conclusion, or did it spontaneously do so on its own after seeing the hand-written expressions.

These aren't my personal views, but there is some handwaving about the process in such a way that reads as if this was all spontaneous involvement on GPTs end.

But regardless, a result is a result so I'm content with it.

lupsasca · 2026-02-13T21:43:23 1771019003

Hi I am an author of the paper. We believed that a simple formula should exist but had not been able to find it despite significant effort. It was a collaborative effort but GPT definitely solved the problem for us.

buttered_toast · 2026-02-13T21:48:47 1771019327

Oh that's really cool, I am not versed in physics by any means, can you explain how you believed there to be a simple formula but were unable to find it? What would lead you to believe that instead of just accepting it at face value?

lupsasca · 2026-02-13T21:57:50 1771019870

There are closely related "MHV amplitudes" which naively obey a really complicated formula, but for which there famously also exists a much simpler "Parke-Taylor formula". Alfredo had derived a complicated expression for these new "single-minus amplitudes" and we were hoping we could find an analogue of the simpler "Parke-Taylor formula" for them.

buttered_toast · 2026-02-13T22:00:49 1771020049

Thank you for taking the time to reply, I see you might have already answered this elsewhere so it's much appreciated.

lupsasca · 2026-02-13T23:26:11 1771025171

My pleasure---thank you for your interest!

etraql · 2026-02-13T22:41:27 1771022487

Do you also work at OpenAI? A comment pointing that out was flagged by the LLM marketers.

buttered_toast · 2026-02-13T23:32:22 1771025542

I think it says in the paper that he does, but it's also public knowledge.

https://www.linkedin.com/in/alex-lupsasca-9096a214/

lupsasca · 2026-02-14T05:52:22 1771048342

Correct, on both counts!

buttered_toast · 2026-02-13T20:02:47 1771012967

Absolutely no way this is true right? Ilya left around the time 4o was released. I can't imagine they haven't had a single successful run since then.

verdverm · 2026-02-13T20:21:58 1771014118

When's the last time they talked about it?

I heard this from people who know more than me

buttered_toast · 2026-02-13T20:25:18 1771014318

Can't say, just seems implausible, but I am a nobody anyways ¯\_(ツ)_/¯

verdverm · 2026-02-13T22:06:36 1771020396

I'm pretty sure it is widely known that the early 5.x series were built from 4.5 (unreleased). It seems more plausible the 5.x series is still in that continuation.

For some extra context, pre-training is ~1/3 of the training, where it gains the basic concepts of how tokens go together. Mid & late training are where you instill the kinds of anthropic behaviors we see today. I expect pre-training to increasingly become a lower percentage of overall training, putting aside any shifts of what happens in each phase.

So to me, it is plausible they can take the 4.x pre-training and keep pushing in the later phases. There is a lot of results out there to show scaling laws (limits) have not peaked yet. I would not be surprised to learn that Gemini 3 Deep Research had 50% late-training / RL

buttered_toast · 2026-02-13T22:19:20 1771021160

Okay I see what you mean, and yeah that sounds reasonable too. Do you have any context on that first part? I would like to know more about how/why they might not have been able to pursue more training runs.

verdverm · 2026-02-13T22:47:21 1771022841

I have not done it myself (don't have the dinero), but my understanding is that there are many runs, restarts, and adjustments at this phase. It's surprisingly more fragile than we know aiui

If you already have a good one, it's not likely much has changed since a year ago that would create meaningful differences at this phase (in data, arch is diff, I know less here). If it is indeed true, it's a datapoint to add to the others singling internal (everybody has some amount of this, not good when it makes the headlines)

Distillation is also a powerful training method. There are many ways to stay with the pack without having new pre-training runs. It's pretty much what we see from all of them with the minor versions. So coming back to it, the speculation is that OpenAi is still on their 4.x pre-train, but that doesn't impede all progress

buttered_toast · 2026-02-13T14:25:24 1770992724

Is there a way you can showcase a few of these?

simonw · 2026-02-13T15:30:31 1770996631

Not without people later saying "you shared that on Hacker News last year clearly the AI labs are training for it now!"

buttered_toast · 2026-02-13T15:54:23 1770998063

Couldn't you just make up new combinations, or new caveats indefinitely to mitigate that? It would be nice to see maybe 3-4 good examples for validation. I'd do it myself, but I don't have $200 to play around with this model.

simonw · 2026-02-13T16:28:42 1771000122

Here's what it gave me for a kakapo on a skateboard https://gist.github.com/simonw/5e2041c32333effd090e3df42b64d...

buttered_toast · 2026-02-13T16:40:52 1771000852

Thank you!