Could you explain how/why GRAM cannot be interpreted or aligned how current LLMs...

kmavm · 2026-05-28T18:47:34 1779994054

Crudely? Because you can't grep a sequence of latent states for variants of "If I kill all the puny humans, I can <achieve my current goal>."

onlyrealcuzzo · 2026-05-28T20:19:53 1779999593

Why do you need to grep latent space?

As long as it's giving the right outputs, who cares what's in latent space?

If the model thinks in latent space: "God I wish these people would die," and constantly does the right thing, who cares?

Additionally, if one of it's latent spaces that it never explores is a psychopath -> who cares? The path never gets taken...

That's a lot of harmless people walking around with crazy thoughts...

noddybear · 2026-05-28T20:37:07 1780000627

Thinking ‘God I wish these people would die’ could increase its propensity to kill all people, even if that propensity is still vanishingly small almost all of the time.

A lot of people are walking around with crazy thoughts. Some of them harm.

notrealyme123 · 2026-05-29T07:25:18 1780039518

Readable reasoning traces are a convenient thing, but they don't have to be true in any way. It's actually dangerous to think that.

randomNumber7 · 2026-05-28T23:40:00 1780011600

Tell me you never had a crazy thought and you are either a lier or a psychopath.

sometimelurker · 2026-05-28T19:55:05 1779998105

sibling comment got to the main points before me, but to add on kmavm's reply, the attack surface for gradient decent to get the system to exchange "bad information is much higher in latent reasoning models (like GRAM). You get ~3 OoM more bits (~17 bits per token in a standard CoT vs the whole residual stream of the model @ f16 = a few kb) per forward pass of the system coming back to itself, and even if you could sift through all that for signs of misalignment, you just can't put a blockade on all of the bad things that leak through.

haldujai · 2026-05-28T20:38:26 1780000706

I think you’re overstating the impact of interpretability here. Your earlier point that latent reasoning models can’t be trained very well and that discretization may be load bearing rather than a readability tax in addition to significant inference infra hurdles (e.g. batching, speculative decoding) have limited any serious attempts and reduced the theoretical advantage over CoT at least in the near term.

sometimelurker · 2026-05-28T23:06:08 1780009568

> I think you’re overstating the impact of interpretability here

Outside of RLAIF, interpretability is the strongest way to do alignment right now. alignment is important because otherwise LLMs are incentivized to learn power seeking, dangerous behaviours [1]. a more downto earth example of alignment being important is that agents are incentivized to do tasks in the shortest way possible, and this way might not be what the user wants (I explain this further in another comment in this thread)

[1] https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...

haldujai · 2026-05-29T00:31:22 1780014682

You’re putting the cart before the horse - alignment is an unsolved challenge (there are proposed approaches and active research on this) but it is still not established (beyond theory) that latent reasoning is more capable than CoT on hard language reasoning, particularly at scale.

ACCount37 · 2026-05-28T20:06:22 1779998782

Most alignment methods nowadays don't rely on interpretability. And neither do all LLM vendors care about alignment much - especially not in China.

Those things being untrainable at scale is why they aren't around. Alignment is an afterthought.

sometimelurker · 2026-05-28T23:07:06 1780009626

China should care: https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...

ACCount37 · 2026-05-28T23:11:48 1780009908

As is, Chinese labs spend more effort on "rhetorical alignment to the party line" than alignment of any other kind.