Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

omg. So is the TL;DR:

- Avoiding building something that turns the universe to paper clips in order to satisfy a prompt is a problem they are genuinely struggling with now.

- They do it by spying on the words generated during CoT. "I can do this quickly by turning the Universe into paper clips. Wait - they won't like that. But there is no need to mention it." - SMACK!

- But you can speed things up immensely (3 orders of magnitude!) by skipping the output layer (and I guess compressing the context window / KV cache, otherwise 3 orders of magnitude seem impossible) which would give someone who pulled it off a huge advantage.

- Downside is humans can't see the CoT anymore, so they can't see what the machine is planning. Keeping the final output layer to spy doesn't work because the model uses its hidden reasoning to sanitise it.

How can this possibly go wrong?



Because it doesn't work like how you think at all. You're still thinking it works like Chain of Thought. It doesn't. And the difference is key!

It works by introducing probabilistic noise, and exploring N paths fully (each with noise) in parallel (all compressed).

It's reasoning at a much, much smaller (probabilistic) level than running everything through the expensive large model (deterministic) and sometimes catching that it said, "I think 1.12 is greater than 1.9 because 12 is bigger than 9, final answer".

The easiest way to think about it is: if you understand how hyper words work, it's as if it's searching for different versions of the hyper words that probilisticslly would lead to better outcomes IF it fed them to the LLM before it even does.

That's not actually how it works exactly. But I think it is close enough to be helpful to understand where the gain is, a rough idea of what's happening (searching paths), and how it can potentially have huge orders of magnitude improvements (doing so without paying the full price of exploring the paths through the expensive and huge model).

And also why it is so much harder to determine what it's "thinking".

If you aren't familiar with hyper words, this is an amazing series: https://youtu.be/eMlx5fFNoYc?si=49KHjn5IrVtyyaFq

The general idea is that a token is a multidimensional vector to represent a word -> think like "man" is a [noun, singular, English, pronoun, masculine, contemporary, ...]. Each time is sees a new word, it mutates this word to mean some new token (often never before seen), that means something. That's how it can roll-up a 1M line context into a shorter context, and somehow keep most of the meaning. Because it mutates all the words into different words that individually mean nothing, but when put next to each other represent the thing you likely want to do, that the LLM can somehow make sense of.

Similarly, GRAM operates entirely in a latent space that doesn't mean anything to us, but it's able to predict N different full paths WITHOUT actually exploring them fully through the LLM before it sends the one it "thinks" is best to the LLM.

If you understand how hyper words work, you can understand the noise injection... It's like it's saying, if instead of the user saying "The quick round fox" it said "The quick brown fox" -> I could probably give a response that's more like the answer they want. It's obviously far more sophisticated in the ways it can help than just a simple typo.

Something may have pushed a hyper word for "man" to somehow become a lot more like "woman", and GRAM allows it to look at the different hyper words and say... Hmm... Maybe if I changed this one gender dimension over here on this one word, maybe the entire outcome would be dramatically better. Let's try it!

Standard models compute these "hyper words" internally but immediately decode them into human language text tokens to form a Chain of Thought. Once decoded into a rigid real word, the multidimensional nuance of the continuous vector is lost!

Hyper words are the exact thing that make LLMs able to actually be smart! They can add so much more meaning to a word than a human ever could imagine - try to put 10,000 dimensions on the word "the"... Forcing them to decode them back into our dumb, un-contextualized, rudimentary language and losing all the valuable information they have - just so we can inspect it - OBVIOUSLY makes them enormously less intelligent!

It's like if we forced your eyeballs to turn everything it saw into words, before feeding it to your optic nerves, just so your optic nerves could check that you didn't see something harmful, before they sent the words to your brain... Instead of just sending light signals directly.


Thanks for the link to the 3b1b video. I enjoyed the entire series, and some of those he linked to as well. The linked ones explained the history of how they got there - which was new to me and really helps cement some ideas.

However, I didn't learn much. Which means as far as I can tell, my mental model of how it works wasn't far off. So yes - I was already aware one interpretation of how these things work is that LLMs turn concepts into vectors in a high dimensional space, and high level abstractions are linear summations of these vectors.

Given that model, parts of your comment don't make much sense to me. For example "it's able to predict N different full paths WITHOUT actually exploring them fully" - why do you think that's so? And "GRAM allows it to look at the different hyper words and say" - no, GRAM does not look at different hyper words, or at least no more than a non-GRAM LLM does. Only the last word (the 4k vector or whatever dimension they are using) is fed back through the machine.

Regarding "Once decoded into a rigid real word, the multidimensional nuance of the continuous vector is lost!". Yes, and no. Yes, the decoded word doesn't mean much compared to the vector it was derived from. But the machine isn't operating on just the last word. It's operating on the information in the entire context window (which could be 100's of thousands of tokens).

There is undeniably a lot of nuance encoded in one vector, and yes you're right - it can't be represented with just one word. But it can be represented using a string of words, and generating that representation by spitting out more words is partially what an LLM is doing when it generates text. It's only partially doing that because it's randomly mutating that last token as it goes, and pulling in information into the vector from the MLP layers.

Re "LLMs ... add so much more meaning to a word than a human ever could imagine try to put 10,000 dimensions on the word "the" .... OBVIOUSLY makes them enormously less intelligent!". No it doesn't obviously do that. A vector is maybe 16k bytes (depending on the number of dimensions). That corresponds to around 5000 words. Humans have no trouble connecting those 5000 words into a single concept - which would presumably spell out the concept represented by the vector. Same meaning - just encoded differently. Using computer science terminology - we could say the 16k vector is serialised into a sequence of words.

So - two representations of the same thing. What humans do that LLMs can't do right now is squeeze those 5000 words into something tiny. For example, the word "LLM" is a huge concept, squeezed into 3 letters. The human knowledge and thought seems to be based on that one trick - naming abstract concepts, and then using them as building blocks for more abstract concepts. LLMs meanwhile are stuck with their fixed size vectors. They cannot add new concepts to their vocabulary by modifying their weights. Where LLMs seem to win is their short-term memory (of the order of 200K tokens), and they are about 1 million times faster (cycle time of the order of 1 nanosecond vs 1 millisecond), which gives their ability to reason very different properties to human reasoning. Sometimes this means they are (dramatically) better, and sometimes they are worse.

I don't see how GRAM on its own is going to make LLMs 3 orders of magnitude faster than they are now. That 200k token context window is hideously expensive and maintaining it grows O(N^2). As you observed, they can already compress a 100,000 word book into the single token encoded in the last word (although beyond 100k words that compression starts to look increasingly lossy). To get the 3 orders of magnitude speed up, they are going to have to start taking advantage of that compression, and start throwing away the part of that 200k context they have already encoded. So far, no one has deployed something that does it well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: