But the probability vector is the output of the LLM, no?

antonvs · 2026-05-17T08:11:51 1779005511

> But the probability vector is the output of the LLM, no?

In some contexts yes, but that's not actually the policy. As I wrote in my other comment (quoting because I think it's worth highlighting):

> "the policy is a function that, given some context, assigns probabilities to possible next tokens."

In the same sentence, I also incorrectly referred to this as a "probability distribution", but that's not accurate: it's a function that produces a probability distribution. The policy instantiated at a specific context produces a probability distribution.

In fact, you'd be closer to the mark if you called the policy "the model", but the two terms emphasize different aspects - as I said, "policy" views it from an RL perspective. From that perspective, the policy is a function, the model is an implementation of that function.

Besides, "output of the LLM" is ambiguous. It commonly means the actual generated token(s) (or text), not the probability distribution. Depending on context, "output of the LLM" could refer to (1) logits, (2) the probability distribution, (3) a single selected token, (4) the full generated text.

"Policy" has no such ambiguity - it has a precise definition. That's why technical subjects rely on jargon in the first place, but it results in the exact issue we're running into here: "Jargon enables quick and precise communication among insiders, but it is usually confusing or unintelligible to outsiders."

greesil · 2026-05-17T11:38:13 1779017893

Yes, I understand one function of jargon, which can be useful to insiders in that it conveys a precise meaning. But, it can be confusing to outsiders, and that is also a useful thing for insiders. In the context of LLMs, what other function can produce p(next token) if not the LLM? And, you just about make my point for me with regards to jargon being confusing by misidentifying what the policy actually is (something i never would have noticed :) In any case, it's an interesting paper. Thanks for all your down votes everyone.

airstrike · 2026-05-17T12:20:48 1779020448

The LLM is the whole car and policy is a specific part.

antonvs · 2026-05-17T19:05:13 1779044713

> In the context of LLMs, what other function can produce p(next token) if not the LLM?

You're thinking about it from a specific implementation-oriented perspective. Policy is a well-defined theoretical concept that generalizes beyond LLMs - as we've discussed, it comes from RL. If one is discussing the use of RL techniques on LLMs, it can makes sense to use well-defined RL terminology.

Here's a definition from Sutton & Barto's RL intro (https://web.stanford.edu/class/psych209/Readings/SuttonBarto...):

> "At each time step, the agent implements a mapping from states to probabilities of selecting each possible action. This mapping is called the agent’s policy and is denoted π_t, where π_t(a|s) is the probability that A_t = a if S_t = s. Reinforcement learning methods specify how the agent changes its policy as a result of its experience. The agent’s goal, roughly speaking, is to maximize the total amount of reward it receives over the long run."

If you apply this definition to an LLM, you find that the model itself becomes the implementation of a policy. But narrowing one's thinking about this to purely thinking about it in terms of what an LLM happens to do to implement a policy is not necessarily a good idea for a researcher.

As Sutton & Barto go on to write:

> "This framework is abstract and flexible and can be applied to many different problems in many different ways. For example, the time steps need not refer to fixed intervals of real time; they can refer to arbitrary successive stages of decision-making and acting. The actions can be low-level controls, such as the voltages applied to the motors of a robot arm, or high-level decisions, such as whether or not to have lunch or to go to graduate school. Similarly, the states can take a wide variety of forms. [...]"

Referring to this as a policy connects it to a much broader body of work that's highly relevant to the problem being studied.

---

> And, you just about make my point for me with regards to jargon being confusing by misidentifying what the policy actually is (something i never would have noticed :)

It's quite the opposite. The jargon exists to make things precise, so that it become easier to identify when some nuance has been accidentally dropped, as in this case. It's bad faith to claim that a mistake in my attempt to simplify things for you proves your point.

> But, it can be confusing to outsiders, and that is also a useful thing for insiders.

You should be careful that you're not using anti-intellectual conspiracy theorizing to justify your refusal to try to understand the purpose of terminology you happen to be unfamiliar with.

greesil · 2026-05-18T03:29:26 1779074966

But me asking questions is in fact my trying to understand, is it not? I ask a stupid simple question with a slightly rude tone, and then I get downvoted by a bunch of pedantic insiders. Although to be fair it appears some are trying to help.

Look dude, every field develops its own terminology. It's not a conspiracy, just an emergent property. But it always makes getting into the field, or understanding what's in the field, much harder than it needs to be.