Can someone share some intuition of the tradeoffs between monte-carlo tree search compared to vanilla policy gradient reinforcement learning?
MCTS has gotten really popular as of AlphaZero, but it's not clear to me how this compares to more simple reinforcement learning techniques that just have a softmax output of the possible moves the agent can make. My intuition is that MCTS is better for planning, but takes longer to train/evaluate. Is that true? Is there some games one will work better than the other?
In vanilla policy gradient, one plays the game to the end and then bumps the probability of all actions taken by the agent up (if AlphaGo won) or down (if it lost). This is very slow because there ~150 moves in an expert game, and we do not know which moves caused decisive victory or loss - i.e. the problem of "long term credit assignment". Also, it is actually preferable to compute the policy gradient with respect to the advantage of every action, so we encourage actions that were better than average and punish actions that were worse than average - otherwise the policy gradient estimator has high variance.
I think about MCTS in the following way: suppose you have a perfect "simulator" for some reinforcement learning task you are trying to accomplish (i.e. real-world robot grasping for a cup). Then instead of trying to grasp the cup over and over again, you can just try/"plan" in simulation until you arrive at a motion plan that picks up the cup.
MCTS is exactly a "planning" module, and it works so well in Go because the simulator fidelity is perfect. AlphaGo can't model adversary behavior perfectly, but MCTS and the policy network complement each other because the policy reduces the search space of MCTS. As long as the best adversary is not far away from the space that MCTS + policy is able to consider, AlphaGo can match or beat the adversary. Then, we train the value network to amortize the computation of the MTCS operator (via Bellman equality). Finally, self-play is an elegant solution for keeping adversary + policy close to each other.
Thanks for the response! I'm familiar with how AlphaZero works, just primarily curious about how performance/speed compare in situations with perfect information and where it is possible to perfectly simulate (pong/go/chess/etc).
I did not realize that MCTS helps with the credit assignment problem, that's really interesting!
Thank you very much for the clear explanation for MCTS Vs. vanilla policy gradient
>In vanilla policy gradient, one plays the game to the end and then bumps the probability of all actions taken by the agent up (if AlphaGo won) or down (if it lost). This is very slow because there ~150 moves in an expert game, and we do not know which moves caused decisive victory or loss - i.e. the problem of "long term credit assignment".
>I think about MCTS in the following way: suppose you have a perfect "simulator" for some reinforcement learning task you are trying to accomplish (i.e. real-world robot grasping for a cup). Then instead of trying to grasp the cup over and over again, you can just try/"plan" in simulation until you arrive at a motion plan that picks up the cup.
The MCTS variant used by AZ offers a big advantage in the exploration/exploitation tradeoff, with a better exploration profile during training and better exploitation during competitive play.
The training phase emphasizes exploration using a modified upper confidence bound for tree search criteria to limit the expansion of the search tree. With a uniform prior over the action space (like standard MCTS), you will explore a large number of unlikely actions. The prior provided by the NN allows them to bootstrap the value of leaf nodes _and_ efficiently sample the actions at each state. AZ has eliminated the simulation phase of MCTS entirely.
AZ uses different criteria for move selection during training and during competitive play. The competitive selection criteria basically replaces the UCBT criteria with standard argmax over the actions (like a normal RL policy selection). We know that tree pruning can be very effective during search _if_ you can order the nodes in a beneficial way. AZ MCTS uses the output of the NN as the prior instead of a uniform prior, which is like a "virtual" pruning (because moves with low probability in the prior are unlikely to be explored). This tends to more quickly focus the search on strong lines of play, so the algorithm produces strong choices with less search (AZ uses about an order of magnitude fewer expansions than other state-of-the-art MCTS engines).
MCTS has always been popular and probably the best online/out-of-the-box MDP technique. But fundamentally its different than RL methods. In MCTS you assume you have a perfect simulation of the MDP which is rarely true outside of games. MCTS also doesnt really have a learning phase.
> My intuition is that MCTS is better for planning, but takes longer to train/evaluate.
There is no training phase in MCTS, rollouts can take a while, its important to have a fast simulator and rollout policy (like random/UCT).
> Is there some games one will work better than the other?
Sorry, I was referring specifically to AlphaZero's approach in which there is training for the expert policies that guide the MCTS. And yes I'm assuming there is perfect information and it can be simulated. Thanks for the response!
It has been a while since I read the paper, but I believe they have some results where they ran AlphaZero without MCTS, using only the policy network. My recollection is that is still performed pretty well, but it was clearly outperformed by MCTS.
MCTS has gotten really popular as of AlphaZero, but it's not clear to me how this compares to more simple reinforcement learning techniques that just have a softmax output of the possible moves the agent can make. My intuition is that MCTS is better for planning, but takes longer to train/evaluate. Is that true? Is there some games one will work better than the other?