Can someone share some intuition of the tradeoffs between monte-carlo tree searc...

ericjang · on Jan 26, 2018

In vanilla policy gradient, one plays the game to the end and then bumps the probability of all actions taken by the agent up (if AlphaGo won) or down (if it lost). This is very slow because there ~150 moves in an expert game, and we do not know which moves caused decisive victory or loss - i.e. the problem of "long term credit assignment". Also, it is actually preferable to compute the policy gradient with respect to the advantage of every action, so we encourage actions that were better than average and punish actions that were worse than average - otherwise the policy gradient estimator has high variance.

I think about MCTS in the following way: suppose you have a perfect "simulator" for some reinforcement learning task you are trying to accomplish (i.e. real-world robot grasping for a cup). Then instead of trying to grasp the cup over and over again, you can just try/"plan" in simulation until you arrive at a motion plan that picks up the cup.

MCTS is exactly a "planning" module, and it works so well in Go because the simulator fidelity is perfect. AlphaGo can't model adversary behavior perfectly, but MCTS and the policy network complement each other because the policy reduces the search space of MCTS. As long as the best adversary is not far away from the space that MCTS + policy is able to consider, AlphaGo can match or beat the adversary. Then, we train the value network to amortize the computation of the MTCS operator (via Bellman equality). Finally, self-play is an elegant solution for keeping adversary + policy close to each other.

For more rigorous mathematical intuition, Ferenc Huszar has a nice blog post on MCTS as a "policy improvement operator": http://www.inference.vc/alphago-zero-policy-improvement-and-...

chrisfosterelli · on Jan 26, 2018

Thanks for the response! I'm familiar with how AlphaZero works, just primarily curious about how performance/speed compare in situations with perfect information and where it is possible to perfectly simulate (pong/go/chess/etc).

I did not realize that MCTS helps with the credit assignment problem, that's really interesting!

LrnByTeach · on Jan 30, 2018

Thank you very much for the clear explanation for MCTS Vs. vanilla policy gradient

>In vanilla policy gradient, one plays the game to the end and then bumps the probability of all actions taken by the agent up (if AlphaGo won) or down (if it lost). This is very slow because there ~150 moves in an expert game, and we do not know which moves caused decisive victory or loss - i.e. the problem of "long term credit assignment".

>I think about MCTS in the following way: suppose you have a perfect "simulator" for some reinforcement learning task you are trying to accomplish (i.e. real-world robot grasping for a cup). Then instead of trying to grasp the cup over and over again, you can just try/"plan" in simulation until you arrive at a motion plan that picks up the cup.

gojima · on Jan 27, 2018

Thank you very much for writing this!

cgearhart · on Jan 26, 2018

The MCTS variant used by AZ offers a big advantage in the exploration/exploitation tradeoff, with a better exploration profile during training and better exploitation during competitive play.

The training phase emphasizes exploration using a modified upper confidence bound for tree search criteria to limit the expansion of the search tree. With a uniform prior over the action space (like standard MCTS), you will explore a large number of unlikely actions. The prior provided by the NN allows them to bootstrap the value of leaf nodes _and_ efficiently sample the actions at each state. AZ has eliminated the simulation phase of MCTS entirely.

AZ uses different criteria for move selection during training and during competitive play. The competitive selection criteria basically replaces the UCBT criteria with standard argmax over the actions (like a normal RL policy selection). We know that tree pruning can be very effective during search _if_ you can order the nodes in a beneficial way. AZ MCTS uses the output of the NN as the prior instead of a uniform prior, which is like a "virtual" pruning (because moves with low probability in the prior are unlikely to be explored). This tends to more quickly focus the search on strong lines of play, so the algorithm produces strong choices with less search (AZ uses about an order of magnitude fewer expansions than other state-of-the-art MCTS engines).

xyhopguy · on Jan 26, 2018

MCTS has always been popular and probably the best online/out-of-the-box MDP technique. But fundamentally its different than RL methods. In MCTS you assume you have a perfect simulation of the MDP which is rarely true outside of games. MCTS also doesnt really have a learning phase.

> My intuition is that MCTS is better for planning, but takes longer to train/evaluate.

There is no training phase in MCTS, rollouts can take a while, its important to have a fast simulator and rollout policy (like random/UCT).

> Is there some games one will work better than the other?

Games with simulators and perfect information!

chrisfosterelli · on Jan 26, 2018

> There is no training phase in MCTS

Sorry, I was referring specifically to AlphaZero's approach in which there is training for the expert policies that guide the MCTS. And yes I'm assuming there is perfect information and it can be simulated. Thanks for the response!

jderick · on Jan 26, 2018

It has been a while since I read the paper, but I believe they have some results where they ran AlphaZero without MCTS, using only the policy network. My recollection is that is still performed pretty well, but it was clearly outperformed by MCTS.