monte carlo tree search

mcts   pseudo   expand   eg   improve   explore   rave   prior   nn  

monte carlo tree search

mcts history
  • from 2006 by Coulom and others

  • revolutionized computer Go

  • repeat these steps until time runs out

    • select leaf

    • expand leaf and simulate result(s)

    • backpropagate result(s)

  • keys to success

    • expansion: select strong child quickly

    • simulate: predict winner accurately

mcts overview

there are many different versions of mcts, we discuss a simple one

an anytime algorithm

  • returns a meaningful result after any amount of time

  • unlike say depth-d minimax, which does not return a result until the search to depth d is complete

a (mostly) best-first algorithm

  • when expanding the search tree, it expands the most promising lines first

  • so mcts search is highly non-uniform: at any fixed level, some subtrees will be must larger than others (unlike say depth-d minimax, where each at each level each subtree is the same size)

pseudo-code

  • while time remains:

    • select leaf

    • expand leaf and simulate result(s)

    • backpropagate result(s)

  • return best move (eg. most visits, or best winrate)

mcts expansion

  • create all children

  • for each child:

    • if child_is_win: backpropagate(child, win)

    • else: simulate, backpropagate(child, sim_result)

(some mcts implementations use other expansion methods)

mcts improvement: balance exploitation/exploration

balance exploitation/exploration
  • exploitation: best-first search, always expand best move

  • exploration: sometimes expand not-best move

  • when?

  • use UCB formula (Kocsis-Szepesvari)

  • MCTS + UCB = UCT

  • mcts simulations have randomness

  • in leaf selection, how to account for variance in win rates ?     eg.

    • move j: 52 wins, 100 visits

    • move k: 529 wins, 1000 visits

    • are we sure that move k is better than move j ?

  • similar to multi-arm bandit problem

  • recall: ratio selection function f(w,v) = (w + t) / (v + 2t)

  • one approach: balance exploitation (pick child with best win rate) with exploration (pick lesser visited child)

  • Kocsis/Szepesvari UCB1 priority formula

    • pick move j that maximizes f( wj,vj ) + c sqrt( ln V / vj )

    • for hex, go, good performance with c small, eg. .001

ucb1 priority values from simple/mcts/ucb1.py
   win rate additive factor T:  5.0
ucb multiplicative constant c:  0.5

for some j, ucb1(c, w*j, v*j, V*j)

w,v,V  0,1,2  1,3,6  9,20,40  100,200,400
 j
   1   .871   .973   .681       .587
   2   .833   .894   .625       .565
   4   .718   .815   .581       .548
   8   .572   .746   .548       .536
  16   .425   .689   .523       .526
  32   .299   .643   .504       .519
  64   .205   .608   .490       .514
 128   .140   .581   .479       .510
 256   .097   .560   .471       .508
 512   .068   .544   .466       .505
1024   .048   .533   .461       .504
2048   .034   .524   .458       .503

some examples, T = 5.0

 c    w   v    V  ucb(c,w,v,V)
0.0  50 100 1000 .500
0.2 200 400 1000 .526
0.2 100 200 1000 .537
0.2  50 100 1000 .553
0.2  25  50 1000 .574
0.2   1   3 1000 .765
0.2   0   3 1000 .688

from github repo /mcts/

def traverse_and_expand(self, node: TreeNode0):
  # traverse tree, select node to simulate
  # args: node (TreeNode)     root of tree
  # returns: node (TreeNode): move from which to simulate

        while not node.is_leaf:
            node = self.best_uct(node)
        if len(node.moves) > 0 and node.sims > 0:
            node.expand_node()
            node = random.choice(node.children)
        return node
def best_uct(self, node: TreeNode1) -> TreeNode1:
    #   Return the next move for traversal (leaf selection).
    #   node is root? return node
    #   node has unsimulated child? return child
    #   otw? return child with best uct score

    #   args: node (TreeNode)
    #   returns: TreeNode  # child to traverse

        best_uct, best_child = None, None
        for child in node.children:
            if child.sims == 0: # unexplored children have priority
                return child
            # calculate UCT, update if best
            mean_res = child.results / child.sims
            uct = mean_res+(self.c*sqrt(log(self.root_node.sims)/child.sims))
            if best_uct is None or uct > best_uct:
                best_uct, best_child = uct, child
        return best_child

trace simple mcts

improving mcts

  • in his Sep 2016 talk to British Royal Arts Society, Demis Hassabis stressed the shift from hand-crafted AI to machine-learned AI. here, we can see this trend. we will see this more in the architecture of AlphaGo.

  • problem   purely random pick (post-expansion child-select; simulation moves) ?

    • suggestion   simulations: biased (hand-crafted or learned) move selection

    • suggestion   post-expansion: use prior knowledge

  • problem   random simulations are noisy

    • suggestion   add exploration factor to child scores

  • problem   tree_leaf_selection_policy: strong/winning moves not found quickly

    • suggestion fast heuristic/win-check

  • problem   tree_leaf_selection_policy: true wins/losses repeatedly searched

    • suggestion   back up true win/loss values, prune accordingly

  • problem   mcts uses tree, but (for go, hex, ttt, …) search space is dag

    • suggestion   treat search space as a dag (how ?)

  • problem   mcts is best-first, true state value is minimax

    • suggestion   add minimax behaviour to mcts (how ?)

big gains
  • simulations   add patterns (go: save-miai, hex: save-bridge)

  • simulations   collect all-moves-as-first data, use data in tree_leaf_select_policy

  • tree_leaf_select_policy   use bandit-arm statistics to bias selection towards children with low visits

  • tree_leaf_select_policy   prior knowledge: after expand, pick child based on prior knowledge (not random)

lesser gains
  • tree   use transposition table (hard), prune isomorphic states

  • tree   back up true-win/true-loss info

    • at node v, if move x yields true ptm-loss, prune x; if all moves loss-pruned, v is true ptm-loss, report that move from parent(v) to v is true ptm-win

    • at node v, if move x yields true ptm-win, report to parent(v) that move from parent(v) to v is true ptm-loss

  • go, hex: prior patters + sim'n patterns + RAVE yields strong intermediate player (strength of python go program michi: about 6 kyu on 4 threads)

save-miai, save-bridge simulation patterns
  @       last simulation o-move
  !  immediate simulation x-response

go: save-miai    hex: save-bridge

     x @              @ x
     ! x             x !

mcts improvement: rave

Rapid Action Value Estimate
  • MCTS RAVE by Sylvain Gelly (MoGo team) and David Silver (when a UAlberta phd student)

  • in uni-rand playouts, can think of playout as sequence of uni-random moves or as uni-random subsets of black/white cells

  • in the latter case, you can use info from simulations when exploring similar positions

black uni-rand win probability from this hex position?
   * * *     .
  * o - o     .        a
 o * - - o     .      b c
  o - - - o     .    d e f
     * * *       .

answer
- there are (6 choose 3) = 20 3-subsets of { a b c d e f }
- of these 3-subsets, all except { d b a } { d b c } { d e c } { d e f } win
- so black win probability = 16 / 20 = .8
application: RAVE, an all-moves-as-first improvement to mcts
  • because uni-rand-playout equivalent to uni-rand-subset selection,

    • can think of playout as subset,

    • can gather data: which cells positively correlate to wins ?

    • use this learned knowledge in future selection process anywhere on path L-to-root

RAVE effective ?
  • when simulations have significant randomness

  • when child-selection policy not strong

  • AlphaGo has a strong child-selection policy learned by neural nets, so does not use RAVE

mcts improvement: prior knowledge

  • after mcts expand, node's children are new to the search

  • instead of picking random child, use a prior knowledge (ie. before search) heuristic to select most promising node

mcts improvement: neural nets

  • use neural net (policy net) and win rate for selection

  • use neural net (policy net) for prior knowledge

  • use neural net (value net) and mcts for move selection

go/hex programs
  • alphago   explore, prior, no rave

  • patchi   explore, prior, rave

  • fuego   no explore, prior, rave

  • mohex   explore, prior, rave