monte carlo tree search

mcts   pseudo   pure   trace   ex   improve   explore   rave   details   prior   nn  

monte carlo tree search

mcts overview

an anytime algorithm

  • returns a meaningful result after any amount of time

  • unlike say depth-d minimax, which does not return a result until the search to depth d is complete

a (mostly) best-first algorithm

  • when expanding the search tree, it expands the most promising lines first

  • so mcts search is highly non-uniform: at any fixed level, some subtrees will be must larger than others (unlike say depth-d minimax, where each at each level each subtree is the same size)

pseudo-code

  • while time remains:

    • select leaf

    • expand leaf

    • simulate result

    • backpropagate result

return best move (eg. most visits, or best winrate)

mcts detailed pseudo-code

class Node:
  def __init__(self, m, p): # move is from parent to node
    self.move, self.parent, self.children = m, p, []
    self.wins, self.visits  = 0, 0

  def expand_node(self, state):
    if not terminal(state):
      for each non-isomorphic legal move m of state:
        nc = Node(m, self) # new child node
        self.children.append(nc)

  def update(self, r):
    self.visits += 1
    if r==win:
      self.wins += 1

  def is_leaf(self):
    return len(self.children)==0

  def has_parent(self):
    return self.parent is not None

def mcts(state):
  root_node  = Node(None, None)
  while time remains:
    n, s = root_node, copy.deepcopy(state)
    while not n.is_leaf():    # select leaf
      n = tree_policy_child(n)
      s.addmove(n.move)
    n.expand_node(s)          # expand
    n = tree_policy_child(n)
    while not terminal(s):    # simulate
      s = simulation_policy_child(s)
    result = evaluate(s)
    while n.has_parent():     # propagate
      n.update(result)
      n = n.parent

return best_move(tree)

pure mcts

  • tree policy:   random pick of any best child

  • use ratio wins/visits to pick best child

  • but most leaves have 0 visits ? …

  • solution: use this formula

    • replace ratio wins/visits with some function that gives unvisited nodes a score of .5

    • eg. pick t = 1, set f(w,v) = ( w + t ) / ( v + 2t )

    • at even depth, pick random node in { argmax( f(w,v) ) }

    • at odd depth, pick random node in { argmin( f(w,v) ) }

  • simulation policy:   each move in playout is uniform random, over all legal moves

hex 3x3: pure mcts example

consider this state, @ to move
       a b c
     1  * @ . @
      2  * . . @
       3  . . . @
           * * *

root_node: move parent children  wins visits
           None None   []        0    0

### iteration 1
tree:   root_node
select leaf: root_node
expand(root_node)
tree after expansion:
               root 0/0
       /     /    |    \      \     \
   a3 /  b2 /  b3 |  c1 \   c2 \  c3 \
    0|0    0|0   0|0    0|0    0|0   0|0

best child = random select argmax: say b2

simulate from root + @b2, say to   * @ @
                                    * * .
                                     * . @
result: loss for root ptm @

propagate
               0|0              =>              0|1
  0|0  0|0  0|0  0|0  0|0  0|0     0|0  0|1  0|0  0|0  0|0  0|0
        ^

### iteration 2
               root 0/0
       /     /    |    \      \     \
   a3 /  b2 /  b3 |  c1 \   c2 \  c3 \
    0|0   0|1    0|0    0|0    0|0   0|0

select leaf: say c3
expand(c3): new subtree
                 c3
       /     /    |    \      \
   a3 /  b2 /  b3 |  c1 \   c2 \
    0|0   0|0    0|0    0|0    0|0

best child = say a3
terminal state: loss for root ptm (so no expand)
propagate
                 0|2
            /   / | \   \    \
          /   /   |   \   \    \
      a3/  b2/  b3|  c1\  c2\  c3\
       /    /     |     \    \    \
     0|0   0|1   0|0   0|0   0|0   0|1
                                  //|\\
                                / / | \ \
                              /  /  |  \  \
                          a3/ b2/ b3| c1\ c2\
                          0|1 0|0  0|0  0|0 0|0
same example (omit leaves with 0 visits)
initialize:     root    (has 0 visits)

after iteration 1:   (*   node with simulation)

root 01
  |
b2*01

after iteration 2:

     root 02
   /        \
b2 01      c3 01
             |
           a3*01

after iteration 3:

     root 13
   /     |     \
b2 01  c3 01  b3 11
         |      |
       a3 01  c2*11

after iteration 4:

     root 14
   /     |      \
b2 01  c3 01    b3 12
         |      /    \
       a3 01  c2 11  a3*01

after iteration 5:

     root 15
   /     |     \
b2 01  c3 01    b3 13
         |      /    \
       a3 01  c2 11  a3*02

after iteration 6:

          root 26
   /     /     \           \
b2 01  c3 01    b3 13       a3 11
         |      /    \        |
       a3 01  c2 11  a3 02  c1*11

after iteration 7:

          root 27
   /     /     \           \
b2 01  c3 01    b3 13       a3 12
         |      /    \        |    \
       a3 01  c2 11  a3 02  c1 11  b2 01
                                     |
                                   c2*01

after iteration 8:

          root 38
   /     /     \           \
b2 01  c3 01    b3 13       a3 23
         |      /    \        |    \
       a3 01  c2 11  a3 02  c1 11  b2 12
                                   /   \
                               c2 01  b3 11
                                        |
                                      c1*11

after iteration 9:

             root 49
    /     /   \              \
   /     /     \              \
b2 01  c3 01  b3 13            a3 35
      /       /    \         /   |   \
   a3 01  c2 11  a3 02  c1 11  b2 12  c2 11
                               /      /    \
                             c2 01  b3 11  c1*11
                                      |
                                    c1 11

above example

exercises

  • if search ends now, what move does white make?

  • if search continues, what is likely to happen on the next iteration?

  • assume search continues until 100 iterations

    • roughly, what will the tree look like then?

    • what move would white make?

  • for this state, what is white's best move? who wins? explain

  • based on this example, how could you improve pure mcts ?

improving mcts

  • in his Sep 2016 talk to British Royal Arts Society, Demis Hassabis stressed the shift from hand-crafted AI to machine-learned AI. here, we can see this trend. we will see this more in the architecture of AlphaGo.

  • problem   purely random pick (post-expansion child-select; simulation moves) ?

    • suggestion   simulations: biased (hand-crafted or learned) move selection

    • suggestion   post-expansion: use prior knowledge

  • problem   random simulations are noisy

    • suggestion   add exploration factor to child scores

  • problem   tree_leaf_selection_policy: strong/winning moves not found quickly

    • suggestion fast heuristic/win-check

  • problem   tree_leaf_selection_policy: true wins/losses repeatedly searched

    • suggestion   back up true win/loss values, prune accordingly

  • problem   mcts uses tree, but (for go, hex, ttt, …) search space is dag

    • suggestion   treat search space as a dag (how ?)

  • problem   mcts is best-first, true state value is minimax

    • suggestion   add minimax behaviour to mcts (how ?)

big gains
  • simulations   add patterns (go: save-miai, hex: save-bridge)

  • simulations   collect all-moves-as-first data, use data in tree_leaf_select_policy

  • tree_leaf_select_policy   use bandit-arm statistics to bias selection towards children with low visits

  • tree_leaf_select_policy   prior knowledge: after expand, pick child based on prior knowledge (not random)

lesser gains
  • tree   use transposition table (hard), prune isomorphic states

  • tree   back up true-win/true-loss info

    • at node v, if move x yields true ptm-loss, prune x; if all moves loss-pruned, v is true ptm-loss, report that move from parent(v) to v is true ptm-win

    • at node v, if move x yields true ptm-win, report to parent(v) that move from parent(v) to v is true ptm-loss

  • go, hex: prior patters + sim'n patterns + RAVE yields strong intermediate player (strength of python go program michi: about 6 kyu on 4 threads)

save-miai, save-bridge simulation patterns
  @       last simulation o-move
  !  immediate simulation x-response

go: save-miai    hex: save-bridge

     x @              @ x
     ! x             x !

mcts improvement: exploration

mcts improvement: exploration
  • mcts simulations have randomness

  • so, in leaf selection, how to account for variance in win rates ?     eg.

    • move j: 52 wins, 100 visits

    • move k: 529 wins, 1000 visits

    • are we sure that move k is better than move j ?

  • similar to multi-arm bandit problem

  • recall: ratio selection function f(w,v) = (w + t) / (v + 2t)

  • one approach: balance exploitation (pick child with best win rate) with exploration (pick lesser visited child)

    • exploitation: best first search

    • exploration: add some breadth to the search

  • Kocsis/Szepesvari UCB1 formula

    • pick move j that maximizes f( wj,vj ) + c sqrt( ln V / vj )

    • for hex, go, good performance with c small, eg. .001

ucb1 sample data from simple/mcts/ucb1.py
UCB1 formula values
win rate additive factor T 5.0
ucb multiplicative constant c 0.5
w v V ratios  0 1 2  1 3 6  9 20 40  50 100 200

v    1    .871 .973 .681 .615
v    2    .833 .894 .625 .587
v    4    .718 .815 .581 .565
v    8    .572 .746 .548 .548
v   16    .425 .689 .523 .536
v   32    .299 .643 .504 .526
v   64    .205 .608 .490 .519
v  128    .140 .581 .479 .514
v  256    .097 .560 .471 .510
v  512    .068 .544 .466 .508
v 1024    .048 .533 .461 .505
v 2048    .034 .524 .458 .504

mcts improvement: rave

  • assume in mcts, after expand, we have picked a child, we are at leaf L

  • at L, we want to perform a random simulation

  • let S = empty cells, let ptm = black

  • want winner of random playout, black to start, from L

  • recall (from intro probability course): uniform-random selection of an element x from a set S: for each x, select with probability 1/|S|

  • consider these two processes for picking winner

  • process A

    • observe that playout is just random permutation P of S

    • construct P: repeatedly uniform-random-select a not-yet-picked element

    • eg. say S = { a b c d e }

    • uni-rand-select from S, say b, as 1st element of P

    • uni-rand-select from S - { b } , say e, as 2nd element of P

    • uni-rand-select from S - { b e } , say c, as 3rd element of P

    • uni-rand-select from S - { b e c } , say a, as 4th element of P

    • uni-rand-select from S - { b e c a } , d, as 5th element of P

    • P = ( b e c a d )

    • use P for playout, report winner

  • process B

    • uni-rand-select 3-subset R of S as black cells of playout

    • eg. uni-rand-select R from { abc abd abe acd ace ade bcd bce bde cde }

    • say R = { a c d }

    • from position L: add all R as black, all S - R as white, report winner

  • do process A and process B give same win probability for L ?

    • yes

    • perform process A

    • from permutation P, pick 3-subset of stones that will be black

    • probability that a 3-subset is picked this way turns out to be uni-random, eg. here 1/10

    • so process A and process B give same win-rate

moral of story?
  • in uni-rand hex playouts, can think of playout as sequence of uni-random moves or as uni-random subsets of black/white cells

black uni-rand win probability from this hex position?
   * * *     .
  * o - o     .        a
 o * - - o     .      b c
  o - - - o     .    d e f
     * * *       .

answer
- there are (6 choose 3) = 20 3-subsets of { a b c d e f }
- of these 3-subsets, all except { d b a } { d b c } { d e c } { d e f } win
- so black win probability = 16 / 20 = .8
application: RAVE, an AMAF improvement to mcts
  • because uni-rand-playout equivalent to uni-rand-subset selection,

    • can think of playout as subset,

    • can gather data: which cells positively correlate to wins ?

    • use this learned knowledge in future selection process anywhere on path L-to-root

RAVE details

  • rapid action value estimate, an all-moves-as-first variant

  • after each simulation, give all cells owned by winner positive-correlation win boost

  • for each child j, store this info

    • wins w_j

    • visits v_j

    • rave-wins rw_j (initially 0)

    • rave-visits rv_j (initially 0)

  • mcts-with-rave propagate: at each node j on path leaf-to-root

    • n_j += 1

    • w_j += 1 if simulation win for root ptm

    • for each child c in extended-simulation (ie. root-to-simulation-end)

      • rn_c += 1

      • rw_c += if simulation win for root ptm

  • mcts-with-rave tree_policy_select:

    • arg_max   (1 - β(nj,rnj)) wj / nj   +   β(nj,rnj) rwj / rnj   +   c √( ln V / vj)

    • β is a function that tends to 0 as the number of iterations increases, so the RAVE effect weakens as the number of playouts increases

wikipedia rave example
after some number of iterations ...

root state: empty board, x to move

iterate:
  select leaf   * - xa1 - ob2 - xc2
  simulate                          - ob2 - xa2 - ob3  ! o-win
  ext-sim       * - xa1 - ob2 - xc2 - ob2 - xa2 - ob3  ! o-win

propagate:
  at leaf j = * - a1 - b2 - c2:
    n_j +=1

  at node j = * - a1 - b2
     n_j += 1
     child c = a2: rn_c +=1
     child c = c2: rn_c +=1

  at node j = * - a1
     n_j += 1
     child c = b1: rn_c +=1
     child c = b2: rn_c +=1
     child c = b3: rn_c +=1

  at node j = *
    n_j +=1
    child c = a1: rn_c +=1
    child c = a2: rn_c +=1
    child c = c2: rn_c +=1
rave effective ?
  • when simulations have significant randomness

  • when child-selection policy not strong

mcts improvement: prior knowledge

  • after mcts expand, node's children are new to the search

  • instead of picking random child, use a prior knowledge (ie. before search) heuristic to select most promising node

mcts improvement: neural nets

  • use neural net (policy net) and win rate for selection

  • use neural net (policy net) for prior knowledge

  • use neural net (value net) and mcts for move selection

go/hex programs
  • alphago   explore, prior, no rave

  • patchi   explore, prior, rave

  • fuego   no explore, prior, rave

  • mohex   explore, prior, rave