# jemdoc: addcss{rbh.css}, addcss{jacob.css}
= monte carlo tree search
~~~
{}{raw}
mcts
pseudo
simple
eg
ex
improve
explore
rave
prior
nn
~~~
# details
~~~
{}{raw}
## monte carlo tree search

~~~
~~~
{mcts overview}
there are *many* different versions of mcts, we discuss a simple one
an *anytime* algorithm
- returns a meaningful result after any amount of time
- unlike say depth-d minimax, which does not return
a result until the search to depth d is complete
a (mostly) *best-first* algorithm
- when expanding the search tree, it expands the most promising
lines first
- so mcts search is highly non-uniform:
at any fixed level, some subtrees will be must larger than others
(unlike say depth-d minimax, where each at each level
each subtree is the same size)
pseudo-code
- while time remains:
-- select leaf
-- expand leaf
-- simulate result
-- backpropagate result
- return best move (eg. most visits, or best winrate)
~~~
~~~
- [https://en.wikipedia.org/wiki/Monte_Carlo_tree_search wiki]
- [https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/ bradberry mcts intro]
- [http://pasky.or.cz/go/prace.pdf Baudis MSc]
- many mcts implementations expand *all* children of selected leaf, then
pick one (random, or using prior knowledge) for simulation
- e.g. [http://webdocs.cs.ualberta.ca/~hayward/396/mcts.pdf mcts hex example] ~
~~~
~~~
{}{raw}
## mcts detailed pseudo-code

~~~
~~~
{}{}
class Node:
def __init__(self, m, p): # move is from parent to node
self.move, self.parent, self.children = m, p, []
self.wins, self.visits = 0, 0
def expand_node(self, state):
if not terminal(state):
for each non-isomorphic legal move m of state:
nc = Node(m, self) # new child node
self.children.append(nc)
def update(self, r):
self.visits += 1
if r==win:
self.wins += 1
def is_leaf(self):
return len(self.children)==0
def has_parent(self):
return self.parent is not None
def mcts(state):
root_node = Node(None, None)
while time remains:
n, s = root_node, copy.deepcopy(state)
while not n.is_leaf(): # select leaf
n = tree_policy_child(n)
s.addmove(n.move)
n.expand_node(s) # expand
n = tree_policy_child(n)
while not terminal(s): # simulate
s = simulation_policy_child(s)
result = evaluate(s)
while n.has_parent(): # propagate
n.update(result)
n = n.parent
return best_move(tree)
~~~
~~~
{}{raw}
## simple mcts

~~~
~~~
- collect win rates for root player-to-move
- child-selection-policy: ~ random pick of any best child
- use ratio wins/visits to pick best child
- but most leaves have 0 visits ? ...
- solution: use this priority formula
-- replace ratio *wins/visits* with some function that
gives unvisited nodes a score of .5
-- eg. pick t = 1, set *{{f(w,v) = ( w + t ) / ( v + 2t )}}*
-- at even depth, pick random node in *{ argmax( f(w,v) ) }*
-- at odd depth, pick random node in *{ argmin( f(w,v) ) }*
- simulation policy: ~ each move in playout is uniform random, over all legal moves
~~~
~~~
{}{raw}
## simple mcts example

~~~
~~~
- [http://webdocs.cs.ualberta.ca/~hayward/396/mcts.pdf pdf] ~
~~~
#~~~
#{}{}
#consider this state, @ to move
#a b c
#1 * @ . @
#2 * . . @
#3 . . . @
#* * *
#
#root_node: move parent children wins visits
#None None [] 0 0
#
#### iteration 1
#tree: root_node
#select leaf: root_node
#expand(root_node)
#tree after expansion:
#root 0/0
#/ / | \ \ \
#a3 / b2 / b3 | c1 \ c2 \ c3 \
#0|0 0|0 0|0 0|0 0|0 0|0
#
#best child = random select argmax: say b2
#
#simulate from root + @b2, say to * @ @
#* * .
#* . @
#result: loss for root ptm @
#
#propagate
#0|0 => 0|1
#0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0 0|0
#^
#
#### iteration 2
#root 0/0
#/ / | \ \ \
#a3 / b2 / b3 | c1 \ c2 \ c3 \
#0|0 0|1 0|0 0|0 0|0 0|0
#
#select leaf: say c3
#expand(c3): new subtree
#c3
#/ / | \ \
#a3 / b2 / b3 | c1 \ c2 \
#0|0 0|0 0|0 0|0 0|0
#
#best child = say a3
#terminal state: loss for root ptm (so no expand)
#propagate
#0|2
#/ / | \ \ \
#/ / | \ \ \
#a3/ b2/ b3| c1\ c2\ c3\
#/ / | \ \ \
#0|0 0|1 0|0 0|0 0|0 0|1
#//|\\
#/ / | \ \
#/ / | \ \
#a3/ b2/ b3| c1\ c2\
#0|1 0|0 0|0 0|0 0|0
#~~~
#
#~~~
#{same example (omit leaves with 0 visits)}{}
#initialize: root (has 0 visits)
#
#after iteration 1: (* node with simulation)
#
#root 01
#|
#b2*01
#
#after iteration 2:
#
#root 02
#/ \
#b2 01 c3 01
#|
#a3*01
#
#after iteration 3:
#
#root 13
#/ | \
#b2 01 c3 01 b3 11
#| |
#a3 01 c2*11
#
#after iteration 4:
#
#root 14
#/ | \
#b2 01 c3 01 b3 12
#| / \
#a3 01 c2 11 a3*01
#
#after iteration 5:
#
#root 15
#/ | \
#b2 01 c3 01 b3 13
#| / \
#a3 01 c2 11 a3*02
#
#after iteration 6:
#
#root 26
#/ / \ \
#b2 01 c3 01 b3 13 a3 11
#| / \ |
#a3 01 c2 11 a3 02 c1*11
#
#after iteration 7:
#
#root 27
#/ / \ \
#b2 01 c3 01 b3 13 a3 12
#| / \ | \
#a3 01 c2 11 a3 02 c1 11 b2 01
#|
#c2*01
#
#after iteration 8:
#
#root 38
#/ / \ \
#b2 01 c3 01 b3 13 a3 23
#| / \ | \
#a3 01 c2 11 a3 02 c1 11 b2 12
#/ \
#c2 01 b3 11
#|
#c1*11
#
#after iteration 9:
#
#root 49
#/ / \ \
#/ / \ \
#b2 01 c3 01 b3 13 a3 35
#/ / \ / | \
#a3 01 c2 11 a3 02 c1 11 b2 12 c2 11
#/ / \
#c2 01 b3 11 c1*11
#|
#c1 11
#
#~~~
~~~
{}{raw}
## exercises

~~~
~~~
- if search ends now, what move does white make?
- if search continues, what is likely to happen on the next iteration?
- assume search continues until 100 iterations
-- roughly, what will the tree look like then?
-- what move would white make?
- for this state, what is white's best move? who wins? explain
- based on this example, how could you improve simple mcts ?
~~~
~~~
{}{raw}
## improving mcts

~~~
~~~
- in his Sep 2016 talk to British Royal Arts Society,
Demis Hassabis stressed the shift
from hand-crafted AI to machine-learned AI.
here, we can see this trend.
we will see this more in the architecture of AlphaGo.
- *problem* ~ purely random pick (post-expansion child-select; simulation moves) ?
-- *suggestion* ~ simulations: biased (hand-crafted or learned) move selection
-- *suggestion* ~ post-expansion: use prior knowledge
- *problem* ~ random simulations are noisy
-- *suggestion* ~ add exploration factor to child scores
- *problem* ~ tree_leaf_selection_policy: strong/winning moves not found quickly
-- *suggestion* fast heuristic/win-check
- *problem* ~ tree_leaf_selection_policy: true wins/losses repeatedly searched
-- *suggestion* ~ back up true win/loss values, prune accordingly
- *problem* ~ mcts uses tree, but (for go, hex, ttt, ...) search space is dag
-- *suggestion* ~ treat search space as a dag (how ?)
- *problem* ~ mcts is best-first, true state value is minimax
-- *suggestion* ~ add minimax behaviour to mcts (how ?)
~~~
~~~
{big gains}
- *simulations* ~ add patterns (go: save-miai, hex: save-bridge)
- *simulations* ~ collect all-moves-as-first data, use data in
*tree_leaf_select_policy*
- *tree_leaf_select_policy* ~ use bandit-arm statistics
to bias selection towards children with low visits
- *tree_leaf_select_policy* ~ prior knowledge:
after expand, pick child based on prior knowledge (not random)
~~~
~~~
{lesser gains}
- *tree* ~ use transposition table (hard), prune isomorphic states
- *tree* ~ back up true-win/true-loss info
-- at node v, if move x yields true ptm-loss, prune x;
if all moves loss-pruned, v is true ptm-loss, report that
move from parent(v) to v is true ptm-win
-- at node v, if move x yields true ptm-win,
report to parent(v) that
move from parent(v) to v is true ptm-loss
- go, hex: prior patters \+ sim'n patterns \+ RAVE yields strong intermediate player (strength of python go program michi: about 6 kyu on 4 threads)
~~~
~~~
{save-miai, save-bridge simulation patterns}{}
@ last simulation o-move
! immediate simulation x-response
go: save-miai hex: save-bridge
x @ @ x
! x x !
~~~
~~~
{}{raw}
## mcts improvement: exploration

~~~
~~~
{mcts improvement: exploration}
- mcts simulations have randomness
- so, in leaf selection, how to account for variance in win rates ? ~ ~ eg.
-- move j: 52 wins, 100 visits
-- move k: 529 wins, 1000 visits
-- are we sure that move k is better than move j ?
- similar to [https://en.wikipedia.org/wiki/Multi-armed_bandit multi-arm bandit problem]
- recall: ratio selection function {{f(w,v) = (w + t) / (v + 2t)}}
- one approach: balance *exploitation* (pick child with best win rate) with
*exploration* (pick lesser visited child)
-- exploitation: best first search
-- exploration: add some breadth to the search
- Kocsis\/Szepesvari UCB1 priority formula
-- pick move j that maximizes {{f( w_{j},v_{j} ) + c sqrt( ln V / v_{j} ) }}
-- for hex, go, good performance with c small, eg. .001
~~~
~~~
{ucb1 priority values from simple\/mcts\/ucb1.py}{}
win rate additive factor T: 5.0
ucb multiplicative constant c: 0.5
for some j, ucb1(c, w*j, v*j, V*j)
w,v,V 0,1,2 1,3,6 9,20,40 100,200,400
j
1 .871 .973 .681 .587
2 .833 .894 .625 .565
4 .718 .815 .581 .548
8 .572 .746 .548 .536
16 .425 .689 .523 .526
32 .299 .643 .504 .519
64 .205 .608 .490 .514
128 .140 .581 .479 .510
256 .097 .560 .471 .508
512 .068 .544 .466 .505
1024 .048 .533 .461 .504
2048 .034 .524 .458 .503
some examples, T = 5.0
c w v V ucb(c,w,v,V)
0.0 50 100 1000 .500
0.2 200 400 1000 .526
0.2 100 200 1000 .537
0.2 50 100 1000 .553
0.2 25 50 1000 .574
0.2 1 3 1000 .765
0.2 0 3 1000 .688
~~~
~~~
{}{raw}
## mcts improvement: rave

~~~
#~~~
#- assume in mcts, after expand, we have picked a child, we are at leaf L
#- at L, we want to perform a random simulation
#- let S = empty cells, let ptm = black
#- want winner of random playout, black to start, from L
#- recall (from intro probability course): *uniform-random selection* of an element x
#from a set S: for each x, select with probability 1/|S|
#- consider these two processes for picking winner
#- process A
#-- observe that playout is just random permutation P of S
#-- construct P: repeatedly uniform-random-select
#a not-yet-picked element
#-- eg. say {{S = { a b c d e } }}
#-- uni-rand-select from S, say b, as 1st element of P
#-- uni-rand-select from {{ S - { b } }}, say e, as 2nd element of P
#-- uni-rand-select from {{ S - { b e } }}, say c, as 3rd element of P
#-- uni-rand-select from {{ S - { b e c } }}, say a, as 4th element of P
#-- uni-rand-select from {{ S - { b e c a } }}, d, as 5th element of P
#-- P = ( b e c a d )
#-- use P for playout, report winner
#- process B
#-- uni-rand-select 3-subset R of S as black cells of playout
#-- eg. uni-rand-select R from {{ { abc abd abe acd ace ade bcd bce bde cde } }}
#-- say R = {{ { a c d } }}
#-- from position L: add all R as black, all S - R as white, report winner
#- do process A and process B give same win probability for L ?
#-- yes
#-- perform process A
#-- from permutation P, pick 3-subset of stones that will be black
#-- probability that a 3-subset is picked this way turns
#out to be uni-random, eg. here 1/10
#-- so process A and process B give same win-rate
#~~~
~~~
{Rapid Action Value Estimate}
- [http://www.cs.utexas.edu/~pstone/Courses/394Rspring13/resources/mcrave.pdf
MCTS RAVE] by Sylvain Gelly (MoGo team) and David Silver (when a UAlberta phd student)
- in uni-rand playouts, can think of playout as
sequence of uni-random moves *or* as uni-random subsets of black/white cells
- in the latter case, you can use info from simulations when
exploring similar positions
~~~
~~~
{black uni-rand win probability from this hex position?}{}
* * * .
* o - o . a
o * - - o . b c
o - - - o . d e f
* * * .
answer
- there are (6 choose 3) = 20 3-subsets of { a b c d e f }
- of these 3-subsets, all except { d b a } { d b c } { d e c } { d e f } win
- so black win probability = 16 / 20 = .8
~~~
~~~
{application: RAVE, an all-moves-as-first improvement to mcts}
- because uni-rand-playout equivalent to uni-rand-subset selection,
-- can think of playout as subset,
-- can gather data: which cells positively correlate to wins ?
-- use this learned knowledge in future selection process
anywhere on path L-to-root
~~~
#~~~
#{}{raw}
### RAVE details

#~~~
#
#~~~
#- *r*apid *a*ction *v*alue *e*stimate, an all-moves-as-first variant
#- after each simulation, give *all cells owned by winner*
#positive-correlation win boost
#- for each child j, store this info
#-- wins w_j
#-- visits v_j
#-- rave-wins rw_j (initially 0)
#-- rave-visits rv_j (initially 0)
#- mcts-with-rave propagate: at each node j on path leaf-to-root
#-- n_j += 1
#-- w_j += 1 if simulation win for root ptm
#-- for each child c in extended-simulation (ie. root-to-simulation-end)
#--- rn_c += 1
#--- rw_c += if simulation win for root ptm
#- mcts-with-rave tree_policy_select:
#-- arg_max ~ {{(1 - β(n_{j},rn_{j}))
#w_{j} / n_{j} +
#β(n_{j},rn_{j})
#rw_{j} / rn_{j} +
#c √( ln V / v_{j})}}
#-- {{β}} is a function that tends to 0 as the number of iterations increases, so the RAVE effect weakens as the number of playouts increases
#~~~
#
#~~~
#{wikipedia rave example}{}
#after some number of iterations ...
#
#root state: empty board, x to move
#
#iterate:
#select leaf * - xa1 - ob2 - xc2
#simulate - ob2 - xa2 - ob3 ! o-win
#ext-sim * - xa1 - ob2 - xc2 - ob2 - xa2 - ob3 ! o-win
#
#propagate:
#at leaf j = * - a1 - b2 - c2:
#n_j +=1
#
#at node j = * - a1 - b2
#n_j += 1
#child c = a2: rn_c +=1
#child c = c2: rn_c +=1
#
#at node j = * - a1
#n_j += 1
#child c = b1: rn_c +=1
#child c = b2: rn_c +=1
#child c = b3: rn_c +=1
#
#at node j = *
#n_j +=1
#child c = a1: rn_c +=1
#child c = a2: rn_c +=1
#child c = c2: rn_c +=1
#~~~
~~~
{RAVE effective ?}
- when simulations have significant randomness
- when child-selection policy not strong
- AlphaGo has a strong child-selection policy learned by neural nets,
so does not use RAVE
~~~
~~~
{}{raw}
## mcts improvement: prior knowledge

~~~
~~~
- after mcts expand,
node's children are new to the search
- instead of picking random child,
use a *prior knowledge* (ie. before search) heuristic
to select most promising node
~~~
~~~
{}{raw}
## mcts improvement: neural nets

~~~
~~~
- use neural net (policy net) and win rate for selection
- use neural net (policy net) for prior knowledge
- use neural net (value net) and mcts for move selection
~~~
~~~
{go/hex programs}
- alphago ~ explore, prior, *no rave*
- patchi ~ explore, prior, rave
- fuego ~ *no explore*, prior, rave
- mohex ~ explore, prior, rave
~~~