# jemdoc: addcss{rbh.css}, addcss{jacob.css}
= monte carlo tree search
~~~
{}{raw}
mcts
pseudo
pure
trace
ex
improve
explore
rave
details
prior
nn
~~~
~~~
{}{raw}
monte carlo tree search
~~~
~~~
 [https://en.wikipedia.org/wiki/Monte_Carlo_tree_search wiki]
 [https://jeffbradberry.com/posts/2015/09/introtomontecarlotreesearch/ bradberry mcts intro]
 [http://pasky.or.cz/go/prace.pdf Baudis MSc]
 [http://webdocs.cs.ualberta.ca/~hayward/396/mcts.pdf mcts hex example] ~
~~~
~~~
{mcts overview}
an *anytime* algorithm
 returns a meaningful result after any amount of time
 unlike say depthd minimax, which does not return
a result until the search to depth d is complete
a (mostly) *bestfirst* algorithm
 when expanding the search tree, it expands the most promising
lines first
 so mcts search is highly nonuniform:
at any fixed level, some subtrees will be must larger than others
(unlike say depthd minimax, where each at each level
each subtree is the same size)
pseudocode
 while time remains:
 select leaf
 expand leaf
 simulate result
 backpropagate result
return best move (eg. most visits, or best winrate)
~~~
~~~
{}{raw}
mcts detailed pseudocode
~~~
~~~
{}{}
class Node:
def __init__(self, m, p): # move is from parent to node
self.move, self.parent, self.children = m, p, []
self.wins, self.visits = 0, 0
def expand_node(self, state):
if not terminal(state):
for each nonisomorphic legal move m of state:
nc = Node(m, self) # new child node
self.children.append(nc)
def update(self, r):
self.visits += 1
if r==win:
self.wins += 1
def is_leaf(self):
return len(self.children)==0
def has_parent(self):
return self.parent is not None
def mcts(state):
root_node = Node(None, None)
while time remains:
n, s = root_node, copy.deepcopy(state)
while not n.is_leaf(): # select leaf
n = tree_policy_child(n)
s.addmove(n.move)
n.expand_node(s) # expand
n = tree_policy_child(n)
while not terminal(s): # simulate
s = simulation_policy_child(s)
result = evaluate(s)
while n.has_parent(): # propagate
n.update(result)
n = n.parent
return best_move(tree)
~~~
~~~
{}{raw}
pure mcts
~~~
~~~
 tree policy: ~ random pick of any best child
 use ratio wins/visits to pick best child
 but most leaves have 0 visits ? ...
 solution: use this formula
 replace ratio *wins/visits* with some function that
gives unvisited nodes a score of .5
 eg. pick t = 1, set *{{f(w,v) = ( w + t ) / ( v + 2t )}}*
 at even depth, pick random node in *{ argmax( f(w,v) ) }*
 at odd depth, pick random node in *{ argmin( f(w,v) ) }*
 simulation policy: ~ each move in playout is uniform random, over all legal moves
~~~
~~~
{}{raw}
hex 3x3: pure mcts example
~~~
~~~
{}{}
consider this state, @ to move
a b c
1 * @ . @
2 * . . @
3 . . . @
* * *
root_node: move parent children wins visits
None None [] 0 0
### iteration 1
tree: root_node
select leaf: root_node
expand(root_node)
tree after expansion:
root 0/0
/ /  \ \ \
a3 / b2 / b3  c1 \ c2 \ c3 \
00 00 00 00 00 00
best child = random select argmax: say b2
simulate from root + @b2, say to * @ @
* * .
* . @
result: loss for root ptm @
propagate
00 => 01
00 00 00 00 00 00 00 01 00 00 00 00
^
### iteration 2
root 0/0
/ /  \ \ \
a3 / b2 / b3  c1 \ c2 \ c3 \
00 01 00 00 00 00
select leaf: say c3
expand(c3): new subtree
c3
/ /  \ \
a3 / b2 / b3  c1 \ c2 \
00 00 00 00 00
best child = say a3
terminal state: loss for root ptm (so no expand)
propagate
02
/ /  \ \ \
/ /  \ \ \
a3/ b2/ b3 c1\ c2\ c3\
/ /  \ \ \
00 01 00 00 00 01
//\\
/ /  \ \
/ /  \ \
a3/ b2/ b3 c1\ c2\
01 00 00 00 00
~~~
~~~
{same example (omit leaves with 0 visits)}{}
initialize: root (has 0 visits)
after iteration 1: (* node with simulation)
root 01

b2*01
after iteration 2:
root 02
/ \
b2 01 c3 01

a3*01
after iteration 3:
root 13
/  \
b2 01 c3 01 b3 11
 
a3 01 c2*11
after iteration 4:
root 14
/  \
b2 01 c3 01 b3 12
 / \
a3 01 c2 11 a3*01
after iteration 5:
root 15
/  \
b2 01 c3 01 b3 13
 / \
a3 01 c2 11 a3*02
after iteration 6:
root 26
/ / \ \
b2 01 c3 01 b3 13 a3 11
 / \ 
a3 01 c2 11 a3 02 c1*11
after iteration 7:
root 27
/ / \ \
b2 01 c3 01 b3 13 a3 12
 / \  \
a3 01 c2 11 a3 02 c1 11 b2 01

c2*01
after iteration 8:
root 38
/ / \ \
b2 01 c3 01 b3 13 a3 23
 / \  \
a3 01 c2 11 a3 02 c1 11 b2 12
/ \
c2 01 b3 11

c1*11
after iteration 9:
root 49
/ / \ \
/ / \ \
b2 01 c3 01 b3 13 a3 35
/ / \ /  \
a3 01 c2 11 a3 02 c1 11 b2 12 c2 11
/ / \
c2 01 b3 11 c1*11

c1 11
~~~
~~~
{above example}
 [http://webdocs.cs.ualberta.ca/~hayward/396/mcts.pdf pdf] ~
~~~
~~~
{}{raw}
exercises
~~~
~~~
 if search ends now, what move does white make?
 if search continues, what is likely to happen on the next iteration?
 assume search continues until 100 iterations
 roughly, what will the tree look like then?
 what move would white make?
 for this state, what is white's best move? who wins? explain
 based on this example, how could you improve pure mcts ?
~~~
~~~
{}{raw}
improving mcts
~~~
~~~
 in his Sep 2016 talk to British Royal Arts Society,
Demis Hassabis stressed the shift
from handcrafted AI to machinelearned AI.
here, we can see this trend.
we will see this more in the architecture of AlphaGo.
 *problem* ~ purely random pick (postexpansion childselect; simulation moves) ?
 *suggestion* ~ simulations: biased (handcrafted or learned) move selection
 *suggestion* ~ postexpansion: use prior knowledge
 *problem* ~ random simulations are noisy
 *suggestion* ~ add exploration factor to child scores
 *problem* ~ tree_leaf_selection_policy: strong/winning moves not found quickly
 *suggestion* fast heuristic/wincheck
 *problem* ~ tree_leaf_selection_policy: true wins/losses repeatedly searched
 *suggestion* ~ back up true win/loss values, prune accordingly
 *problem* ~ mcts uses tree, but (for go, hex, ttt, ...) search space is dag
 *suggestion* ~ treat search space as a dag (how ?)
 *problem* ~ mcts is bestfirst, true state value is minimax
 *suggestion* ~ add minimax behaviour to mcts (how ?)
~~~
~~~
{big gains}
 *simulations* ~ add patterns (go: savemiai, hex: savebridge)
 *simulations* ~ collect allmovesasfirst data, use data in
*tree_leaf_select_policy*
 *tree_leaf_select_policy* ~ use banditarm statistics
to bias selection towards children with low visits
 *tree_leaf_select_policy* ~ prior knowledge:
after expand, pick child based on prior knowledge (not random)
~~~
~~~
{lesser gains}
 *tree* ~ use transposition table (hard), prune isomorphic states
 *tree* ~ back up truewin/trueloss info
 at node v, if move x yields true ptmloss, prune x;
if all moves losspruned, v is true ptmloss, report that
move from parent(v) to v is true ptmwin
 at node v, if move x yields true ptmwin,
report to parent(v) that
move from parent(v) to v is true ptmloss
 go, hex: prior patters \+ sim'n patterns \+ RAVE yields strong intermediate player (strength of python go program michi: about 6 kyu on 4 threads)
~~~
~~~
{savemiai, savebridge simulation patterns}{}
@ last simulation omove
! immediate simulation xresponse
go: savemiai hex: savebridge
x @ @ x
! x x !
~~~
~~~
{}{raw}
mcts improvement: exploration
~~~
~~~
{mcts improvement: exploration}
 mcts simulations have randomness
 so, in leaf selection, how to account for variance in win rates ? ~ ~ eg.
 move j: 52 wins, 100 visits
 move k: 529 wins, 1000 visits
 are we sure that move k is better than move j ?
 similar to [https://en.wikipedia.org/wiki/Multiarmed_bandit multiarm bandit problem]
 recall: ratio selection function {{f(w,v) = (w + t) / (v + 2t)}}
 one approach: balance *exploitation* (pick child with best win rate) with
*exploration* (pick lesser visited child)
 exploitation: best first search
 exploration: add some breadth to the search
 Kocsis\/Szepesvari UCB1 formula
 pick move j that maximizes {{f( w_{j},v_{j} ) + c sqrt( ln V / v_{j} ) }}
 for hex, go, good performance with c small, eg. .001
~~~
~~~
{ucb1 sample data from simple\/mcts\/ucb1.py}{}
UCB1 formula values
win rate additive factor T 5.0
ucb multiplicative constant c 0.5
w v V ratios 0 1 2 1 3 6 9 20 40 50 100 200
v 1 .871 .973 .681 .615
v 2 .833 .894 .625 .587
v 4 .718 .815 .581 .565
v 8 .572 .746 .548 .548
v 16 .425 .689 .523 .536
v 32 .299 .643 .504 .526
v 64 .205 .608 .490 .519
v 128 .140 .581 .479 .514
v 256 .097 .560 .471 .510
v 512 .068 .544 .466 .508
v 1024 .048 .533 .461 .505
v 2048 .034 .524 .458 .504
~~~
~~~
{}{raw}
mcts improvement: rave
~~~
~~~
 assume in mcts, after expand, we have picked a child, we are at leaf L
 at L, we want to perform a random simulation
 let S = empty cells, let ptm = black
 want winner of random playout, black to start, from L
 recall (from intro probability course): *uniformrandom selection* of an element x
from a set S: for each x, select with probability 1/S
 consider these two processes for picking winner
 process A
 observe that playout is just random permutation P of S
 construct P: repeatedly uniformrandomselect
a notyetpicked element
 eg. say {{S = { a b c d e } }}
 unirandselect from S, say b, as 1st element of P
 unirandselect from {{ S  { b } }}, say e, as 2nd element of P
 unirandselect from {{ S  { b e } }}, say c, as 3rd element of P
 unirandselect from {{ S  { b e c } }}, say a, as 4th element of P
 unirandselect from {{ S  { b e c a } }}, d, as 5th element of P
 P = ( b e c a d )
 use P for playout, report winner
 process B
 unirandselect 3subset R of S as black cells of playout
 eg. unirandselect R from {{ { abc abd abe acd ace ade bcd bce bde cde } }}
 say R = {{ { a c d } }}
 from position L: add all R as black, all S  R as white, report winner
 do process A and process B give same win probability for L ?
 yes
 perform process A
 from permutation P, pick 3subset of stones that will be black
 probability that a 3subset is picked this way turns
out to be unirandom, eg. here 1/10
 so process A and process B give same winrate
~~~
~~~
{moral of story?}
 in unirand hex playouts, can think of playout as
sequence of unirandom moves *or* as unirandom subsets of black/white cells
~~~
~~~
{black unirand win probability from this hex position?}{}
* * * .
* o  o . a
o *   o . b c
o    o . d e f
* * * .
answer
 there are (6 choose 3) = 20 3subsets of { a b c d e f }
 of these 3subsets, all except { d b a } { d b c } { d e c } { d e f } win
 so black win probability = 16 / 20 = .8
~~~
~~~
{application: RAVE, an AMAF improvement to mcts}
 because unirandplayout equivalent to unirandsubset selection,
 can think of playout as subset,
 can gather data: which cells positively correlate to wins ?
 use this learned knowledge in future selection process
anywhere on path Ltoroot
~~~
~~~
{}{raw}
RAVE details
~~~
~~~
 *r*apid *a*ction *v*alue *e*stimate, an allmovesasfirst variant
 after each simulation, give *all cells owned by winner*
positivecorrelation win boost
 for each child j, store this info
 wins w_j
 visits v_j
 ravewins rw_j (initially 0)
 ravevisits rv_j (initially 0)
 mctswithrave propagate: at each node j on path leaftoroot
 n_j += 1
 w_j += 1 if simulation win for root ptm
 for each child c in extendedsimulation (ie. roottosimulationend)
 rn_c += 1
 rw_c += if simulation win for root ptm
 mctswithrave tree_policy_select:
 arg_max ~ {{(1  β(n_{j},rn_{j}))
w_{j} / n_{j} +
β(n_{j},rn_{j})
rw_{j} / rn_{j} +
c √( ln V / v_{j})}}
 {{β}} is a function that tends to 0 as the number of iterations increases, so the RAVE effect weakens as the number of playouts increases
~~~
~~~
{wikipedia rave example}{}
after some number of iterations ...
root state: empty board, x to move
iterate:
select leaf *  xa1  ob2  xc2
simulate  ob2  xa2  ob3 ! owin
extsim *  xa1  ob2  xc2  ob2  xa2  ob3 ! owin
propagate:
at leaf j = *  a1  b2  c2:
n_j +=1
at node j = *  a1  b2
n_j += 1
child c = a2: rn_c +=1
child c = c2: rn_c +=1
at node j = *  a1
n_j += 1
child c = b1: rn_c +=1
child c = b2: rn_c +=1
child c = b3: rn_c +=1
at node j = *
n_j +=1
child c = a1: rn_c +=1
child c = a2: rn_c +=1
child c = c2: rn_c +=1
~~~
~~~
{rave effective ?}
 when simulations have significant randomness
 when childselection policy not strong
~~~
~~~
{}{raw}
mcts improvement: prior knowledge
~~~
~~~
 after mcts expand,
node's children are new to the search
 instead of picking random child,
use a *prior knowledge* (ie. before search) heuristic
to select most promising node
~~~
~~~
{}{raw}
mcts improvement: neural nets
~~~
~~~
 use neural net (policy net) and win rate for selection
 use neural net (policy net) for prior knowledge
 use neural net (value net) and mcts for move selection
~~~
~~~
{go/hex programs}
 alphago ~ explore, prior, *no rave*
 patchi ~ explore, prior, rave
 fuego ~ *no explore*, prior, rave
 mohex ~ explore, prior, rave
~~~