mcts overview

there are **many** different versions of mcts, we discuss a simple one

an **anytime** algorithm

returns a meaningful result after any amount of time

unlike say depth-d minimax, which does not return a result until the search to depth d is complete

a (mostly) **best-first** algorithm

when expanding the search tree, it expands the most promising lines first

so mcts search is highly non-uniform: at any fixed level, some subtrees will be must larger than others (unlike say depth-d minimax, where each at each level each subtree is the same size)

pseudo-code

while time remains:

select leaf

expand leaf

simulate result

backpropagate result

return best move (eg. most visits, or best winrate)

many mcts implementations expand

**all**children of selected leaf, then pick one (random, or using prior knowledge) for simulatione.g. mcts hex example

class Node: def __init__(self, m, p): # move is from parent to node self.move, self.parent, self.children = m, p, [] self.wins, self.visits = 0, 0 def expand_node(self, state): if not terminal(state): for each non-isomorphic legal move m of state: nc = Node(m, self) # new child node self.children.append(nc) def update(self, r): self.visits += 1 if r==win: self.wins += 1 def is_leaf(self): return len(self.children)==0 def has_parent(self): return self.parent is not None def mcts(state): root_node = Node(None, None) while time remains: n, s = root_node, copy.deepcopy(state) while not n.is_leaf(): # select leaf n = tree_policy_child(n) s.addmove(n.move) n.expand_node(s) # expand n = tree_policy_child(n) while not terminal(s): # simulate s = simulation_policy_child(s) result = evaluate(s) while n.has_parent(): # propagate n.update(result) n = n.parent return best_move(tree)

collect win rates for root player-to-move

child-selection-policy: random pick of any best child

use ratio wins/visits to pick best child

but most leaves have 0 visits ? …

solution: use this priority formula

replace ratio

**wins/visits**with some function that gives unvisited nodes a score of .5eg. pick t = 1, set

**f(w,v) = ( w + t ) / ( v + 2t )**at even depth, pick random node in

**{ argmax( f(w,v) ) }**at odd depth, pick random node in

**{ argmin( f(w,v) ) }**

simulation policy: each move in playout is uniform random, over all legal moves

if search ends now, what move does white make?

if search continues, what is likely to happen on the next iteration?

assume search continues until 100 iterations

roughly, what will the tree look like then?

what move would white make?

for this state, what is white's best move? who wins? explain

based on this example, how could you improve simple mcts ?

in his Sep 2016 talk to British Royal Arts Society, Demis Hassabis stressed the shift from hand-crafted AI to machine-learned AI. here, we can see this trend. we will see this more in the architecture of AlphaGo.

**problem**purely random pick (post-expansion child-select; simulation moves) ?**suggestion**simulations: biased (hand-crafted or learned) move selection**suggestion**post-expansion: use prior knowledge

**problem**random simulations are noisy**suggestion**add exploration factor to child scores

**problem**tree_leaf_selection_policy: strong/winning moves not found quickly**suggestion**fast heuristic/win-check

**problem**tree_leaf_selection_policy: true wins/losses repeatedly searched**suggestion**back up true win/loss values, prune accordingly

**problem**mcts uses tree, but (for go, hex, ttt, …) search space is dag**suggestion**treat search space as a dag (how ?)

**problem**mcts is best-first, true state value is minimax**suggestion**add minimax behaviour to mcts (how ?)

big gains

**simulations**add patterns (go: save-miai, hex: save-bridge)**simulations**collect all-moves-as-first data, use data in**tree_leaf_select_policy****tree_leaf_select_policy**use bandit-arm statistics to bias selection towards children with low visits**tree_leaf_select_policy**prior knowledge: after expand, pick child based on prior knowledge (not random)

lesser gains

**tree**use transposition table (hard), prune isomorphic states**tree**back up true-win/true-loss infoat node v, if move x yields true ptm-loss, prune x; if all moves loss-pruned, v is true ptm-loss, report that move from parent(v) to v is true ptm-win

at node v, if move x yields true ptm-win, report to parent(v) that move from parent(v) to v is true ptm-loss

go, hex: prior patters + sim'n patterns + RAVE yields strong intermediate player (strength of python go program michi: about 6 kyu on 4 threads)

save-miai, save-bridge simulation patterns

@ last simulation o-move ! immediate simulation x-response go: save-miai hex: save-bridge x @ @ x ! x x !

mcts improvement: exploration

mcts simulations have randomness

so, in leaf selection, how to account for variance in win rates ? eg.

move j: 52 wins, 100 visits

move k: 529 wins, 1000 visits

are we sure that move k is better than move j ?

similar to multi-arm bandit problem

recall: ratio selection function f(w,v) = (w + t) / (v + 2t)

one approach: balance

**exploitation**(pick child with best win rate) with**exploration**(pick lesser visited child)exploitation: best first search

exploration: add some breadth to the search

Kocsis/Szepesvari UCB1 priority formula

pick move j that maximizes f( w

_{j},v_{j}) + c sqrt( ln V / v_{j})for hex, go, good performance with c small, eg. .001

ucb1 priority values from simple/mcts/ucb1.py

win rate additive factor T: 5.0 ucb multiplicative constant c: 0.5 for some j, ucb1(c, w*j, v*j, V*j) w,v,V 0,1,2 1,3,6 9,20,40 100,200,400 j 1 .871 .973 .681 .587 2 .833 .894 .625 .565 4 .718 .815 .581 .548 8 .572 .746 .548 .536 16 .425 .689 .523 .526 32 .299 .643 .504 .519 64 .205 .608 .490 .514 128 .140 .581 .479 .510 256 .097 .560 .471 .508 512 .068 .544 .466 .505 1024 .048 .533 .461 .504 2048 .034 .524 .458 .503 some examples, T = 5.0 c w v V ucb(c,w,v,V) 0.0 50 100 1000 .500 0.2 200 400 1000 .526 0.2 100 200 1000 .537 0.2 50 100 1000 .553 0.2 25 50 1000 .574 0.2 1 3 1000 .765 0.2 0 3 1000 .688

Rapid Action Value Estimate

MCTS RAVE by Sylvain Gelly (MoGo team) and David Silver (when a UAlberta phd student)

in uni-rand playouts, can think of playout as sequence of uni-random moves

**or**as uni-random subsets of black/white cellsin the latter case, you can use info from simulations when exploring similar positions

black uni-rand win probability from this hex position?

* * * . * o - o . a o * - - o . b c o - - - o . d e f * * * . answer - there are (6 choose 3) = 20 3-subsets of { a b c d e f } - of these 3-subsets, all except { d b a } { d b c } { d e c } { d e f } win - so black win probability = 16 / 20 = .8

application: RAVE, an all-moves-as-first improvement to mcts

because uni-rand-playout equivalent to uni-rand-subset selection,

can think of playout as subset,

can gather data: which cells positively correlate to wins ?

use this learned knowledge in future selection process anywhere on path L-to-root

RAVE effective ?

when simulations have significant randomness

when child-selection policy not strong

AlphaGo has a strong child-selection policy learned by neural nets, so does not use RAVE

after mcts expand, node's children are new to the search

instead of picking random child, use a

**prior knowledge**(ie. before search) heuristic to select most promising node

use neural net (policy net) and win rate for selection

use neural net (policy net) for prior knowledge

use neural net (value net) and mcts for move selection

go/hex programs

alphago explore, prior,

**no rave**patchi explore, prior, rave

fuego

**no explore**, prior, ravemohex explore, prior, rave