from 2006 by Coulom and others
revolutionized computer Go
repeat these steps until time runs out
select leaf
expand leaf and simulate result(s)
backpropagate result(s)
keys to success
expansion: select strong child quickly
simulate: predict winner accurately
there are many different versions of mcts, we discuss a simple one
an anytime algorithm
returns a meaningful result after any amount of time
unlike say depth-d minimax, which does not return a result until the search to depth d is complete
a (mostly) best-first algorithm
when expanding the search tree, it expands the most promising lines first
so mcts search is highly non-uniform: at any fixed level, some subtrees will be must larger than others (unlike say depth-d minimax, where each at each level each subtree is the same size)
pseudo-code
while time remains:
select leaf
expand leaf and simulate result(s)
backpropagate result(s)
return best move (eg. most visits, or best winrate)
create all children
for each child:
if child_is_win: backpropagate(child, win)
else: simulate, backpropagate(child, sim_result)
(some mcts implementations use other expansion methods)
exploitation: best-first search, always expand best move
exploration: sometimes expand not-best move
when?
use UCB formula (Kocsis-Szepesvari)
MCTS + UCB = UCT
mcts simulations have randomness
in leaf selection, how to account for variance in win rates ? eg.
move j: 52 wins, 100 visits
move k: 529 wins, 1000 visits
are we sure that move k is better than move j ?
similar to multi-arm bandit problem
recall: ratio selection function f(w,v) = (w + t) / (v + 2t)
one approach: balance exploitation (pick child with best win rate) with exploration (pick lesser visited child)
Kocsis/Szepesvari UCB1 priority formula
pick move j that maximizes f( wj,vj ) + c sqrt( ln V / vj )
for hex, go, good performance with c small, eg. .001
win rate additive factor T: 5.0 ucb multiplicative constant c: 0.5 for some j, ucb1(c, w*j, v*j, V*j) w,v,V 0,1,2 1,3,6 9,20,40 100,200,400 j 1 .871 .973 .681 .587 2 .833 .894 .625 .565 4 .718 .815 .581 .548 8 .572 .746 .548 .536 16 .425 .689 .523 .526 32 .299 .643 .504 .519 64 .205 .608 .490 .514 128 .140 .581 .479 .510 256 .097 .560 .471 .508 512 .068 .544 .466 .505 1024 .048 .533 .461 .504 2048 .034 .524 .458 .503 some examples, T = 5.0 c w v V ucb(c,w,v,V) 0.0 50 100 1000 .500 0.2 200 400 1000 .526 0.2 100 200 1000 .537 0.2 50 100 1000 .553 0.2 25 50 1000 .574 0.2 1 3 1000 .765 0.2 0 3 1000 .688
def traverse_and_expand(self, node: TreeNode0): # traverse tree, select node to simulate # args: node (TreeNode) root of tree # returns: node (TreeNode): move from which to simulate while not node.is_leaf: node = self.best_uct(node) if len(node.moves) > 0 and node.sims > 0: node.expand_node() node = random.choice(node.children) return node
def best_uct(self, node: TreeNode1) -> TreeNode1: # Return the next move for traversal (leaf selection). # node is root? return node # node has unsimulated child? return child # otw? return child with best uct score # args: node (TreeNode) # returns: TreeNode # child to traverse best_uct, best_child = None, None for child in node.children: if child.sims == 0: # unexplored children have priority return child # calculate UCT, update if best mean_res = child.results / child.sims uct = mean_res+(self.c*sqrt(log(self.root_node.sims)/child.sims)) if best_uct is None or uct > best_uct: best_uct, best_child = uct, child return best_child
in his Sep 2016 talk to British Royal Arts Society, Demis Hassabis stressed the shift from hand-crafted AI to machine-learned AI. here, we can see this trend. we will see this more in the architecture of AlphaGo.
problem purely random pick (post-expansion child-select; simulation moves) ?
suggestion simulations: biased (hand-crafted or learned) move selection
suggestion post-expansion: use prior knowledge
problem random simulations are noisy
suggestion add exploration factor to child scores
problem tree_leaf_selection_policy: strong/winning moves not found quickly
suggestion fast heuristic/win-check
problem tree_leaf_selection_policy: true wins/losses repeatedly searched
suggestion back up true win/loss values, prune accordingly
problem mcts uses tree, but (for go, hex, ttt, …) search space is dag
suggestion treat search space as a dag (how ?)
problem mcts is best-first, true state value is minimax
suggestion add minimax behaviour to mcts (how ?)
simulations add patterns (go: save-miai, hex: save-bridge)
simulations collect all-moves-as-first data, use data in tree_leaf_select_policy
tree_leaf_select_policy use bandit-arm statistics to bias selection towards children with low visits
tree_leaf_select_policy prior knowledge: after expand, pick child based on prior knowledge (not random)
tree use transposition table (hard), prune isomorphic states
tree back up true-win/true-loss info
at node v, if move x yields true ptm-loss, prune x; if all moves loss-pruned, v is true ptm-loss, report that move from parent(v) to v is true ptm-win
at node v, if move x yields true ptm-win, report to parent(v) that move from parent(v) to v is true ptm-loss
go, hex: prior patters + sim'n patterns + RAVE yields strong intermediate player (strength of python go program michi: about 6 kyu on 4 threads)
@ last simulation o-move ! immediate simulation x-response go: save-miai hex: save-bridge x @ @ x ! x x !
MCTS RAVE by Sylvain Gelly (MoGo team) and David Silver (when a UAlberta phd student)
in uni-rand playouts, can think of playout as sequence of uni-random moves or as uni-random subsets of black/white cells
in the latter case, you can use info from simulations when exploring similar positions
* * * . * o - o . a o * - - o . b c o - - - o . d e f * * * . answer - there are (6 choose 3) = 20 3-subsets of { a b c d e f } - of these 3-subsets, all except { d b a } { d b c } { d e c } { d e f } win - so black win probability = 16 / 20 = .8
because uni-rand-playout equivalent to uni-rand-subset selection,
can think of playout as subset,
can gather data: which cells positively correlate to wins ?
use this learned knowledge in future selection process anywhere on path L-to-root
when simulations have significant randomness
when child-selection policy not strong
AlphaGo has a strong child-selection policy learned by neural nets, so does not use RAVE
after mcts expand, node's children are new to the search
instead of picking random child, use a prior knowledge (ie. before search) heuristic to select most promising node
use neural net (policy net) and win rate for selection
use neural net (policy net) for prior knowledge
use neural net (value net) and mcts for move selection
alphago explore, prior, no rave
patchi explore, prior, rave
fuego no explore, prior, rave
mohex explore, prior, rave