alphago

cgo   oly   agfh   nat   agls   dzcho   evo   nat   input   slp   rlp   rlv   search   sims   tree   str   disc  

computer go

icga computer games olympiad

  • history

  • 2005 Taipei   19x19 go: Hand Talk, Go Intellect, Aya

  • 2006 Turin   19x19 go: Gnu Go, Go Intellect, Indigo

    • 9x9 go: Crazystone, Aya, Go Intellect

  • 2007 Amsterdam   19x19 go: mogo, crazystone, gnu go

  • 2008 Beijing   19x19 go: many faces of go, mogo, leela

    • 11x11 hex: wolve (ualberta), mohex (ualberta), six

2015 oct   alphago v fan hui 5-0

2016 jan   alphago nature paper

2016 mar   alphago v lee sedol 4-1

2016 nov   deep zen v cho chikun   1-2

common mcts flaw: optimistic simulations

hex position

  • in games where winner often has narrow line of play, eg. sequence with unique winning move(s), randomized simulation often has wrong result

  • eg. above hex position, white to play, who wins ?

  • mohex, rave, 17500 simulations in tree, win rate .53, move d5

  • d5 loses, all white moves lose: black wins

end of game 3

alphgao evolution

deep-zen-go vs cho-chikun game 3
storkey-clark dcnn correctly-predicted next-moves

    1  2  3  4  5  6  -  -  9
 - 11 12  - 14 15 16  -  - 19
20 21  - 23 24  - 26 27  -  -
30 31  - 33 34 35 36  - 38  -
40 41 42  -  - 45 46 47  - 49
 - 51 52  - 54  -  - 57  -  -
60 61  - 63 64 65 66  -  - 69
70 71  - 73 74 75 76  -  -  -
 - 81 82 83  -  -  -  -  -  -
 -  -  -  -  -  -  -  -  -  -
 - 101 102 - 104 - ...

nature paper: details

  • alphago nature paper pdf

  • go is a game of perfect information, so solving is easy: traverse the game tree

  • problem: depth ~ 150   breadth ~ 250   search space ~ 10**360

  • number atoms in observable universe ~ 10**80

  • idea: reduce search space   truncate tree at depth d, for subtree s, replace true v*(s) with v’(s)

  • idea: reduce breadth   sample actions from policy p(a|s) that is prob. dist. over moves a from s

  • this sampling is what mcts does

  • big improvement in image classification using dcnn: use many layers of neurons, each arranged in overlapping tiles, to represent image

  • alphago: use neural nets to evaluate positions   value net

  • alphago: use neural nets to estimate p(a|s)   policy net

  • alphago network training pipeline

    • start with 29 400 000 positions from 160 000 KGS games, train supervised learning policy net

    • result: strong policy net

    • also train a fast (weaker) policy net

    • train a value net

    • combine fast policy net and value net with mcts

input features for neural nets

  • cell: player   opponent   empty

  • liberties

  • capture size

  • self-atari size

  • liberties after move

  • ladder capture

  • ladder escape

  • legal and does not fill own eye

policy nets (deep, shallow) via supervised learning

  • supervised learning

  • goal here: using human data and sl, create a policy net

    • for arbitrary state s and all possible actions a, deep convolution neural net that estimates function f = prob(a|s)

    • input: board position s

    • output: from s, for each empty cell a, probability (from the human data) that human selects action a

  • how dcnn accuracy is measured

    • fraction of human record data is withheld from training, used for testing

  • human data: 30 000 000 game positions from 160 000 KGS games

  • improvement: when input is board position + above features, function is more accurate

  • SL policy net accuracy:

    • .557 (board input)

    • .57 (board + features input)

net response time (10-6 s) response accuracy
deep 3 000 .57
shallow 2 .24

strengthen policy net via reinforcement learning

  • reinforcement learning

  • observation: a move that clearly leads to a loss is not likely to be played

  • so adding information about whether move leads to loss should make dcnn more accurate

  • improvement: starting from SL policy net, train a new neural net using reinforcement learning

    • for each move, self-play game to end, find result

    • move that leads to win gets reward 1

    • move that leads to loss gets reward -1

  • RL dcnn is similar architecture to SL dcnn (1 extra final layer)

  • weights of RL dcnn initialized to final weights of SL dcnn

  • reinforcement learning then trains RL dcnn to find final weights

  • RL policy net   v   SL policy net     winrate   .8

from RL policy net to RL value net

  • policy net estimates probability of making a move

  • so ranks children, but cannot rank non-siblings

  • for tree search (eg. minimax, mcts, …) need to be able to compare non-siblings

  • need, for each node, an estimate of value: what is the probability that a given move wins ?

    • for arbitrary state s, deep convolution neural net estimates v(s) = prob(s is winning position)

    • input: board position s

  • problem: using 30 000 000 game positions from only 160 000 games leads to overfitting

    • value net learns to mimic move sequences in 160 000 games, but plays poorly in other games

    • solution: use 30 000 000 game positions from 30 000 000 self-play data-set games

  • RL value net will be used at leaf nodes of tree search

  • MCTS will also be used at leaf nodes of tree search

  • one RL value net call comparable in accuracy to about t MCTS simulations using the fast RL policy net, but requires 15 000 times less computation time

  • each edge (s,a) stores action value Q(s,a) from policy net, visit count N(s,a), prior probability P(s,a)

  • bonus u(s,a) proportional to P(s,a) / (1 + N(s,a))

  • use argmax ( Q(s_t,a) + u(s_t,a) ) for child selection

  • at leaf, expand if number of visits exceeds a threshold

  • at leaf, evaluate using simulation and using value net (mixed ratio .5 .5 is best)

  • after simulation, backup as usual, update N(s,a), update Q(s,a)

  • when time is up, pick root child with most visits as move

  • surprise

    • SL policy net works better in MCTS than stronger RL policy net, presumably because humans select a diverse beam of promising moves to explore, whereas RL optimizes for single best move

    • RL value net (derived from RL policy net) works better in alphago than similar value net derived from SL policy net

  • evaluating policy / value nets is slow, so

    • use asynchronous multi-threaded search

    • use CPUs for simulation

    • use GPUs for policy / value net calls

  • alphago specs

    • single-machine: 40 search threads, 48 CPUs, 8 GPUs

    • distributed: 40 search threads, 1202 CPUs, 176 GPUs

simulations

  • like RAVE, but more general (used also in Crazystone and Erica)

    • each move (and est. win prob) in tree is cached

    • simulations use these cache moves/probs

  • RAVE uses these probs only for children of each node on path to root

simulation policy features
  • response   ( 1 pattern )

  • save atari   ( 1 )

  • neighbour to previous move   ( 8 )

  • nakade (inside move)   ( 8192 )

  • response pattern (12-point diamond around prev move)   ( 32207 )

  • non-response pattern (8-points around cell)   ( 69338 )

nakade example
if x group has no outside liberties ,
  where should x/o play ?

 x x x x x
 x - - - x
 x x x x x

tree policy features

  • all simulation policy features plus …

  • self-atari   ( 1 )   moves allows stone to be captured

  • last move distance   ( 34 )   manhattan distance to previous 2 moves

  • non-response pattern   ( 32207 )

alphago strength

  • tournaments with different versions, also Zen Crazystone Pachi Fuego Gnugo, 5s/move

  • no handicap   alphago win rate .998

  • 4 stone handicap   alphago win rate .77 .86 .99 1   v   C Z P F

  • dist alphgo v   alphago C Z P F   win rate   .77 1 1 1 1

  • mixed ratio (sims, value net) .5 .5 v other variants win rate .95

    • suggests sims, value net provide complementary info

  • distributed alphago   v   Fan Hui 2p   Oct 2015   5-0 formal games (each player 1 h + 3 x 30s byoyomi) and 3-2 (3 x 30s byoyomi)

discussion

  • during match v Fan Hui, alphago consider thousands times fewer / second positions than DeepBlue v Kasparov   but …

    • picked positions more wisely, using policy net and mcts

    • evaluated positions more accurately, using value net and sims

  • DeepBlue used handcrafted eval function, alphago used neural nets trained using general-purpose supervised/reinforcement learning methods

  • alphago is trained to play chinese rules, 7.5 komi

    • changing the komi or rule-set would require retraining all its neural nets

  • alphago does not use RAVE, progressive widening, dyanmic komi, opening book

not mentioned in paper
  • alphago time management policy found by ad hoc selfplay machine learning

  • alphago uses pondering (thinks while opponent is thinking about their

  • in match v Lee Sedol

    • alphago time per move close to constant

    • LS often behind on time (compared to ag)

    • ag pondering exaggerated this effect, putting LS under more time pressure

    • LS used to playing with little time