2006 AAAI Computer Poker Competition
In this competition, there was a focus on statistical significance.
In sports and games involving human teams, although you can know who won a game, it is difficult to say how likely they were to have won it. Thus, although competitions such as Robocup and The DARPA Grand Challenge are useful in making robotics and artificial intelligence more fun and accessible, there is an unquantifiable "luck" factor that makes it difficult to say what would happen if you ran the same competition again.In designing this competition, we attempted to make the match both exciting (from an outsider's perspective) and significant (from a scientific perspective). The conflict arises in terms of the randomness in poker, which makes it very difficult to have a quick, exciting match, where the outcome will almost surely be determined by who gets the best cards, i.e. "luck".
However, there is more to playing poker than just luck, and a carefully designed experiment can glean this information. The most obvious first step is "lots of hands", but alone this is not an answer. If we just had two bots play 40,000 hands against one another, we could not guarantee the same outcome if they played again, regardless of how distinctively one beat the other. For instance, VexBot, a player designed at the University of Alberta, can sometimes find holes in the play of opponents. However, at its present state of design, it can learn the wrong thing, and then the match is lost. Thus, 40,000 sequential hands may have just as much randomness as 1 hand: in the latter case, it is mostly the luck of the cards, in the former, it is the "luck" in the learning process.
Therefore, we played a series of matches, each match containing 1000 hands, and we reset the bots after every match. From a traditional sports perspective, this is a magical thing: if your star goalie is injured in the first game, he is magically healed before the second game. If you figured out how to beat the opponent's defense during the first game, you magically forget before the second game. In the world of computers, it is easy: you simply reset the machine back to its state at the beginning of the match: barring a hardware problem, this is sufficient.
Now, in order to further reduce the luck factor, we played duplicate match pairs. In other words, after a match is played, the bots are reset, and they play the same match from opposite sides: i.e., if your opponent had a straight flush in one match, then you will in another. Note that duplicate match pairs can only be played if bots are being reset between matches.
Thus, in the end we are asking, "What is the amount (in small bets/hand) that Bot A wins from Bot B in a duplicate match pair of 2000 hands?" This is a random variable, because of the randomness of the cards, the bots decisions, et cetera. Most scientific experiments involve the study of a random variable. In particular, in this instance, we are interested in the expected value, a quantity that is often considered.
One way to think about the expected value is to imagine playing Bot A and Bot B over and over again forever. After every duplicate match pair, we take the average small bets/hand that they won overall. This would be very close to the expected value if we waited a very long time, and for now this can serve as our understanding of the expected value.
Thus, if we had a really long series, unless the bots are really close in performance, if the expected value of the amount Bot A wins from Bot B is positive, Bot A will beat Bot B.
The central limit theorem in statistics not only states that this will eventually converge to the expected value, but it also talks about how quickly. How much it applies depends upon how many samples you take (i.e. duplicate match pairs we play), and how "random" the random variable is. As a rule of thumb, after twenty samples or so, the standard deviation can be used to characterize the randomness of the random variable. In particular, over your lifetime, out of every 20 experiments you perform, in only one will the expected value be more than two standard deviations away from what you measure the average to be. Thus, an active scientist performing thousands of experiments will actually see dozens of unlikely outcomes.
In the two competitions, we computed the standard deviations of the results of each series. Because the bankroll competition contained 20 duplicate match pairs, the standard deviation is a good estimate of the randomness. On the other hand, because the series competition was much smaller, the standard deviation is not as good an estimate of the confidence. Thus, here we will show the actual values of each of the duplicate match pairs.
In the Bankroll Competition, there was a time limit of 7 seconds for a program to make each of its plays, and each program played 40,000 hands against each of the other programs in a round-robin fashion. The player with the highest total bankroll was declared the winner. Hyperborean, BluffBot, Monash BPP, and Teddy competed. The medalists were:
The units were "small bets/hand". To put this in perspective, always folding loses 0.75 small bets/hand. Each individual series is summarized below. The number is the amount (in terms of small bets/hand) the row player won from the column player. Green (positive) indicates a series where the row player won money. Red (negative) indicates a series where the column player won money. The number after the ± indicates the standard deviation. Note that these standard deviations are small with respect to the differences above.
Note that Hyperborean's margin of victory over BluffBot has very little to do with its performance against BluffBot directly. The majority of the difference is in how well Hyperborean and BluffBot played against Monash BPP and Teddy.
|Hyperborean (University of Alberta, Canada). Record: 3 wins, 0 losses.|
|Bluffbot (USA). Record: 2 wins, 1 loss.|
|GS2 (Carnegie Mellon University, USA). Record: 1 win, 2 losses.|
|Hyperborean||Hyperborean wins||Hyperborean wins||Hyperborean wins|
|BluffBot||Hyperborean wins||Bluffbot wins||Bluffbot wins|
|GS2||Hyperborean wins||Bluffbot wins||GS2 wins|
|Monash BPP||Hyperborean wins||Bluffbot wins||GS2 wins|
|Copyright © 2006|