Report: The 541.leela_r Benchmark (June 2016 Development Kit)

Elliot Colp, J. Nelson Amaral, Caian Benedicto, Raphael Ernani Rodrigues,
Edson Borin, Dylan Ashley, Morgan Redshaw
University of Alberta

2018-01-18

1 Executive Summary

This report presents:

Important take-away points from this report:

2 Benchmark Information

The 541.leela_r benchmark is an advanced Go playing engine featuring many of the most recent advances in computer go. It has attained ratings of 1 kyu in Internet server play.

2.1 Inputs

The input for this benchmark consists of a incomplete game of go in the Smart Game Format.1

2.2 Generating Workloads

The process of generating and running workloads is, for the most part, completely automated. Place a compiled version of the 541.leela_r benchmark in the gen-input directory and call it leela_r. Following this simply cd into the gen-input directory and run make-wklds.sh.

3 New Workload

All of the generated workloads are randomly-generated sets of go positions taken from the defunct No-Name Go Server’s archive.2

Each of the workloads generated by the system contains exactly six go positions. Moves must be culled from the end of the game to ensure that the AI plays the game to its completion. The size of the board and number of moves culled vary between inputs.

The system is configured so that all input generated will fall into one of the categories in Table 1.




Name Board Size Moves Removed



short-9 9x9 5-15
medium-9 9x9 15-30
long-9 9x9 30-45
short-13 13x13 10-30
medium-13 13x13 40-60
long-13 13x13 70-90
short-19 19x19 15-40
medium-19 19x19 90-120
long-19 19x19 170-200



Table 1: Categories of randomly generated workloads.

4 Analysis

This section presents an analysis of the workloads created for the 541.leela_r benchmark. All data was produced using the Linux perf utility and represents the mean of three runs. In this section the following questions will be addressed:

  1. How much work does the benchmark do for each workload?
  2. How does the behavior of the benchmark change between inputs?
  3. How is the benchmark’s behavior affected by different compilers and optimization levels?

4.1 Work

In order to ensure a clear interpretation the analysis of the work done by the benchmark on various workloads will focus on the results obtained with GCC 4.8.4 at optimization level -O3. The execution time was measured on machines equipped with Intel Core i7-2600 processors at 3.4 GHz with 8 GiB of memory running Ubuntu 14.04 LTS on Linux Kernel 3.16.0-71.

Figure 1 shows the mean execution time of 3 runs of 541.leela_r on a certain workload. From this figure it is clear that there is an increase in execution time when the size of the board is increased. Since increasing the board size increases the number of game states this result is as would be expected.

It can also be observed that there is a relationship between the number of moves culled from the end of the game and execution time of the benchmark.

Figure 2 displays the mean instruction count and Figure 3 gives the mean clock cycles per instruction. Both means are taken from 3 runs of the corresponding workload. The most striking thing about Figure 2 is that it appears no altogether different from Figure 1. This observation is explained by Figure 3. If the CPI is relatively consistent between all workloads then the difference in execution times for the workloads should be, for the most part, explained by their instruction count. This suggests that the instruction mix is more or less independent of the input.

PIC
Figure 1: The mean execution time from three runs for each workload.
 
PIC
Figure 2: The mean instruction count from three runs for each workload.

PIC
Figure 3: The mean clock cycles per instruction from three runs for each workload.

4.2 Workload Behavior

The analysis of the workload behavior is done using two different methodologies. The first section of the analysis is done using Intel’s top down methodology.3 The second section is done by observing changes in branch and cache behavior between workloads.

To collect data GCC 4.8.4 at optimization level -O3 is used on machines equipped with Intel Core i7-2600 processors at 3.4 GHz with 8 GiB of memory running Ubuntu 14.04 LTS on Linux Kernel 3.16.0-71. All data remains the mean of three runs.

4.2.1 Intel’s Top Down Methodology

Intel’s top down methodology consists of observing the execution of micro-ops and determining where CPU cycles are spent in the pipeline. Each cycle is then placed into one of the following categories:

Front-end Bound
Cycles spent because there are not enough micro-operations being supplied by the front end.
Back-end Bound
Cycles spent because there are not enough resources available to process pending micro-operations, including slots spent waiting for memory access.
Bad Speculation
Cycles spent because speculative work was performed and resulted in an incorrect prediction.
Retiring
Cycles spent actually carrying out micro-operations.

Using this methodology the program’s execution is broken down in Figure 4. When the various workloads are compared it is clear that there are no real differences in either the percentage of micro-ops that retire or the percentage of cycles that are spent because of bad speculation.

PIC

Figure 4: Breakdown of each workload with respect to Intel’s top down methodology.

4.2.2 Branch and Cache

By looking at the behavior of branch predictions and cache hits and misses we can gain a deeper insight into the execution of the program between workloads.

Figure 5 summarizes the percentage of instructions that are branches and exactly how many of those result in a miss. It is clear that there is minimal fluctuation between different workloads.

Figure 6 summarizes the percentage of LLC accesses and exactly how many of those result in LLC misses. Here there are two important trends. The first is that a larger board appears to increase both the number of LLC accesses and the number of overall LLC misses. The second is that increasing the number of moves removed seems to lessen the penalty of a larger board in regards to both LLC access and overall LLC misses.

PIC
Figure 5: Breakdown of branch instructions in each workload.
 
PIC
Figure 6: Breakdown of LLC accesses in each workload.

4.3 Compilers

Limiting the experiments to only one compiler can endanger the validity of the research. To compensate for this a comparison of results between GCC (version 4.8.4) and the Intel ICC compiler (version 16.0.3) has been conducted. Furthermore, due to the prominence of LLVM in compiler technology a comparison between results observed through only GCC (version 4.8.4) and results observed using the Clang frontend to LLVM (version 3.6) has also been conducted.

Due to the sheer number of factors that can be compared, only those that exhibit a considerable difference have been included in this section.

4.3.1 LLVM

Figure 7 summarizes some interesting differences gained as the result of swapping GCC out with an LLVM based compiler. The first noteworthy characteristic of the results is that the only difference between GCC and LLVM appears to occurs at the optimization level -O1. As such the remainder of the analysis of LLVM will consider only the optimization level -O1.

At this optimization level LLVM takes approximately twice as long to execute. Furthermore there are considerably more branch instructions executed overall but with a lower miss rate. Furthermore LLVM has fewer page faults per instruction.

This information appears to indicate that LLVM is performing more work on the same data then GCC does. Though this cannot be confirmed without further experiments.

Figure 7: Changes in various performance measures from GCC to LLVM.
PIC (a) relative execution time   PIC (b) branch instructions PIC (c) branch miss rate   PIC (d) page faults per instruction

4.3.2 ICC

Figure ?? summarizes some interesting differences gained as the result of swapping GCC out with the Intel ICC compiler. It is clear that the only large difference between the two in regards to the execution of this benchmark lies in dTLB load misses and dTLB store misses when the optimization level is set to -O0.

Figure 9: Changes in various performance measures from GCC to LLVM.
PIC (a) dTLB load misses   PIC (b) dTLB store misses PIC (c) Legend for all the graph Figures 7

5 Conclusion

The execution of the 541.leela_r benchmark is primarily affected by the size of the board it receives as input and is only marginally affected by the number of moves culled from the end of the game. The existing train workload is an accurate model of one case of the benchmark’s execution but does not adequately represent all viable inputs to the program. The set of new workloads presented here along with the already existing workloads provide a comprehensive exercise of the benchmark.

Appendix

Figure 11: Percentage of execution time spent on all symbols which made up more than 5% of the execution time for at least one of the workloads. This figure continues onto the next page.
(a) for long-13


Symbol Time Spent (geost)


UCTNode::uct_
select_child 4.18 (1.00)


FastState::
play_random_move 12.23 (1.00)


FastBoard::
update_board_fast 15.96 (1.00)


FastBoard::
self_atari 6.65 (1.00)


FastBoard::save_
critical_neighbours 6.65 (1.00)


FastBoard::
no_eye_fill 5.36 (1.00)


FastBoard::nbr_
criticality 4.99 (1.01)


FastBoard::get_
pattern_fast_augment 5.62 (1.00)


FastBoard::add_
pattern_moves 11.94 (1.00)


(b) for long-19


Symbol Time Spent (geost)


UCTNode::uct_
select_child 3.17 (1.00)


FastState::
play_random_move 13.12 (1.00)


FastBoard::
update_board_fast 16.78 (1.00)


FastBoard::
self_atari 6.16 (1.01)


FastBoard::save_
critical_neighbours 6.92 (1.00)


FastBoard::
no_eye_fill 4.80 (1.00)


FastBoard::nbr_
criticality 5.84 (1.01)


FastBoard::get_
pattern_fast_augment 6.09 (1.01)


FastBoard::add_
pattern_moves 12.46 (1.00)


(c) for long-9



Symbol Time Spent (geost)


UCTNode::uct_
select_child 8.53 (1.00)


FastState::
play_random_move 9.96 (1.01)


FastBoard::
update_board_fast 13.39 (1.01)


FastBoard::
self_atari 7.10 (1.01)


FastBoard::save_
critical_neighbours 6.24 (1.00)


FastBoard::
no_eye_fill 4.38 (1.01)


FastBoard::nbr_
criticality 3.38 (1.01)


FastBoard::get_
pattern_fast_augment 4.86 (1.01)


FastBoard::add_
pattern_moves 10.40 (1.01)


(d) for medium-13


Symbol Time Spent (geost)


UCTNode::uct_
select_child 4.54 (1.01)


FastState::
play_random_move 11.02 (1.02)


FastBoard::
update_board_fast 15.86 (1.00)


FastBoard::
self_atari 7.00 (1.02)


FastBoard::save_
critical_neighbours 6.61 (1.01)


FastBoard::
no_eye_fill 5.57 (1.02)


FastBoard::nbr_
criticality 3.93 (1.01)


FastBoard::get_
pattern_fast_augment 5.30 (1.01)


FastBoard::add_
pattern_moves 11.18 (1.01)


(e) for medium-19



Symbol Time Spent (geost)


UCTNode::uct_
select_child 3.48 (1.01)


FastState::
play_random_move 12.28 (1.01)


FastBoard::
update_board_fast 17.01 (1.00)


FastBoard::
self_atari 6.46 (1.00)


FastBoard::save_
critical_neighbours 6.72 (1.00)


FastBoard::
no_eye_fill 5.10 (1.00)


FastBoard::nbr_
criticality 4.87 (1.01)


FastBoard::get_
pattern_fast_augment 5.75 (1.00)


FastBoard::add_
pattern_moves 11.78 (1.01)


(f) for medium-9


Symbol Time Spent (geost)


UCTNode::uct_
select_child 8.24 (1.01)


FastState::
play_random_move 9.93 (1.02)


FastBoard::
update_board_fast 13.49 (1.01)


FastBoard::
self_atari 6.79 (1.01)


FastBoard::save_
critical_neighbours 6.06 (1.01)


FastBoard::
no_eye_fill 4.59 (1.00)


FastBoard::nbr_
criticality 3.47 (1.01)


FastBoard::get_
pattern_fast_augment 4.88 (1.01)


FastBoard::add_
pattern_moves 10.38 (1.01)


(g) for refrate



Symbol Time Spent (geost)


UCTNode::uct_
select_child 4.52 (1.01)


FastState::
play_random_move 11.61 (1.02)


FastBoard::
update_board_fast 15.97 (1.00)


FastBoard::
self_atari 6.77 (1.01)


FastBoard::save_
critical_neighbours 6.72 (1.00)


FastBoard::
no_eye_fill 5.10 (1.00)


FastBoard::nbr_
criticality 4.62 (1.01)


FastBoard::get_
pattern_fast_augment 5.46 (1.01)


FastBoard::add_
pattern_moves 11.46 (1.00)


(h) for short-13


Symbol Time Spent (geost)


UCTNode::uct_
select_child 4.48 (1.01)


FastState::
play_random_move 11.64 (1.02)


FastBoard::
update_board_fast 15.93 (1.00)


FastBoard::
self_atari 6.46 (1.01)


FastBoard::save_
critical_neighbours 6.46 (1.01)


FastBoard::
no_eye_fill 5.59 (1.01)


FastBoard::nbr_
criticality 4.53 (1.00)


FastBoard::get_
pattern_fast_augment 5.57 (1.01)


FastBoard::add_
pattern_moves 11.57 (1.00)


(i) for short-19



Symbol Time Spent (geost)


UCTNode::uct_
select_child 3.68 (1.01)


FastState::
play_random_move 11.78 (1.02)


FastBoard::
update_board_fast 16.48 (1.00)


FastBoard::
self_atari 6.22 (1.01)


FastBoard::save_
critical_neighbours 6.51 (1.00)


FastBoard::
no_eye_fill 6.07 (1.01)


FastBoard::nbr_
criticality 4.49 (1.01)


FastBoard::get_
pattern_fast_augment 5.58 (1.00)


FastBoard::add_
pattern_moves 11.25 (1.00)


(j) for short-9


Symbol Time Spent (geost)


UCTNode::uct_
select_child 7.47 (1.05)


FastState::
play_random_move 8.37 (1.02)


FastBoard::
update_board_fast 12.05 (1.03)


FastBoard::
self_atari 6.07 (1.04)


FastBoard::save_
critical_neighbours 4.81 (1.03)


FastBoard::
no_eye_fill 6.03 (1.06)


FastBoard::nbr_
criticality 2.30 (1.07)


FastBoard::get_
pattern_fast_augment 4.39 (1.04)


FastBoard::add_
pattern_moves 8.57 (1.02)


(k) for test


Symbol Time Spent (geost)


UCTNode::uct_
select_child 3.49 (1.03)


FastState::
play_random_move 10.82 (1.02)


FastBoard::
update_board_fast 16.14 (1.02)


FastBoard::
self_atari 6.18 (1.03)


FastBoard::save_
critical_neighbours 6.49 (1.02)


FastBoard::
no_eye_fill 5.28 (1.02)


FastBoard::nbr_
criticality 3.75 (1.04)


FastBoard::get_
pattern_fast_augment 5.04 (1.02)


FastBoard::add_
pattern_moves 10.96 (1.02)


(l) for train


Symbol Time Spent (geost)


UCTNode::uct_
select_child 3.33 (1.01)


FastState::
play_random_move 13.13 (1.01)


FastBoard::
update_board_fast 16.74 (1.01)


FastBoard::
self_atari 6.11 (1.01)


FastBoard::save_
critical_neighbours 6.83 (1.01)


FastBoard::
no_eye_fill 4.58 (1.01)


FastBoard::nbr_
criticality 5.54 (1.01)


FastBoard::get_
pattern_fast_augment 6.07 (1.01)


FastBoard::add_
pattern_moves 12.44 (1.00)


1http://www.red-bean.com/sgf/

2https://github.com/zenon/NNGS_SGF_Archive

3More information can be found in B.3.2 of the Intel 64 and IA-32 Architectures Optimization Reference Manual .