Report: The 519.lbm_r Benchmark (June 2016 Development Kit)

Elliot Colp, J. Nelson Amaral, Caian Benedicto, Raphael Ernani Rodrigues,
Edson Borin, Dylan Ashley, Morgan Redshaw
University of Alberta

2018-01-18

1 Executive Summary

This report presents:

Important take-away points from this report:

2 Benchmark Information

The 519.lbm_r benchmark simulates incompressible fluids in 3D using the Lattice Boltzmann Method. It serves to emulate the most computationally important part of a larger program, used in material science to simulate the behavior of fluids with free surfaces in the formation and movement of gas bubbles in metal foams. In the interest of benchmarking and to enhance portability there is extensive use of macros in the program to hide the details of data access.

2.1 Inputs

The input to this benchmark consists of an obstacle file along with a number of other command line arguments. The obstacle file is an ASCII representation of the objects that occupy the channel the fluids flow through. The command line arguments are as follows:

time step
the amount of steps that should be simulated
result file
the file that the output should be placed in
0,1,2
what to do with the results; 0 does nothing, 1 compares the results with the expected output, and 2 stores the results, overwriting if necessary
0,1
the type of simulation setup used; 0 uses a lid-driven cavity and 1 uses channel flow
obstacle file
the file that contains the ascii representation of the channel the fluids flow through

2.2 Generating Workloads

The workloads for this benchmark are pre-generated to exercise all features of the benchmark.

3 New Workload

A number of different workloads were generated based on possible inputs that the benchmark can receive. Each workload has the following properties associated with it:

object shape
the shape of the objects that exist in the channel
object density
how many objects exist in the channel that the fluid flows through
object size
the size of the objects that occupy the channel
simulation type
whether lid-driven cavity or channel flow is to be used
time step
the amount of time steps that should be simulated

The workloads are named according to the properties listed in the table above. The ordering of the table is the order that the properties appear in the name of each workload. Note that workloads beginning with ”square” use cubic objects instead of spherical objects.

4 Analysis

This section presents an analysis of the workloads created for the 519.lbm_r benchmark. All data was produced using the Linux perf utility and represents the mean of three runs. In this section the following questions will be addressed:

  1. How much work does the benchmark do for each workload?
  2. How does the behavior of the benchmark change between inputs?
  3. How is the benchmark’s behavior affected by different compilers and optimization levels?

4.1 Work

In order to ensure a clear interpretation the analysis of the work done by the benchmark on various workloads will focus on the results obtained with GCC 4.8.2 at optimization level -O3. The execution time was measured on machines equipped with Intel Core i7-2600 processors at 3.4 GHz with 8 GiB of memory running Ubuntu 14.04.1 LTS on Linux Kernel 3.16.0-71-generic.

Figure 1 shows the mean execution time of 3 runs of 519.lbm_r on each workload. It is clear that the only thing that alters the execution time of 519.lbm_r significantly is the value of time step. Furthermore train is similar to workloads with a small value for time step and refrate is similar to workloads with a large value for time step.

Figure 2 displays the mean instruction count and Figure 3 gives the mean clock cycles per instruction. Both means are taken from 3 runs of the corresponding workload. Despite the similar execution times, Figure 2 shows a difference in the number of instructions executed between workloads that does not seem to be explained by the value of time step. Figure 3 further emphasizes this difference. An important takeaway point from both figures is that train and refrate are not drastically different to any perceivable norm in the instruction count or cpi of workloads.

PIC
Figure 1: The mean execution time from three runs for each workload.
 
PIC
Figure 2: The mean instruction count from three runs for each workload.

PIC
Figure 3: The mean clock cycles per instruction from three runs for each workload.

4.2 Coverage

This section will analyze which parts of the benchmark are exercised by each of the workloads. To this end, we determined the percentage of execution time the benchmark spends on several of the most time-consuming functions. This data was recorded on a machine equipped with Intel Core i7-2600 processor at 3.4 GHz with 8 GiB of memory running Ubuntu 14.04.1 LTS on Linux Kernel 3.16.0-71-generic.

It is clear from both Figure 5a and Figure 5b that, with all workloads, the majority of the execution time for the program is spend on the LBM_performStreamCollideTRT function.

Figure 4: Breakdown of dominant functions for 519.lbm_r.
PIC (a) Percentage of execution PIC (b) Execution time

4.3 Workload Behavior

The analysis of the workload behavior is done using two different methodologies. The first section of the analysis is done using Intel’s top down methodology.1 The second section is done by observing changes in branch and cache behavior between workloads.

To collect data GCC 4.8.2 at optimization level -O3 is used on machines equipped with Intel Core i7-2600 processors at 3.4 GHz with 8 GiB of memory running Ubuntu 14.04.1 LTS on Linux Kernel 3.16.0-71-generic. All data remains the mean of three runs.

4.3.1 Intel’s Top Down Methodology

Intel’s top down methodology consists of observing the execution of micro-ops and determining where CPU cycles are spent in the pipeline. Each cycle is then placed into one of the following categories:

Front-end Bound
Cycles spent because there are not enough micro-operations being supplied by the front end.
Back-end Bound
Cycles spent because there are not enough resources available to process pending micro-operations, including slots spent waiting for memory access.
Bad Speculation
Cycles spent because speculative work was performed and resulted in an incorrect prediction.
Retiring
Cycles spent actually carrying out micro-operations.

Using this methodology the program’s execution is broken down in Figure 6. While the top down analysis does appear to imply that there is some trend that the object density causes some change in execution it remains small.

PIC

Figure 6: Breakdown of each workload with respect to Intel’s top down methodology.

4.3.2 Branch and Cache

By looking at the behavior of branch predictions and cache hits and misses we can gain a deeper insight into the execution of the program between workloads.

Figure 7 summarizes the percentage of instructions that are branches and exactly how many of those branches resulted in a miss. There does appear to be some small trend in the number of branches based on the workload. This remains a small trend.

Figure 8 summarizes the percentage of LLC accesses and exactly how many of those accesses resulted in LLC misses. Here there does appear to be a considerably large trend in the number of LLC accesses and LLC misses depending on the shape, density, and size of the objects.

Note that the train and refrate workload do not appear to violate any perceivable norm that could exist in the aforementioned figures.

PIC
Figure 7: Breakdown of branch instructions in each workload.
 
PIC
Figure 8: Breakdown of LLC accesses in each workload.

4.4 Compilers

Limiting the experiments to only one compiler can endanger the validity of the research. To compensate for this a comparison of results between GCC (version 4.8.2) and the Intel ICC compiler (version 16.0.3) has been conducted. Furthermore, due to the prominence of LLVM in compiler technology a comparison between results observed through only GCC (version 4.8.2) and results observed using the Clang frontend to LLVM (version 3.6.2) has also been conducted.

Due to the sheer number of factors that can be compared, only those that exhibit a considerable difference have been included in this section.

4.4.1 LLVM

Figure 9 summarizes some interesting differences gained as the result of swapping GCC out with an LLVM based compiler. The most prominent difference that exist when compiled with llvm and with gcc is the amount of front end bound execution slots and the amount of bad speculation execution slots.

Figure 10a shows that, at optimization level -O0, llvm has considerably more bad speculation execution slots than gcc.

Figure 10b shows that, at all optimization levels save -O0 and -O3, llvm has considerably more front end bound execution slots than gcc.

Figure 9: Changes in various performance measures from GCC to LLVM.
PIC (a) Bad Speculation  PIC (b) Front-end Bound

4.4.2 ICC

Figure 11 summarizes some interesting differences gained as the result of swapping GCC out with the Intel ICC compiler. The most prominent difference that exist when compiled with icc and with gcc appear to be in the amount of front end bound execution slots and the amount of bad speculation execution slots.

Figure 12a shows that, at optimization level -O0, icc has considerably more bad speculation execution slots than gcc.

Figure 12b shows that, at all optimization levels save -O0 and -O3, icc has considerably more front end bound execution slots than gcc.

Figure 11: Changes in various performance measures from GCC to ICC.
PIC (a) Bad Speculation  PIC (b) Front-end Bound PIC (c) Legend for all the graphs in Figures 9 and 11

5 Conclusion

Apart from execution time, the behaviour of 519.lbm_r does not differ significantly between workloads. The execution time of 519.lbm_r is almost entirely explained by the number of time steps is is asked to simulate.

The additional workloads that were created serve to provide a larger data set. These data sets are used as input. The train and refrate workloads behave similarly to any perceivable norm that could exist between workloads.

6 Limitation of the Generated Inputs

The only shapes that were used in the input were cubes and spheres. It is unknown if more unusual shapes would induce a different behavior for the benchmark.

1More information can be found in B.3.2 of the Intel 64 and IA-32 Architectures Optimization Reference Manual .