Report: The 510.parest_r Benchmark (June 2016 Development Kit)

Elliot Colp, J. Nelson Amaral, Caian Benedicto, Raphael Ernani Rodrigues, Edson Borin, Erick Ochoa
University of Alberta

January 18, 2018

1 Executive Summary

This report presents:

Information about generating workloads for the benchmark.
New workloads for the benchmark, as well as an analysis of these workloads.

Important take-away points from this report:

The benchmark’s behaviour is extremely different from workload to workload.
To truly assess how the benchmark interacts with the processor, it should be run with a large collection of workloads. More workloads should be generated than the ones presented in this report.
The benchmark’s behaviour is nearly identical between the train and refrate workloads. Thus feedback-directed optimization (FDO) that is evaluated only with the train and refrate may report over-optimistic performance prediction for FDO.

2 Inputs

See benchmark description for a description of the input format.

2.1 Generating Workloads

To generate new workloads, one can modify the parameters’ values found on the file with extension .prm. The source code is well documented and can be used to find which values are accepted by the parameters. For example: it is not obvious what type of values the parameter Refinement criterion will accept. However, searching for Refinement criterion in the source code will yield a list of strings that are valid values for the Refinement criterion parameter.

3 New Workloads

This report describes a script that randomly generates valid .prm files. This script enabled an initial exploration of the input space available for 510.parest_r. This exploration revealed that some parameter values significantly change the behaviour of the benchmark’s execution. Here is a list with the new workloads’ names and the values that are different from refrate:

SD_FE_cubic: Sets Finite element in Synthetic Data to cubic.
Reduces Number of experiments to 8.
Reduces Maximal number of iterations to 2
SD_FE_quadratic: Sets Finite element in Synthetic Data to quadratic.
Reduces Maximal number of iterations to 2
TS_FE_quadratic: Sets Finite element in Tomography Slave to quadratic.
Reduces Number of experiments to 4.
Reduces Maximal number of iterations to 2.
misfit: Sets Misfit functional to ratio over ratio minus one.
noise: Sets Noise level to 0.5.
Sets Noise type to additive.

Beware that some parameters’ values produce an error while running the 510.parest_r. We have removed the parameters’ values that appear to consistently break 510.parest_r. However other values or combination of values may still result in an error.

The script and its documentation can be found in the main page for the Alberta Workloads for the SPEC CPU2017 Benchmark Suite.

4 Analysis

This section presents an analysis of the workloads for the 510.parest_r benchmark and their effects on its performance. The data presented this section was produced with the Linux perf utility. Results presented are the mean value from three runs. This section answers the following questions:

How much work does the benchmark do for each workload?
Which parts of the benchmark code are covered by each workload?
How does each workload affect the benchmark’s behaviour?
How is the benchmark’s behaviour affected by different compilers and optimization levels?

4.1 Work

The simplest metric by which we can compare the various workloads is their execution time. To this end, the benchmark was compiled with GCC 4.8.4 at optimization level -O3, and the execution time was measured for each workload on machines with Intel Core i7-2600 processors at 3.4 GHz and 8 GiB of memory running on Ubuntu 14.04 LTS on Linux Kernel 3.16.0-71-generic. Figure 1 shows the mean execution time for each of the workloads.

Figure 1 shows the mean execution time for each benchmark. The new workloads were designed to match the execution time of refrate. However, some parameters drastically change the execution time of the program, such as the parameters Noise type and Noise level.

Figure 2 shows the number of instructions executed for each benchmark. From this graph, we can conclude that those benchmarks which take the longest to run, also execute the most instructions.

Figure 3 shows the ratio of cycles per instruction for each benchmark. The low count of cycles suggests that the benchmark is performing useful work most of the time and that the GCC was able to parallelize the work.

Figure 1: The mean execution time from three runs for each workload.

Figure 2: The mean instruction count from three runs for each workload.

Figure 3: The mean clock cycles per instruction from three runs for each workload.

4.2 Coverage

The impact of the changes on the run time can be better understood with some background on finite element analysis: “The finite element method (FEM) is a numerical technique for finding approximate solutions to boundary value problems for partial differential equations. [ …] FEM subdivides a larger problem into smaller, simpler, parts, called finite elements.” ¹

This section analyzes which parts of the benchmark are exercised by each of the workloads. To this end, we determined the percentage of execution time the benchmark spends on several of the most time-consuming functions.

Figure 4 shows the functions that made up more than 5% of the execution time for at least one of the workloads. In this figure, we can see that for SD_FE_cubic and SD_FE_quadratic a large percentage of their execution time is spent on function void METomography::ForwardSolver::block_build_matrix_and_rhs_threaded.The previous function is also executed by other workloads, but not to the same extent.

Figure 4 shows that for SD_FE_quadratic, SD_FE_cubic and TS_FE_quadratic execution of the function dealii::FESystem::initialize is negligible. However the same function executes for 6.92 ± 1.00 % of the time when the misfit is used.

Figure 4: The mean (of 3 runs) percentage of execution time spent on each of the functions listed in Figure 13.

Figure 5: The mean (of 3 runs) execution time in seconds spent on each of the functions listed in Figure 13.

4.3 Workload Behaviour

In this section, we will first use the Intel “Top-Down” methodology²to see where the processor is spending its time while running the benchmark. This methodology breaks the benchmark’s total execution slots into four high-level classifications:

Front-end bound: Cycles spent because there are not enough micro-operations being supplied by the front end.
Back-end bound: Cycles spent because there are not enough resources available to process pending micro-operations, including slots spent waiting for memory access.
Bad speculation: Cycles spent because speculative work was performed due to an incorrect prediction (usually from branch misses).
Retiring: Cycles spent actually carrying out micro-operations.

The event counters used in the formulas to calculate each of these values can be inaccurate. Therefore the information should be considered with some caution. Nonetheless, they provide a broad, generally reasonable overview of the benchmark’s performance.

Figure 6 shows the distribution of execution slots spent on each of the aforementioned Top-Down categories. The workloads spend most of their time retiring micro-operations. This means that the binary is performing useful work most of the time. When the binary is not performing useful work it is waiting for resources. This means that 510.parest_r is a good benchmark for stressing processor resources.

Figure 6: The distribution of execution slots spent in each category according to Intel’s Top-Down methodology.

We now analyze cache and branching behaviour more closely. Figures 7 and 8 show, respectively, the rate of branching instructions and last-level cache (LLC) accesses from the same runs as Figure 6. As the benchmark executes a vastly different number of instructions for each workload, which makes comparison between benchmarks difficult, the values shown are rates per instruction executed.

Figure 7 shows that all workloads have a similar number of branches per instruction. It is also interesting to note that our analyses on other benchmarks show less branching per instruction than 510.parest_r. From this graph, we can infer that the main body of 510.parest_r has a 1 to 10 ratio of branches to instructions.

Figure 8 shows the last-level cache accesses and misses per workload. There is no pattern immediately visible and different workloads recorded different results.

Figure 7: The mean (of 3 runs) rate of branching instructions and branch misses for each workload. The number in each bar represents the mean branch miss rate (computed as number of branch misses divided by number of branch instructions).

Figure 8: The mean (of 3 runs) rate of last-level cache (LLC) references and misses per instruction for each workload. The number in each bar represents the mean cache miss rate (computed as number of cache misses divided by number of cache references).

4.4 Compilers

This section compares the performance of the benchmark when compiled with either GCC (version 4.8.4); the Clang frontend to LLVM (version 3.6); or the Intel ICC compiler (version 16.0.3). To this end, the benchmark was compiled with all three compilers at six optimization levels (-O0, -O1, -O2, -O3, -Ofast, -Os), resulting in 18 different executables. Each workload was then run 3 times with each executable on the Intel Core i7-2600 machines described earlier.

4.4.1 LLVM

First, we compare GCC to LLVM. There are a few notable differences in their behaviour, the most prominent of which are highlighted in Figure 9.

Figure 10a shows that the execution times are generally similar. However, for the execution time at optimization level -O1, LLVM binaries underperform those generated by GCC. Another interesting result is that the workload SD_FE_cubic most LLVM binaries took longer to execute than GCC’s binaries.

Figure 10b demonstrates that many of the spikes and dips seen in Figure 10a can be explained by the number of instructions executed. Normally, a huge spike in instructions executed means speculative execution took place and performed poorly. However, Figure 12e shows that to be the opposite case. One assumption that can explain why LLVM underperforms GCC for -O1 is that, maybe, LLVM does not perform an optimization that GCC performs at this optimization level.

Another interesting piece of data found in Figure 12e is that LLVM binaries compiled with optimization levels -O2, -O3, -Os and -Ofast do a worse job at speculating SD_FE_cubic’s execution.

Figure 10c shows the cache misses in each of the binaries. At optimization level -O1, LLVM binaries performed better as they had less cache misses. This is an interesting result, since at this optimization level the LLVM binary generated 2 to 4 times more instructions than GCC.

Figure 10d shows the number of dTLB misses. LLVM binaries for optimization levels -O3, -Os and -Ofast produced similar results for each workload.

Figure 9: Changes in various performance measures from GCC to LLVM.

(a) relative execution time

(b) instructions executed

(d) dTLB misses (load and store combined)

(e) Bad Speculation

(f) Legend for a graphs in Figure 9 and 11

4.4.2 ICC

Once again, the most prominent differences between the ICC- and GCC- generated code are highlighted in Figure 11.

Figure 12a shows a similar pattern to Figure 10a. One can no longer find that at optimization level -O1, all workloads take twice or more to execute, but one can still see that workload SD_FE_cubic takes longer to execute across most optimization levels.

Figure 12b is very similar to Figure 12a. This means that execution time is proportional to the number of instructions executed. Another interesting result is that ICC executes less instructions than ICC for most workloads at the optimization levels -O2, -O3, -Os and -Ofast

Figure 12c shows that ICC has more cache misses than GCC across optimization levels -O2, -O3, -Os and -Ofast

Finally, Figure 12d data translation lookaside buffer (dTLB) misses. As with LLVM, dTLB misses are highly depended on input.

Figure 11: Changes in various performance measures from GCC to ICC.

(a) relative execution time

(b) instructions executed

(d) dTLB misses (load and store combined)

(e) Bad Speculation

5 Conclusion

The 510.parest_r benchmark’s behaviour may depend heavily on single values. For instance, workload SD_FE_cubic spends over 40% of its time executing function void METomography::ForwardSolver::block_build_matrix_and_rhs_threaded which does not appear on the execution of the refrate nor train workloads.

The different inputs may change only one dimension of the benchmark’s execution. For example, the coverage for the noise workload looks very similar to the refrate workload but the execution time grew by more than 400%.

The LLVM version used in this analysis generates binaries that underperform when using the optimization level -O1 for this specific benchmark.

The workload SD_FE_cubic seems to be a good test for compilers since both LLVM and ICC underperformed against GCC.

Appendix

Figure 13: Percentage of execution time spent on all symbols which made up more than 5% of the execution time for at least one of the workloads. This figure continues onto the next page.

(a) for SD_FE_cubic


Symbol	Time Spent (geost)

void dealii::
SparseMatrix::vmult	7.22 (1.00)

void dealii::SparseMatrix::
precondition_SSOR	6.58 (1.01)

void dealii::
SparseILU::vmult	3.50 (1.01)

void dealii::
SolverGMRES::solve	1.22 (1.00)

void METomography::ForwardSolver::
block_build_matrix_and_rhs_threaded	46.45 (1.01)

double dealii::
Vector::operator*	1.72 (1.01)

dealii::FESystem::
initialize	0.00 (1.00)

METomography::Slave::Slave::
GlobalMatrix::assemble_matrix	0.46 (1.00)

(b) for SD_FE_quadratic


Symbol	Time Spent (geost)

void dealii::
SparseMatrix::vmult	19.46 (1.00)

void dealii::SparseMatrix::
precondition_SSOR	8.44 (1.00)

void dealii::
SparseILU::vmult	15.33 (1.00)

void dealii::
SolverGMRES::solve	4.66 (1.00)

void METomography::ForwardSolver::
block_build_matrix_and_rhs_threaded	14.64 (1.00)

double dealii::
Vector::operator*	6.59 (1.01)

dealii::FESystem::
initialize	0.01 (1.00)

METomography::Slave::Slave::
GlobalMatrix::assemble_matrix	2.01 (1.00)


Symbol	Time Spent (geost)

void dealii::
SparseMatrix::vmult	28.91 (1.00)

void dealii::SparseMatrix::
precondition_SSOR	10.58 (1.00)

void dealii::
SparseILU::vmult	36.01 (1.00)

void dealii::
SolverGMRES::solve	2.77 (1.00)

void METomography::ForwardSolver::
block_build_matrix_and_rhs_threaded	1.19 (1.00)

double dealii::
Vector::operator*	3.77 (1.00)

dealii::FESystem::
initialize	0.01 (1.00)

METomography::Slave::Slave::
GlobalMatrix::assemble_matrix	5.47 (1.00)

(d) for misfit


Symbol	Time Spent (geost)

void dealii::
SparseMatrix::vmult	23.42 (1.00)

void dealii::SparseMatrix::
precondition_SSOR	6.10 (1.00)

void dealii::
SparseILU::vmult	23.69 (1.00)

void dealii::
SolverGMRES::solve	5.63 (1.00)

void METomography::ForwardSolver::
block_build_matrix_and_rhs_threaded	0.85 (1.00)

double dealii::
Vector::operator*	8.18 (1.00)

dealii::FESystem::
initialize	6.92 (1.00)

METomography::Slave::Slave::
GlobalMatrix::assemble_matrix	1.82 (1.00)

(e) for noise


Symbol	Time Spent (geost)

void dealii::
SparseMatrix::vmult	19.44 (1.00)

void dealii::SparseMatrix::
precondition_SSOR	6.53 (1.01)

void dealii::
SparseILU::vmult	16.34 (1.00)

void dealii::
SolverGMRES::solve	4.90 (1.01)

void METomography::ForwardSolver::
block_build_matrix_and_rhs_threaded	1.81 (1.01)

double dealii::
Vector::operator*	6.90 (1.01)

dealii::FESystem::
initialize	10.83 (1.01)

METomography::Slave::Slave::
GlobalMatrix::assemble_matrix	2.40 (1.01)

(f) for refrate


Symbol	Time Spent (geost)

void dealii::
SparseMatrix::vmult	27.14 (1.00)

void dealii::SparseMatrix::
precondition_SSOR	5.45 (1.01)

void dealii::
SparseILU::vmult	26.57 (1.00)

void dealii::
SolverGMRES::solve	7.52 (1.00)

void METomography::ForwardSolver::
block_build_matrix_and_rhs_threaded	0.57 (1.01)

double dealii::
Vector::operator*	10.56 (1.00)

dealii::FESystem::
initialize	3.67 (1.01)

METomography::Slave::Slave::
GlobalMatrix::assemble_matrix	2.28 (1.00)

(g) for train


Symbol	Time Spent (geost)

void dealii::
SparseMatrix::vmult	27.82 (1.00)

void dealii::SparseMatrix::
precondition_SSOR	8.23 (1.01)

void dealii::
SparseILU::vmult	24.59 (1.00)

void dealii::
SolverGMRES::solve	7.08 (1.00)

void METomography::ForwardSolver::
block_build_matrix_and_rhs_threaded	1.42 (1.02)

double dealii::
Vector::operator*	10.04 (1.01)

dealii::FESystem::
initialize	0.01 (1.00)

METomography::Slave::Slave::
GlobalMatrix::assemble_matrix	3.22 (1.01)

¹https://en.wikipedia.org/wiki/Finite_element_method

²More information can be found in §B.3.2 of the Intel 64 and IA-32 Architectures Optimization Reference Manual