Report: The 520.omnetpp_r Benchmark (June 2016 Development Kit)

Elliot Colp, J. Nelson Amaral, Caian Benedicto, Raphael Ernani Rodrigues,
Edson Borin, Dylan Ashley, Morgan Redshaw, Erick Ochoa Lopez
University of Alberta

2018-01-18

1 Executive Summary

This report presents:

Information regarding generation of workloads for the 520.omnetpp_r benchmark.
New workloads for the benchmark and an analysis of these workloads.

Important take-away points from this report:

The behavior of this benchmark does not differ significantly between workloads.
LLVM’s performance with optimization level -O1 shows a significant decrease across various metrics.

2 Benchmark Information

The 520.omnetpp_r benchmark simulates a campus network using the OMNeT++ simulation framework.

2.1 Inputs

The benchmark takes NEtwork Description .ned files, which describe the structure of a simulation model in the NED language, and a configuration file as inputs. ned files are written in the Network Description language and it is documented in the OMNeT++ website.¹

ned files: These files lets users specify the simulation model, and some parameters that were specify during the compilation stage.
ini file: The ini file lets the user specify initial configuration for the simulation as well as which measurements will be recorded as output.

The data in this report was generated with the June 2016 Development Kit 93. The available workloads provided in SPEC 2017 Kit 91 differ only by the amount of time the program is required to simulate.

train: Simulates the network for 0.3 seconds.
test: Simulates the network for 0.03 seconds.
refrate: Simulates the network for 3 seconds.

3 Generating Workloads

It is possible to generate new workloads by modifying the ned files or the ini configuration file. For example, modifying the LargeNetwork.ned file can allow one to change the simulated network topology. This is the way we have chosen to generate new inputs. There are other parameters that can be changed to generate new workloads. For example, the omnetpp.ini configuration file provides a way to increase the number of clients in the network connected to either buses, hubs, or switches.

4 New Workload

Seven new inputs were genereted using the method described above. These inputs are described below:

line: The campus network is organized in a line topology.
ring: The campus network is organized in a ring topology.
star: The campus network is organized in a star topology.
tree: The campus network is organized in a tree topology.
rand9: The campus network is organized in a random network with 9 edges.
rand18: The campus network is organized in a random network with 18 edges.
rand27: The campus network is organized in a random network with 27 edges.

Other parameters were not changed in order to provide a reasonable way to compare the available workloads with the new workloads. Simulation time was set to 0.3 seconds in order to keep the workload consistent and manageable. It also provides a point of comparison with the train workload.

5 Analysis

This section presents an analysis of the workloads created for the 520.omnetpp_r benchmark. All data was produced using the Linux perf utility and represents the mean of three runs. In this section the following questions will be addressed:

How much work does the benchmark do for each workload?
How does the behavior of the benchmark change between inputs?
How is the benchmark’s behavior affected by different compilers and optimization levels?

5.1 Work

In order to ensure a clear interpretation the analysis of the work done by the benchmark on various workloads will focus on the results obtained with GCC 4.8.2 at optimization level -O3. The execution time was measured on machines equipped with Intel Core i7-2600 processors at 3.4 GHz with 8 GiB of memory running Ubuntu 14.04.1 LTS on Linux Kernel 3.16.0-70-generic.

Figure 1 shows the mean execution time of 3 runs of 520.omnetpp_r on a certain workload. There is no clear trend that explains the different execution times of the inputs. Note that the test workload does 1/10th of the simulation and refrate does 10 times the amount of work.

Figure 2 displays the mean instruction count of 3 runs. This graph matches the mean execution time graph shown previously.

Figure 3 displays the mean clock cycles per instructions of 3 runs. Two facts explain why all inputs show a similar mean CPI count.

The number of clock cycles is correlated to the execution time.
The similarity between Figure 1 and Figure 2

Figure 1: The mean execution time from three runs for each workload.

Figure 2: The mean instruction count from three runs for each workload.

Figure 3: The mean clock cycles per instruction from three runs for each workload.

5.2 Coverage

Figure 4: Time spent in functions.

Figure 5: Percent of time spent in functions.

Figures 4 and 5 show the amount of time spent in the most time consuming functions. No differences between the workloads can be seen in either graph with the exception of their proportion to the total execution time.

5.3 Workload Behavior

The analysis of the workload behavior is done using two different methodologies. The first section of the analysis is done using Intel’s top down methodology.² The second section is done by observing changes in branch and cache behavior between workloads.

To collect data GCC 4.8.2 at optimization level -O3 is used on machines equipped with Intel Core i7-2600 processors at 3.4 GHz with 8 GiB of memory running Ubuntu 14.04.1 LTS on Linux Kernel 3.16.0-70-generic. All data remains the mean of three runs.

5.3.1 Intel’s Top Down Methodology

Intel’s top down methodology consists of observing the execution of micro-ops and determining where CPU cycles are spent in the pipeline. Each cycle is then placed into one of the following categories:

Front-end Bound: Cycles spent because there are not enough micro-operations being supplied by the front end.
Back-end Bound: Cycles spent because there are not enough resources available to process pending micro-operations, including slots spent waiting for memory access.
Bad Speculation: Cycles spent because speculative work was performed and resulted in an incorrect prediction.
Retiring: Cycles spent actually carrying out micro-operations.

Using this methodology the program’s execution is broken down in Figure 6. The benchmark shows the same behaviour accross different inputs. It is interesting to note that most of the execution slots are back-end bound.

Figure 6: Breakdown of each workload with respect to Intel’s top down methodology.

5.3.2 Branch and Cache

By looking at the behavior of branch predictions and cache hits and misses we can gain a deeper insight into the execution of the program between workloads.

Figure 7 summarizes the percentage of instructions that are branches and exactly how many of those branches resulted in a miss.

Figure 8 summarizes the percentage of LLC accesses and exactly how many of those accesses resulted in LLC misses.

Figure 7 and Figure 8 further confirm that the benchmark shows little change in runtime behaviour across different inputs.

Figure 7: Breakdown of branch instructions in each workload.

Figure 8: Breakdown of LLC accesses in each workload.

5.4 Compilers

Limiting the experiments to only one compiler can endanger the validity of the research. To compensate for this a comparison of results between GCC (version 4.8.2) and the Intel ICC compiler (version 16.0.3) has been conducted. Furthermore, due to the prominence of LLVM in compiler technology a comparison between results observed through only GCC (version 4.8.2) and results observed using the Clang frontend to LLVM (version 3.6.2) has also been conducted.

Due to the sheer number of factors that can be compared, only those that exhibit a considerable difference have been included in this section.

(a) ICC to GCC’s execution time ratio. (b) LLVM to GCC’s execution time ratio.

(e) ICC to GCC’s cycles ratio. (f) LLVM to GCC’s cycles ratio.

Figure 9: ICC and LLVM’s comparison against GCC

(a) ICC to GCC’s branch misses ratio. (b) LLVM to GCC’s branch misses ratio.

Figure 10: ICC and LLVM’s comparison against GCC

5.4.1 LLVM

Figures 9b, 9d and 9f summarize some interesting differences gained as the result of swapping GCC out with the LLVM compiler. The most interesting result is seen at optimization level -O1 where Figure 9b shows a significance increase in execution time for the LLVM backend.

Figure 9d shows that for optimization level -O1, LLVM performs some code transformation that increases the number of branches by a factor of 3. Figures 10b and 10d are included to further investigate the impact of branching. These two figures show no deviation from GCC’s performance. However, since 10d represents the number of memory references that could not be served by any of the cache, it is possible that the execution with optimization level -O1 had increase references to data on caches further away from the CPU. This is seen at Figure 9f where an increase in cycles is shown.

5.4.2 ICC

There is not much variation when switching GCC with ICC. ICC has similar performance across all inputs and across all optimization levels when compared to GCC. The same metrics that were used in the analysis for LLVM are also shown for the sake of completeness.

6 Conclusion

All 520.omnetpp_r’s analyses across inputs show similar results. There is no reason to believe that different workloads will influences the program behaviour significantly.

The additional workloads that were created serve to provide a larger data set.

LLVM’s optimization level -O1 shows a significant decrease in performance as measured by execution time and branching.

Appendix

Figure 11: Percentage of execution time spent on all symbols which made up more than 5% of the execution time for at least one of the workloads. This figure continues onto the next page.

(a) for line


Symbol	Time Spent (geost)

std::_Rb_
tree_increment	9.65 (1.01)

cMessageHeap::
shiftup	16.60 (1.11)

cIndexedFileOutputVectorManager::
record	4.42 (1.03)

cGate::deliver	6.25 (1.02)

(b) for rand18


Symbol	Time Spent (geost)

std::_Rb_
tree_increment	10.05 (1.02)

cMessageHeap::
shiftup	16.76 (1.12)

cIndexedFileOutputVectorManager::
record	4.29 (1.02)

cGate::deliver	6.46 (1.02)


Symbol	Time Spent (geost)

std::_Rb_
tree_increment	10.08 (1.04)

cMessageHeap::
shiftup	16.73 (1.08)

cIndexedFileOutputVectorManager::
record	4.01 (1.04)

cGate::deliver	6.17 (1.03)

(d) for rand9


Symbol	Time Spent (geost)

std::_Rb_
tree_increment	9.66 (1.03)

cMessageHeap::
shiftup	16.16 (1.01)

cIndexedFileOutputVectorManager::
record	4.53 (1.03)

cGate::deliver	6.41 (1.03)

(e) for refrate


Symbol	Time Spent (geost)

std::_Rb_
tree_increment	9.17 (1.03)

cMessageHeap::
shiftup	16.39 (1.04)

cIndexedFileOutputVectorManager::
record	5.30 (1.04)

cGate::deliver	6.81 (1.02)

(f) for ring


Symbol	Time Spent (geost)

std::_Rb_
tree_increment	9.76 (1.02)

cMessageHeap::
shiftup	17.54 (1.12)

cIndexedFileOutputVectorManager::
record	4.44 (1.02)

cGate::deliver	6.48 (1.03)

(g) for star


Symbol	Time Spent (geost)

std::_Rb_
tree_increment	9.33 (1.08)

cMessageHeap::
shiftup	17.01 (1.09)

cIndexedFileOutputVectorManager::
record	4.44 (1.06)

cGate::deliver	6.43 (1.03)

(h) for train


Symbol	Time Spent (geost)

std::_Rb_
tree_increment	9.57 (1.04)

cMessageHeap::
shiftup	16.04 (1.03)

cIndexedFileOutputVectorManager::
record	4.57 (1.01)

cGate::deliver	6.32 (1.02)

(i) for tree


Symbol	Time Spent (geost)

std::_Rb_
tree_increment	9.45 (1.02)

cMessageHeap::
shiftup	15.77 (1.02)

cIndexedFileOutputVectorManager::
record	4.52 (1.02)

cGate::deliver	6.34 (1.01)

¹OMNeT++ Manual: https://omnetpp.org/doc/omnetpp/manual

²More information can be found in §B.3.2 of the Intel 64 and IA-32 Architectures Optimization Reference Manual .