Report: The 502.gcc_r Benchmark


January 18, 2018

1 Executive Summary

This report presents:

Important take-away points from this report:

2 Benchmark Information

502.gcc_r is an important benchmark for the SPEC suite and its been present in all of the 5 suites released so far. The current version of the GCC benchmark used in CPU2017 is based on the GNU C Compiler 4.5.0.

The 502.gcc_r benchmark requires a preprocessed single compilation unit input. Please see the SPEC documentation for more information about the benchmark.

3 The OneFile tool

The source code for most large and interesting C programs is organized into multiple C and include files. OneFile is a tool that transforms a multiple-compilation-unit program into an equivalent preprocessed single-compilation-unit program. While the goal is to allow any C program to be automatically transformed into a valid input for the 502.gcc_r benchmark, OneFile has limitations. The requirements, general work flow, results and limitations are described in Section 7: OneFile’s Overview.

To download OneFile please visit: https://webdocs.cs.ualberta.ca/~amaral/AlbertaWorkloadsForSPECCPU2017/scripts/502.gcc˙r.scripts.tar.gz

OneFile uses Java and was compiled using version 1.8.0_65.

3.1 Usage

To use OneFile you must be in a directory that contains a folder named src. The src folder must contain the .c and .h files that will be merged into a single compilation unit. After that, enter the following command:

$ /path/to/tool/OneFile OUTFILE.c

This command will invoke OneFile. The argument OUTFILE.c is required. The argument OUTFILE.c is the name of the new .c file that will be generated from OneFile.

OneFile will print messages about its progress to stdout.

4 New Workloads

There are thirteen new workloads for 502.gcc_r. The thirteen new workloads are named

1.
lbm.c
2.
img_process.c
3.
mcf.c
4.
gzip.c
5.
bzip2.c
6.
johnripper.c
7.
oggenc.c
8.
O0
9.
O1
10.
O2
11.
O3
12.
Os
13.
g

4.1 Files and flags

502.gcc_r is a complex program with many configurable runtime options that allows users to fully specify how to compile a source file. In order to explore the impact of these different runtime options, there are two different types of workloads. The first type of workload allows one to measure the impact of different C source files and the second one allows one to measure the impact of different command line options.

The following sections describe the design of the workloads briefly. The full workloads can be downloaded at https://webdocs.cs.ualberta.ca/~amaral/AlbertaWorkloadsForSPECCPU2017/inputs/502.gcc˙r.inputs.tar.gz .

4.1.1 Files

The first type of workload is designed in order to analyze the impact of different input C-files to 502.gcc_r. This impact is measured by profiling a workload as it compiles a C source file 20 times with different optimization flags. Profiling a workload this way, allows one to compare the impact of a specific c-file with another across different workloads, while the impact of optimizations averages out.

This type of workloads can be identified by their naming convention. They are named after the C-file which they repeatedly compile.

Ideal workloads should have an execution time similar to the refrate workload and exercise the workload in a way that is typically done by a user. While, compiling the same C source file 20 times with different optimization options might not be a typical use for GCC for most users, this design decision allowed some of the new workloads to achieve an execution time comparable to refrate. This decision was also used by SPEC CPU 2017 by compiling multiple times the same input C-file.




Compilation UnitkLOC



lbm.c 2.0



img_process.c 3.2



mcf.c 3.5



gzip.c 6.6



bzip2.c 7.0



johnripper.c 18.5



oggenc.c 49.5



gcc.c 480.8






Table 1: New source files and their kLOC count




Compilation UnitkLOC



train01.c 40.2



scilab.c 52.2



200.c 59.9



gcc-smaller.c 269.4



gcc-pp.c 366.4



ref-32 414.6






Table 2: Reference source files and their kLOC count

This strategy used to achieve workloads with an execution time similar to refrate was partially successful. While the refrate workload achieves an execution time of around 275 seconds, most of the workloads that explore differences in C-source files execute in less than 100 seconds. This difference in execution time can be explained by the difference in the line-of-code count. Table 1 and Table 2 show the kLOC count for the new single-compilation-unit C source files.

4.1.2 Flags

The second type of workload is designed to analyze the impact of different option flags. This impact is measured by profiling a workload as it compiles different C source files with the same option flags. For example workload O0 compiles different C source files each with only the option -O0 explicitly typed.

This type of workloads can also be identified by their naming convention. They are named after the optimization flag which was used during the execution of 502.gcc_r. E.g. O0 is the name of the benchmark that compiles several C source files with only the -O0 optimization flag.

In order to decrease the difference in execution time between these type of workloads and the refrate workload, single-compilation-unit source files train0.c, scilab.c and 200.c were added to these workloads.

5 Analysis

A simple metric to compare the effect of various workloads on a benchmark is the benchmark execution time. All of the data in this section was obtained using the Linux perf utility and represents the mean value from three runs. The 502.gcc_r run on an Intel Core i7-2600 processor at 3.4 GHz and 8 GiB of memory running Ubuntu 14.04 LTS on Linux Kernel 3.16.0-76. To this end, 502.gcc_r was compiled with GCC 4.8.4 at the optimization level -O3, and the execution time was measured for each workload.

This section presents an analysis of the workloads for 502.gcc_r and their effects on its performance.

1.
How much work does the benchmark do for each workload?
2.
Which parts of the benchmark code are covered by each workload?
3.
How does each workload affect the benchmark’s behaviour?
4.
How is the benchmarks behaviour affected by different compilers and optimization levels?

5.1 Execution Time

PIC
Figure 1: The mean execution time from three runs for each workload.
PIC
Figure 2: The mean instruction count from three runs for each workload.
PIC
Figure 3: The mean clock cycles per instruction from three runs for each workload.

Figure 1 shows the mean execution time for each of the workloads. The execution time of 502.gcc_r is lower when it runs with workloads with a lower kLOC count.

Another interesting find is the difference between the optimization flags. Increasing the optimization level increases the execution time of the workloads.

5.2 Coverage

This section analyzes which parts of the benchmark are exercised by each of the workloads by showing the percentage of execution time that the benchmark spends on several of the most time-consuming functions. This data was recorded on a machine equipped with Intel Core i7-2600 processor at 3.4 GHz with 8 GiB of memory running Ubuntu 14.04 LTS on Linux Kernel 3.16.0-76.

PIC

Figure 4: The mean amount of time in each function. Functions that consume less than 1% are aggregated into others.

PIC

Figure 5: The mean percentage of time in each function. functions that consume less than 1% are aggregated into others.

Compared to other benchmark’s analyses, the execution on 502.gcc_r is more uniformly spread across most of the functions. In most benchmarks there are several functions that each execute for more than 5% of the execution time. However, for 502.gcc_r there are no functions that are run for more than 5% of the execution time.

Compared to other workloads, workload O0 does not execute functions found on most other workloads. The functions bitmap_ior_input, color_pass, df_worklist_dataflow and sorted_array_from_bitmap_set all execute for more than 1% of the total execution time.

Another interesting finding is that workload g, which compiles binaries with debugging symbols, is the only workload that spends more than 1% on function canonicalized_values_star.

5.3 Workload Behaviour

The analysis of the workload behaviour is done using two different methodologies. The first section of the analysis is done using Intel’s top down methodology.1 The second section is done by observing changes in branch and cache behaviour between workloads.

To collect data GCC 4.8.4 at optimization level -O3 is used on machines equipped with Intel Core i7-2600 processors at 3.4 GHz with 8 GiB of memory running Ubuntu 14.04 LTS on Linux Kernel 3.16.0-76. All data remains the mean of three runs.

5.3.1 Intel’s Top Down Methodology.

Intel’s top down methodology consists of observing the execution of micro-ops and determining where CPU cycles are spent in the pipeline. Each cycle is then placed into one of the following categories:

Front-end Bound
Cycles spent because there are not enough micro-operations being supplied by the front end.
Back-end Bound
Cycles spent because there are not enough resources available to process pending micro-operations, including slots spent waiting for memory access.
Bad Speculation
Cycles spent because speculative work was performed and resulted in an incorrect prediction.
Retiring
Cycles spent actually carrying out micro-operations.

PIC

Figure 6: Breakdown of each workload with respect to Intel’s top down methodology.

Using this methodology the program’s execution is broken down into these categories as shown Figure 6. Using the top down analysis it appears that refrate spends more time waiting for resources on the back-end than the rest of the workloads.

5.3.2 Branch and Cache

By looking at the behaviour of branch predictions and cache hits and misses we can gain deeper insight into the execution of the program between workloads.

PIC
Figure 7: Breakdown of branch instructions in each workloads.
PIC
Figure 8: Breakdown of LLC accesses in each workload.

Figure 7 summarizes the percentage of instructions that are branches and exactly how many of those branches resulted in a miss. The amount of branches and branch misses do not seem to vary among different workloads.

Figure 8 summarizes the percentage of LLC accesses and exactly how many of those accesses resulted in LLC misses. Workloads which compile multiple source files with different optimization flags have more LLC misses than other workloads.

6 Compilers

This section compares the performance of the benchmark when compiled with GCC 4.8.4 and the Intel ICC compiler 16.0.3. The benchmark was compiled with these compilers at six optimization levels (-O0, -O1, -O2, -O3, -Ofast, and -Os), resulting in 12 different executables. Each workload was then run 3 times with each executable on an Intel Core i7-2600 machine mentioned earlier. The subsequent sections will compare ICC against GCC.

GCC binaries compiled with LLVM were abled to be generated. However, only binaries generated with -O0 were able to run without creating an error. The error message received said that it was an “Internal Compiler Error”.

6.1 ICC

Figure 9: Changes in various performance measures from GCC to ICC.
PIC (a) relative execution time PIC (b) instructionsPIC (c) relative number of bad speculations performed PIC (d) cache misses PIC (e) dTLB misses (load and store combined)PIC (f) Legend for all graphs in Figure 9

The most prominent differences between the ICC and GCC generated binaries are highlighted in Figure 9.

Figure 10a shows that ICC generated binaries take longer to execute than GCC generated binaries for all but optimization level -Os.

Figure 10b shows that for optimization levels -O0, -O1, -O3 and -Ofast ICC binaries execute more instructions than GCC binaries. For the rest of the optimization levels, ICC binaries execute about the same number of instructions as GCC.

Figure 10c shows that ICC binaries execute less instructions that will later be discarded due to bad speculation. However, the binary compiled with GCC at optimization level -Os performs better than its ICC counter part.

Figure 10d shows that at optimization levels -O0 and -O2 ICC generated binaries have more cache misses than GCC.

Figure 10e shows that at optimization level -O0 ICC generated binary performs better than GCC’s binary. However, ICC generated binaries perform similarly to GCC’s for the rest of the inputs.

7 OneFile’s Overview

This section describes the steps that OneFile performs to generate a pre-processed single-compilation-unit workload.

1.
OneFile creates a modified directory in your working directory.
2.
OneFile copies all .c and .h files from the src directory into the modified directory. However, OneFile places all copied files at the root of the modified directory. To avoid file name collisions, OneFile renames .c files by pre-pending the name of the directories in the path from src to the file into the filename. E.g.: 2 The file path to .h files is not added to the name of the file. Therefore, if multiple .h files with the same name are in different directories, only one of these files will be copied.
$ find src 
src/foo.h 
src/hello.c 
src/world/world.c 
 
$ OneFile out.c 
find modified 
foo.h 
src_hello.c 
src_world_world.c
3.
OneFile calls processIncludes. processIncludes modifies the preprocessor directives to reflect the changes in the folder structure. processIncludes does the following changes.
(a)
Modify #include directives found in .c and .h files to point to the correct file. E.g.

#include ”hello/world.h” changes to #include ”world.h”

(b)
Modify #include directives found in .c and .h files by removing #include directives that try to include a .c file. Because at the end OneFile will be appending all .c files, the preprocessor directive can be safely ommitted. E.g.

#include ”hello/world.c” is ommited from files.

(c)
Copy preprocessor directives found in all .c and .h files into a temporary file named temp.txt
(d)
Parse temp.txt and make a list of all #include directives outside of preprocessor conditionals (e.g. #if, #ifdefined).
(e)
Make a file named includes.c that will hold all preprocessor directives found in temp.txt. However, includes.c will have a different order than temp.txt. processIncludes will place the list of #include directives mentioned in the step 3d inside the includes.c file. Then, it will place the #if, #ifdefined and other preprocessor directives inside (and including) this conditionals. E.g.:
$ cat tmp.txt 
#ifdef __HELLO_H__ 
#include hello.h 
#endif 
#include world.h

After this step

$ cat includes.c 
#include world.h 
#ifdef __HELLO_H__ 
#include hello.h 
#endif
4.
After processIncludes finishes running, OneFile tries to preprocess includes.c. This step may fail. In case of failure, OneFile prompts the user to resolve the issue. Possible reasons for failure are discussed in the next section.
5.
The file includes.c now contains all the preprocessor directives that do not have any definition dependencies.
6.
All other .c files still have #include preprocessing directives. Otherwise, gcc would not be able to preprocess them correctly. OneFile now calls gcc to preprocess all .c files. We used gcc version 4.8.4 on Ubuntu 14.04 for preprocessing. However any C compiler that can generated a preprocessed output should work. The output of the preprocessor includes linemarkers.3
7.
Linemarkers are a way for compilers to tell where each line of code originated from. Line markers are displayed using the following convention:
# linenum filename flags

Linemarkers can be used to tell what code comes from which file. In particular, we care about finding out what code in our pre-processed files came from .h files.

For example:

$ ls 
hello.c world.h 
$ cat hello.c 
#include world.h  
extern int bar; 
 
$ cat world.h 
extern int foo; 
 
$ gcc -E hello.c 
# 1 hello.c 
# 1 ”<built-in>” 
# 1 ”<command-line>” 
# 1 ”/usr/include/stdc-predef.h 1 3 4 
# 1 ”<command-line>” 2 
# 1 hello.c 
# 1 world.h 1 
extern int foo; 
# 1 hello.c 2 
extern int bar;

The previous example shows a minimal preprocessed file. The output of the command

gcc -E hello.c

can be used to tell that the line

extern int foo;

comes from the file world.h because it is preceded by the linemarker

# 1 world.h 1

The .h files could not have been removed them from the code in step 6 because these files contain important information about the semantics of the program. However, multiple .c files might include the same .h file. Thus, proceeding with name mangling of the preprocessed files at this stage would result in multiple definition errors. To deal with this problem OneFile removes all of the code found in the preprocessed files that come from .h files by calling removeLineMarkers.4

After this step, the preprocessed files contain no source code from .h files. For example:

# 1 hello.c 
# 1 ”<built-in>” 
# 1 ”<command-line>” 
# 1 hello.c 
# 1 hello.c 2 
extern int bar;
8.
We now have files that nameMangler.jar can mangle. nameMangler.jar is a .jar file that encapsulates all the functionality to mangle C files.5

nameMangler.jar will go through each file and ignore extern declarations. nameMangler.jar will rewrite static functions, and variables by prepending the file name to the function and variable name.

E.g.

Before name mangling, file a.c has the following content:

static int foo() { 
        return 0; 
} 
 
static int bar = 0; 
 
int baz() { 
        return 0; 
} 
 
int qux = 0;

After name mangling, file a.c has the following content:

static int a_foo() { 
        return 0; 
} 
 
static int a_bar = 0; 
 
int baz() { 
        return 0; 
} 
 
int qux = 0;

nameMangler.jar only mangles names of symbols that are not available to other compilation units. nameMangler.jar uses ANTLR v 4.5.3 to generate a C lexer and parser.

9.
Finally, a OUTFILE.c will be created by inserting includes.c then appending all the other .c files.

This final file can be compiled by a C compiler.

7.1 Current results and limitations

OneFile successfully created single compilation units for the 505.mcf_r and 519.lbm_r SPEC benchmarks and for johnripper the password cracker. These benchmarks have a simple source-code organization: all the .c and .h files are in the root directory, and hence there are no files with the same name.

There are some known limitations of this tool. One problem we encountered was that a different .h files which had the same name. It wasn’t a problem for the original c project we were attempted to convert since the .h were placed in different folders. However, when placing them at the root of the directory, this became a problem.

Another known issue is the need to address compilations that contain macro definitions. At the moment, OneFile calls gcc without any -D flags. However, it is known that certain compilations require the definition of macros in order to compile correctly. This has not been addressed. A user could manually preprocess source files and add definition flags. Then, OneFile would be able to name mangle all sources into one compilation unit.

There are some cases in which a definition is placed out of order and a unsatisfied dependency will cause the program to fail. OneFile will warn the user when this happens and the user will be required to manually inspect the code.

References

[1]   Stephen McCamant. Large single compilation-unit C programs. https://people.csail.mit.edu/smcc/projects/single-file-programs/, January 2006. Accessed on August 09 2016.

1More information can be found in B.3.2 of the Intel 64 and IA-32 Architectures Optimization Reference Manual .

2This is an example where execution of OneFile was cut short. You can simulate this by modifying OneFile and placing an exit command on line 22

3For documentation on linemarkers please visit https://gcc.gnu.org/onlinedocs/cpp/Preprocessor-Output.html

4This is also why OneFile made an includes.c file that we can later prepend to our name mangled source code.

5We have also provided the source code.