Report: The 525.x264_r Benchmark (June 2016 Development Kit)

Elliot Colp, J. Nelson Amaral, Caian Benedicto, Raphael Ernani Rodrigues,
Edson Borin, Dylan Ashley, Morgan Redshaw, Erick Ochoa Lopez
University of Alberta

2018-02-15

1 Executive Summary

This report presents:

Important take-away points from this report:

2 Benchmark Information

The 525.x264_r benchmark is an application for encoding video streams into H.264/MPEG-4 AVC format.

2.1 Inputs

The following supplementary information — compiled from change history and examination of the reference workload — should be helpful for the creation of workloads for 525.x264_r.

The input is a 264 video file. The resolution has to be 1280x720.1

These are the input arguments to the program.

input-filename
The name of the input file.
yuv-name
The name of the uncompressed yuv file.
no_md5
String that says no_md5 or md5 if one has it.
output-file
The name of the output file.
resolution
The video’s resolution in <w>x<h> format.
clip-start
The start of the video clip in seconds. The converter will seek to this location before starting to output video. This will make the input to the benchmark smaller.
seek
Is the frame from which the benchmark should start encoding. This will make the output of the benchmark smaller.
frame-count
Is the number of frames to be encoded.
frame-count-2
Is the number of frames to be encoded in a two-pass run. If set to 0, this run is ignored.
dump-interval
Is the interval at which the benchmark should output frame grabs. For example, if dump-interval is 50, then a frame will be saved every 50 frames. Note that the last frame is always saved.
Every argument afterward is a dump-frame, which is a frame that will be validated when the benchmark is run to confirm that the compressed video is givin correct output.

3 Generating Workloads

It is possible to generate new workloads by taking videos with the correct licenses2 and run them gen-input.py script. The script gen-input.py will generate new workloads based on original videos. It works by:

4 New Workload

Ten new inputs were generated using the description above. Two colored videos and one grayscale video were used to generate ten inputs. Each input is named after the title of the video and annotated either gray, 2, or gray-2 accordingly.

burma
a simple sunset scene 3:00 minutes long.
disco
people dancing and light spectacle 0:31 minutes long.
notld
trailer for Night of the Living Dead. 2:38 minutes long.

5 Analysis

This section presents an analysis of the workloads created for the 525.x264_r benchmark. 3 All data was produced using the Linux perf utility and represents the mean of three runs. In this section the following questions will be addressed:

  1. How much work does the benchmark do for each workload?
  2. How does the behavior of the benchmark change between inputs?
  3. How is the benchmark’s behavior affected by different compilers and optimization levels?

5.1 Work

In order to ensure a clear interpretation the analysis of the work done by the benchmark on various workloads will focus on the results obtained with GCC 4.8.4 at optimization level -O3. The execution time was measured on machines equipped with Intel Core i7-2600 processors at 3.4 GHz with 8 GiB of memory running Ubuntu 14.04.5 LTS on Linux Kernel 3.16.0-76-generic.

Figure 1 shows the mean execution time of 3 runs of 525.x264_r on a certain workload. There is no clear trend that explains the different execution times of the inputs. Comparing file size with execution time yields no correlation. Further, the differences in the videos duration makes it difficult to compare other input properties. However, workloads generated from a single source do seem to show a pattern. The execution time appears to be related to how the video is encoded. Sorting the execution times in descending order yields the following:

  1. encoded in one pass
  2. encoded in two passes
  3. grayscale in one pass
  4. grayscale in two passes

Figure 2 displays the mean instruction count of 3 runs. This graph matches the mean execution time graph shown previously.

Figure 3 displays the mean clock cycles per instructions of 3 runs. Two facts explain why all inputs show a similar mean CPI count.

  1. The number of clock cycles is correlated to the execution time.
  2. The similarity between Figure 1 and Figure 2

PIC
Figure 1: The mean execution time from three runs for each workload.
 
PIC
Figure 2: The mean instruction count from three runs for each workload.

PIC
Figure 3: The mean clock cycles per instruction from three runs for each workload.

5.2 Workload Behavior

The analysis of the workload behavior is done using two different methodologies. The first section of the analysis is done using Intel’s top down methodology.4 The second section is done by observing changes in branch and cache behavior between workloads.

To collect data GCC 4.8.4 at optimization level -O3 is used on machines equipped with Intel Core i7-2600 processors at 3.4 GHz with 8 GiB of memory running Ubuntu 14.04.5 LTS on Linux Kernel 3.16.0-76-generic. All data remains the mean of three runs.

5.2.1 Intel’s Top Down Methodology

Intel’s top down methodology consists of observing the execution of micro-ops and determining where CPU cycles are spent in the pipeline. Each cycle is then placed into one of the following categories:

Front-end Bound
Cycles spent because there are not enough micro-operations being supplied by the front end.
Back-end Bound
Cycles spent because there are not enough resources available to process pending micro-operations, including slots spent waiting for memory access.
Bad Speculation
Cycles spent because speculative work was performed and resulted in an incorrect prediction.
Retiring
Cycles spent actually carrying out micro-operations.

Using this methodology the program’s execution is broken down in Figure 4. The benchmark shows the same behaviour accross different inputs.

PIC

Figure 4: Breakdown of each workload with respect to Intel’s top down methodology.

5.2.2 Branch and Cache

By looking at the behavior of branch predictions and cache hits and misses we can gain a deeper insight into the execution of the program between workloads.

Figure 5 summarizes the percentage of instructions that are branches and exactly how many of those branches resulted in a miss.

Figure 6 summarizes the percentage of LLC accesses and exactly how many of those accesses resulted in LLC misses.

Figure 5 and Figure 6 further confirm that the benchmark shows little change in runtime behaviour across different inputs.

PIC
Figure 5: Breakdown of branch instructions in each workload.
 
PIC
Figure 6: Breakdown of LLC accesses in each workload.

5.3 Compilers

Limiting the experiments to only one compiler can endanger the validity of the research. To compensate for this a comparison of results between GCC (version 4.8.4) and the Intel ICC compiler (version 16.0.3) has been conducted. Furthermore, due to the prominence of LLVM in compiler technology a comparison between results observed through only GCC (version 4.8.4) and results observed using the Clang frontend to LLVM (version 3.6) has also been conducted.

Due to the sheer number of factors that can be compared, only those that exhibit a considerable difference have been included in this section.

PIC (a) ICC to GCC’s execution time ratio. PIC (b) LLVM to GCC’s execution time ratio.

PIC (c) ICC to GCC’s cache reference per instruction ratio. PIC (d) LLVM to GCC’s cache reference per instruction ratio.

PIC (e) ICC to GCC’s branch misses ratio. PIC (f) LLVM to GCC’s branch misses ratio.

Figure 7: ICC and LLVM’s comparison against GCC

5.3.1 ICC

Figures 7a,  7c and  7e summarize some interesting differences gained as the result of swapping GCC out with the Intel ICC compiler. The most prominent difference that exist when compiled with ICC and with GCC appear to be the cache references per instruction and the branch misses. Figure 7a shows the impacts of cache references per instruction and branch misses metrics on the execution time of the benchmark.

Figure 7c shows that, at optimization levels -O0 and -O1 ICC has less cache references than GCC. At other optimization levels, ICC has more cache references than GCC. This graph shows an inverse relationship with the Figure 7a.

Figure 7e shows that, at optimization levels -O0 and -O1 ICC has more branch misses than GCC. At other optimization levels, icc has less branch misses than GCC. This graph shows a proportional relationship with Figure 7a

5.3.2 LLVM

Figures 7b,  7d and  7f summarize some interesting differences gained as the result of swapping GCC out with the Intel GCC compiler. The same metrics used in the ICC compiler analysis are used to compare LLVM’s performance against GCC.

Figure 7d shows the same inverse relationship with Figure 7a as discussed in the previous section.

Figure 7f does not show the same relationship with Figure 7a as discussed in the previous section. Performance in the cache may be more significant in predicting the execution time.

6 Conclusion

All 525.x264_r’s analyses across inputs show similar results. There is no reason to believe that different workloads will influences the program behaviour significanlty.

The additional workloads that were created serve to provide a larger data set.

The variation between the analyzed compilers yield some interesting results about their perfomance when submitted to different optimization techniques.

1A tool called imageValidate was created, by the SPEC CPU subcommittee, for the validation of images for benchmarks. This validation tool is restricted to operate with 1280x720 images. Therefore it is possible to create workloads for 525.x264_r with different resolutions. However such images could not be validated using the imageValidate tool.

2The videos generated in this report were taken from this website. http://vimeo.com/groups/freehd

3The data in this report is currently under review. New consistent runs with the machines in the remaining reports are expected to be completed by May 2018.

4More information can be found in §B.3.2 of the Intel 64 and IA-32 Architectures Optimization Reference Manual .