Insights Into Benchmarks for Which the Alberta Workloads Provide No New Inputs

Elliot Colp, J. Nelson Amaral, Caian Benedicto, Raphael Ernani Rodrigues, Edson Borin,
Marcus Karpoff, Erick Ochoa
University of Alberta

January 23, 2017

1 Introduction

In the effort to create the Alberta Inputs for the SPEC CPU2017 benchmark suite we examined most of the benchmarks in the suite and we were successful in creating new inputs, and in some cases even scripts for the automated generation of inputs, for several of the benchmarks in the suite. However, for some benchmarks it was not possible to generate new inputs. This report presents insights into some of these benchmarks. These insights were gained from the process of attempting to generate new inputs.

2 500.perlbench_r

This benchmark is a stripped-down version of the Perl interpreter. It accepts Perl scripts as inputs.

A simple way to create new inputs for 500.perlbench_r would be to collect open-source Perl applications. All applications that we found contain dependencies that require the integrated compilation of C code to work. However 500.perlbench was not designed to support extending Perl to use C libraries. Therefore we could not leverage the Perl applications that we found to create new workloads for 500.perlbench. For completeness, here is a list of Perl scripts and frameworks that we analyzed:

  1. Perl Defense Blaster, Perl Racer and Perl Arena: a series of open-source games from David Slimp.
  2. BioPerl: Toolkit for building bioinformatics solutions in Perl.
  3. Catalyst: MVC framework for creating web applications.
  4. Dancer: Web application framework for Perl.

3 503.bwaves_r

This benchmark simulates blast waves in a three-dimensional flow simulation. It takes a few simulation parameters as inputs.

An analysis of the source code revealed little divergence in the application flow, which suggests that the workloads provided entirely cover the benchmark. There are few input parameters to modify. Only the grid dimensions actually affect how the simulation runs by leading to different execution times. Changing the grid dimensions does not lead to any significant change in the behaviour of the benchmark.

4 525.x264_r

This benchmark is based on the x264 video encoder. It takes raw H.264-encoded videos as inputs and performs both decoding and encoding on the video stream.

4.1 Required Resolution

This benchmark can process videos of any resolution (and specifically allows the resolution to be provided in a workload’s control file), but the benchmark fails at the validation step if the resolution is not 1280x720.

Three programs are executed for each workload. First, the input video is decoded using the ldecod_r program. Next, the video is encoded again using the x264_r program. Finally, the output is validated using the imagevalidate_r program, which compares frames extracted from the video. The first two programs work regardless of the video’s resolution, but the imagevalidate_r program exits with an error (“Error reading input.”) if any resolution other than 1280x720 is given. As revealed by a comment in the source code for imagevalidate_r, this error occurs because the program assumes that the input video will always have a resolution of 1280x720.

A comment in the source code indicates that the program could be updated “to allow for variable WxH.” We refrain from modifying the program because the goal is to provide additional inputs to the benchmark as published. Researchers that wish to explore with applications in this domain may choose to do such modification and thus allow a wider variety of inputs with varying resolutions. For example, testing 1080p video would be useful given that 1080p videos are becoming more common than 720p video.

4.2 Additional Information for the Generation of Workloads

For this benchmark, each workload’s control file should contain a single line with the following arguments:

<input file> <output file 1> <output file 2> <resolution> <frame count> <dump interval> [<dump frame> ...]

input file

The name of the input file (a raw 264 stream).
output file 1

The name of the uncompressed output file (a raw YUV stream).
output file 2

The name of the compressed output file (a raw 264 stream).

The resolution of the input file, e.g. 1280x720.
frame count

The number of frames in the input file.
dump interval

Every dump interval frames, a raw YUV dump of the current frame will be saved to the working directory, always including the last frame of the video.
dump frame

Any arguments after this point are interpreted as dump frames, which are the frame numbers to be compared for validation purposes.

Aside from the control file, a workload must provide an input video in raw 264 format and TGA versions of the desired dump frames. The TGA dump frames can be generated by first running the the 525.x264_r benchmark with the --dumptga <interval> option, then for each of the dump frames running:

imagevalidate_r -dumpfile <framename>.yuv <framename>.yuv

The file will be overwritten by a TGA version but it will still have its original .yuv extension.

4.3 Process to Generate Workloads

  1. Create a directory with your workload’s name. In this directory, create two directories named input and output.
  2. In the input directory, create a file named control as described in the Input Description section.
  3. Convert the desired input video from its current format to Y4M format. For example, with libav installed the conversion of an MP4 video can be done with the following command:
    avconv -i <name>.mp4 <name>.y4m

  4. Now use x264 to convert the Y4M file to .264 raw video. The reason to convert twice is that the benchmark seems to be sensitive to how the .264 file is produced – using avconv to directly produce the file could cause the benchmark to silently fail. Run this command:
    x264 -i <name>.y4m <name>.264

  5. To produce the other required input files, do a partial run of the benchmark. Add the 525.x264_r/exe directory (which should include x264_r, ldecod_r and imagevalidate_r) to your PATH, then cd to the input directory. Run the following command:
    ldecod_r -i <name>.264 -o <name>.yuv

  6. To create the dump frames in YUV format, run:
    x264_r --dumptga <dump interval>  
      --frames <frame count> -o /dev/null <name>.yuv  

  7. For each dump frame, run these commands:
    imagevalidate_r -dumpfile frame_<number>.yuv  
    mv frame_<number>.yuv frame_<number>.org.tga;

  8. Each frame should validate with 100% similarity. Thus, for each frame, create a file in the output directory named imageValidate_frame_<number>.out with the following contents:
    AVG SSIM = 1.000000000

  9. Delete all files produced in this process except for the <name>.264 file, the frame_*.org.tga files and the control file in the input directory; and the imageValidate_frame_*.out files in the output directory.
  10. The benchmark will not run without dummy reftime and refpower files in your workload’s root directory. Create these files, and in them, simply write your workload’s name followed by a newline, a 2 and then another newline.

4.4 Input Generator

The Alberta Workloads provide an input generator script that does most of the work required to generate workloads for x264 @tildehttp://whatever_website_is_called/scripts/525.x264˙r.scripts.tar.gz .

5 541.leela_r

This benchmark is based on the Leela Go playing and analysis engine. It takes unfinished Go games as inputs and plays them to completion.

5.1 Procedure to Generate Workloads

  1. Create a directory with your workload’s name. In this directory, create two directories named input and output.
  2. Create a file with the extension “.sgf” in the input directory. The input format is described in the benchmark’s documentation page.
  3. Add a compiled version of the 541.leela_r benchmark to your PATH, then cd to the input directory. Run the following command:
    leela_r <name>.sgf > ../output/<name>.out

  4. The benchmark will not run without dummy reftime and refpower files in the workload’s root directory. Create these files, and in them, simply write your workload’s name followed by a newline, a 2 and then another newline.

5.2 Input Generator

The Alberta Workloads provide a script to produce a wide variety of inputs. This script randomly assembles workloads: @tildehttp://whatever_website_is_called/scripts/541.leela˙r.scripts.tar.gz .

The script randomly chooses SGF games from an sgf directory (which currently includes many games from the defunct No-Name Go Server’s archive1) and removes a number of moves from the end of each game so that they are incomplete.

The script can be given the number of games to include per workload, the board size (9x9, 13x13 or 19x19) and a range on the number of moves to remove.

These games are pulled from a game server. There is no guarantee that these games were played to what the benchmark considers “completion” — players are allowed to resign at any time, whereas the Leela AI may continue playing longer in some cases. Thus, even with a workload that has had no moves removed, it is possible for the benchmark to run for longer than we might expect. Still, the script is useful for producing many workloads to test, and the library of games to choose from can easily be replaced if a better set is found.

6 538.imagick_r

This benchmark is based on the ImageMagick image editing suite. Its inputs are made up of a TGA image and a list of image transformations to apply.

The suite from which this benchmark was extracted can read and write images in over 100 formats. However, the benchmark itself only accepts TGA files as inputs.

6.1 A Preliminary Study on Code Coverage

The Image Magick application contains several commands for image transformations. Table 1 lists the set of commands selected for coverage measurement. An interesting question is how many of these commands are executed by the workloads used for the benchmark.

adaptive-blur adaptive-resize adaptive-sharpen auto-gamma

auto-level auto-orient black-threshold blue-shift

blur brightness-contrast charcoal clamp

colorize colors convolve cycle

deskew despeckle edge emboss

enhance equalize flip flop

frame gamma gaussian-blur implode

lat level median modulate

monochrome motion-blur negate noise

normalize opaque paint posterize

radial-blur raise random-threshold repage

resample resize roll rotate

segment selective-blur sepia-tone sharpen

shave shear sigmoidal-contrast sketch

solarize sparse-color splice spread

swirl threshold tint transparent

transpose transverse trim unique-colors

unsharp vignette wave white-threshold

Table 1: Commands used for profiling.

We used the symbols contained in the symbol table of the 538.imagick_r binary to measure coverage. The symbol table is a table generated automatically during the compilation and is present as part of the ELF format of the binary. Each symbol from the table is a label associated with an address. A symbol can be associated with a range of addresses. This range can start at original symbol address and go up to the address of the next symbol. We used a basic-block profiler to get the address of the basic blocks executed in each run and identified which symbol is associated with each basic block. The symbols that contain at least one executed basic block are classified as used. This metric gives an idea of how many functions where used by the commands at each run.

We profiled 72 commands available in Image Magick and obtained coverages ranging from 11.3% to 15.3%. All the commands executed the same 11% of symbols, which indicates that the difference between commands are concentrated in only a few functions. The combined coverage for all of the commands was 24.3%.

6.2 Reference Workload

We analyzed the reference workload provided with the benchmark:

checks.tga -shear 31x14 -negate -edge 14 -resize 1800x1200 -implode1.2  
-flop -convolve 1,2,1,4,3,4,1,2,1 -edge 100 -resize 900x900 ref1.tga

By timing each commands separately we discovered that using the reference workload, 99% of the execution time was due to the -edge 100 command. We reduced the value of both -edge commands from 14 to 2 and 100 to 3 to balance the execution time of all commands and with that we reduced the contribution of both commands to 13% and 28% of the total execution time. Although this change led to a distribution of the execution time that depends more on the execution of other commands, there are still commands that are too fast to be significant, for this workload we could identify the -shear, -negate and flop. A way to change the distribution of execution time for this benchmark would be to carefully place resize commands to increase the execution time of commands that do take any parameters or that have parameters too hard to tune.

Another question is whether repeating the same command multiple times changes the frequency of basic-block execution. Experimentations with repetitions like this led to no significant changes for the cases tested. An alternative method to increase the execution time of fast commands is to repeat them multiple times. Experimentation also revealed that the execution time for color, adaptive-resize, auto-orient and black-threshold scales slowly with the number of repetitions.