In the effort to create the Alberta Inputs for the SPEC CPU2017 benchmark suite we examined most of the benchmarks in the suite and we were successful in creating new inputs, and in some cases even scripts for the automated generation of inputs, for several of the benchmarks in the suite. However, for some benchmarks it was not possible to generate new inputs. This report presents insights into some of these benchmarks. These insights were gained from the process of attempting to generate new inputs.
This benchmark is a stripped-down version of the Perl interpreter. It accepts Perl scripts as inputs.
A simple way to create new inputs for 500.perlbench_r would be to collect open-source Perl applications. All applications that we found contain dependencies that require the integrated compilation of C code to work. However 500.perlbench was not designed to support extending Perl to use C libraries. Therefore we could not leverage the Perl applications that we found to create new workloads for 500.perlbench. For completeness, here is a list of Perl scripts and frameworks that we analyzed:
This benchmark simulates blast waves in a three-dimensional flow simulation. It takes a few simulation parameters as inputs.
An analysis of the source code revealed little divergence in the application flow, which suggests that the workloads provided entirely cover the benchmark. There are few input parameters to modify. Only the grid dimensions actually affect how the simulation runs by leading to different execution times. Changing the grid dimensions does not lead to any significant change in the behaviour of the benchmark.
This benchmark is based on the x264 video encoder. It takes raw H.264-encoded videos as inputs and performs both decoding and encoding on the video stream.
This benchmark can process videos of any resolution (and specifically allows the resolution to be provided in a workload’s control file), but the benchmark fails at the validation step if the resolution is not 1280x720.
Three programs are executed for each workload. First, the input video is decoded using the ldecod_r program. Next, the video is encoded again using the x264_r program. Finally, the output is validated using the imagevalidate_r program, which compares frames extracted from the video. The first two programs work regardless of the video’s resolution, but the imagevalidate_r program exits with an error (“Error reading input.”) if any resolution other than 1280x720 is given. As revealed by a comment in the source code for imagevalidate_r, this error occurs because the program assumes that the input video will always have a resolution of 1280x720.
A comment in the source code indicates that the program could be updated “to allow for variable WxH.” We refrain from modifying the program because the goal is to provide additional inputs to the benchmark as published. Researchers that wish to explore with applications in this domain may choose to do such modification and thus allow a wider variety of inputs with varying resolutions. For example, testing 1080p video would be useful given that 1080p videos are becoming more common than 720p video.
For this benchmark, each workload’s control file should contain a single line with the following arguments:
<input file> <output file 1> <output file 2> <resolution> <frame count> <dump
interval> [<dump frame> ...]
Aside from the control file, a workload must provide an input video in raw 264 format and TGA versions of the desired dump frames. The TGA dump frames can be generated by first running the the 525.x264_r benchmark with the --dumptga <interval> option, then for each of the dump frames running:
imagevalidate_r -dumpfile <framename>.yuv <framename>.yuv
The file will be overwritten by a TGA version but it will still have its original .yuv extension.
The Alberta Workloads provide an input generator script that does most of the work required to generate workloads for x264 @tildehttp://whatever_website_is_called/scripts/525.x264˙r.scripts.tar.gz .
This benchmark is based on the Leela Go playing and analysis engine. It takes unfinished Go games as inputs and plays them to completion.
The Alberta Workloads provide a script to produce a wide variety of inputs. This script randomly assembles workloads: @tildehttp://whatever_website_is_called/scripts/541.leela˙r.scripts.tar.gz .
The script randomly chooses SGF games from an sgf directory (which currently includes many games from the defunct No-Name Go Server’s archive1) and removes a number of moves from the end of each game so that they are incomplete.
The script can be given the number of games to include per workload, the board size (9x9, 13x13 or 19x19) and a range on the number of moves to remove.
These games are pulled from a game server. There is no guarantee that these games were played to what the benchmark considers “completion” — players are allowed to resign at any time, whereas the Leela AI may continue playing longer in some cases. Thus, even with a workload that has had no moves removed, it is possible for the benchmark to run for longer than we might expect. Still, the script is useful for producing many workloads to test, and the library of games to choose from can easily be replaced if a better set is found.
This benchmark is based on the ImageMagick image editing suite. Its inputs are made up of a TGA image and a list of image transformations to apply.
The suite from which this benchmark was extracted can read and write images in over 100 formats. However, the benchmark itself only accepts TGA files as inputs.
The Image Magick application contains several commands for image transformations. Table 1 lists the set of commands selected for coverage measurement. An interesting question is how many of these commands are executed by the workloads used for the benchmark.
adaptive-blur | adaptive-resize | adaptive-sharpen | auto-gamma |
auto-level | auto-orient | black-threshold | blue-shift |
blur | brightness-contrast | charcoal | clamp |
colorize | colors | convolve | cycle |
deskew | despeckle | edge | emboss |
enhance | equalize | flip | flop |
frame | gamma | gaussian-blur | implode |
lat | level | median | modulate |
monochrome | motion-blur | negate | noise |
normalize | opaque | paint | posterize |
radial-blur | raise | random-threshold | repage |
resample | resize | roll | rotate |
segment | selective-blur | sepia-tone | sharpen |
shave | shear | sigmoidal-contrast | sketch |
solarize | sparse-color | splice | spread |
swirl | threshold | tint | transparent |
transpose | transverse | trim | unique-colors |
unsharp | vignette | wave | white-threshold |
We used the symbols contained in the symbol table of the 538.imagick_r binary to measure coverage. The symbol table is a table generated automatically during the compilation and is present as part of the ELF format of the binary. Each symbol from the table is a label associated with an address. A symbol can be associated with a range of addresses. This range can start at original symbol address and go up to the address of the next symbol. We used a basic-block profiler to get the address of the basic blocks executed in each run and identified which symbol is associated with each basic block. The symbols that contain at least one executed basic block are classified as used. This metric gives an idea of how many functions where used by the commands at each run.
We profiled 72 commands available in Image Magick and obtained coverages ranging from 11.3% to 15.3%. All the commands executed the same 11% of symbols, which indicates that the difference between commands are concentrated in only a few functions. The combined coverage for all of the commands was 24.3%.
We analyzed the reference workload provided with the benchmark:
By timing each commands separately we discovered that using the reference workload, 99% of the execution time was due to the -edge 100 command. We reduced the value of both -edge commands from 14 to 2 and 100 to 3 to balance the execution time of all commands and with that we reduced the contribution of both commands to 13% and 28% of the total execution time. Although this change led to a distribution of the execution time that depends more on the execution of other commands, there are still commands that are too fast to be significant, for this workload we could identify the -shear, -negate and flop. A way to change the distribution of execution time for this benchmark would be to carefully place resize commands to increase the execution time of commands that do take any parameters or that have parameters too hard to tune.
Another question is whether repeating the same command multiple times changes the frequency of basic-block execution. Experimentations with repetitions like this led to no significant changes for the cases tested. An alternative method to increase the execution time of fast commands is to repeat them multiple times. Experimentation also revealed that the execution time for color, adaptive-resize, auto-orient and black-threshold scales slowly with the number of repetitions.
1https://github.com/zenon/NNGS_SGF_Archive