Report: The 511.povray_r Benchmark (June 2016 Development Kit)

Elliot Colp, J. Nelson Amaral, Caian Benedicto, Raphael Ernani Rodrigues, Edson Borin
University of Alberta

January 18, 2018

1 Executive Summary

This report presents:

Important take-away points from this report:

2 Benchmark Information

The 511.povray_r benchmark is a 3D raytracer written in C++.

2.1 Inputs

See Section ?? for a description of the input format.

2.2 Generating Workloads

There are a few steps to generate new workloads for the 511.povray_r benchmark:

  1. Create a directory with your workload’s name. In this directory, create two directories named input and output.
  2. Place a pov file in the input directory. If the file requires any non-standard inc files, they should also be placed in the input directory.
  3. Place an ini file in the input directory to set up rendering parameters. You can likely just copy the ini file from one of the SPEC benchmarks, as the only change that usually needs to be made is the Input_File_Name so that it points to your pov file. You may also want to change Width and Height to change how long the render takes.
  4. Add compiled versions of the 511.povray_r and imagevalidate executables to your PATH, then cd to the input directory. Run the following commands:
    povray_r <ininame>.ini;  
    511imageValidate_r SPEC-benchmark.tga SPEC-benchmark.tga  
        > ../output/imageValidate_SPEC-benchmark.out;  
    mv SPEC-benchmark.tga SPEC-benchmark.org.tga;

  5. The benchmark will not run without dummy reftime and refpower files in your workload’s root directory. Create these files, and in them, simply write your workload’s name followed by a newline, a 2, and then another newline.

3 New Workloads

For our tests, we created 7 workloads. Each workload aims to test to exercise different portions of the program by including different rendering methods and parameters. The new workloads can be organized into three categories, as detailed in the following subsections.

3.1 Collection Workloads (grass and pillars)

These workloads are both from the POV-Ray object collection1 and licensed under the terms of the CC-GNU LGPL2. They were chosen to represent real-world uses of POV-Ray.

The grass workload renders a patch of grass at 640x480 resolution. It is notable for spending a large amount of time constructing its mesh before rendering begins, as it is made up of many smaller rectangles (blades of grass).

The pillars workload renders three pillars with various column, capital and base shapes at 320x240 resolution. It also has radiosity3(a shading technique) enabled with a recursion limit of 5. It is an example of rendering moderately complex geometry made up of simple primitives.

3.2 Lumpy Workloads (lumpy-ref and lumpy-rad)

These workloads both render a single object placed over a checkered plane and illuminated by two spotlights. The objects are made up of isosurfaces, which are special surfaces in POV-Ray that are described by a mathematical equation. As a result, these workloads tend to put more stress on the processor’s floating-point unit, as rendering isosurface geometry involves calculating an equation’s value for many different points.

The lumpy-ref workload is a smooth, lumpy and translucent object rendered at 320x240 resolution. Primarily, this workload tests the interaction between isosurfaces and refraction.

The lumpy-rad workload uses the same base shape and resolution as lumpy-ref, but increases the intensity of the pits on its surfaces. Further, refraction is disabled, while radiosity is enabled with a recursion limit of 2. Primarily, this workload tests the interaction between isosurfaces and radiosity, which is noted in the POV-Ray documentation4 as being particularly slow.

3.3 Primitive Workloads (prim-solid, prim-trans and prim-apert)

These workloads render some geometric primitives which are built into POV-Ray (a sphere, a cylinder and a torus). They use the same plane, spotlights and radiosity settings as the lumpy-rad workload. POV-Ray calculates the shapes of built-in primitives fairly quickly, so most of the focus in these workloads is on the rendering techniques being used.

The prim-solid workload renders the primitives at 1440x1080 pixels. The objects are all solid, shiny and slightly reflective. This workload primarily tests the performance of POV-Ray’s reflections.

The prim-trans workload renders the primitives at 960x720 pixels. The objects are the same as in prim-solid, except that they are translucent and have refraction enabled. This workload is intended for comparison to the prim-solid workload to determine the effect of refraction on the benchmark’s behaviour.

Finally, the prim-apert workload renders the primitives at 160x120 pixels. It uses the same settings as prim-trans except that it gives the camera an aperture of 0.1, a confidence of 0.99 and a variance of 0. As such, rays must be continually fired at slightly different angles for each pixel until there is a 99% chance that it is exactly the correct color, significantly slowing down rendering. Thus, the workload is rendered at a much smaller resolution. This workload primarily tests the effect of camera lens aperture.

4 Analysis

This section presents an analysis of the workloads for the 511.povray_r benchmark and their effects on its performance. All of the data in this section was measured using the Linux perf utility and represents the mean value from three runs. In this section, the following questions will be answered:

  1. How much work does the benchmark do for each workload?
  2. Which parts of the benchmark code are covered by each workload?
  3. How does each workload affect the benchmark’s behaviour?
  4. How is the benchmark’s behaviour affected by different compilers and optimization levels?

4.1 Work

The simplest metric by which we can compare the various workloads is their execution time. To this end, the benchmark was compiled with GCC 4.8.2 at optimization level -O3, and the execution time was measured for each workload on machines with Intel Core i7-2600 processors at 3.4 GHz and 8 GiB of memory running Ubuntu 14.04 LTS on Linux Kernel 3.16.0-71-generic. Figure 1 shows the mean execution time for each of the workloads.

As shown in Figure 1, the grass and pillars workloads both take around 100 seconds to execute. However, the grass workload actually causes far fewer instructions to execute than the pillars workload. This is explained by its extremely high cycles per instruction (CPI) which, as shown in Figure 3, more than doubles the CPI of many other workloads.

The lumpy-rad workload takes much less time and executes far fewer instructions than the lumpy-ref workload. This suggests that, at least for isosurfaces, refraction in a smooth surface is considerable more processor-intensive than calculating radiosity for a complicated, crinkly surface.

We also see the progressively larger execution times of the prim-* workloads, as one would expect for workloads that progressively use more program features. Note that prim-trans renders at a lower resolution than prim-solid, and prim-apert in turns renders at a much lower resolution than prim-trans, so the differences in their execution times are actually less pronounced than they would be if the workloads were at the same resolution.

Lastly, we see the train and refrate workloads have extremely different execution times, with train having an extremely short execution time ( 15s) and refrate having an extremely long one ( 550s). This may be problematic, as the wiki page for povray5indicates that the train workload should run for 10% to 25% as long as refrate, when in fact it runs closer to 3% as long. Also of note is that their mean CPIs are nearly identical, suggesting that the two workloads may be performing very similar work.

PIC
Figure 1: The mean execution time from three runs for each workload.
 
PIC
Figure 2: The mean instruction count from three runs for each workload.

PIC
Figure 3: The mean clock cycles per instruction from three runs for each workload.

4.2 Coverage

This section analyzes which parts of the benchmark are exercised by each of the workloads. To this end, we determined the percentage of execution time the benchmark spends on several of the most time-consuming functions.

The functions covered in this section are all the functions which made up more than 5% of the execution time for at least one of the workloads. A brief explanation of each function follows:

__ieee754_pow_sse2

Computes powers on floating-point numbers.
pov::All_CSG_Intersect_Intersections

Finds any constructive solid geometry (CSG) which intersects with an object.
pov::All_Plane_Intersect_Intersections

Finds any planes which intersect with an object.
pov::All_Quadric_Intersect_Intersections

Finds any quadric shapes which intersect with an object.
pov::All_Sphere_Intersect_Intersections

Finds any spheres which intersect with an object.
pov::Check_And_Enqueue

Adds an object to a priority queue if a ray intersects its bounding box.
pov::Get_Token

Used to parse input files.
pov::Inside_Plane

Checks if a point is inside of a plane.
pov::Intersect_BBox_Tree

Intersects a ray with a bounding box tree and determines where the intersection occurs.
pov::Noise

Produces Perlin noise.
pov::POVFPU_RunDefault

Used for isosurfaces. Executes a compiled function using the FPU.
pov::compute_lighted_texture

Determines the apparent colour of a point on a texture, taking all lighting effects into account.
pov::mesh_hash

Used for mesh primitives. Looks up a triangle normal/vertex in a given hash table, or adds it to the hash table if it isn’t present.
pov::ot_dist_traverse

Used to update radiosity. Traverses a tree, calling a function on all nodes which are within a certain distance from a given point.
pov::ra_average_near

Used to update radiosity. Checks if a node’s cached value is valid, then adds its information to a weighted average which is built during tree traversal.

Figure 5 shows a coverage graph as a stacked bar graph. Evidently, depending on the workload, the benchmark’s behaviour can vary greatly. We can roughly categorize these workloads by the similarities in their behaviour. Figure 4 shows the same inforamtion as a percentage of the total execution time.

The grass workload stands alone as the sole workload to make extensive use of the pov::mesh_hash function, and to a lesser extent, the pov::Check_And_Enqueue and pov::Get_Token functions. This is likely due to the workload’s extreme focus on meshes, while the other workloads make use of primitives and isosurfaces.

The pillars, prim-solid, prim-trans and prim-apert benchmarks spend most of their time on the pov::ot_dist_traverse and pov::ra_average_near functions, both of which are related to radiosity. The longer-running workloads tend to increase the size of the “Others” category and slightly increase the execution time for the functions near the top of the chart. Thus, the addition of refraction and/or camera aperture effects appears to result in a few more calls to many different functions rather than many more calls to a few functions. Overall, this does not completely change the program’s behaviour, as the majority of execution time is still spent calculating radiosity values.

The lumpy-* workloads behave fairly similarly, though the lumpy-rad workload’s behaviour also bears some similarity to that of the prim-* workloads due to the addition of radiosity-related functions. These are the only workloads in which the pov::POVFPU_RunDefault function is significant, as they are the only workloads which make use of isosurfaces. As well, pov::Noise is called more often than in other workloads as the shapes are made up of noise-based functions.

Finally, the train and refrate workloads both exercise several functions that are rarely if ever called in the other workloads. Despite their extremely different runtimes the two workloads behave nearly identically, which could be problematic for assessing feedback-directed optimizations as the benchmark performs essentially the same work both during training and during the reference run.

Given the wide variety of behaviours shown in this section, it is likely that there are other parts of the benchmark’s code which are never exercised even with this new set of workloads. Creation of further workloads is recommended.

Figure 4: The mean (of 3 runs) percentage of execution time spent on each of the functions
PIC

Figure 5: The mean (of 3 runs) execution time in seconds spent on each of the functions
PIC

4.3 Workload Behaviour

In this section, we will first use the Intel “Top-Down” methodology6to see where the processor is spending its time while running the benchmark. This methodology breaks the benchmark’s total execution slots into four high-level classifications:

Front-end bound

Cycles spent because there are not enough micro-operations being supplied by the front end.
Back-end bound

Cycles spent because there are not enough resources available to process pending micro-operations, including slots spent waiting for memory access.
Bad speculation

Cycles spent because speculative work was performed due to an incorrect prediction (usually from branch misses).
Retiring

Cycles spent actually carrying out micro-operations.

The event counters used in the formulas to calculate each of these values can be inaccurate. Therefore the information should be considered with some caution. Nonetheless, they provide a broad, generally reasonable overview of the benchmark’s performance.

Figure 6 shows the distribution of execution slots spent on each of the aforementioned Top-Down categories. Depending on the workload, the ratio of work is completely different, which confirms that different workloads can exercise the processor in very different ways.

The grass workload is heavily backend-bound, which is likely due to the fact that rendering a mesh requires constantly reading triangle data from memory.

The prim-* and pillars workloads are similarly backend-bound, but with a higher retiring bound, suggesting that more time is spent doing useful work. For the pillars and prim-apert workloads, the backend bound is slightly less than the other three bounds combined.

The lump-* workloads have an even lower backend bound with a larger focus on frontend and bad speculation stalls. This may be explained by the fact that isosurfaces require more calculation than pure memory recall, so backend bound will naturally decrease as less memory access is required. The reason for the increase in front-end and bad speculation bound is less clear, though it may be due to the way that isosurface functions (which are not part of the compiled code, but instead parsed by the benchmark) are computed.

Lastly, unlike the other workloads, the train and refrate functions are mostly retiring-bound. The train workload has more variance as its runtime is significantly shorter than refrate’s, but we can see that, as expected, the two workloads have a fairly similar top-down profile.

PIC

Figure 6: The distribution of execution slots spent in each category according to Intel’s Top-Down methodology.

We now analyze cache and branching behaviour more closely. Figures 7 and 8 show, respectively, the rate of branching instructions and last-level cache (LLC) accesses from the same runs as Figure 6. As the benchmark executes a vastly different number of instructions for each workload, which makes comparison between benchmarks difficult, the values shown are rates per instruction executed.

Figure 7 demonstrates that, other than grass, most of the workloads have a similar rate of branching instructions. The branch miss rate for test, train and refrate is around 2%. The branch miss rate for grass and pillars is around 4%. The branch miss rate for lumpy-* workloads is around 8%. This pattern matches that seen in Figure 6 in which these workloads have higher bad speculation bounds.

Figure 8 reveals that most of the workloads rarely miss the LLC. The grass workload is the one exception, as it misses the LLC 52% of the times that it is referenced – hence its high backend bound. Meanwhile, while the pillars and prim-* workloads seldom miss the LLC, they do reference it much more frequently than the other workloads, which likely contributes to backend stall as they have missed the first two levels of cache. In contrast, the train, refrate and especially lumpy-ref workloads seem to catch most memory accesses in the first two levels of cache and so they have fairly low backend bounds.

PIC

Figure 7: The mean (of 3 runs) rate of branching instructions and branch misses for each workload. The number in each bar represents the mean branch miss rate (computed as number of branch misses divided by number of branch instructions).

PIC

Figure 8: The mean (of 3 runs) rate of last-level cache (LLC) references and misses per instruction for each workload. The number in each bar represents the mean cache miss rate (computed as number of cache misses divided by number of cache references).

4.4 Compilers

This section compares the performance of the benchmark when compiled with either GCC (version 4.8.2); the Clang frontend to LLVM (version 3.4); or the Intel ICC compiler (version 14.0.3). To this end, the benchmark was compiled with all three compilers at six optimization levels (-O{0-3}, -Os and -Ofast), resulting in 18 different executables. Each workload was then run 3 times with each executable on the Intel Core i7-2600 machines described earlier.

4.4.1 LLVM

First, we compare GCC to LLVM. There are a few notable differences in their behaviour, the most prominent of which are highlighted in Figure 9.

Figure 10a shows that the execution times are generally similar. However, the lumpy-ref workload is notably slower in the LLVM version at levels -O0 and -Os, while it is around 5 times faster at levels -O1 and -O2. Similarly, though less pronounced, the prim-trans workload is slower for levels -O3 and -Ofast, but is faster at level -O2.

Figure 10b demonstrates that many of the spikes and dips seen in Figure 10a can be explained by the number of instructions executed. However, the spike in number of instructions in the workload lumpy-ref as seen in Figure 10b does not affect the execution time.

Figure 10c reveals that at levels -Ofast LLVM misses significantly more last level cache references on all prim-* inputs.

Figure 10d shows data translation lookaside buffer (dTLB( misses. The number of misses varies depending on the input.

Figure 9: Changes in various performance measures from GCC to LLVM.
PIC (a) relative execution time   PIC (b) instructions executed PIC (c) cache misses   PIC (d) dTLB misses (load and store combined)

4.4.2 ICC

Once again, the most prominent differences between the ICC- and GCC- generated code are highlighted in Figure 11.

Figure 12a shows a similar pattern to Figure 10a, though the spikes and dips occur for different optimization levels: lumpy-ref sees slowdowns at levels -O3 and -Ofast, and a speedup for levels -O0 and -O1, while prim-trans sees slowdowns at levels -O3, -Os and -Ofast, and a speedup at level -O0.

In Figure 12b, we see once again that most of the variation in execution time is explained by the number of instructions executed for each workload.

Figure 12c reveals that like LLVM, ICC also significantly reduces the number of cache misses at levels -O2 and -O3 in comparison to GCC. Thus, GCC’s cache behaviour appears to be poor for this benchmark at levels -O2 and -O3 (though as noted earlier, perhaps not as poor as it may appear).

Finally, Figure 12d data translation lookaside buffer (dTLB) misses. As with LLVM, dTLB misses are highly depended on input.

Figure 11: Changes in various performance measures from GCC to ICC.
PIC (a) relative execution time   PIC (b) instructions executed PIC (c) cache misses   PIC (d) dTLB misses (load and store combined) PIC (e) Legened for all graphs in Figures 9 and 11

5 Conclusion

The 511.povray_r benchmark’s behaviour varies greatly depending on its inputs, though this is not reflected by the relative sameness of the SPEC workloads. The addition of 7 new workloads revealed many portions of the code which were rarely if ever visited by train and refrate. Producing further inputs for this benchmark would be beneficial, as it would allow even more of its code to be exercised. The importance of thoroughly exploring this program’s many behaviours is underlined by Section 4.4, in which we see that the different compilers may CauĦse very different performance – but if we were to look only at train and refrate, these differences would not be visible.

Appendix

Figure 13: Percentage of execution time spent on all symbols which made up more than 5% of the execution time for at least one of the workloads. This figure continues onto the next page.
(a) for grass


Symbol Time Spent (geost)


pov::ra_
average_near 0.00 (1.00)


pov::ot_
dist_traverse 0.00 (1.00)


pov::mesh_hash 57.24 (1.00)


pov::compute_
lighted_texture 0.36 (1.01)


pov::POVFPU_
RunDefault 0.28 (1.03)


pov::Noise 0.06 (1.02)


pov::Intersect_
BBox_Tree 0.11 (1.01)


pov::Inside_Plane 0.00 (1.00)


pov::Get_
Token [clone .part.4] 5.30 (1.01)


pov::Check_
And_Enqueue 16.20 (1.00)


pov::All_Sphere_
Intersections 0.00 (1.00)


pov::All_Quadric_
Intersections 0.00 (1.00)


pov::All_Plane_
Intersections 0.00 (1.00)


pov::All_CSG_
Intersect_Intersections 0.00 (1.00)


__ieee754_pow_sse2 0.12 (1.01)


(b) for pillars


Symbol Time Spent (geost)


pov::ra_
average_near 22.85 (1.00)


pov::ot_
dist_traverse 37.00 (1.00)


pov::mesh_hash 0.00 (1.00)


pov::compute_
lighted_texture 1.44 (1.02)


pov::POVFPU_
RunDefault 0.00 (1.00)


pov::Noise 0.00 (1.00)


pov::Intersect_
BBox_Tree 9.71 (1.00)


pov::Inside_Plane 0.00 (1.00)


pov::Get_
Token [clone .part.4] 0.00 (1.00)


pov::Check_
And_Enqueue 0.10 (1.01)


pov::All_Sphere_
Intersections 0.43 (1.01)


pov::All_Quadric_
Intersections 0.00 (1.00)


pov::All_Plane_
Intersections 0.39 (1.01)


pov::All_CSG_
Intersect_Intersections 1.24 (1.02)


__ieee754_pow_sse2 3.05 (1.01)


(c) for lumpy-rad



Symbol Time Spent (geost)


pov::ra_
average_near 15.74 (1.01)


pov::ot_
dist_traverse 9.86 (1.01)


pov::mesh_hash 0.00 (1.00)


pov::compute_
lighted_texture 0.69 (1.03)


pov::POVFPU_
RunDefault 47.76 (1.01)


pov::Noise 8.63 (1.01)


pov::Intersect_
BBox_Tree 0.50 (1.01)


pov::Inside_Plane 0.00 (1.00)


pov::Get_
Token [clone .part.4] 0.00 (1.00)


pov::Check_
And_Enqueue 0.07 (1.01)


pov::All_Sphere_
Intersections 0.00 (1.00)


pov::All_Quadric_
Intersections 0.00 (1.00)


pov::All_Plane_
Intersections 0.17 (1.01)


pov::All_CSG_
Intersect_Intersections 0.00 (1.00)


__ieee754_pow_sse2 0.10 (1.01)


(d) for lumpy-ref


Symbol Time Spent (geost)


pov::ra_
average_near 0.00 (1.00)


pov::ot_
dist_traverse 0.00 (1.00)


pov::mesh_hash 0.00 (1.00)


pov::compute_
lighted_texture 0.63 (1.02)


pov::POVFPU_
RunDefault 66.01 (1.00)


pov::Noise 11.77 (1.00)


pov::Intersect_
BBox_Tree 0.24 (1.01)


pov::Inside_Plane 0.00 (1.00)


pov::Get_
Token [clone .part.4] 0.00 (1.00)


pov::Check_
And_Enqueue 0.12 (1.00)


pov::All_Sphere_
Intersections 0.00 (1.00)


pov::All_Quadric_
Intersections 0.00 (1.00)


pov::All_Plane_
Intersections 0.14 (1.01)


pov::All_CSG_
Intersect_Intersections 0.00 (1.00)


__ieee754_pow_sse2 0.44 (1.01)


(e) for prim-solid



Symbol Time Spent (geost)


pov::ra_
average_near 40.30 (1.00)


pov::ot_
dist_traverse 45.35 (1.00)


pov::mesh_hash 0.00 (1.00)


pov::compute_
lighted_texture 1.78 (1.02)


pov::POVFPU_
RunDefault 0.00 (1.00)


pov::Noise 0.00 (1.00)


pov::Intersect_
BBox_Tree 0.83 (1.01)


pov::Inside_Plane 0.00 (1.00)


pov::Get_
Token [clone .part.4] 0.00 (1.00)


pov::Check_
And_Enqueue 0.35 (1.01)


pov::All_Sphere_
Intersections 0.14 (1.01)


pov::All_Quadric_
Intersections 0.00 (1.00)


pov::All_Plane_
Intersections 0.44 (1.01)


pov::All_CSG_
Intersect_Intersections 0.00 (1.00)


__ieee754_pow_sse2 0.55 (1.01)


(f) for prim-trans


Symbol Time Spent (geost)


pov::ra_
average_near 26.96 (1.01)


pov::ot_
dist_traverse 35.28 (1.04)


pov::mesh_hash 0.00 (1.00)


pov::compute_
lighted_texture 4.61 (1.02)


pov::POVFPU_
RunDefault 0.00 (1.00)


pov::Noise 0.00 (1.00)


pov::Intersect_
BBox_Tree 2.09 (1.03)


pov::Inside_Plane 0.00 (1.00)


pov::Get_
Token [clone .part.4] 0.00 (1.00)


pov::Check_
And_Enqueue 1.09 (1.01)


pov::All_Sphere_
Intersections 0.33 (1.01)


pov::All_Quadric_
Intersections 0.00 (1.00)


pov::All_Plane_
Intersections 1.16 (1.01)


pov::All_CSG_
Intersect_Intersections 0.00 (1.00)


__ieee754_pow_sse2 4.47 (1.03)


(g) for prim-apert



Symbol Time Spent (geost)


pov::ra_
average_near 18.17 (1.01)


pov::ot_
dist_traverse 36.64 (1.02)


pov::mesh_hash 0.00 (1.00)


pov::compute_
lighted_texture 5.54 (1.02)


pov::POVFPU_
RunDefault 0.00 (1.00)


pov::Noise 0.00 (1.00)


pov::Intersect_
BBox_Tree 2.46 (1.01)


pov::Inside_Plane 0.01 (1.00)


pov::Get_
Token [clone .part.4] 0.00 (1.00)


pov::Check_
And_Enqueue 1.30 (1.02)


pov::All_Sphere_
Intersections 0.42 (1.02)


pov::All_Quadric_
Intersections 0.00 (1.00)


pov::All_Plane_
Intersections 1.41 (1.01)


pov::All_CSG_
Intersect_Intersections 0.00 (1.00)


__ieee754_pow_sse2 5.37 (1.01)


(h) for test


Symbol Time Spent (geost)


pov::ra_
average_near 0.00 (1.00)


pov::ot_
dist_traverse 0.00 (1.00)


pov::mesh_hash 0.00 (1.00)


pov::compute_
lighted_texture 2.20 (1.09)


pov::POVFPU_
RunDefault 0.00 (1.00)


pov::Noise 2.23 (1.32)


pov::Intersect_
BBox_Tree 9.22 (1.11)


pov::Inside_Plane 5.53 (1.07)


pov::Get_
Token [clone .part.4] 0.74 (1.34)


pov::Check_
And_Enqueue 1.20 (1.00)


pov::All_Sphere_
Intersections 10.41 (1.12)


pov::All_Quadric_
Intersections 5.75 (1.11)


pov::All_Plane_
Intersections 15.56 (1.07)


pov::All_CSG_
Intersect_Intersections 9.76 (1.08)


__ieee754_pow_sse2 1.36 (1.13)


(i) for train



Symbol Time Spent (geost)


pov::ra_
average_near 0.00 (1.00)


pov::ot_
dist_traverse 0.00 (1.00)


pov::mesh_hash 0.00 (1.00)


pov::compute_
lighted_texture 2.27 (1.06)


pov::POVFPU_
RunDefault 0.00 (1.00)


pov::Noise 2.34 (1.06)


pov::Intersect_
BBox_Tree 8.54 (1.01)


pov::Inside_Plane 5.69 (1.05)


pov::Get_
Token [clone .part.4] 0.11 (1.04)


pov::Check_
And_Enqueue 1.18 (1.07)


pov::All_Sphere_
Intersections 10.99 (1.03)


pov::All_Quadric_
Intersections 5.34 (1.05)


pov::All_Plane_
Intersections 16.81 (1.03)


pov::All_CSG_
Intersect_Intersections 9.28 (1.02)


__ieee754_pow_sse2 1.20 (1.06)


refrate (j) for refrate


Symbol Time Spent (geost)


pov::ra_
average_near 0.00 (1.00)


pov::ot_
dist_traverse 0.00 (1.00)


pov::mesh_hash 0.00 (1.00)


pov::compute_
lighted_texture 2.37 (1.00)


pov::POVFPU_
RunDefault 0.00 (1.00)


pov::Noise 2.39 (1.02)


pov::Intersect_
BBox_Tree 8.05 (1.02)


pov::Inside_Plane 5.78 (1.03)


pov::Get_
Token [clone .part.4] 0.00 (1.00)


pov::Check_
And_Enqueue 1.19 (1.01)


pov::All_Sphere_
Intersections 11.25 (1.01)


pov::All_Quadric_
Intersections 5.44 (1.02)


pov::All_Plane_
Intersections 16.66 (1.02)


pov::All_CSG_
Intersect_Intersections 9.41 (1.02)


__ieee754_pow_sse2 1.24 (1.01)


1http://lib.povray.org/searchcollection/index.php

2http://creativecommons.org/licenses/LGPL/2.1/

3http://wiki.povray.org/content/HowTo:Use_radiosity

4http://www.povray.org/documentation/view/3.6.1/106/

6More information can be found in §B.3.2 of the Intel 64 and IA-32 Architectures Optimization Reference Manual