4th Workshop on Compiler-Driven Performance

October 17, 2005
Sheraton Parkway Toronto North Hotel and Convention Centre,
Richmond Hill, Ontario, Canada

Associated with CASCON 2005
(http://www.cas.ibm.com/cascon)

Final Program

Mini Section 1 - Chair: José Nelson Amaral - University of Alberta

09:00-09:30	A Probabilistic Pointer Analysis for Speculative Optimizations Jeff Da Silva and Greg Steffan - EECG - University of Toronto
09:30-10:00	ISPRE = Isothermal Speculative Partial Redundancy Elimination Nigel Horspool - University of Victoria
10:00-10:20	Architectural Cloning For PowerPC Processor Edwin Chan - IBM Toronto Lab
10:20-10:40	Does Training Input Selection Matter for Feedback-Directed Optimizations? Paul Berube, José Nelson Amaral - Computing Science - University of Alberta

10:40-11:00

Break

Mini Section 2 - Chair: Chen Ding - University of Rochester

11:00-11:20	Context Threading: A Flexible and Efficient Dispatch Technique for Virtual Machine Interpreters Marc Berndl, Benjamin Vitale, Mathew Zaleski, and Angela Demke Brown Computing Science - University of Toronto
11:20-11:40	Inlining Java Native Calls At Runtime Levon Stepanian, Angela Demke Brown, Allan Kielstra, Gita Koblents, and Kevin Stoodley Computing Science - University of Toronto, IBM Toronto Lab.
11:40-12:00	Toward Deterministic Performance in JIT Compilers Mark Stoodley, Mike Fulton- IBM Toronto Lab.

12:00-01:00

Lunch

Mini Section 3 - Chair: Arie Tal - IBM Toronto Lab

01:00-01:30	A Safe Automatic Data Reorganizations Shimin Cui - IBM Toronto Lab.
01:30-02:00	Dynamic Shape and Data Structure Analysis in Java Sokhom Pheng, Clark Verbrugge - McGill University
02:00-02:25	JIT Compilation Strategy Overview in J9 JVM Marius Pirvu, Derek Inglis, and Vijay Sundaresan - IBM Toronto Lab.
02:25-02:45	Detecting Variable-Length Phases in Utility Programs Xipeng Shen, Chen Ding, Sandhya Dwarkadas, and Michael Scott University of Rochester

02:45-03:05

Break

Mini Section 4 - Chair: Calin Cascaval - IBM T. J. Watson Research Center

03:05-03:25	Partial Copying for First and Last Privatization of Arrays Guangsong Zhang, Eik Charlebois, and Roch Archambault - IBM Toronto Lab.
03:25-03:45	Auto-SIMDization Challenges Amy Wang, Peng Zhao - IBM Toronto Lab. Peng Wu and Alex Eichenberger - IBM T.J. Watson Research Center
03:45-04:05	A Technique for Generic Iteration and Its Optimization Stephen M. Watt - University of Western Ontario
04:05-04:25	Speeding Up Floating-Point Division with In-Lined Iterative Algorithms Robert Enenkel, Allan Martin - IBM Toronto Lab.

A Probabilistic Pointer Analysis for Speculative Optimizations
Jeff Da Silva and Gregg Stephan - EECG - University of Toronto
Presentation Slides: [PPT] [PDF]

Pointer analysis is a critical compiler analysis used to disambiguate the indirect memory references that result from the use of pointers and pointer-based data structures in C and other similar programming languages. Since memory references often remain ambiguous--even after applying an aggressive pointer analysis--optimizations called "data speculative optimizations" have recently been proposed which can be applied despite ambiguous memory references. Currently, all data speculative optimizations require extensive profile information to drive the compiler's decision of when to speculate. Hence, we are motivated to take a fresh look at pointer analysis with speculative optimizations in mind. A conventional pointer analysis deduces for every pair of pointers, at any program point, whether a points-to relation between them (i) definitely exists, (ii) definitely does not exist, or (iii) maybe exists. Typically, a large majority of points-to relations are categorized as maybe, especially if a fast-but-inaccurate approach is used. Unfortunately, most analyses must treat the maybe case the same as the definitely case to ensure correctness. However, speculative optimizations can capitalize on the maybe case, especially if we can quantify the likelihood that two pointers alias.

We propose a Probabilistic Pointer Analysis (PPA) that statically predicts the probability of each points-to relation at every program point. We have developed an innovative PPA algorithm which is both scalable and accurate, the key to which is the use of linear transfer functions to collect points-to probability information. Building on simple control flow edge profiling and other simplifying heuristics, our analysis is one-level context, flow and field sensitive, yet can still scale to large programs, including the entire SPEC integer benchmark suite. Our preliminary results have shown that the algorithm (even without the probability information) is as accurate as the best algorithms ever developed; we are currently evaluating the accuracy of the resulting probabilities.

Back to CDP05 Program

Architectural Cloning For PowerPC Processor
Edwin Chan - IBM Toronto Lab
Presentation Slides: [PDF]

Currently the IBM XL compiler generates generic instructions to produce an executable that is compatible on multiple PowerPC platforms. One distinct drawback with the generic instructions is that they do not exploit the underlying hardware features, and therefore the executable generated will have suboptimal performance on all PowerPC platforms. This paper describes a novel method for the compiler to build a single executable that is targeted for multiple PowerPC architectures. During compilation, the compiler analyzes all the procedures to find the profitable candidates. Then it generates both architectural specific instructions and generic instructions for each candidate, and also inserts additional instructions in the program to dispatch the appropriate instructions at runtime. Thus the resulting executable can take advantage of the hardware features available on the latest PowerPC processors, while at the same time is also backward compatible with older PowerPC processors. Experimental results with SPEC CPU2000 benchmarks have demonstrated that the executable generated by this feature delivers high performance on different PowerPC processors, without sacrificing backward compatibility.

Back to CDP05 Program

Does Training Input Selection Matter for Feedback-Directed Optimizations?
Paul Berube - Computing Science - University of Alberta
Presentation Slides: [PPT] [PDF]

Feedback-directed optimization (FDO) uses a training run that provides a compiler with a profile that summarizes the run-time behavior of the program. Most studies that use FDO techniques use either a single input for both training and performance evaluation, or a single input for training and a single input for evaluation. However, the run-time behavior of a program is influenced by the data it is processing. This exploratory study addresses an important open question: How important is the selection of training data for FDO? Likely, the answer to this question is not constant across all optimizations that use profile information. How sensitive are individual compiler transformations to the selection of training data used with FDO? Does training on different inputs result in different optimization decisions at compile time? Furthermore, do these different decisions result in changes in program performance? We develop a tool called Aestimo to quantify the differences between FDO logs for inlining and if conversion from the Open Research Compiler (ORC) for SPEC CINT2000 benchmark programs trained on a large number of inputs. Aestimo also compares the performance of programs trained on different inputs, and the performance of programs compiled with and without FDO. Training on different inputs results in as much as a 6% difference in performance on a workload of inputs. Also, evaluating FDO performance on different inputs can lead to substantially different performance results. Aestimo finds differences in best case FDO performance on different inputs for the same program larger than 13% for if conversion, and larger than 20% for inlining. Finally, Aestimo reveals that the current if conversion heuristics in the ORC always results in performance degradation for the Itanium 2 processor when FDO is used.

Back to CDP05 Program

Context Threading: A Flexible and Efficient Dispatch Technique for Virtual Machine Interpreters
Marc Berndl, Benjamin Vitale, Mathew Zaleski, and Angela Demke Brown
Presenter: Mathew Zaleski - Computing Science - University of Toronto
Presentation Slides: [PDF]

Direct-threaded interpreters use indirect branches to dispatch bytecodes, but deeply-pipelined architectures rely on branch prediction for performance. Due to the poor correlation between the virtual program's control flow and the hardware program counter, which we call the "context problem", direct threading's indirect branches are poorly predicted by the hardware, limiting performance. Our dispatch technique, context threading, improves branch prediction and performance by aligning hardware and virtual machine state.

Linear virtual instructions are dispatched with native calls and returns, aligning the hardware and virtual PC. Thus, sequential control flow is predicted by the hardware return stack. We convert virtual branching instructions to native branches, mobilizing the hardware's branch prediction resources. We evaluate the impact of context threading on both branch prediction and performance using interpreters for Java and OCaml on the Pentium and PowerPC architectures. On the Pentium IV, our technique reduces mean mispredicted branches by 95%. On the PowerPC, it reduces mean branch stall cycles by 75% for OCaml and 82% for Java. Due to reduced branch hazards, context threading reduces mean execution time by 25% for Java and by 19% and 37% for OCaml on the P4 and PPC970, respectively. We also combine context threading with a conservative inlining technique and find its performance comparable to that of selective inlining.

Back to CDP05 Program

Inlining Java Native Calls At Runtime
Levon Stepanian, Angela Demke Brown, Allan Kielstra, Gita Koblents, and Kevin Stoodley
Presenter: Levon Stepanian - Computing Science - University of Toronto
Presentation Slides: [PPT] [PDF]

We introduce a strategy for inlining native functions into Java B applications using a JIT compiler. We perform further optimizations to transform inlined "callbacks" into semantically equivalent lightweight operations. We show that this strategy can substantially reduce the overhead of performing JNI calls, while preserving the key safety and portability properties of the JNI. Our work leverages the ability to store statically-generated IL alongside native binaries, to facilitate native inlining at Java callsites at JIT compilation time. Preliminary results with our prototype implementation show speedups of up to 93X when inlining and callback transformation are combined.

Back to CDP05 Program

Toward Deterministic Performance in JIT Compilers
Mark Stoodley - IBM Toronto
Presentation Slides: [PPT] [PDF]

Real-time systems require a higher level of predictable performance than other types of systems. A Java execution environment designed for real-time systems must deliver high performance without sacrificing deterministic behaviour. While Java is a dynamic language exploiting Just-In-Time compilers that employ a variety of highly speculative optimizations, many of the non-deterministic aspects of high performance Java execution can be mitigated albeit at reduced levels of sustained performance. For example, rather than compiling code as the program executes, ahead-of-time (AOT) compilation strategies can be used to achieve compiled-code speed without a JIT compiler interrupting the program. The Real Time Specification for Java also defines several new features that programmers can use as tools to write programs that meet real-time deadlines. In this talk, I will describe several areas of the JIT compiler and JVM that must be modified to provide more deterministic performance, as well as talk about the various AOT compiler changes mandated by the RTSJ.

Back to CDP05 Program

ISPRE = Isothermal Speculative Partial Redundancy Elimination
Nigel Horspool - University of Victoria
Presentation Slides: [PDF]

Speculative partial redundancy elimination (SPRE) attempts to minimize the expected number of computations for each expression in a program. Unfortunately, existing algorithms which solve the SPRE problem are computationally expensive, so much so that they could consume a significant fraction of the total compilation time. The new approach of ISPRE solves the problem for all expressions simultaneously, but it had to trade optimality of the results for speed. However, as preliminary results indicate, the solutions are not very far from optimal in practice.

Back to CDP05 Program

A Safe Automatic Data Reorganizations
Shimin Cui - IBM Toronto
Presentation Slides: [PPT] [PDF]

Improving cache efficiency is critical to achieve high performance on modern machines. This presentation will describe a practical safe automatic optimization technique for data reorganization to increase the data cache utilization of pointer based programs. The fully automatic transformation will be discussed including (1) data splitting and data outlining (2) memory allocation merging, and (3) data grouping.

Back to CDP05 Program

Dynamic Shape and Data Structure Analysis in Java
Clark Verbrugge - McGill University
Presentation Slides: [PPT] [PDF]

Analysis of dynamic data structure usage is important for both program understanding and for improving the accuracy of other program analyses. Static shape analysis techniques, however, suffer from reduced accuracy in complex situations. We have designed and implemented a dynamic shape analysis system that allows one to examine and analyze how Java programs build and modify data structures. Using a complete execution trace from a profiled run of the program, we build an internal representation that mirrors the evolving runtime data structures. The resulting series of representations can then be analyzed and visualized, and we show how to use our approach to help understand how programs use data structures, the precise effect of garbage collection, and to establish limits on static techniques for shape analysis. A deep understanding of dynamic data structures is particularly important for modern, object-oriented languages that make extensive use of heap-based data structures.

Back to CDP05 Program

Detecting Variable-Length Phases in Utility Programs
Xipeng Shen, Chen Ding, Sandhya Dwarkadas, and Michael Scott
Presenter: Xipeng Shen - University of Rochester
Presentation Slides: [PDF]

The behavior of utility programs depends strongly on the input. A compiler, for example, behaves differently when compiling different B functions. Similar input dependences can be seen in interpreters, compression and encoding tools, databases, and document parsers. Because their behavior is hard to predict, these programs pose a special challenge for such dynamic adaptation mechanisms as phase-based memory management or hardware reconfiguration.

We present a two-step technique to detect phases in utility programs. Active profiling constructs a regular input to induce repeating behavior. Phase-boundary analysis then examines a basic-block trace to identify patterns that occur periodically with regular inputs and B consistently (but at irregular scale) with real inputs. Once phases have been identified, we use binary rewriting to insert phase markers that will announce phase transitions at run time, with arbitrary input. The technique can handle unmodified binary code. Experiments with utility programs from the Spec95 and Spec2K benchmark suites indicate that program behavior within phases is surprisingly predictable in many (though not all) cases. This in turn suggests that dynamic adaptation, either in hardware or in software, may be applicable to a wider class of programs than previously believed.

Back to CDP05 Program

A Technique for Generic Iteration and Its Optimization
Stephen M. Watt
Presentation Slides: [PDF]

Software libraries rely increasingly on iterator objects to provide generic traversal of data structures. These iterators can be represented either as structures that save state or as programs that suspend and resume execution. This paper addresses two problems that remain in the use of iterators today: The first problem is that iterators represented as state-saving objects in languages such as C++ typically have logic that is much more complicated than the suspend/resume iterators as pioneered by {\sc clu}. This paper presents a program structuring technique that allows state-saving iterators to be implemented with the same clarity as suspending iterators. The second problem is that the usual implementations of suspend/resume iterators is not suitable for inner loops in high-performance applications. This paper presents a code optimization technique that can be applied to suspend/resume iteration to produce efficient natural loops at the machine code level. Combined, these two results allow iterators for complex data structures to be easily understood and implemented efficiently.

Back to CDP05 Program

Partial Copying for First and Last Privatization of Arrays
Guangsong Zhang, Eik Charlebois, and Roch Archambault - IBM Toronto
Presentation Slides: [PPT] [PDF]

Array data flow analysis is a known method for privatizing arrays discussed in [1], [2] and [3]. The core idea is to examine array sections to see whether they belong to a loop iteration and can therefore be privatized. To prove that it is safe to privatize an array, every use of an array element must have a dominating definition in the same loop iteration. That is, every path from the top of the loop to the use of the array element must pass the definition site of the same array element. [1] supplies the basic algorithm for this analysis and [2] further refines it. However, the approaches taken for array copy-in and copy-out are inefficient. Particular, supporting first privatization is discouraged due to this drawback. With well defined semantics in OpenMP, we find it is easier to partition different kind of array uses, such as copy-in and copy-out as firstprivate and lastprivate variables. In our approach, we are more focusing on calculating those sets in an efficient way.

References
[1] Peng Tu. Automatic array privatization and demand-driven symbolic analysis. Technical report, 1995. Thesis, University of Illinois at Urbana-Champaign.
[2] Junjie Gu and Zhiyuan Li. Efficient interprocedural array data-flow analysis for automatic program parallelization. IEEE Transactions on Software Engineering, 26(3):244?261, 2000.
[3] Manish Gupta, Sayak Mukhopadhyay, and Navin Sinha. Automatic parallelization of recursive procedures. Technical report, 1999. Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT).
[4] OpenMP Architecture Review Board. OpenMP application program interface, version 2.5, 2005. http://www.openmp.org.

Back to CDP05 Program

JIT Compilation Strategy Overview in J9 JVM
Marius Pirvu, Derek Inglis, and Vijay Sundaresan
Presentation Slides: [PPT] [PDF]

Modern Java Virtual Machines (JVMs) employ Just-In-Time (JIT) compilers that optimize and generate native code at runtime in order to improve performance of Java programs. Because compilation is inherently part of the application running-time, minimizing compilation overhead is a major concern in JIT compilation. Short running programs and the start-up phase in large server or GUI-based applications are examples of scenarios where compilation time is a significant proportion of the overall execution time. To address these problems JIT developers have a few alternatives: (1) Develop cheaper optimization algorithms; (2) Restrict compilation activity to a subset of methods that are deemed to be "frequently executed"; (3) Attenuate the negative impact of compilation on the overall execution of the program. In this paper we will elaborate on solutions for the latter two options to reduce compilation overhead, solutions implemented in the TR JIT compiler developed for the J9 virtual machine at IBM.

Adaptive JIT compilers, like Sun's Hot Spot or Jikes RVM, start from the assumption that most of an application's time is spent in a small portion of the code. They dynamically identify the set of performance critical routines and optimize them heavily, while the rest of the code is either interpreted or compiled with minimal optimizations. Both J9 and the previous JVM offered by IBM (codename Sovereign) follow this principle. However, from the point of view of compilation strategy, J9 is different from Sovereign in the following respects: (1) it provides a higher optimization granularity by offering multiple optimization levels; (2) it monitors program execution continuously and may recompile methods at higher optimization levels. We will describe the compilation infrastructure implemented in the TR JIT compiler and present a comparison among different optimization strategies regarding compilation time and application's performance.

Some applications have a very flat execution profile with no clear hot-spots. A case in point is WebSphere Application Server where thousands of methods are, more or less, equally important. In these situations the JIT compiler cannot make discriminatory decisions and will have to either spend a significant amount of time compiling these methods or lower the optimization level at the expense of runtime performance. To cope with this scenario we have developed a mechanism that selectively lowers the optimization level for some parts of the execution. We will describe this mechanism and show how it enabled us to significantly reduce the start-up of WebSphere (up to 33%) without degrading the performance of J2EE applications running on %top of it.

To attenuate the negative effects of compilation we have recently implemented an asynchronous compilation mechanism: All compilations are carried out by a separate compilation thread that runs asynchronously to the Java threads. This means that a Java thread will not have to wait for the compiled version of the method, but can instead go ahead and make progress. The major benefit of the asynchronous compilation is that it increases the application's parallelism when running on multiprocessors. In addition to detailing our implementation of asynchronous compilation, we will describe other work done to minimize the impact of a long-running compilation on overall performance. We will also present experimental data showing substantial improvements in start-up of WebSphere (up to 25%) and speedup of short running applications (up to 24%).

Back to CDP05 Program

Speeding Up Floating-Point Division with In-Lined Iterative Algorithms
Robert Enenkel and Allan Martin - IBM Toronto lab
Presentation Slides: [PRZ] (Lotus Notes) [PPT] [PDF]

Hardware floating-point division in many modern pipelined processors is one of the most expensive instructions, and may completely occupy the floating-point unit for the duration of the instruction. Alternatively, a software algorithm for floating-point division could be used. Although a software algorithm would have longer latency than the hardware instruction, if multiple independent computations can be interleaved so as to effectively utilize the cycles during pipeline delays, the software algorithm can result in higher throughput. This approach can fill the gap between hardware division and vector subroutine libraries such as IBM MASS (Mathematical Acceleration Subsystem).

This talk describes our implementation of software floating-point division for the PowerPC family of processors in the IBM Fortran and C/C++ compilers. The algorithms are available both as user-callable built-in functions, or automatically through the compiler. We first describe the division algorithms used, which are based on hardware table look-up estimates and Newton iteration, with variations for different accuracy and exception-handling requirements. We then explain how the compiler decides whether the software algorithm is likely to be profitable (and hence whether to automatically invoke it), based on a measurement of the amount of independent computation available to interleave with the software division.

Back to CDP05 Program

Auto-SIMDization Challenges
Amy Wang - IBM Toronto Lab
Presentation Slides: [PPT] [PDF]

Automatic SIMDization is an optimization in TPO that exploits SIMD parallelism for VMX hardware on PPC970 by vectorizing loops. This feature was first released in XL C/C++ V7 and XL Fortran V9 for Linux on SLES9 last year. In this talk, we will present some of the challenges we encountered and the solutions being implemented. Specifically, we will address the memory protection/multi-threading issues that arose from the memory truncation effect in the VMX load and store instructions. We will also discuss the strengths and weaknesses of our current design by looking into the effect of loop distribution on simdization. We will then present our attempts to determine the profitability of SIMDization and our novel solution of mix-mode SIMDization. We will also outline some of the on-going tuning efforts to better integrate the simd technology with existing optimizations. A selective set of performance data will be shown to illustrate the above efforts. The talk will conclude with our future plans for the technology and for future hardwares.

Back to CDP05 Program