State-of-the-art multiprocessor systems pose several difficulties: (i) the user has to parallelize the existing serial code; (ii) explicitly threaded programs using a thread library are not portable; (iii) writing efficient multi-threaded programs requires intimate knowledge of machine's architecture and micro-architecture. Thus, well-tuned parallelizing compilers are in high demand to leverage state-of-the-art computer advances of NUMA-based multiprocessors, simultaneous multi-threading processors and chip-multiprocessor systems in response to the performance quest from the highperformance computing community. On the other hand, OpenMP* has emerged as the industry standard parallel programming model. Applications can be parallelized using OpenMP with less effort in a way that is portable across a wide range of multiprocessor systems. In this paper, we present several practical compiler optimization techniques and discuss their effect on the performance of OpenMP programs. We elaborate on the major design considerations in a high performance OpenMP compiler and present experimental data based on the implementation of the optimizations in the Intel® C++ and Fortran compilers. Interactions of the OpenMP transformation with other sequential optimizations in the compiler are discussed. The techniques in this paper have achieved significant performance improvements on the industry standard SPEC* OMPM2001 and SPEC* OMPL2001 benchmarks, and these performance results are presented for Intel® Pentium® and Itanium® processor based systems.
INTRODUCTION
The current performance in computing challenge is being tackled with state-of-the-art hardware technologies that use Chip-Multiprocessor (CMP) [1, 2] , Simultaneous MultiThreading processor [3] , Hyper-Threading Technology processor [4] , larger caches and sophisticated memory hierarchies. However, leveraging these new hardware advances requires portable and easy-to-use programming models that allow programmers to exploit multi-level parallelism inherent in applications using standard high-level languages and well-tuned high-performance compilers for the efficient threaded-code generation. Obviously, Windows threads or Pthreads based approaches are always possible but require a significant programming effort. The users have to face the complexity of managing the parallelism at the application level, manually dealing with thread creation, workload distribution, allocation of private variables and synchronizations.
Furthermore, these approaches lack portability between platforms and operating systems. The OpenMP approach has emerged for dealing with those issues. The principle behind the OpenMP programming model is to shift most of the complex tasks of thread management from the user to a compiler, freeing the user to concentrate on the expression of parallelism through the OpenMP directives. The OpenMP language specification [5, 6] for shared memory parallel programming has a rich set of features that allows the user to write parallel programs with a modest development effort using directives. These directives are translated by the compiler to generate multithreaded code that will usually show increased performance on shared memory multiprocessor systems. Performance can also be gained on recent processors that allow simultaneous execution of multiple threads (e.g. IBM* Power5 [3] or Intel processors with Hyper-Threading Technology [4] ). The Intel® 1 C++/Fortran95 compilers support the OpenMP* 2 2.0 language specification on Windows* and Linux* platforms on the IA-32 and Itanium® Processor Family (IPF) architectures [7, 8] .
Several papers discussing OpenMP support in compilers [8, 9, 10, 11, 12, 13] show that various OpenMP implementations are possible.
With the preprocessor
Practical Compiler Techniques for OpenMP Programs
589 approach [10, 11, 12] , an OpenMP preprocessor accepts C++/Fortran95 OpenMP programs and translates them to C++/Fortran95 programs (without OpenMP directives) that are subsequently compiled with a native compiler (one that generates machine code). A more integrated approach is to have an internal OpenMP translation phase in the native compiler [8, 9, 13] that eliminates the preprocessor all together. In most cases an auxiliary runtime library for thread management is used with the compiler generated code making many calls to this library. All implementations have to worry about the OpenMP translation phase adversely affecting the other optimization phases in the native compiler. As a simple example, when implementing OpenMP through a preprocessor, the preprocessor may require the generation of calls to the OpenMP runtime library and passing of addresses of variables as parameters in such calls. However, taking the addresses of variables can significantly affect the ability of the compiler to determine accurately which variables are read (or written) at various points in the program. In this paper, we study the interaction between other optimization phases in the compiler and the OpenMP translation phase and show the cooperation that is required among all phases to generate optimized code. The remainder of this paper is organized as follows. Section 2 presents an overview of the Intel® C++/Fortran95 compilers. Section 3 describes several phases implemented in the Intel compiler for generating multithreaded code from OpenMP* directives. Section 4 describes the loop partitioning and scheduling schemes supported in the Intel compiler. Section 5 presents several practical compiler techniques to make the OpenMP parallelizer interact tightly with other optimizations. In Section 6 and 7, we discuss the quantitative effect of optimizations on small OpenMP kernels as well as the industry standard SPEC* OMPM2001 benchmarks [14, 15] on Intel® Pentium® and Itanium® processor-based systems. Section 8 provides the experimental results of multithreading overhead using SPEC OMPM2001 and the multimedia application H.264 encoder. In Section 9, we report the industry leading performance results of SPEC OMPL2001 on an Itanium® 2 processor-based SGI* Altix* system with 128 processors obtained with the Intel compilers 8.0 that were well-tuned with the techniques presented in this paper. Section 10 discusses some related work. Finally, concluding remarks can be found in Section 11.
INTEL COMPILER ARCHITECTURE
A high level overview of the Intel® C++/Fortran95 compiler is shown in Figure 1 . The compiler incorporates many well-known and advanced optimization techniques [16, 17, 18, 19] compiler can be classified into:
• Code restructuring and inter-procedural optimizations (IPOs).
• OpenMP directive-guided parallelization, automatic parallelization and automatic vectorization.
• High-level optimizations (HLOs) and scalar optimizations including memory optimizations such as loop control and data transformations, partial redundancy elimination (PRE), and partial dead store elimination (PDSE).
• Low-level machine code generation and optimizations such as register allocation, instruction scheduling and software pipelining. The compiler intermediate representation, IL0, has been extended to express the OpenMP* directives. Implementing the OpenMP translation phase at the IL0 level allows the same implementation to be used across languages (C++/C, Fortran95) and architectures (IA-32 and IPF). The Intel® compiler generated code has references to a high-level library API. The library implements this API in terms of the threading functionality provided by the OS; the use of such an API thus allows the compiler OpenMP translation phase to be independent of the underlying operating systems. The compiler architecture allows for one OpenMP implementation that covers differing languages (C++ and Fortran95), differing architectures (IA-32 and IPF) and differing operating systems (Windows* and Linux*). Compiler FE support. The translation process for the compiler FE for OpenMP code is illustrated in Figure 2 , where the worksharing for loop has been lowered into if and goto statements after the IL0 lowering phase. Each OpenMP* pragma has been converted into an equivalent pair of IL0 directives and its matching end directive, which helps the OpenMP parallelizer define the boundaries of the OpenMP constructs. Besides syntax and semantics checking, one of the issues the FE needs to address is finding the implicit attributes of variables that are not explicitly listed in an OpenMP clause. Based on the OpenMP specification, the FE treats a locally declared automatic variable as a private variable of the OpenMP construct that immediately encloses it lexically. And the FE finds the implicit shared variables of the parallel region based on a rule in OpenMP specification-the default attribute is default shared if the default clause is not specified.
THREADED CODE GENERATION
Pre-pass code transformation. The pre-pass transformation converts a parallel section to a parallel for loop, so that the implementation of parallel sections construct can use the implementation of the parallel loop construct. Essentially, a parallel loop is generated and the loop trip count is the number of sections. Given that the granularity of the parallel sections could be dramatically different, static or static-even scheduling type may not achieve the best load balance and hence we decided to use runtime scheduling for such a loop. Therefore, the decision regarding scheduling is deferred until run-time, and a better load balance can be achieved based on the decision made by the OMP_SCHEDULE environment variable and the OpenMP library at run-time. Figure 3 shows a parallel section code example. The compiler transforms a parallel section code to a parallel loop and sets the number of iterations to 3 which is equal to the number of sections in the original source code. to multithreaded code at the IL0 level. Given the example shown in Figure 4 , there is a worksharing for loop in the routine parwork with the dynamic scheduling type. The multithreaded code generation involves three major steps:
• Step 1: Generate a runtime initialization (__kmpc _dispatch_init) routine call to set up the loop scheduling type, pass original loop lower-bound, upper-bound, stride and all other necessary information to the runtime system; As shown in Figure 4 , the scheduling type is SCH_DYNAMIC, chunksize is 125, loop lowerbound is 0, loop upperbound is 999 and loop stride is 1; so the loop is partitioned into 8 chunks in total. • Step 2: Generate an enclosing while loop to dispatch loop-chunk at runtime through the __kmpc _dispatch_next routine supported in the library. Basically, during the parallel execution, each thread gets a loop chunk to execute through __kmpc_dispatch_next call. This runtime call returns ZERO when there is no chunk left to be executed. The runtime uses first-comefirst-serve policy to assign a chunk to a thread. • Step 3: Generate threaded-loop with thread-specific loop lower-bound, upper-bound, and loop control variable 'k' and privatized stack variable 'x'. With the MET technology [8] , one threaded entry, or T-entry 3 is created within the parwork() routine for each parallel region. The call __kmpc_fork_call spawns a team of threads to execute the threaded codes in parallel. In Figure 4 , the entry instruction address (ip address) of threaded code is _parwork_par_region(. . .). Note that, given the compiler interfaces __kmpc_dispatch_init and __kmpc_dispatch_next we employed, the compiler passes scheduling type and chunk size, and other necessary loop related information such as lower-bound, upper-bound, and stride etc. to runtime library through compiler interfaces for dynamic, guided and runtime scheduling type. The actual loop partitioning or scheduling is done in runtime library, so the threaded code can be simplified using the compiler-to-runtime interfaces. In the next section, we provide more details on our loop partitioning and scheduling schemes.
LOOP PARTITIONING AND SCHEDULING
The Intel® compiler and OpenMP* library support all 4 loop scheduling types defined in the OpenMP. For static scheduling, the chunks are handled with the round-robin scheme.
In particular, for static scheduling without specifying the chunk size, each thread gets at most one chunk. If there are enough iterations, each thread gets exactly one chunk, in order of their thread id. For dynamic scheduling, the chunks are handled with the first-come-first-serve scheme, and the default chunk size is 1. Each time the number of iterations grabbed is equal to the chunk size by each thread except the last chunk. For example, if the chunk size is specified as 7 with schedule(dynamic, 7) clause, assuming the total number of iterations is 100, then the partition will be 7, 7, 7, 7, 7, 7, . . ., 2 with a total of 15 chunks. For our guided scheduling, the basic idea is to start the execution of a loop by partitioning chunks of iterations whose size starts from ω 2N and keeps decreasing until all the iterations are scheduled. The chunks are handled with the first-comefirst-serve scheme as well. The formulas that compute chunk size are:
where N is the number of threads, ω denotes the number of iterations and π k denotes the size of the k'th chunk, starting from the 0 th chunk. These two formulas are derived from the
, where β k is the number of remaining unscheduled loop iterations while computing the k'th chunk; when π k gets too small, it gets clipped to the chunk size S specified in the schedule (guided, S) clause. The default chunk size setting is 1, if it is not specified in the schedule clause. Hence, for the guided scheduling, the way the loop is partitioned depends on the number of threads (N ), the number of iterations (ω) and the chunk size (S).
For example, given a loop with ω = 500, N = 2, and S = 50, the loop partition is {125, 93, 70, 50, 50, 50, 50, 12}. When π 3 is smaller than S, it gets clipped to S. In addition, if the number of remaining unscheduled iterations is smaller than S, we trim the upper bound of the last chunk whenever it is necessary. Table 1 shows different loop partitions with different N and S values for our guided scheduling. Our loop scheduling is slightly different from the guided self-scheduling (GSS) proposed in [20] , in which the loop partition was done with the recurrences π k = β k N which do not use the extra factor of two in the denominator and it was rounded to the ceiling instead of the floor. In comparison, for the same loop with ω = 500, N = 2, and S = 50, the loop partition is {250, 125, 63, 50, 12} with the GSS. We use an extra factor of 2 in the denominator to exploit a slightly fine-grained parallelism that can produce a better load distribution and balance for applications due to the parallel overhead (see a comparison in Section 10) being very low on Intel architectures. The runtime scheduling is not a scheduling scheme per se. It just determines the scheme defined by OMP_SCHEDULE environment variable, which is set to static by default. 
ENABLING ADVANCED OPTIMIZATIONS
The OpenMP* implementation in the Intel® compiler strives to generate a threaded code which gains speedup over the optimized serial code by integrating parallelization tightly with IPO, scalar/loop optimizations such as autovectorization [17] and memory optimizations [8, 18, 19 ] to achieve better cache locality, exploit dual-level parallelism and minimize overhead of data-sharing among threads. This section describes techniques for generating an efficient threaded code.
Effective ordering of optimization phases
The order of optimization phases in the compiler is very critical for achieving optimal performance. It is difficult to architect an effective ordering to achieve speedups over welloptimized serial code through OpenMP parallelization if significant sequential optimizations are affected adversely by the parallelization transformations. In the Intel compiler we • fully leverage classical peephole optimizations within basic-blocks, and perform inlining and OpenMP construct-aware constant propagation, and memory disambiguation before parallelization and multithreaded code generation; • perform HLOs such as loop tiling, loop unrolling, loop distribution, loop fusion, vectorization, software prefetching, scalar replacement, complex type lowering after parallelization and multithreaded code generation; • Enable advanced optimizations such as PRE, PDSE and dead code elimination (DCE) after HLO. This ordering is obtained following some general principles:
• For most optimizations, the parallel semantics inherent in the OpenMP program are difficult to handle. Such optimizations are run after the OpenMP translation phase.
• Some optimizations (e.g. limited constant propagation) enable the OpenMP translation pass to generate a better code, hence they are run before the OpenMP translation phase.
• Some collection of information (e.g. memory disambiguation information explained in Section 5.4) is best
The Computer Journal Vol. 48 No.
5, 2005
Practical Compiler Techniques for OpenMP Programs 593 done before the OpenMP translation phase. Otherwise, the IL gets too complicated to be analyzed effectively.
Reducing side-effects of privatization
Privatization is one of the key components when generating threaded code. Privatizing a local stack variable, static variable or global variable is straight forward-the compiler can simply create its clone on the stack. Some Fortran95 arrays (unknown-size, assumed-size and assumed-shape) can be allocated on the stack or heap. Sometimes heap allocation is preferred for large objects in the sequential case as stack space is limited. However, in the case of parallel programs, heap allocation may cause performance slowdowns, as memory allocation routines are usually in a critical section guarded by a locking mechanism. Our solution takes advantage of the proper nesting structure of OpenMP directives to limit the lifetime of such allocations by judiciously allocating and freeing up stack space in a LIFO (last in first out) manner. The OpenMP* transformation pass generates stack allocation and free intrinsics, _vla_alloc(size) and _vla_free(p, size). The _vla_alloc and vla_free intrinsics are lowered to stack adjustment instructions in the machine code generation phase. This is an efficient scheme for privatizing an object for each thread. In the above example, the size of array 'arr' is unknown at compiler time, so the compiler FE creates a dope-vector which is an array descriptor that consists of array shape, array size, base address, stride, array bounds information at runtime. During the privatization phase, the compiler clones the original dope-vector creating dv_clone_arr, and substitutes the original memory reference 'arr' with dv_clone_arr_baseaddr for each thread. The dv_clone_arr_baseaddr denotes the base address of the privatized array 'arr' and dv_clone_arr_size denotes the size of the privatized array 'arr'. Essentially, the privatization for the array 'arr' is done by allocating memory on the stack of each thread through calling the intrinsic _vla_alloc and _vla_free incrementing/decrementing each thread's stack pointer.
subroutine foo(arr, n) integer arr(n) !$omp parallel private(arr, k) do k=1, 100 
Annotating threaded IL
The OpenMP translation pass transforms the compiler IL dramatically. In the sample code (a) below, an array 'a' in common block 'ccc' is marked as threadprivate. The compiler generates code 'kmpc_threadprivate_cached' to allocate thread local storage for each thread with necessary checks to guarantee that there is no duplicated re-allocation based on the thread id and the address (&a) of the array 'a'. And the reference to the original array 'a' is substituted with the thread-local-storage base pointera_tpv_ptr. Since a normal static allocation has changed to a dynamic allocation, it is difficult for optimizations that run after threaded-code generation phase to do aliasing analysis, and the absence of accurate aliasing analysis can disable many sequential optimizations. The solution we proposed is that the threaded-code generation phase propagates the original attributes (e.g. address_taken, no_pointer_aliasing) of variable 'a' to 'a_tpv_ptr', and annotates the kmpc_threadprivate _cached call statement. The compiler knows that kmpc_threadprivate_cached does not create aliasing between a_tpv_ptr and other user defined arrays, if 'a' is not already aliased with others. The annotation also tells other optimizations that there is no aliasing between 'a' and 'a_tpv_ptr', as thread local storage allocated by the call is disjoint from the original 'a'. Proper representation and propagation of such information is necessary for not disabling optimizations that happen later, such as loop distribution, loop tiling, and software pipelining.
Leveraging memory disambiguation
Memory disambiguation is a technique for removing spurious data dependencies or pointer aliasing in programs that limits the compiler optimizations. The disambiguator retains a certain amount of high-level information about memory locations. Many of the optimizations that rely on memory disambiguation occur in the compiler back-end. Typically, after the program representation is lowered and optimizations are performed, much of the source-level information is lost and the code is transformed in ways that make it more difficult for the compiler to perform memory disambiguation.
To address this issue, the Intel® compiler maintains a link from each load/store to a high-level symbolic representation of the memory reference and other information that is crucial for memory disambiguation. We named our memory disambiguator DISAM, which stands for DISambiguation using Abstract Memory locations [18] . As we mentioned in Section 5.1, in the Intel® compiler, the memory disambiguation phase is invoked before the OpenMP* translation phase. Essentially, DISAM tokens are created early in the compiler when the high-level information is still available. Each memory reference in the IL is linked to it's symbolic representation through a DISAM token. DISAM tokens are part of the memory referencing IL, and as such are automatically carried along whenever a memory reference is moved or copied. The DISAM token provides access to all the information necessary to conduct the memory disambiguation. This information includes the location (LOC) set that represents the memory reference, type information, and a link to an array data dependence graph for disambiguation of different elements of the same array [18] . In the simple kernel, above, from a real large application, with the array 'a', 'b', 'c', 'd' members of common block 'ccc' and 'eee', the optimizations relying on DISAM information such as the loop distribution and software pipelining are disabled if threaded-code generation phase does not preserve DISAM token information in the new array referencing (i.e. base + offset) expression e.g. *(P32 *)(tpv_ccc_base+0)(k) of the threaded-code, since tpv_ccc_base and tpv_eee_base represent the base addresses of threadprivate memory chunks of ccc and eee, it would be hard for compiler to figure out if they point to distinct as there are allocated at runtime.
By preserving the DISAM token for each expression during the OpenMP translation pass, other optimizations know that there is no memory overlap among those memory references *(P32 *)(tpv_ccc_base+0)(k), *(P32 *)(tpv_ccc_base+400)(k), *(P32 *)(tpv_ccc_base+800)(k) and *(P32 *)(tpv_eee_base+0)(k) by simply querying DISAM information.
Exploiting vector-and thread-level parallelism
The Intel® Pentium® 4 processor features the streaming SIMD extensions (SSE, SSE2 and SSE3) that support floating-point operations on 4 packed single-precision and 2 packed double-precision floating-point numbers, as well as integer operations on 16 packed bytes, 8 packed words (16 bits), 4 packed dwords (32 bits) and 2 packed qwords (64 bits) 4 . The Intel compiler supports the automatic conversion of serial loops into SIMD form, a transformation that is referred to as intra-register vectorization [16, 17] . Combining intra-register vectorization with parallelization for hyper-or multi-threading enables the exploitation of dual-level parallelism, i.e. using the different forms of parallelism that are present in a code fragment to obtain high performance. As an example consider the code for matrix-vector multiplication shown above. In this example, parallelism appears at multiple levels. The iterations of the outermost k-loop may execute independently, as has been made explicit with an OpenMP pragma. The reduction performed in the innermost j -loop provides yet another level of parallelism. This loop can be implemented by accumulating partial sums in SIMD style, followed by code that constructs the final sum. The techniques presented in Section 5.3 ensure that the inner j -loop can be vectorized after the OpenMP translation phase, as the compiler annotates the IL of array referencing 'a' and 'y' to pass the original address taken and expression structure of array referencing 'a' and 'y' into vectorization phase. Thus, we don't lose the opportunity of vectorizing the inner loop.
The performance of the vectorized inner loop depends on the detection of the alignment of the array references in the loop. The detection of the alignment depends on the base address of the array, the subscripts and the starting iteration loop index. Due to the nature of the transformations in the OpenMP translation phase, it is hard, and sometimes impossible, for the compiler to make the same alignment determination as in the sequential case. If the alignment of memory references cannot be determined at compile-time, the Intel compiler has at its disposal several alignment optimizations (such as run-time loop peeling) to avoid performance penalties that are usually associated with unaligned memory accesses. These optimizations, therefore, prove very useful for OpenMP programs. Dynamic data dependence testing is used to allow the compiler to proceed with vectorization in situations where analysis has failed to prove independence statically. These advanced techniques (and others) have been discussed in detail in previous work [8, 16, 17] .
EFFECT OF EXPLOITING DUAL-LEVEL PARALLELISM
In this section, we show some performance results for the matrix-vector multiplication kernel discussed in Section 5.5 on a Hyper-Threading Technology enabled Intel® Xeon™ Processor dual-CPU system running at 1.5 Ghz with 512 MB of memory, 8K L1-Cache and a 256K L2-Cache. This graph shows speed-ups (relative to serial execution) for varying matrix sizes for vector execution (VEC), multithreaded execution using two threads (OMP2) and four threads (OMP4), and vector-multithreaded execution using two and four threads, (OMP2+VEC) and (OMP4+VEC), respectively ( Figure 5 ). For the datasets that completely fit in cache, the kernel is computationally bound. In these cases, intra-register vectorization alone obtains a speed-up of up to 2×. For the larger datasets, where the kernel becomes more memory bound, the improvements of merely intra-register vectorization become less evident. As we expected, the overhead associated with multithreading causes a slight slowdown for the small matrix size 32 × 32. For the larger matrices ranging from 64 × 64 to 256 × 256, the relative overhead introduced by parallelization becomes small and observed speed-up ranges from 1.4× to 5.8×. The difference between (OMP2) and (OMP4) for matrix size 200 × 200 reveals a 1.6× performance gain due to the exploitation of extra thread-level parallelism that leverages the HyperThreading Technology. The best performance gains are obtained when dual-level of parallelism (SIMD parallelism and TLP parallelism due to Hyper-Threading Technology) are exploited simultaneously, yielding a speed-up of up to 5.8× with four threads (OMP4+VEC) and a speed-up of 5.1× with two threads (OMP2+VEC).
EFFECT OF OPTIMIZATIONS ON SPEC OMPM2001 PERFORMANCE
SPEC* OMPM2001 suite consists of a set of OpenMP* based application programs [14, 15] . The input datasets of the SPEC OMPM2001 suite (also referred to as the medium suite) are derived from state-of-the-art computations on modern medium-scale (4-to 16-way) shared-memory multiprocessor systems. This benchmark suite consists of 11 large application programs that represent the type of software used in scientific technical computing. Most of industry compiler and multiprocessor system vendors are using SPEC OMP suite for performance measurement. Table 2 provides an overview of SPEC* OMPM2001 These benchmarks require a virtual address space of ∼2 GB to run. The datasets are significantly larger than those of the SPEC CPU2000 benchmarks, while still fitting in a 32-bit address space. In the next sub-section, we study the effect of SPEC OMPM2001 performance gain due to hyper-threading and processor specific optimizations.
Effect of Hyper-Threading Technology
This performance study of SPEC* OMPM2001 benchmarks is conducted on a single processor system with HyperThreading Technology enabled Intel® Pentium® 4 processor built with 90 nm technology running at 2.8 GHz, with 2 GB memory, an 8K L1-Cache, and 1M L2-Cache. For our performance measurement, all SPEC OMPM2001 benchmarks are compiled by the Intel® 8.0 C++/Fortran compilers with the option set of our base performance run: -Qopenmp -Qipo -O3 -QxP (OMP w/ QxP) under Windows* XP on a Hyper-Threading enabled Pentium 4 processor. QxP enables the compiler to generate SSE3 instructions available on the more recent Pentium 4 processors.
The normalized performance speedup of the SPEC* OMPM2001 benchmarks is shown in Figure 6 , which demonstrates the performance gain attributed to the HyperThreading Technology. The hyper-threading performance scaling is derived from the baseline performance of single thread binary with OMP 1T w/ QxP, and two threads execution under OMP 2T w/ QxP, respectively. As we see, Hyper-Threading Technology enabled Intel® Pentium® 4 processor to achieve a performance improvement of 4.3 to 28.3% (OMP 2T w/ QxP) on 9 out of 11 benchmarks except 316.applu_m (0.0%) and 312.swim_m (−7.4%) 5 . The 312.swim_m slowdown under two thread execution mode is due to the 312.swim_m being a memory bandwidth bound application. Overall, the improvement in GEOMEAN with OMP 2T w/ QxP is 9.1% due to Hyper-Threading. Considering that Hyper-Threading Technology does not add extra hardware execution (engine) resources, the gain of 9.1% illustrates that almost all sequential optimizations also trigger in the OpenMP threaded code.
Effect of compiler optimizations
In this section, we examine the effect of compiler optimizations on generating multithreaded code of SPEC* OMPM2001 application programs with different optimization sets. Ideally, given a fixed number (we used 4 threads in our performance study) of threads or processors, it would be more interesting to study the effect of each compiler optimizations one-by-one to demonstrate their effectiveness on performance improvement. However, given the complexity of interaction among compiler optimizations, we decided to study the effect of only a few compiler optimization levels. In our performance study, all SPEC OMPM2001 benchmarks are compiled by the latest Intel® 8.0 C++/Fortran compilers with four sets of base options: (i) -openmp -O2 (used as a baseline performance); (ii) -openmp -O2 -ipo; (iii) -openmp -O3; and (iv) -openmp -O3 -ipo. The experiments were done on a 4-way 1.5 GHz Itanium 2 based system with 6 MB L3 cache. Figure 7 illustrates the performance gain with different higher level optimizations versus the performance measured at a default optimization level -openmp -O2.
The O2 level optimization includes many traditional optimizations such as peephole optimization, constant propagation, copy propagation, DCE, PDSE, PRE etc.; the O3 level optimization includes advanced loop transformations (loop tiling, loop fusion, loop distribution etc.), scalar replacement, software-prefetching, array contraction etc.; the IPO flag enables inter-procedural (IP) optimizations such as function inlining, IP mod-ref analysis etc. Figure 7 provides the performance results at different optimization levels. As is evident from the graph, the performance gain from OMP+O2 to OMP+O2+IPO is 3% on Geomean, which is relatively small. This is because many advanced optimizations that can exploit the IP information such as mod-ref analysis are run only at O3. The results of OMP+O3 showed 22% performance gain versus the performance of OMP+O2 and 19% performance gain over versus OMP+O2+IPO performance. This result reveals that the HLOs are effectively enabled for multithreaded-code generated for OpenMP programs.
As we expected, the best performance is achieved with OMP+O3+IPO, Figure 7 shows a 31% performance gain over the baseline performance (OMP+O2). For example, the performance of 310.wupwise_m is dominated by a few hot loops with unknown trip count and at OMP+O3 the compiler needs to be conservative without knowing those trip-counts. In this case, OMP+O3 actually causes a 16% performance slow down on 310.wupwise_m. However, the addition of IPO provides a 49% performance gain with known tripcounts through IPO constant propagation. Overall, 10 out of 11 benchmarks in SPEC* OMPM2001 benchmark suite achieved a performance gain ranging from 7 to 98% with OMP+O3+IPO. An anomaly is 332.ammp_m, which shows a slowdown at OMP+O3+IPO. Our initial analysis shows that the lock contention is increased due to a data placement issue for a lock. It needs to be investigated further.
MULTITHREADING OVERHEAD
To justify the efficiency of multithreaded codes generated by a compiler for OpenMP programs, people would like to see the performance of serial code versus the performance of multithreaded code running with a single thread for the same OpenMP program, which would indicate how much multithreading overhead is introduced by the compiler and multithreading runtime libraries. Normally, near-zero overhead is desired. In this section, we show the performance results of SPEC* OMPM2001 benchmarks with -openmp ON and with -openmp OFF running with a single thread to study the efficiency of generated multithread code. In our performance study, all SPEC OMPM2001 benchmarks are compiled by the latest Intel® 8.0 C++/Fortran compilers, and the experiments were done on a 32-way 1.5 GHz Itanium® 2 based system with 6 MB L3 cache. Figure 8 illustrates the serial code performance with -openmp OFF versus the single thread performance with -openmp ON for all benchmarks in the SPEC OMPM2001 suite.
As shown in Figure 8 318.galgel_m, 324.apsi_m) out of 11 benchmarks achieving 93.15-93.9% of their serial execution. The 310.wupwise_m achieved 88.53% of its serial execution performance, which is mainly because a less aggressive inlining was performed to reduce resource contention for a better scaling on large CPU count system. The 326.gafort_m got 71.93% of its serial execution performance, which is mainly due to a few more F90 temporary array copies being introduced to remove the assumed data-dependency and this issue has been addressed in our new compiler release. For 332.ammp_m, it achieved 84.71% of it serial execution performance. The reason is that there are ∼6 millions function calls that call omp_init_lock, omp_set_lock and omp_unset_lock routines while executing 332.ammp_m at runtime. However, the serial code was linked with our OpenMP stub library as defined in OpenMP standard 2.0, which means that the serial execution does not actually set and unset LOCKs at all. Therefore, the 15.39% cost of excising 6 million calls for the single thread execution of the 332.ammp_m is considered as very reasonable and also a low cost for this special application. The geomean measurement in Figure 8 shows that the single thread execution efficiency of SPEC OMPM2001 with OMP ON achieved 94% of its serial execution performance. Note that the 328.fma3d_m got 111.83% of its serial execution performance, which is simply due to an aggression loop invariant code motion being enabled when OpenMP was enabled, and this optimization will be applied to serial code as well. Furthermore, we studied and presented the overhead of multithreaded execution of a multimedia application H.264 encoder parallelized with OpenMP and compiled with the Intel compiler on a 4-way system with Hyper-Threading Technology [21] .
The H.264 is an emerging video coding standard proposed by the Joint Video Team. The performance results showed that the multithreaded H.264 encoders achieved a speedup ranging from 1.9× to 2.01× on 2 processors, a speedup ranging from 3.61× to 3.99× on 4 processors with Hyper-Threading disabled and a speedup ranging from 3.97× to 4.69× on 4 processors with Hyper-Threading enabled for five different input 0+1  1+1  1+2  1+3  1+4  1+5  1+6  1+7  1+8  1+9  1+10  1+11  1+12  1+13  1+14  1+15 video sequences [21] . The single thread performance of H.264 encoder achieved 99.2% of its well-optimized serial execution performance. In Figure 9 , QPHT denotes a quad-processor system with HT-enabled, QP denotes quadprocessor, DP denotes dual-processor, UP denotes uniprocessor. While measuring UP performance with H.264 encoder, only one processor with HT OFF in the system was enabled and all created threads were sharing the single CPU. In this way, we can measure the performance variation by changing the number of threads. For DP performance measurement, only two HT-disabled processors were enabled in the system. For QP performance measurement, HT was disabled. The full system configuration is QP with HT enabled (QPHT). As shown in Figure 9 , the H.264 encoder speedup is pretty flat or varies slightly when the number of threads is greater than the number of physical or logical processors for different system configurations. It indicates that the overhead due to threading is minor. For the UP configuration, compared with the serial code performance, the execution efficiency of threaded H264.encoder ranges from 97.1 to 99.2% while changing the number of threads from 1 thread to 17 threads. In other words, the multithreaded code generated by the Intel compiler is efficient for exploiting effective parallelism, and the overhead of the multithreaded code is small.
SPEC OMPL2001 PERFORMANCE RESULTS
SPEC* OMPL2001 benchmark suite shares most of the application code base with SPEC OMPM2001 suite, and consists of 9 application programs from SPEC OMPM2001. However, the code and the datasets are modified to achieve better scaling and also to reflect the class of computation regularly performed on large-scale systems (32-way and larger). Figure 10 shows SPEC OMPL2001 performance results measured on an SGI* Altix* 3700 Bx2 (using Intel® 1600 MHz Itanium® 2 processors) with 256 KB L2 cache, 6 MB L3 cache 256 GB memory (32*512 MB PC2700 DIMMS per 8 core module). The performance results are measured on 64-and 128-CPU system configurations, using the Intel® 8.1 C++ and Fortran95 compilers with OpenMP* support that incorporate the techniques and solutions discussed in this paper. With the 64-CPU system configuration, the SPEC OMPL2001 achieved SPEC base ratio 507,602. With the 128-CPU system configuration, the SPEC OMPL2001 achieved SPEC base ratio 750,848. At the time of writing this paper (November, 2004) , these results were the best published SPEC* OMPL2001 results on www.spec.org. Both 315.mgrid_l and 321.equake_l are sparse matrix calculations and are known applications that do not scale well beyond 64 processors.
RELATED WORK
Many researchers and compiler engineers have been studying and working on the design and implementation of parallelizing compilers and runtime libraries to support the OpenMP* programming model [8, 9, 10, 11, 12, 13] . Within the research community, almost all researchers take the source-to-source translation approach to generate multithreaded source code.
For example, OpenMP preprocessors presented in [10, 11, 12] accept C/C++ OpenMP programs and translates them into C/C++ programs (without OpenMP directives) that are subsequently compiled with a native compiler (one that generates machine code). Therefore, in general, the preprocessor approach does not touch the phase ordering and internal optimization interaction issues we have addressed and discussed in this paper. For most commercial product compilers, the vendors use a more integrated approach with an internal OpenMP translation phase in the native compiler [8, 9, 13] that eliminates the preprocessor. Because most commercial compiler vendors do not publish their compiler techniques with a lot of details, and every commercial product compiler has its own target architecture and operating system etc., it is very difficult to conduct a fair apple-to-apple comparison. Myungho Lee et al. presented the SUN* One Studio 8 (S1S8) compiler at a very high level for SunFire 6800 and Sunfire 15000 systems in [9] . J.-H. Chow et al. described the OpenMP support in the IBM* compiler [13] . However, both papers did not discuss the issues we have discussed in this paper. The SPEC* OMPL2001 results reported in [21] are 158531 as the SPEC base ratio and 195619 as the SPEC peak ratio on a 64 processor SunFire 1500 system. Both performance results are way lower than the 64-processor performance 507602 (SPEC base ratio) of SPEC OMPL2001 we reported, in Section 9, on an 1600MHz Itanium® 2 processor-based SGI* 3700 Bx2 system with the Intel® compilers.
Furthermore, all Intel compilers are commercial product compilers available to industry and academia. Many researchers and compiler engineers have been using Intel compilers for compiling their applications and conducting performance studies. One of the most recent performance studies was carried out by Sven Karlsson [22] . Sven did a performance comparison of the Intel® compiler version 8.0 with the OdinMP research compiler version 0.284.1 by compiling the EPCC micro benchmark suite [23] using the highest available optimization level from both compilers. Sven Karlsson reported the overheads in microseconds for the most common OpenMP constructs. As shown in Table 3 , while the overheads for synchronization primitives of OdinMP+Balder are as low as Intel's compiler, the overheads using OdinMP+Balder for PARALLEL region construct and worksharing FOR loops are higher than the overheads using Intel compiler, even after the Balder runtime library is tuned with architecture support. Sven mentioned that the reason lies in the way the OdinMP compiler generates code and how those parallel regions are handled by runtime library. In contrast to OpenMP programming model for shared memory multiprocessor system, the MPI [24] has been used in a wide area of applications and various size of problems for parallel programming based on clusters of SMPs or distributed-memory multiprocessor systems, although application engineers have to work hard. The MPI approach is a library-like method, which is similar to the Pthreads approach, but there is no special support required from the compiler for generating multithreaded code. A lot of studies that have been done to explore the use of MPI/OpenMP mixed mode [25] could potentially offer the effective parallelization strategy for an SMP cluster to fully leverage the different characteristics of both paradigms to achieve the best performance on cluster SMPs. In another perspective, application domains using OpenMP have been steadily increasing. For example, the thread-safe grid RPC facility OmniRPC for cluster and global computing was developed using OpenMP. In [26] , OpenMP is proposed as an easy-to-use programming environment for the multithreaded client of OmniRPC. Extending and supporting OpenMP on a cluster of SMP nodes for grid computing has been an important hot research topic. In Intel compiler, we have added a simple extension to OpenMP for programmers to specify all memory references that are sharable among SMP nodes. In addition, the compiler and runtime library are being tuned for the emerging dual-core and multi-core (CMP) Intel architectures by leveraging CMP architecture features. Details of the extension and tuning techniques will be reported in the future.
CONCLUDING REMARKS
Exploiting thread-level parallelism for multithreaded processors and multiprocessor systems adds one more dimension of difficulty to the compiler development and tuning for generating and optimizing threaded code. We tackled this performance challenge in our OpenMP* design and implementation by developing technologies that produce well defined and annotated IL to ensure that classical optimizations are enabled seamlessly for Intel® Pentium® and Itanium® processor based systems. In this paper, we studied the interaction between other optimization phases in the compiler and the OpenMP translation phase and show the cooperation that is required between the two to generate optimized threading code. The main contributions of this paper are:
• A number of practical compiler techniques and solutions are proposed and discussed to ensure the generation of efficient threaded code for OpenMP programs while interacting with other compiler optimizations phases. The implementation of these techniques in the Intel® compilers is discussed. Although these compiler techniques presented and discussed in this paper are under the context of Intel compiler framework, the ideas are generally applicable to other compilers.
• The effect of compiler optimizations on OpenMP programs is studied experimentally based on two industry standard SPEC OMPM2001 and SPEC OMPL2001 benchmark suites: a small kernel program and H.264 encoder (OMP version) on Intel® Pentium® 4 and Itanium® 2 processor based systems. In addition, this paper reported the best performance results (also published on www.spec.org web page) of SPEC* OMPL2001 delivered by Intel® C++/Fortran compiler on an SGI* Altix* system.
