Abstract
Parallel architectures are an instrumental tool for the execution of compute intensive applications. Current programming models support distributed memory, shared memory, and clusters of shared memory architectures. An example of the support of distributed memory programming is MPI [12] , which provides the functionality for process communication and synchronization. OpenMP [13] was introduced as an industrial standard for sharedmemory programming with directives. The directives support loop level parallelization. The OpenMP programming paradigm provides ease of programming when developing parallel applications. For applications exhibiting multiple levels of parallelism the current most common programming paradigms are hybrid approaches such as the combination of MPI and OpenMP, or the MLP [15] model developed at NASA Ames. However, there is not much experience in the parallelization of applications with multiple levels of parallelism using OpenMP only.
The lack of compilers that are able to exploit further parallelism inside a parallel region has been the main cause of this problem, which has favored the practice of combining several programming models to exploit multiple levels of parallelism on a large number of processors. The nesting of parallel constructs in OpenMP is a feature that requires attention in future releases of OpenMP compilers. Some research platforms, such as the OpenMP NanosCompiler [4] , have been developed to show the feasibility of exploiting nested parallelism in OpenMP and to serve as testbeds for new extensions in this direction. The OpenMP NanosCompiler accepts Fortran-77 code containing OpenMP directives and generates plain Fortran-77 code with calls to the NthLib thread library [10] . NthLib allows for multilevel parallel execution such that inner parallel constructs are not being serialized. The NanosCompiler programming model supports several extensions to the OpenMP standard allowing the user to control the allocation of work to the participating threads. By supporting nested OpenMP directives the NanosCompiler offers a convenient path to multilevel parallelism.
Multi-zone codes are a class of applications featuring multiple levels of parallelism. They are commonly used in large scale Computational Fluid Dynamics (CFD) applications. A single mesh is often not sufficient to describe a complex domain and multiple meshes are used to cover it. These meshes are referred to as zones which yield the name multi-zone code. It is common to solve the flow equations independently within each zone. After each iteration boundary values are exchanged between neighboring zones. Solutions within each zone can be computed independently, providing coarse grain parallelism. Fine grain loop level parallelism can be exploited within each zone. A set of benchmarks has recently been released which captures this behavior and allows the analysis and evaluation of multi-level programming paradigms. These benchmarks are multi-zone versions of the well known NAS Parallel Benchmarks [2] . The NPB Multi-Zone (NPB-MZ) are described in [16] . A serial and two hybrid parallel reference implementations of the NPB-MZ are available. We have developed a nested OpenMP version of the NPB-MZ and used the NanosCompiler to evaluate the efficiency on several hardware platforms.
The rest of the paper is structured as follows: Section 2 summarizes the NanosCompiler extensions to the OpenMP standard. Section 3 describes the implementation of the NPB-MZ. Section 4 presents timing results for the benchmark codes. Related work is discussed in Section 5 and the conclusions are presented in Section 6.
2.The NanosCompiler
OpenMP provides a fork-and-join execution model in which a program begins execution as a single process or thread. This thread executes sequentially until a PARALLEL construct is found. At this time, the thread creates a team of threads and it becomes its master thread. All threads execute the statements lexically enclosed by the parallel construct. Work-sharing constructs (DO, SECTIONS and SINGLE) are provided to divide the execution of the enclosed code region among the members of a team. All threads are independent and may synchronize at the end of each work-sharing construct or at specific points (specified by the BARRIER directive). Exclusive execution mode is also possible through the definition of CRITICAL and ORDERED regions. If a thread in a team encounters a new PARALLEL construct, it creates a new team and it becomes its master thread. OpenMP v2.0 provides the NUM_THREADS clause to restrict the number of threads that compose the team.
The NanosCompiler extension to OpenMP to support multilevel parallelization is based on the concept of thread groups. A group of threads is composed of a subset of the total number of threads available in the team to run a parallel construct. In a parallel construct, the programmer may define the number of groups and the composition of each one. When a PARALLEL construct defining groups is encountered, a new team of threads is created. The new team is composed of as many threads as the number of groups. The rest of the threads are used to support the execution of nested parallel constructs. In other words, the definition of groups establishes an allocation strategy for the inner levels of parallelism. To define groups of threads, the NanosCompiler supports the GROUPS clause extension to the PARALLEL directive.
C$OMP PARALLEL GROUPS (gspec)
Different formats for the GROUPS clause argument gspec are allowed [5] . The simplest specifies the number of groups and performs an equal partition of the total number of threads to the groups:
gspec = ngroups
The argument ngroups specifies the number of groups to be defined. This format assumes that work is well balanced among groups and therefore all of them receive the same number of threads to exploit inner levels of parallelism. At runtime, the composition of each group is determined by equally distributing the available threads among the groups. Another possible format is: gspec = ngroups, weight
In this case, the user specifies the number of groups (ngroups) and an integer vector (weight) indicating the relative weight of the computation that each group has to perform. From this information and the number of threads available in the team, the threads are allocated to the groups at runtime. The weight vector is allocated by the user and its values are computed from information available within the application itself (for instance iteration space or computational complexity).
3.The Multi-Zone Versions of the NAS Parallel Benchmarks
The purpose of the NPB-MZ is to capture the multiple levels of parallelism inherent in many full scale CFD applications. Multi-zone versions of the NAS Parallel Benchmarks LU, BT, and SP were developed by dividing the discretization mesh into a two-dimensional tiling of three-dimensional zones. Within all zones the LU, BT, and SP problems are solved to advance the time-dependent solution. The same kernel solvers are used in the multizone codes as in the single-zone codes. Exchange of boundary values takes place after each time step. A detailed discussion of the NPB-MZ can be found in [16] . Figure 1 .a shows the general structure for all benchmarks. We will refer to the multi-zone versions of the LU, BT, and SP benchmarks as LU-MZ, BT-MZ, and SP-MZ.
The Hybrid Implementations
The source code distribution of the NPB-MZ includes two different hybrid implementations, as shown in Figure  1 .b. The first hybrid implementation is based on using MPI for the coarse grained parallelization on zone-level and OpenMP for fine grained loop level parallelism within each of the zones. The MPI programming paradigm assumes a private address space for each process. Data is transferred by explicitly exchanging messages via calls to the MPI library. This model was originally designed for distributed memory architectures but is also suitable for shared memory systems. In the NPB-MZ MPI/OpenMP implementation the number of processes is defined at compile time. Each process is assigned a number of zones and spawns a number of OpenMP threads in order to achieve a balanced load. Data is communicated at the beginning of the time step loop using MPI. There is no communication during the solution of the LU, BT, and SP problems within one zone. The OpenMP parallelization is similar to the single-zone versions as described in [6] .
The second hybrid implementation that is part of the NPB-MZ is based on the MLP programming model developed by Taft [15] at NASA Ames Research Center. The MLP programming model is similar to MPI/OpenMP, using a mix of coarse grain process level parallelization and loop level OpenMP parallelization. As it is the case with MPI, a private address space is assumed for each process. The MLP approach was developed for ccNUMA architectures and explicitly takes advantage of the availability of shared memory. A shared memory arena which is accessible by all processes is required. Communication is done by reading from and writing to the shared memory arena. Libraries supporting the MLP paradigm usually provide routines for process creation, shared memory allocation, and process synchronization. Details about the process level parallelization in the MLP paradigm and corresponding library support can be found in [7] . The MLP implementation of the NPB-MZ is very similar to the MPI/OpenMP implementation. Communication is handled by copying the boundary values to and from the shared memory arena. The OpenMP parallelization is identical in both versions.
Both hybrid implementations apply a load balancing algorithm to determine the number of threads that each process spawns. A detailed description of the reference implementations, which are part of the benchmark distribution, can be found in [8] .
The Nested OpenMP Implementations
The nested OpenMP implementation is currently not part of the NPB-MZ distribution. It has been developed by the authors using the thread group extensions mentioned before. This implementation combines a coarse grained parallelization (inter-zone) and parallelization within the zones (intra-zone), but employing OpenMP on both levels. The intra-zone parallelization is identical in the hybrid and the nested OpenMP implementations. The inter-zone parallelism is implemented by creating groups of threads and by assigning one or more zones to a thread group. The whole address space is shared by default among the threads working at both levels of parallelism. Data exchange at the zone boundaries is done in parallel by reading from and writing to the original application data structures. There is no need for using any special primitives such as MPI communication routines or MLP synchronization routines. This implementation just requires the addition of less than half a dozen OpenMP directives in each application. The same function that maps zones to MPI or MLP processes is used to map zones to thread groups. The mapping function generates two vectors that indicate which group executes each zone (pzone_id) and how many threads are allocated to each group (pn_thr). Since zones are not mapped in a consecutive way and the number of zones assigned to each group may be different, a couple of statements in the parallel regions at the outer level to control the execution of zones had to be added. The number of groups (num_grps) is controlled by an environment variable. Figure 2 shows an excerpt of the parallelization for the LU-MZ benchmark. The first part shows the parallelism at the inter-zone level. The intra-zone parallelization occurs in routine ssor, which is identical to the parallelization used in the other two strategies (MPI+OpenMP and MLP). At this point the reader may want to know why there is a necessity for extending OpenMP to support thread groups. The current specification for OpenMP includes the NUM_THREADS clause which tells the runtime environment the number of threads to be used in the execution of the PARALLEL region. With this extension it is possible to implement a nested parallel strategy similar to the one described above. However, it requires that the programmer explicitly controls the allocation of threads at each level of parallelism, as shown in Figure 3 (equivalent to Figure 2 ). This implies that the vectors that control the allocation of zones to groups are visible to the thread that is going to spawn the inner level of parallelism (common block inside routine ssor).
Two problems are worth mentioning about this implementation. The first one is the lack of modularity of the approach. For example, now the programmer has coded in the application itself the fact that this routine is called from inside a parallel region; if called from a serial part of the application the behavior would not be appropriate. In addition, if more levels of parallelism were available, coding the allocation of threads would be painful using NUM_THREADS. The version employing the NanosCompiler GROUPS clause extension is more modular since the context is implicit in the OpenMP runtime support and the code is valid in all possible situations. The second problem is related to the usual implementation of nested parallelism in OpenMP. It is common practice to implement a pool of threads, so that when a thread arrives at a PARALLEL region the desired number of threads is taken from the pool. In the example depicted in Figure 3 this would be the number specified by the NUM_THREADS clause. This is the case for example in the runtime system of the IBM XL compiler [17] . However, there is no guarantee that a particular thread is always executed on the same processor, so that data locality is not necessarily exploited. The definition of thread groups establishes an allocation strategy for the inner lev- els of parallelism, so that multiple instances of the same PARALLEL region or different regions with the same GROUPS definition will always use the same thread/processor mapping. In other words, the definition of GROUPS is more static than the definition of NUM_THREADS, which we consider more dynamic.
4.Timing Results
We ran the BT-MZ, LU-MZ, and SP-MZ benchmarks of problem classes W, A, and B. The aggregate sizes for all benchmarks are:
• Class W: 64x64x8 grid points • Class A: 128x128x16 grid points • Class B: 304x208x17 grid points Our tests were executed on two hardware platforms: an SGI Origin 3000 located at the NASA Ames Research Center and one frame of an IBM Regatta p690 located at the FZ Juelich Center in Germany.
The SGI Origin 3000 is a ccNUMA architecture with 4 CPUs per node. The CPUs are of type R12K with a clock rate of 400 MHz, 2 GB of local memory per node, and 8 MB of L2 cache. The peak performance of each CPU is 0.8 Gigaflops. The MLP implementations use the SMPlib library as described in [7] . The MIPSpro 7.4 Fortran Compiler [11] is used to compile the hybrid codes and the NanosCompiler for the nested OpenMP code. The compiler options -mp -O3 and -64 are set in both cases.
The IBM Regatta frame has 32 processors of type Power4+, running at 1.7 GHz. The main memory is 64 MB and the cache hierarchy has three levels: internal L1 cache with 64 KB instruction and 32 KB data (per processor), shared L2 cache with 1.5 MB (per chip = 2 processors), and shared L3 cache with 512 MB. The IBM XL Fortran compiler with the option -qsmp=omp is used to compile the hybrid MPI/OpenMP codes. The NanosCompiler supporting the GROUPS extension is used for the nested OpenMP codes. On the IBM platform there was no library support for the MLP programming model available. The native IBM compiler supports nested parallelism. Some tests were run employing the native IBM compiler together with the NUM_THREADS clause (as shown in Figure 3 ) to achieve nested parallelism. The option -qsmp=omp:nested_par was set in this case to compile the nested OpenMP version. The option -O3 was used for all cases. In the charts we use the following notation to refer to the different versions:
• MPI+OpenMP: Hybrid version implemented with MPI and OpenMP.
• MLP: Hybrid version implemented using the MLP approach.
• NTH: Nested OpenMP implementation using the NanosCompiler and the GROUPS clause.
• IBM Nested: Nested OpenMP implementation using the native IBM compiler and the NUM_THREADS clause.
• NPxNT: Number of CPUs expressed as number of processes (NP) times number of threads (NT). For the nested OpenMP code NP refers to the number of thread groups or threads used at the outer parallel level.
The BT-MZ Benchmark
The number of zones grows with the problem size. The number of zones is 4x4 for Class W and A, and 8x8 for Class B. The sizes of the zones vary widely. The ratio of the largest to the smallest zone is approximately 20. In order to achieve a good load balance a different number of threads has to be assigned to each group in the nested OpenMP codes. The same is true for the number of threads that are spawned by the processes in the hybrid codes. Figure 4 shows results for the hybrid MPI/OpenMP version and the nested OpenMP version compiled with the IBM native compiler and runtime system on the IBM Regatta. The timings show that the current implementation of nested parallelism in the native IBM system is not achieving the scalability of the hybrid version. We suspect that the implementation of nested parallelism using a pool of threads does not exploit data locality. Timings for different allocations of threads to the outer and inner levels are shown in Figure 5 . Although the runtime environment may ensure that the outer level of parallelism always uses the same kernel thread to execute each OpenMP thread, this is not guaranteed at the inner level. At the inner level, the threads that compose each team are dynamically selected from the pool, so there is no guarantee that the same kernel threads are used in all parallel regions. Notice that the performance of the IBM Nested implementation is best when all threads are used on the inner level because in this case the same threads are always used to execute the inner Figure 6 shows the speedup achieved by the hybrid MPI/OpenMP and the NTH versions for BT-MZ class A on the IBM Regatta system. Although the performance of the NTH version is slightly worse than the performance of the hybrid MPI/OpenMP version, the behavior is the same. The performance is worse due to the current implementation of the runtime system supporting the NanosCompiler.
Figure 4: Timings for 20 iterations of BT-MZ
The MIPSpro compiler and runtime environment on the SGI Origin do not support nested parallelism. The execution times of the hybrid MPI/OpenMP and the NTH versions are shown in Figure 7 . Due to load balancing, the number of threads per process and the number of threads per group varies. We indicate the average number of threads per process or group in the timings charts. The performance of the nested OpenMP implementation is nearly identical to that of the hybrid codes. The thread groups implementation in the Nanos compiler and runtime environment guarantee the same mapping of kernel threads to OpenMP threads in all parallel regions, both at the outer and inner levels. This improves memory behavior and results in performance levels that are comparable to the hybrid versions. This demonstrates the importance of having these extensions in OpenMP and provides an efficient implementation for nested parallelism in OpenMP. Figure 8 shows the impact of different combinations of processes or groups and threads. The timings are shown for the problem Class B and 128 CPUs. The problem Class B has 64 zones. Using 64 processes or 64 thread groups did not allow the most efficient load balancing. The best load balancing was achieved using 16 processes in the hybrid codes and 16 groups in the NanosCompiler nested OpenMP code. Since the number of threads per process or group varies, we report the average number of threads per process. 1x2  2x1  1x4  2x2  4x1  1x8  2x4  4x2  8x1  1x16  2x8  4x4  8x2  16x1  1x32  2x16  4x8  8x4 To demonstrate the scalability of the different implementations on the SGI Origin 3000 the Gigaflop rate as reported by the benchmark is shown in Figure 9 .
Figure 9: Performance of BT-MZ in Gigaflops when increasing the number of processors
The Class B performance for the number of processes or thread groups that produced the best results is reported. This number was the same for all three implementations. This is not surprising since the load balancing issue is the same in all versions. The three implementations show almost identical scalability, achieving about 28 Gigaflops/s for 128 CPUs.
To illustrate the load balancing issue in this benchmark, we show timeline views of time spent in useful calculations for different numbers of thread groups in Figure 10. 
The SP-MZ Benchmark
Here the mesh is partitioned such that zones are identical in size. The number of zones grows with the problem size. The number of zones is 4x4 for Class W and A, and 8x8 for Class B. The computations are naturally load balanced on the coarse level. Timings for the different implementations and different benchmark classes on the SGI Origin are shown in Figure 11 . As before, we report the timings for the best combinations of processes or groups and threads. The hybrid implementations achieve the best performance when employing a maximum number of processes on the coarse level. The use of multiple threads per process is only advantageous when the number of CPUs exceeds the number of zones. The situation is similar for the nested OpenMP code: It is best to employ groups consisting of only 1 thread, unless the number of CPUs exceeds the number of zones. As an example we show in Figure 12 the timings for problem Class B on different process or group and thread combinations.
Figure 12: Timings for 20 iterations of SP-MZ
On the IBM Regatta is was advantageous to use multiple threads per process or group for the Class A benchmark. The timings for both MPI/OpenMP and NTH are comparable, as shown in Figure 13. 
The LU-MZ Benchmark
In this case the number of zones is 4x4 for all problem sizes. The overall mesh is partitioned such that the zones are identical in size. This makes load balancing easy. The coarse grain parallelism in the hybrid codes is limited to 16 processes due to the structure of the benchmark. Parallelism beyond that has to be obtained at the fine grained level. In the nested OpenMP code the number of thread groups is limited to 16.
The timings for the SGI Origin are shown in Figure 14 . As before, we show the combinations of processes or groups and threads that yielded the best results for the hybrid codes and the NanosCompiler nested OpenMP code, respectively. The best timings were achieved by the same combinations in the hybrid and the nested OpenMP codes. In the case of LU-MZ the nested OpenMP code does not achieve the performance of the hybrid implementations. The major difference between LU-MZ and the two previous benchmark implementations is that both, BT-MZ and SP-MZ perform one time step before timing of the actual iteration loop. This ensures efficient data placement in case of a first touch data placement policy. For LU-MZ this is not the case. While it does not effect the hybrid codes, the lack of touching the data before the start of the iteration yields to a dramatic increase in time for the first iteration in the nested OpenMP code, which had a significant impact due to the fact that we were only timing the 
Number of CPUs (NPxNT)
Gigaflops/s MPI/OpenMP NTH NTH (touch data)
Figure 15: Performance of LU-MZ in Gigaflops
The timings for LU-MZ Class A on the IBM Regatta are shown in Figure 16 . Due to the small number of CPUs on a single node, the scalability problem observed on the SGI Origin does not show. Hybrid and nested parallelism are advantageous for more than 16 CPUs.
Related Work
Most current commercial and research compilers mainly support the exploitation of a single level of parallelism and special cases of nested parallelism (e.g. double perfectly nested loops as in the SGI MIPSpro compiler [11] ). The KAI/Intel compiler offers, through a set of extensions to OpenMP, work queues and an interface for inserting application tasks before execution (WorkQueue proposal [14] ). The KAI/Intel proposal mainly targets dynamic work generation schemes (recursions and loops with unknown loop bounds). In this proposal, there is no explicit (at the user or compiler level) control over the allocation of threads so they do not support the logical clustering of threads in the multilevel structure, which we think is necessary to allow good work distribution and data locality exploitation. The IBM XL [17] Fortran compiler supports nested parallelism. The execution environment provides a pool of threads from which any parallel region can take some for parallel execution. The user has the possibility to limit the number of threads on the outer level or parallelism by using the NUM_THREADS clause in the PARALLEL directive. We have discussed the problems that may result from this approach in subsection 3.2.
There are a number of papers reporting experiences in combining multiple programming paradigms to exploit multiple levels of parallelism (e. g. [15] ). Experiences on employing multiple level of parallelism in OpenMP are reported in [1] . Implementation of nested parallelism by means of controlling the allocation of processors to tasks in a single-level parallelism environment is discussed in [3] . The authors show the improvement due to nested parallelization. The performance of code containing automatically generated nested OpenMP directives is discussed in [9] . The NanosCompiler was then used to evaluate the performance of the nested OpenMP code on two different hardware platforms. The performance was compared to corresponding hybrid implementations of the benchmarks using the MPI/OpenMP and the MLP programming paradigms. For all three benchmarks the performance of the OpenMP code was comparable to that of the hybrid implementations. On the SGI Origin the LU-MZ benchmark required touching the data before the start of the iteration in order to achieve the performance of the hybrid codes.
Conclusions and Future Work
The first conclusion of the study is that the OpenMP paradigm allowed a very rapid development of the parallel code. The second observation is that the thread groups implementation in the NanosCompiler and runtime system was crucial to obtaining good performance. The reason is that the implementation guarantees the same mapping of kernel threads to OpenMP threads in all parallel regions, both at the outer and inner levels. This improves memory access time and results in performance levels that are comparable to the hybrid versions. Another important feature of the NanosCompiler is the possibility to assign weights to the thread groups in order to achieve a well balanced work load distribution.
We plan to conduct further case studies to compare the performance of parallelization based on nested OpenMP directives with hybrid and pure message passing parallelism. We will consider other hardware platforms, larger benchmark classes, and full-scale applications.
