Abstract: This article emphasizes on load balancing issues associated with hybrid programming models for the parallelization of tiled algorithms onto SMP clusters. Although tiled algorithms usually account for relatively regular computation and communication patterns, their hybrid parallelization often suffers from intrinsic load imbalance between threads. This observation mainly reflects the fact that most existing message passing libraries generally provide limited multi-threading support, thus allowing only the master thread to perform inter-node message passing communication. In order to mitigate this effect, we propose a generic method for the application of load balancing on the coarse-grain hybrid model for the appropriate distribution of the computational load to the working threads. We adopt both a static, as well as a dynamic load balancing approach, and implement three alternative balancing variations. All implementations are experimentally evaluated against kernel benchmarks, in order to demonstrate the potential of such load balancing schemes for the extraction of maximum performance out of hybrid parallel programs.
INTRODUCTION
SMP clusters have dominated the high performance computing domain by providing a reliable, cost-effective solution to both the research and the commercial communities. An immediate consequence stemming from the emersion of this new architecture is the consideration of new parallel programming models, which might exploit the underlying infrastructure more efficiently. Currently, message passing parallelization via the MPI library has become the de-facto programming approach for the development of portable code for a variety of high performance platforms. Although message passing parallel programs are generic enough, so as to be directly adapted to multi-layered, hierarchical architectures in a straightforward manner, there is an active research interest in considering alternative parallel programming models, that could be more appropriate for such platforms.
Hybrid programming models on SMP clusters resort to both message passing and shared memory access for interand intra-node communication, respectively, thus implementing a two-level hierarchical communication pattern. Usually, MPI is employed for the inter-node communication, while a multi-threading API, such as OpenMP, is used for the intra-node processing and synchronization. There are mainly two hybrid programming variations addressed in related work, namely the fine-grain incremental hybrid parallelization, as well as the coarse-grain SPMD-like alternative. Naturally, other cluster parallelization approaches also exist, such as using a parallel programming language, like HPF (Merlin and Hey (1995) ) or UPC (El-Ghazawi et al. (2002) ), and relying on the compiler for efficiency, or even visualizing a single shared memory system image across the entire SMP cluster, implemented with the aid of a Distributed Shared Virtual Memory (DSVM) software (Morin and Puaut (1997) , Hu et al. (2000) ). Nevertheless, these techniques are not as popular as either the message passing approach, or even hybrid parallel programming, hence they will not be discussed here.
Tiled loop algorithms account for a large fraction of the computational intensive part of many existing scientific codes. A typical representative of this application model stems from the discretization of Partial Differential Equations (PDEs). We consider generic N +1-dimensional tiled loops, which are parallelized across the outermost N dimensions, so as to perform sequential execution along the innermost dimension in a pipeline fashion, interleaving computation and communication phases. These algorithms impose significant communication demands, thus rendering communication-efficient parallelization schemes critical in order to obtain high performance. Moreover, the hybrid parallelization of such algorithms is a non-trivial issue, as there is a trade-off in programming complexity and parallel efficiency.
Hybrid parallelization is a popular topic in related literature, although it has admittedly delivered controversial re- sults (Cappello and Etiemble (2000) , Drosinos and Koziris (2004) , Henty (2000) , Dong and Karniadakis (2004) , Loft et al. (2001) , Ayguadé et al. (2004) , Majumdar (2000) , Nakajima (2003) , Su et al. (2004) etc) . In practice, it is still a very open subject, as the efficient use of an SMP cluster calls for appropriate scheduling methods and load balancing techniques. Most message passing libraries provide a limited multi-threading support level (e.g. funneled), allowing only the master thread to perform message passing communication. Even under multiple thread support, the additional complexity associated with ensuring thread safety can potentially diminish the load balancing advantages of full multi-threading support. Therefore, additional load balancing must be applied, so as to equalize the per tile execution times of all threads. This effect has been theoretically spotted in related literature (Rabenseifner and Wellein (2003) , Legrand et al. (2004) , Chavarría-Miranda and Mellor-Crummey (2003) , Darte et al. (2003) ), but to our knowledge no generic load balancing technique has been proposed and, more importantly, evaluated.
In this article we propose both static and dynamic load balancing for the coarse-grain funneled hybrid parallelization of tiled loop algorithms. The computational load associated with a tile is appropriately distributed among threads, either statically, based upon the estimation and modeling of basic system parameters, or at run-time, by sampling relative computation and communication times and dynamically applying appropriate load balancing. We distinguish between two variations of static load balancing, namely both a constant and a variable scheme, depending on whether the same task distribution is applied on a global or a per process base, respectively. We emphasize on the elements of applicability and simplicity, and evaluate the efficiency of the proposed scheme against two popular kernel benchmarks, namely ADI integration and a backward discretization of the Diffusion Equation. The experimental evaluation eloquently demonstrates the merit of load balancing when following the hybrid parallelization approach, and yields to significant overall performance improvement over the dominant message passing model.
The rest of this article is organized as follows: Section 2 discusses our target algorithmic model and introduces the notation used throughout the article. Section 3 refers to a pure message passing parallelization of tiled loop algorithms, while Section 4 presents the hybrid programming alternatives. Section 5 focuses on the proposed load balancing schemes, whereas Section 6 experimentally evaluates the efficiency of these schemes against ADI and DE kernels. Last, Section 7 concludes the paper and summarizes the performance results.
ALGORITHMIC MODEL -NOTATION
Our algorithmic model concerns tiled algorithms, that can be formally described as in Alg. 1. Such algorithms can be typically obtained by applying tiling transformation to fully permutable loops. Tiling is a popular loop transformation and can be applied in order to implement coarse granularity in parallel programs. Tiling partitions the original iteration space of an algorithm into atomic units of execution (tiles). This partitioning facilitates the parallelization of the algorithm, as a tile-to-process allocation scheme can be selected to implement domain decomposition in a straightforward manner. In our case, each process assumes the execution of a sequence of tiles, successive along the longest dimension of the original iteration space.
Algorithm 1: iterative algorithm model
Formally, we assume an N + 1-dimensional algorithm with an iteration space of X 1 × · · · × X N × Z, which has been partitioned using a tiling transformation defined by function H. Z is considered the longest dimension of the iteration space, and should be brought to the innermost loop through permutation, in order to simplify the generation of efficient parallel tiled code (Wolf and Lam (1991) ). In the above code, tiles are identified by an N +1-dimensional vector −→ tile = (tile 1 , . . . , tile N +1 ). foracross implies parallel execution, as opposed to sequential execution (for). In each tile denoted by a specific instance of vector −→ tile, a process first receives data required for the computations associated with the current tile (Receive), then performs these computations (Compute) and finally transmits computed data required by its neighboring processes (Send).
Generally, tiled code is associated with a particular tileto-process distribution strategy, that enforces explicit data distribution and implicit computation distribution, according to the computer-owns rule. For homogeneous platforms and fully permutable iterative algorithms, related scientific literature (Calland et al. (1997) ) has proven the optimality of the columnwise allocation of tiles to processes, as long as sequential pipelined execution along the longest dimension is assumed. Therefore, all parallel algorithms considered in this paper implement computation distribution across the N outermost dimensions, while each process computes a sequence of tiles along the innermost N +1-th dimension.
In most practical cases, the data dependencies of the algorithm are of several orders of magnitude smaller compared to the iteration space dimensions. Consequently, only neighboring processes need to communicate, assuming reasonably coarse parallel granularities, which are common for the distributed memory architectures addressed here. According to the above, we only consider unitary process communication directions for our analysis, since all other non-unitary process dependencies can be satisfied according to indirect message passing techniques, such as the ones described in Tang and Zigman (1994) . However, in order to preserve the communication pattern of the application, we consider a weight factor d i for each process dependence direction i, implying that if iteration j = (j 1 , . . . , j i , . . . , j N +1 ) is assigned to a process p, and iteration j = (j 1 , . . . , j i + d i , . . . , j N +1 ) is assigned to a different process p , p = p , then data calculated at iteration j from p need to be sent to p , since they will be required for the computation of data at iteration j .
In the following, P 1 ×· · ·×P N and T 1 ×· · ·×T N denote the process and thread topology, respectively. P = N i=1 P i is the total number of processes, while T = N i=1 T i the total number of threads. Usually, if P mpi the total number of processes for the message passing programming model, whereas P hybrid the total number of processes and T hybrid the respective number of threads for the hybrid model, both quantities P mpi and P hybrid × T hybrid equal the total number of available processors, in order to fully exploit the computational infrastructure for the particular class of algorithms we address in this article. Nevertheless, for the sake of generality, we will only be considering processes and threads, as opposed to processors, where vector p = (p 1 , . . . , p N ), 0 ≤ p i ≤ P i − 1 identifies a specific process, while t = (t 1 , . . . , t N ), 0 ≤ t i ≤ T i − 1 refers to a particular thread. Throughout the text, we will use MPI and OpenMP notations in the proposed parallel algorithms.
MESSAGE PASSING PARALLELIZATION
The proposed pure message passing parallelization for the algorithms described above is based on the tiling transformation and is schematically depicted in Alg. 2. Each process is identified by N -dimensional vector p, while different tiles correspond to different instances of N +1-dimensional vector −→ tile. The N outermost coordinates of a tile specify its owner process p, while the innermost coordinate tile N +1 iterates over the set of tiles assigned to that process. z denotes the tile height along the sequential execution dimension, and determines the granularity of the achieved parallelism: higher values of z imply less frequent communication and coarser granularity, while lower values of z call for more frequent communication and lead to finer granularity. The investigation of the effect of granularity on the overall completion time of the algorithm and the selection of an appropriate tile height z are beyond the scope of this paper. Generally, we consider z to be a user-defined parameter, and perform measurements for various granularities, in order to experimentally determine the value of z that delivers minimal execution time. More on the effect of granularity on the parallel performance can be found in Kumar et al. (2003) , Hodzic and Shang (1998) , Andonov et al. (2003) .
Furthermore, an advanced scheduling that allows for computation-communication overlapping is adopted as follows: In each time step, a process p = (p 1 , . . . , p N ) for a non-boundary process p, then p needs to send data to process p + − → dir. Similarly, R p corresponds to the valid data reception directions of process p, implying that process p should receive data from process p − − → dir if − → dir ∈ R p . S p and R p are determined both by the data dependencies of the original algorithm, as well as by the selected process topology of the parallel implementation.
For the true overlapping of computation and communication, as theoretically implied in the above scheme by combining non-blocking communication primitives with the overlapping scheduling, the usage of advanced CPU offloading features is required, such as zero-copy and DMAdriven communication. Unfortunately, experimental evaluation over a standard TCP/IP based interconnection network, such as Ethernet, combined with the ch p4 ADI-2 device of the MPICH implementation, prohibits such advanced non-blocking communication, but nevertheless the same limitations hold for our hybrid model, and are thus not likely to affect the relative performance comparison. However, this fact does complicate our theoretical analysis, since we will assume in general distinct, non-overlapped computation and communication phases, an assumption that to some extent misrepresents the efficiency of the message passing communication primitives.
HYBRID PARALLELIZATION
The potential for hybrid parallelization is directly associated with the multi-threading support provided by the message passing library. From that perspective, there are mainly five levels of multi-threading support addressed in relevant scientific literature: 4. serialized All threads are allowed to call message passing routines, but only one at a time.
5. multiple All threads are allowed to call message passing routines, without restrictions.
Each category is a superset of all previous ones. Currently, popular non-commercial message passing libraries provide support up to the funneled or serialized thread support level, thus effectively restraining the message passing communication capabilities within multi-threaded regions, while only few proprietary libraries allow for full multithreading support (multiple thread support level). Due to this fact, most attempts for hybrid parallelization of applications, that have been proposed or implemented, are mostly restricted to the first three thread support levels.
Two major hybrid parallel programming variations are discussed in related literature, namely fine-grain and coarse-grain. The fine-grain model applies incremental multi-threading parallelization solely to specific computational code parts, and therefore requires minimum multithreading support on behalf of the message passing library, as even the masteronly thread support level is sufficient. On the other hand, the coarse-grain hybrid programming model enforces an SPMD-like programming style by spawning threads only once close to the beginning of the program, thus calling for at least a funneled thread support level, as message passing will have to be conducted within multi-threaded parallel regions. The particular thread support level provided by the message passing library leads to two further flavors of the coarse-grain hybrid model, depending on whether only the master thread is allowed to perform message passing communication, or if all threads are enabled to call message passing primitives. The application of the three hybrid models (fine-grain, coarse-grain funneled and coarse-grain multiple) in the parallelization process of tiled loop algorithms will be the subject of this Section.
Fine-grain Hybrid Parallelization
The fine-grain hybrid programming paradigm, also referred to as masteronly in related literature, is the most popular hybrid programming approach, although it raises a number of performance deficiencies. The popularity of the fine-grain model over the coarse-grain one is mainly attributed to its programming simplicity: in most cases, it is a straightforward incremental parallelization of pure message-passing code by applying block distribution work sharing constructs to computationally intensive code parts (usually loops). Because of this fact, it does not require significant restructuring of the existing message passing code, and is relatively simple to implement by submitting an application to performance profiling and further parallelizing performance critical parts with the aid of multi-threading processing. Also, fine-grain parallelization is the only feasible hybrid approach for those message passing libraries, that support only masteronly multi-threading.
However, the efficiency of the fine-grain hybrid model is directly associated with the fraction of the code that is incrementally parallelized, according to Amdahl's law: since message passing communication can be applied only outside of parallel regions, other threads are essentially sleeping when such communication occurs, resulting to poor CPU utilization and inefficient overall load balancing. Also, this paradigm suffers from the overhead of reinitializing the thread structures every time a parallel region is encountered, since threads are continually spawned and terminated. The thread management overhead can be substantial, especially in case of a poor implementation of the multi-threading library, and generally increases with the number of threads. Moreover, incremental loop parallelization is a very restrictive multi-threading parallelization approach for many real algorithms, where such loops either do not exist or cannot be directly enclosed by parallel regions.
Algorithm 3: fine-grain hybrid model /*determine group sequence from pid p */
#pragma omp parallel 9 /*calculate candidate tile to execute... */ for i ← 1 to N do
The proposed fine-grain hybrid implementation for iterative algorithms is depicted in Alg. 3. Hyperplane scheduling categorizes the tiles assigned to all threads of a specific process into groups, which can be concurrently executed. Each group contains all tiles, that can be safely executed in parallel by the specified number of threads T , without violating the data dependencies of the initial algorithm. Each group is identified by a N + 1-dimensional vector −−−→ group, where the N outermost coordinates denote the owner process p, and the innermost one iterates over the distinct time steps. G p corresponds to the set of execution time steps of process p, and depends both on the process and thread topology. Formally, vector −−−→ group and set G p are defined as follows:
For each instance of vector −−−→ group, each thread determines a candidate tile −→ tile for execution, and further evaluates an if -clause to check whether that tile is valid and should be computed at the current time step.
Figure 1: Hybrid parallel program for 3D algorithm and 6 processes × 6 threads All message passing communication is performed outside of the parallel region (lines 4-8 and 15-17), while the multi-threading parallel computation occurs in lines 9-14. Note that no explicit barrier is required for thread synchronization, as this effect is implicitly achieved by exiting the multi-threaded parallel region. Note also that only the code fraction in lines 9-14 fully exploits the underlying processing infrastructure, thus effectively limiting the parallel efficiency of the algorithm. Fig. 1 clarifies some of the notation used in the hybrid algorithms.
Coarse-grain Funneled Hybrid Parallelization
According to the coarse-grain model, threads are only spawned once and their ids are used to determine their flow of execution in the SPMD-like code. Obviously, message passing communication will be performed within the multi-threaded parallel region, hence the coarse-grain programming model requires at least a funneled thread support level. Assuming funneled multi-threading support, the master thread will have to undertake the entire message passing communication required for the inter-node data transfer, though other threads will be allowed to perform computation at the same time. We shall briefly refer to this case as coarse-grain funneled model, or simply coarse-grain model in short.
The additional promising feature of the coarse-grain approach is the potential for overlapping multi-threaded computation with message passing communication. However, due to the restriction that only the master thread is allowed to perform message passing, a naive straightforward implementation of the coarse-grain model suffers from load imbalance between the threads, if equal portions of the computational load are assigned to all threads. Therefore, additional load balancing must be applied, so that the master thread will assume a relatively smaller computational load compared to the other threads, thus equalizing the per tile execution times of all threads. Moreover, the coarse-grain model avoids the overhead of re-initializing thread structures, since threads are spawned only once, and can potentially implement more generic parallelization schemes, as opposed to its limiting fine-grain counterpart.
The pseudo-code for the coarse-grain parallelization of the tiled algorithms is depicted in Alg. 4. Note that the inter-node communication (lines 8-13 and 17-20) is conducted by the master thread, per communication direction and per owner thread, incurring additional complexity compared both to the pure message passing and the finegrain model. Also, note the bal( p, t) parameter in the computation, that corresponds to a balancing factor for thread t of process p and optionally implements load balancing between threads, as will be described in Section 5.
Coarse-grain Multiple Hybrid Parallelization
Should the message passing library provide full multithreading support, another variation of the coarse-grain hybrid parallelization scheme is feasible. In fact, in that case of a thread-safe message passing library, relatively less programming effort is required compared to the funneled paradigm, as each thread can in theory satisfy its own communication needs. Moreover, the multiple thread support level allows for a more balanced distribution of the communication load to the available threads, as opposed to the funneled model, where the master thread undertakes the entire task of inter-process communication.
However, as message passing libraries consider only processes as peer communicating entities, additional care must be taken to implement point-to-point communication between threads residing on different processes with the aid of the message passing primitives. For instance, according to the MPI standard, there is no direct way for thread t i residing on process p to address thread t j of a different process p . Message passing communication can be established only between the two processes p and p , as opposed to a finer level between threads. As a workaround to this problem, it is possible to implicitly integrate the thread id information into the MPI message tag, so that a message coming from a remote thread is indirectly matched only by the appropriate local thread.
Alg. 5 outlines the coarse-grain multiple implementation. The algorithm is similar to Alg. 4, except that the master directives have been removed, since all threads contribute to the inter-process communication. S p, t cor- responds to the set of valid transmission directions for thread t of the owner process p. In particular, if − → dir ∈ S p, t , then data computed by thread t need to be sent to process p + − → dir, assuming p denotes a non-boundary process. Similarly, if − → dir ∈ R p, t , then process p needs to receive from its neighbor process p − − → dir, in order to satisfy dependencies related to data computed by thread t of p. As all threads call message passing primitives based on the sets S p, t , R p, t , the parameter tag() emphasizes the role of the message tag in matching communicating thread pairs. Note also that we will not be considering load balancing for the coarse-grain multiple case, as this hybrid implementation primarily aims to be as simple as possible, by benefiting from the full multi-threading support provided by the message passing library. For the sake of simplicity, we shall refer to this programming model as multiple hybrid for the rest of this article.
LOAD BALANCING FOR THE HYBRID MODEL
Since most existing message passing libraries allow only the master thread to perform inter-process communication, the coarse-grain hybrid models suffer from intrinsic load imbalance and require an appropriate computation distribution scheme in order to achieve good performance. In the opposite case, the master thread will inevitably have to perform a larger fraction of work compared to the other threads, that is, if equal computational loads are assigned indiscriminately to all threads.
The hyperplane scheduling scheme enables a more efficient load balancing between threads: Since the computations of each time step are essentially independent of the communication data exchanged at that step, the former can be arbitrarily distributed among threads. Thus, it would be meaningful for the master thread to assume a smaller part of computational load, so that the total computation and the total communication associated with the owner process is evenly distributed among all threads.
In order to achieve this, we propose both a static and a dynamic (adaptive) approach for the calculation of the balancing factor(s). According to the static approach, load balancing is applied at compile time based upon a theoretical estimation of the relative communication vs computation cost of a specific algorithm on the particular underlying infrastructure. Only fundamental system characteristics (average iteration execution time, network bandwidth and latency) are included in this analysis, so as to keep the static approach applicable in practice. On the other hand, the dynamic alternative makes for a more obtrusive approach, where the relative communication vs computation cost of the algorithm is sampled at run-time, and no prior assumptions are made regarding the theoretical behaviour of the underlying system.
In this Section, we shall refer to the three proposed load balancing schemes, namely two variations of the static approach (constant and variable load balancing), as well as the dynamic, adaptive load balancing methodology.
Static Balancing
We have implemented two alternative static load balancing schemes. The first one (constant balancing) requires the calculation of a constant balancing factor, which is common for all processes, irrespective of the selected process topology. For this purpose, we consider a non-boundary process, that performs communication across all N process topology dimensions, and theoretically determine the computational fraction of the master thread, that equalizes tile execution times on a per thread basis. We then apply this constant balancing factor to all processes and quantitatively specify how much less computational load should be assigned to the master thread, so as to achieve the equalization of the total tile execution times of all threads.
The second scheme (variable balancing) requires further knowledge of the process topology, and ignores communication directions cutting the iteration space boundaries, since these do not result to actual message passing. According to the variable balancing scheme, we compute a different balancing factor for boundary and non-boundary processes, since the former are from the communication perspective less heavily burdened than the latter. Depending on the position of its owner process in the topology, each master thread is assigned a different balancing factor, as opposed to the constant balancing scheme, where the same balancing factor was applied to all master threads.
For both cases (constant and variable balancing), the balancing factor(s) can be obtained by the following lemma:
T . Let P = P 1 × · · · × P N be the process topology and T the number of threads available for the parallel execution of the hybrid funneled implementation of the respective tiled algorithm. The overall completion time of the algorithm is minimal if the master thread assumes a portion bal T of the process's computational load, where
t comp (x) The computation time required for x iterations t comm (x) The transmission time of an x-sized message z The tile height for each execution step
The proof of the lemma is presented in the Appendix. We assume the computation time t comp to be a linear function of the number of iterations, that is, we assume t comp (ax) = at comp (x). Note that if condition i ∈ S p is evaluated independently for each process, variable balancing is enforced, as each communication term is only included in the sum of (1) as long as it contributes to message passing along the particular dimension for the specific process. If the above check is omitted, (1) delivers the constant balancing factor.
The constant balancing scheme only requires knowledge of the underlying computational and network infrastructure, but also tends to overestimate the communication load for boundary processes. On the other hand, the variable balancing scheme can be applied only after selecting the process topology, as it uses that information to calculate a different balancing factor for each process. According to the variable scheme, the balancing factor is slightly smaller for non-boundary processes. However, as the active communication directions decrease for boundary processes, the balancing factor increases, so as to preserve the desirable thread load balancing. Figure 2: Variable balancing for 8 processes × 2 threads and 3D algorithm with iteration space X 1 × X 2 × Z, X 1 = 4X 2 Fig. 2 demonstrates the variable load balancing scheme for a 3D tiled algorithm, that has been mapped onto a 2D virtual process topology on a dual SMP cluster. The different balancing factors have been computed by applying (1), where functions t comp () and t comm () have been approximated as in equations (4) and (5) of Section 6, whereas N = 2, P = 8, P 1 = 4, P 2 = 2, T = 2,
Generally, a factor bal, 0 ≤ bal ≤ 1, for load balancing T threads implies that the master thread assumes bal T of the process's computational share, while each other thread is assigned a fraction of T −bal T (T −1) of that share. Consequently, larger balancing factors tantamount to the master thread assuming larger fractions of the process's computational load, with a balancing factor of 100% meaning equal distribution of that load among all existing threads (no balancing). For instance, since process (0, 1) does not have to perform communication along dimension X 2 , its master thread assumes a greater balancing factor (95% vs 87%) compared to the master thread of process (0, 0), which has to perform message passing communication along both X 1 and X 2 directions. Moreover, process (3, 1) will not be transmitting any communication data, hence no balancing is applied to its master thread.
Clearly, the most important aspect for the effectiveness of both static load balancing schemes is the accurate architectural modeling of both the underlying infrastructure, as well as its basic performance parameters, such as the sustained network bandwidth and latency, and the requirements imposed by the specific algorithm in terms of processing power. In order to preserve the simplicity and applicability of the methodology, we avoid complicated indepth software and hardware modeling, and adopt simple, yet partly crude, approximations for the system's structure and behavior. A dynamic scheme adopting an adaptive, run-time balancing approach, could potentially overcome some of the drawbacks associated with static balancing and its approximations, and is hence considered next.
Dynamic Balancing
Instead of applying thread load balancing a priori, according to a particular static modeling of application-system performance, it is possible to adaptively calculate appropriate balancing factors, by slightly restructuring the hybrid coarse-grain paradigm. Under a more obtrusive approach, each thread can time its computation and communication performance at run-time, so that load balancing is applied mainly based upon that information, and by further relying on a minimum set of assumptions. Obviously, as with most dynamic, obtrusive programming approaches, adaptive balancing incurs an additional profiling and processing overhead, mainly due to the insertion of timers, though only for a relatively short part of the code.
Suppose we apply an initial factor bal for the balancing of the T threads of a specific process. bal can be computed by using either of the static load balancing techniques mentioned above. Assuming P processes, our intention is to sample program execution for more than P T time steps, so that the execution wavefront reaches the most distant process, as we would like to time communication performance in a full parallel pipeline state. Based on the information gathered at the sampling period, a more appropriate load balancing factor bal can be applied for the rest of the algorithm, according to the following lemma:
Lemma 2. Assuming a balancing factor bal and T threads, let t m comp , t m comm be the average tile computation and communication times of the master thread, profiled at run-time. A more efficient balancing factor bal can be computed using the following expression:
The proof of Lemma 2 is presented in the Appendix. It should be noted that t m comm refers to the non-overlapped communication time of the master thread, that is, excluding any DMA-driven communication overlapped with computation, hence it can easily be profiled. In practice, the coarse-grain model depicted in Alg. 4 is unfolded into two parts, the sampling period and the main execution period. During the sampling period, the initial balancing factor bal is applied, and the master thread profiles its partial tile computation and communication times (t m comp and t m comm , respectively), that approximately constitute the total tile execution time. Based on this measurements, the new balancing factor bal is computed, and further applied for the main execution period of the program.
The adaptive balancing approach is expected to incur some additional performance penalty, mainly due to the timing requirements and re-balancing process. More importantly, it adopts two basic assumptions: first, that the communication time is invariant to the computation distribution between threads, and second that the computation time is proportional to the number of iterations. The first assumption misrepresents the potential of overlapping computation and communication, while the second fails to consider cache effects. However, these simplifications ensure the generality and applicability of the proposed dynamic balancing scheme. In fact, the dynamic scheme requires even less input to be supplied by the user concerning application and platform characteristics, as it basically extracts that information by monitoring and profiling program execution performance at run-time.
EXPERIMENTAL RESULTS
In order to test the efficiency of the proposed load balancing schemes, we compared the performance of all proposed programming models (message passing, fine-grain hybrid, coarse-grain hybrid) against two kernel benchmarks, namely Alternating Direction Implicit integration (ADI), as well as a second order backward discretization of the Diffusion Equation (DE). ADI is a stencil computation used for solving partial differential equations (Karniadakis and Kirby (2002) ). Essentially, ADI is a simple threedimensional perfectly nested loop algorithm, that imposes unitary data dependencies across all three space directions. On the other hand, DE is a similar three-dimensional algorithm, that can be derived from the following unsteady diffusion PDE with the aid of backward discretization:
Both kernels have an iteration space of X 1 × X 2 × Z, where Z is considered to be the longest algorithm dimension. Furthermore, both applications are suitable for the performance evaluation of all parallel programming models and techniques discussed here, as they are typical representatives of nested loop algorithms and comply with our target algorithmic model. More importantly, they impose communication in all three iteration space dimensions, and thus facilitate the comparison of the various programming models in terms of communication performance. In fact, in the DE kernel, a process transmits three times the communication volume along each iteration space dimension compared to ADI, therefore the relative performance of the two kernels enables scalability evaluation of the different programming models in respect to the inherent communication needs of a specific application.
We use MPI as the message passing library and OpenMP as the multi-threading API. Our experimental platform is an 8-node Pentium III dual-SMP cluster interconnected with 100 Mbps FastEthernet. Each node has two Pentium III CPUs at 800 MHz, 256 MB of RAM, 16 KB of L1 I Cache, 16 KB L1 D Cache, 256 KB of L2 cache, and runs Linux with 2.4.26 kernel. For the support of OpenMP directives, we use Intel C++ compiler v.8.1 with the following optimization flags: -O3 -mcpu=pentiumpro -openmp -static. We also used two MPI implementations, namely the popular open-source MPICH library (v.1.2.6), as well as the proprietary ChaMPIon/Pro MPI library (v.1.1.1). Both MPI libraries have been appropriately configured, in order to perform SMP intra-node communication efficiently (e.g. through SYS V shared memory). MPICH provides at most a funneled thread support level, while ChaMPIon/Pro is fully thread-safe, and will be mainly used for the performance comparison of the multiple hybrid implementation with the other hybrid alternatives. Some fine-tuning of the MPICH communication performance for our experimental platform indicated using a maximum socket buffer size of 104KB, so the respective environment variable was appropriately set to that value for all cluster nodes.
Under the MPICH library, we are able to fully exploit our cluster architecture, as we will be using 16 processes for the pure message passing experiments, and 8 processes with 2 threads per process for all hybrid programs. With the ChaMPIon/Pro library, the number of available licenses compelled us to use a maximum of 8 processes for the message passing case, and only 4 processes for the hybrid implementations, with each process spawning 2 threads. Furthermore, all experimental results are averaged over three independent executions for each case.
Regarding static thread load balancing, a simplistic approach is adopted in order to model the behavior of the underlying infrastructure, so as to approximate quantities t comp and t comm of (1). As far as t comp is concerned, we assume the computational cost involved with the calculation of x iterations to be x times the average cost required for a single iteration. On the other hand, the communication cost is considered to consist of a constant start-up latency term, as well as a term proportional to the message size, that depends upon the sustained network bandwidth on application level. Formally, we define
modeling. For instance, we ignored cache effects in the estimation of the computation cost, as we intended to avoid performing a memory access pattern analysis of the tiled application in respect to the memory hierarchy configuration of the underlying architecture. Also, a major difficulty we encountered was modeling the TCP/IP communication performance and incorporating that analysis in our static load balancing scheme. Assuming distinct, nonoverlapping computation and communication phases and relatively high sustained network bandwidth allowed us to bypass this restriction. However, this hypothesis underestimates the communication cost for short messages, since these are mostly latency-bound and sustain relatively low throughput. On the other hand, it overestimates the respective cost in the case of longer messages, where DMA transfers alleviate the CPU significantly. For our analysis, we considered t comp (1) = 288nsec, t startup = 107usec and B sustained = 100M bps. Finally, as regards adaptive balancing, we considered a duration of 2P T time steps for the sampling period. For instance, given P = 8 and T = 2, this corresponds to 32 time steps, while each process will be executing 82 to 8k tiles, depending on the tile height (z ranges from 2 to 200, while dimension Z is always equal to 16k, for all iteration spaces). Note that for some cases, the duration of the sampling period is not negligible, compared to the main execution period.
ADI Integration
Initially, we performed an overall performance comparison of all proposed programming models against the ADI kernel, by using both the MPICH and the ChaMPIon/Pro MPI implementations. It should be noted that although this comparison may be useful towards the efficient usage of SMP clusters, it cannot be generalized beyond the chosen hardware-software combination, as it largely depends upon the comparative performance of the specific programming APIs (MPICH, ChaMPIon/Pro, OpenMP support on Intel compiler), as well as how efficiently the MPI library supports multi-threaded programming.
We tried five different iteration spaces for ADI, in order to investigate the performance variation of each model for different process topologies. For each iteration space and programming model, we selected among all feasible cartesian process topologies the one that minimizes the total execution time. The selected topologies for each case are depicted in Table 1 . The variety of iteration spaces and the potential for multiple alternative process topologies eloquently demonstrates that the comparison of the pure message passing model with the hybrid approach is a non-trivial issue, even from the communication perspective. Generally, the total communication volume is reduced when assuming the hybrid approach, however this reduction is not necessarily proportional to the number of threads T , but also depends on the particular process topology. For instance, considering the iteration space 32 × 256 × 16k, we expect the process topology assumed with the hybrid implementations to be quite beneficial, as it eliminates the need for communication across the long X 2 = 256 iteration space dimension. On the other hand, for the iteration space 256 × 256 × 16k, the communication data is reduced by approximately 33%, while there are only half as many entities to perform message passing under funneled multi-threading support (16 processes in the pure message passing model vs 8 master threads in the hybrid alternatives). Thus, even theoretically, the advantages of the hybrid approach may be diminished without proper thread load balancing. iteration space message passing hybrid 16 The overall experimental results for ADI integration and the above iteration spaces are depicted in Fig. 3 (MPICH) and Fig. 5 (ChaMPIon/Pro). These results are normalized in respect to the message passing execution times, so as to allow for straightforward quantitative comparison. Furthermore, granularity measurements for various iteration spaces are depicted in Fig. 4 for the MPICH library.
Following conclusions can be drawn from the thorough investigation of the obtained performance measurements (MPICH library):
• Unoptimized hybrid parallelization (fine-grain, coarsegrain unbalanced) is often worse than the pure message passing approach. If no load balancing is applied to the hybrid model, its relative performance compared to the pure message passing model exclusively depends on the appropriateness of the selected pro- cess topology, given a specific algorithm and iteration space. As anticipated, the hybrid implementation performs better for the 32 × 256 × 16k iteration space, even without load balancing, as the assumed process topology of the hybrid implementation eliminates the largest fraction of the message passing communication required. In almost all other cases, the pure message passing model performed better than the unbalanced hybrid implementations.
• The coarse-grain hybrid model generally performs better than the fine-grain alternative for most iteration spaces. However, the coarse-grain model is not always more efficient than its fine-grain counterpart (e.g. 128 × 256 × 16k, 256 × 256 × 16k). This observation reflects the fact that the poor load balancing of the simple coarse-grain model diminishes its advantages compared to the fine-grain alternative.
• When applying constant static balancing, in some cases the balanced coarse-grain implementation is less effective than the unbalanced alternatives (e.g., 16×256×16k, 32×256×16k). This can be attributed both to inaccurate theoretical modeling of the system parameters for the calculation of the balancing factors, as well as to the inappropriateness of the constant balancing scheme for boundary processes. Generally, the constant balancing approach performs well for relatively symmetric process topologies, with few of the processes lying at the topology boundary.
• When applying variable static balancing, the coarsegrain hybrid model was able to deliver superior performance to the message passing alternative in all cases. The performance improvement lies in the range of 2-12%, also depending upon the selected process topology in each case.
• The same performance improvement is also observed for the adaptive dynamic balancing approach, thus confirming that the required information for the application of efficient load balancing can be obtained at run-time, with minimal overhead. Overall, the variable static and the adaptive dynamic balancing techniques have delivered the best parallel performance in almost all cases, and have proven to be the most reliable and efficient parallelization approaches.
The granularity results in Fig. 4 reveal that these two balancing techniques (variable, adaptive) perform better than any other implementation for almost all granularities. Some performance degradations observed at certain threshold values of the tile height z can be ascribed to the transition from the eager to the rendezvous MPICH message protocol, occurring at 128000 bytes (for instance, see Fig. 4 (c) at z = 125). Depending on the message size, MPICH resorts to three different message passing communication protocols, namely short, eager and rendezvous, each assuming a different approach in terms of intermediate buffering and/or receiver notification. Generally, there are certain trade-offs when transitioning between these protocols, that mainly correlate sender-receiver synchronization with the overhead of intermediate buffering, therefore performance implications should not be surprising.
The performance evaluation of all models with the aid of the ChaMPIon/Pro implementation mainly aimed at providing additional insight for the multiple hybrid implementation, since we simply repeated the same experiments on a smaller scale. Fig. 5 indicates that an appropriately balanced hybrid funneled implementation is usually even slightly better than the multiple hybrid implementation, confirming our intuition that appropriate load balancing could mitigate the limited multi-threading support and also avoid the overhead associated with ensuring thread safety of the message passing library. As many popular message passing libraries currently provide only limited multi-threading support, appropriately load balancing the funneled hybrid model arises as a feasible and efficient parallel programming alternative. In addition, even if the thread safety of the message passing library allows all threads to perform inter-node communication, the underlying network infrastructure might be too restrictive for the efficient utilization of this feature (e.g., one NIC to be used by multiple threads).
Diffusion Equation
We conducted similar experimental evaluation for the DE kernel, by using both the MPICH and the ChaMPIon/Pro libraries. The results are summarized in Fig. 6 for the MPICH library, and in Fig. 7 for the ChaMPIon/Pro implementation. In all cases, the execution times have been normalized to the message passing model.
We have observed similar behavior as in the ADI benchmark. Variable and adaptive balancing deliver the best results, by providing a performance improvement in the range 3-22%. The relative performance improvement in the DE kernel is significantly higher in all iteration spaces compared to ADI, reflecting the fact that hybrid programming can be particularly beneficial in algorithms with high communication to computation demands. The reader should also note that although DE imposes more communication overhead than ADI, it is still heavily computationbound. Since the advantage of hybrid programming lies in the efficient utilization of the underlying architecture for communication purposes, it could be even more suitable for the parallelization of communication-bound applications, as long as these do not saturate the available memory bandwidth. Fig. 7 further confirms that multiple hybrid parallelization is not more efficient than either the variably or the adaptively balanced coarse-grain funneled hybrid implementation, and even leads to slight performance degradation in the 64×256×16k iteration space. Also, the relative performance improvement is significantly higher for the DE kernel (more than 5% in all cases) in comparison to ADI (around 3% for the various iteration spaces), thus the same qualitative behavior observed at the MPICH experiments is exhibited also for the ChaMPIon/Pro library, though on a smaller scale.
CONCLUSIONS
This paper discusses load balancing issues regarding hybrid parallelization of tiled loop algorithms. We propose three load balancing schemes, namely two static techniques (constant, variable), as well as a dynamic (adaptive) one. Static balancing is based on application-system modeling, so as to determine a suitable task distribution between threads. On the other hand, dynamic balancing extracts similar information at run-time, by profiling the actual application performance and re-evaluating the initial load balancing strategy. We have compared both message passing and hybrid implementations, and demonstrated the performance improvements that can be attained by resorting to an appropriately balanced coarse-grain hybrid implementation.
The experimental evaluation has confirmed that for the algorithms considered here unoptimized hybrid parallelization may perform poorly, even worse than the monolithic message passing paradigm. Moreover, coarse-grain SPMDlike hybrid parallelization does not always outperform the fine-grain incremental alternative, as the poor load balancing of the former diminishes its relative advantages in respect to the latter (e.g. higher parallelization efficiency, overlapping computation with communication, no thread re-initialization overhead). However, when applying variable or adaptive load balancing, the coarse-grain funneled hybrid model was able to deliver superior performance in respect to message passing parallelization, and even exhibit similar performance to the coarse-grain multiple alternative, but without requiring additional thread-safe support from the message passing library.
Generally, the performance improvements are proportional to the intrinsic communication demands of the parallelized algorithm, and also depend on the efficiency of the process topology assumed in each case. Conclusively, efficient load balancing can effectively mitigate limited multithreading support, and delivers superior performance compared to standard message passing implementations. All models that have been proposed can easily be adopted to more generic parallelization techniques, as we have emphasized on the elements of applicability and simplicity.
A Static Load Balancing
Proof. For the sake of simplicity, and without loss of generality, we assume that all divisions result to integers, in order to avoid over-complicating our equations with ceil and f loor operators. Thus, each process will be assigned Z z tiles, each containing Xz P iterations. Furthermore, under a balancing factor bal, the master thread will assume the execution of a fraction bal T of the process's computational load, while each of the other T − 1 threads will be assigned a fraction of T −bal T (T −1) of the remaining computations. Note that under the funneled hybrid model, only the master thread is allowed to perform inter-node communication, and should therefore take care of both its own communication data, as well as the communication data of the other threads. The overall completion time of the algorithm can be approximated by multiplying the total number of execution steps with the execution time required for each tile (step) t tile . It holds 
the tile execution time of the master thread, while
the tile execution time of a non-master thread. In order to minimize the overall completion time, or equivalently the execution time for each tile (since the number of execution steps does not depend on the load distribution between threads), t Assuming that the computation time t comp is a linear function of the number of iterations, that is, t comp (ax) = at comp (x), (7), (8) and (9) can be easily combined to deliver (1).
B Dynamic Load Balancing
Proof. We assume that the initial thread load balancing is in general non-optimal in practice, that is, if t Such suboptimal load balancing between the threads can be ascribed to inaccurate theoretical system modeling, as well as to the various approximations in performance estimation. Our goal is to determine a new balancing factor bal , such that the average tile execution times become equalized for all threads, that is We assume that the time required for the computation of a tile is proportional to the number of iterations associated with that tile. Formally, we assume
Because of (11), it holds 
By combining (13), (14) and (15), (10) (11) and (12), we can easily deduce (2) 
