Abstract-In this paper, we revisit the supernode-shape selection problem, which has been widely discussed in the bibliography. In general, the selection of the supernode transformation greatly affects the parallel execution time of the transformed algorithm. Since the minimization of the overall parallel execution time via an appropriate supernode transformation is very difficult to accomplish, researchers have focused on scheduling-aware supernode transformations that maximize parallelism during the execution. In this paper, we argue that the communication volume of the transformed algorithm is an important criterion, and its minimization should be given high priority. For this reason, we define the metric of the per-process communication volume and propose a method to minimize this metric by selecting a communication-aware supernode shape. Our approach is equivalent to defining a proper Cartesian process grid with MPI_Cart_Create, which means that it can be incorporated in applications in a straightforward manner. Our experimental results illustrate that by selecting the tile shape with the proposed method, the total parallel execution time is significantly reduced due to the minimization of the communication volume, despite the fact that a few more parallel execution steps are required.
Ç

INTRODUCTION
T ILING or supernode 1 transformation has been proposed as one of the most efficient methods to map applications based on stencils onto distributed-memory architectures with significant communication latencies. Stencil computations are very frequently met in image processing and in simulation applications resulting from the discretization of partial differential equations (PDEs) using explicit finite-difference schemes [2] , [3] , [4] . Such applications are essentially DOACCROSS loop nests, i.e., n-dimensional loop nests with at least n linearly independent data dependencies. In order to execute DOACCROSS nested loops in the aforementioned parallel architectures, researchers have proposed the application of tiling transformation [5] . Tiling groups neighboring iterations into one computational unit, the tile or supernode. For the parallel execution of tiled iteration spaces, tiles are assigned to the available processes that are orchestrated to communicate before and after the computation within one tile. In this way, both the communication volume and frequency are reduced, thus enabling DOACCROSS nested loops, which suffer from high communication needs, to efficiently execute onto parallel platforms where remote memory access times are crucially larger than local access times and communication start-up latencies are an important hurdle to high performance.
Tiling for coarse-grain parallelism has attracted extensive scientific research [5] , [6] , [7] , [8] , [9] , [10] , [11] , [12] , [13] , [14] , [15] , [16] , [17] , [18] , [19] , [20] , [21] , [22] , [23] , [24] , [25] , [26] right after its presentation by Irigoin and Triolet in 1988 [1] . Tiling transformation provides flexibility concerning the number of iterations to be grouped together into a single tile (tile size), as well as the shape of the enclosing parallelogram. Since the selection of the tile size and shape greatly affects the properties of the transformed space, researchers have focused on defining criteria for an efficient tiling transformation. Ohta et al. [17] , Hodzic and Shang [11] , and Andonov et al. [27] focused on the selection of the optimal tile size based on the special characteristics of the application and the target architecture. Ramanujam and Sadayappan [5] , Boulet et al. [7] , and Xue [21] worked on the selection of a tile shape that minimizes the per-tile communication volume, i.e., their goal was to minimize the dependence vectors cutting the planes defining a tile. In this case, the optimal tile shape is formed by planes parallel to the algorithm's dependence cone. More importantly, for a given tile size, Hodzic and Shang [11] , [14] and Hö gstedt et al. [15] , [28] determined the tile shape that minimizes the parallel execution steps of the tiled space. In this case, the scheduling-aware tile shape is obtained by 1) deciding on an appropriate basic tile shape (in most cases, tile sides are again parallel to the dependence cone) and 2) properly scaling the sides of the tile, in order to minimize the maximum parallel execution path between the first and the last tile.
Determining the optimal tiling transformation, i.e., the one that minimizes parallel execution times, is very difficult, since various tiling transformations lead to different transformed (tiled) iteration spaces, memory access patterns, communication granularities, and processor idle times. In order to evaluate the impact of each of the above factors to the overall execution time, one needs to devise a highly accurate parallel execution model that takes into consideration all the above factors. However, such an execution model is almost 1 . Throughout relevant research papers, the terms supernode and tiling transformation have been used to describe the same transformation. Although we adopt the term tiling, we have used the term supernode in our title and in several sections of this manuscript as a reference to the seminal paper of Irigoin and Triolet [1] .
impossible to exist for the extremely complex modern parallel platforms. For this reason, the researchers simplify their approaches by discarding some of the above factors, paying the cost of suboptimal tiling transformations. For example, Hö gstedt et al. in [15] and [28] determine the shortest path between the first and the last tile without taking into consideration the communication overhead. Hodzic and Shang in [11] and [14] propose a tile-shape selection technique assuming constant communication times for all message sizes. This approach disregards the communication data volume with the assumption that the volume-dependent message transmission is overlapped by useful computations and, thus, hidden. However, as shown in [9] and [16] , one needs to apply special scheduling strategies (that are not considered in [11] , [14] ) and employ sophisticated communication hardware in order to hide some part-and not all-of the transmission time.
In this paper, we focus on a special but important class of problems that are frequently met in practice. We assume rectangular iteration spaces (as in [11] ) and nonnegative elements in dependence vectors. Note that a tiling transformation can be uniquely defined by determining three parameters: 1) tile size, 2) basic tile shape, and 3) scaling factors of tile sides. In our approach, we consider the tile size as an input parameter determined by the computation and communication costs of the algorithm and the hardware features of the target architecture. In addition, we consider rectangular basic tile shapes. A general parallelogram tiling transformation can only be implemented by automatic parallelizing compilers due to the complexity of the code that traverses nonrectangular tiles [10] , [12] . For the above problem class, we propose a new criterion for the selection of an efficient tile shape. This criterion emphasizes the minimization of the per-process communication volume. Note that minimizing the communication overhead is the primary goal of tiling transformation; therefore, trying to further decrease the communication data by properly selecting the tile shape seems a good idea in the first place. The problem arises in the cases where the scheduling-aware tile shape differs from the communication-aware tile shape proposed here. In this paper, we demonstrate that the criterion for communication minimization should be given the greatest priority when targeting distributed-memory architectures, since it is the one that more drastically affects the overall execution time of the parallel algorithm.
Our method takes into consideration the boundaries of the initial iteration space and the dependencies of the original algorithm and can be applied on a distributed-memory architecture for a limited (fixed) number of processes. This selection of the tile scaling factors is equivalent to determining a virtual process topology and thus can be easily incorporated in a message-passing programming environment like MPI, with a proper initialization to the parameters of the MPI_Cart_Create routine. Our experimental results indicate that the proposed communication-aware tile shape significantly reduces the overall execution time, compared to the one achieved by the scheduling-aware tile shape, although it requires a larger number of parallel execution phases.
The rest of the paper is organized as follows: Section 2 provides useful background knowledge and basic definitions concerning the program model, supernode (tiling) transformation, scheduling, and mapping techniques. Section 3 discusses in more detail work concerning criterions and methods for the selection of efficient tiling transformations, while Section 4 defines our problem and proposes a method to select tiling transformations that minimize the per-process communication volume. Section 5 experimentally tests the efficiency of the proposed approach in terms of total parallel execution times and compares it to the selections proposed by previous research. Finally, Section 6 provides the overall conclusions drawn from this paper.
PRELIMINARIES
Algorithmic Model
Our algorithmic model concerns applications that involve ðn þ 1Þ-dimensional perfectly nested loops with constant flow dependencies. The iteration space J nþ1 is rectangular; thus, it holds J nþ1 ¼ fjðj 1 ; j 2 ; . . . ; j nþ1 Þ 2 Z nþ1^l i j i u i ; i ¼ 1; . . . ; n þ 1g, where l i , u i 2 Z are the lower and upper bounds of the ith loop, respectively. The dependencies of the problem are expressed with constant ðn þ 1Þ-dimensional dependence vectorsd i , i ¼ 1; . . . ; m. We denoted ij the jth element of vectord i . In the class of problems under consideration, it holds thatd ij ! 0, i ¼ 1; . . . ; m and j ¼ 1; . . . ; n þ 1. The dependence matrix of the algorithm, denoted D, is an ðn þ 1Þ Â m matrix containing as columns the dependence vectors of the algorithm. The reader is referred to [29] for more details on the properties of data dependencies. It holds that rankðDÞ ¼ n þ 1, which means that the algorithm has n þ 1 linearly independent data dependence vectors. Note that if rankðDÞ < n þ 1, then the iteration space can be partitioned into independent subspaces and parallelized without the use of tiling [30] . We define the vectord 0 ¼ ðd
. . . ; m, which expresses the maximum dependence length per dimension. Unlike [23] and [31] , we consider incore computations, i.e., all data sets assigned to each process fit in the main memory; thus, we do not consider secondary storage access times. Overall, the algorithms have the general form of Algorithm 1, where U is an ðn þ 1Þ-dimensional matrix, and F is a linear function.
Application Example: Advection Equation
Advection is the physical process of transportation within a fluid, for example, the transportation of polluted particles in the atmosphere. Advection phenomena are very commonly studied in meteorology. The advection equation is the PDE that governs the motion of a conserved scalar as it is advected by a known velocity field (the material in which advection occurs). The advection equation for a scalar v (e.g., particle density or temperature) is expressed mathematically as
whereã is the vector field, e.g., the velocity vector of the material. In two spatial dimensions, the above equation is equivalent to
If we need to study an advection process in an X Â Y space for a time window T , we can discretize the initial domain into a uniform grid using a time step Át and space steps Áx and Áy. Then, we can discretize the above PDE using a variety of finite differencing schemes. For example, if we employ the Euler-Forward scheme [32] , the time derivative can be substituted by a fraction of differences as follows:
Át
. The physics of the problem allows us to employ upwind [32] differencing schemes for the space derivatives, which involves computations with "previous" spatial grid points. This discretization strategy favors the direct application of rectangular tiling in the sequel. Thus, in this case, we can substitute the space partial derivatives as follows:
Áx
. If we substitute the above formulas in (1), we get
where for notational convenience, we suppose a uniform computational grid in the two spatial dimensions ðÁx ¼ ÁyÞ, and a x ¼ a y ¼ a. Note that v 
The dependence matrix of the above algorithm is
andd 0 ¼ ð1; 1; 1Þ. The discretization process followed leads to nonnegative elements in the dependence matrix. Note that alternative discretization schemes can lead to longer dependencies. The reader can find additional details on the equation and the discretization process in [32] .
Supernode Transformation
In a supernode transformation, the iteration space J nþ1 is partitioned into identical ðn þ 1Þ-dimensional parallelepiped areas (tiles or supernodes) formed by n þ 1 independent families of parallel hyperplanes. Supernode transformation is defined by the ðn þ 1Þ-dimensional square matrix H. Each row vector of H is perpendicular to one family of hyperplanes forming the tiles. Dually, supernode transformation can be defined by n þ 1 linearly independent vectors, which are the sides of the supernodes. Similar to matrix H, matrix P contains the side vectors of a supernode as column vectors. It holds that P ¼ H À1 . Formally, supernode transformation is defined as follows:
where bHjc identifies the coordinates of the tile that index pointjðj 1 ; j 2 ; . . . ; j nþ1 Þ is mapped to, andj À H À1 bHjc gives the coordinates ofj within that tile relative to the tile origin. Thus, the initial ðn þ 1Þ-dimensional iteration space is transformed to a ð2n þ 2Þ-dimensional one, the space of tiles, and the space of indices within tiles. Indices within tiles have to be sequentially executed, while tiles themselves can be assigned to processes and executed in parallel according to a valid hyperplane schedule, as we will see in Section 2.4. The tiled space J S and the supernode dependence matrix D S are defined as follows:
nþ1 j0 bHj 0 c 1g, wherej 0 denotes the index points belonging to the first complete tile starting from the origin of the iteration space J nþ1 . The tiled space can be also written as
S is a distinct tile with coordinates ðj
Given an algorithm with dependence matrix D, for a tiling to be legal, it must hold that HD ! 0. This ensures that tiles are atomic and that the initial execution order is preserved [1] , [5] . In the opposite case, any execution order of tiles will result in a deadlock. In this paper, we assume that all dependence vectors contain no negative element and are smaller than the tile size. This allows us to apply rectangular tiling transformations, which are defined by the diagonal matrices H ¼ diagðh 1 ; h 2 ; . . . ; h nþ1 Þ and P ¼ diagðp 1 ; p 2 ; . . . ; p nþ1 Þ. Fig. 1a shows an example of a rectangular tiling transformation.
Finally, we assume that all dependencies are entirely contained in each supernode's area, which means that jHDj < 1 [20] or, alternatively, that the supernode dependence matrix D S contains only zeros and ones. This assumption is quite reasonable since dependence vectors for common problems are relatively small, while tile sizes may result to be orders of magnitude greater in systems with very fast processors. In this case, every tile needs to exchange data only with its nearest neighbors. The number of index points contained in a supernode expresses the respective computation cost of this supernode (tile) and is calculated by detðP Þ. Thus, we have V comp ¼ detðP Þ, and for rectangular tiling transformations,
Scheduling, Mapping, and Parallel
Execution Time
We will schedule the problems under consideration with linear scheduling techniques [33] , [34] . Central to linear scheduling is the notion of the scheduling vector Å. Intuitively, in simple cases, it suffices to calculate the inner product of a pointj 2 J nþ1 with Å to derive the parallel time step at whichj will be executed. In the general case [33] , j 2 J nþ1 scheduled according to a linear scheduling vector Å will be executed at t j ¼ b
is the alignment constant or the initial time of execution of the first point in the iteration space, and dispÅ ¼ minÅd i :d i 2 D is the displacement constant expressing the time pace of computations. Thus, a tilej S 2 J S will be executed at t j S ¼ b
All points (or tiles) that lie within each n-dimensional surface perpendicular to the scheduling vector Å can execute in parallel; thus, one can employ an n-dimensional array of processes to maximize parallelism [35] . In our approach, we will also consider the general case of an n-dimensional process grid to execute in parallel ðn þ 1Þ-dimensional iteration (or tiled) spaces.
Suppose, for notational convenience, that l i ¼ 0, i ¼ 1; . . . ; n þ 1. Then, the last point of the iteration spacẽ j last ¼ ðu 1 ; u 2 ; . . . ; u nþ1 Þ will be transformed by a rectangular tiling transformation to the last tilej
Using Å ¼ ð1; 1; . . . ; 1Þ as a scheduling vector, the last tile will be scheduled at time step t j S ¼ P nþ1 i¼1 u i p i (t 0 ¼ 0 and supposing dispÅ ¼ 1), which clearly constitutes the total number of parallel time steps for the problems under consideration. In every parallel time step, each process performs uninterrupted computation within a single tile and communicates with its n neighbors in order to exchange data. Note that even if the dependencies of the problem lead to the need for data exchange with diagonal neighbors, one can apply indirect message-passing techniques (discussed in [36] ), in order to limit the neighboring processes to the n nondiagonal ones. If t c is the time to compute one iteration, t s is the communication start-up latency, t t is the time to transmit a unit of data, and k is the mapping dimension (i.e., the dimension across which all tiles are assigned to the same process), then the total parallel execution time can be expressed by
The above equation multiplies the number of the parallel time steps ð P nþ1 i¼1 u i pi Þ with the total time of each step. This in turn is decomposed in the computation time of a tile ð Q nþ1 i¼1 p i t c Þ, the start-up time for the communication along n dimensions ðnt s Þ, and the transmission time of all messages ð P nþ1 i¼1;i6 ¼k ð Q nþ1 j¼1;j6 ¼i p j Þd 0 i t t Þ. 2gÞ and the tile depen-
We map tiles along the innermost (second in this case) dimension to the same process ðC 1 ; C 2 ; C 3 Þ. If we apply linear scheduling with vector Å ¼ ð1; 1Þ, the last tile of the algorithm ðj S last ¼ ð2; 2ÞÞ will be executed at time step t j S ¼ Åj S last ¼ 4; thus, the total number of parallel time steps will be five. Tiles executed at the same time step are shaded in the same color. According to (3), the total parallel execution time will be
RELATED WORK-"OPTIMAL" SUPERNODE SHAPES
After the proposition of tiling transformation by Irigoin and Triolet [1] , researchers started elaborating on efficient or optimal tiling transformations. Apart from the definition of legal tile shapes, i.e., tile shapes that respect the dependencies of the algorithm and allow uninterrupted execution of tiles, the seminal work of Ramanujam and Sadayappan [5] introduced the idea of tiling transformations for minimal communication. They formulated a per-tile communication function, which needs to be minimized respecting the legality criterions, in order to minimize the number of dependencies that crosscut the tile's surfaces. These crosscutting dependencies are responsible for communication between tiles. Further elaborating on this optimization criterion, Xue [21] showed that the surfaces that define the tile should be selected to be parallel to the dependence cone of the algorithm. A more general approach could be to search for a tiling transformation that minimizes the total parallel execution time of the algorithm as expressed in (3) . Note that the selection of various tiling transformations affects the number of parallel time steps and the time for computation and communication per step. Minimizing the above equation is quite an intricate task, since one cannot effectively model the factors t c , t s , and t t . t c , being the computation time of a single iteration within a tile, includes the necessary memory accesses for the computed data. The time to access data from memory is crucial for the computation process and is greatly affected by the tile size and shape. The interaction of various tiling transformations with the memory subsystem is difficult to model, since a number of factors such as the cache size and organization and hardware prefetching mechanisms need to be taken into consideration. In any case, t c cannot be considered as a constant but, on the contrary, as a proper function of the tiling parameters; thus, t c ¼ t c ðp 1 ; p 2 ; . . . ; p nþ1 Þ. Accordingly, the t s time is dependent on the implementation of the message-passing mechanisms employed (e.g., MPI library) and cannot be easily considered as constant since it includes communication tasks that are not deterministic, such as the management of unexpected messages (e.g., call to receive functions after the arrival of a message). Finally, even for t t , which is the simplest of the above parameters, since one needs to simply divide the message size with the effective bandwidth of the underlying communication network, one has to consider implementation-specific issues. For example, various MPI libraries employ different transmission mechanisms (eager, rendezvous, etc.) according to the message size.
With a goal to minimize the total parallel execution time, Hodzic and Shang [14] employ a simpler model than that of (3). The authors disregard the communication transmission part, assuming that this task can be overlapped by useful computations. They consider t c and t s as constants and thus focus on the selection of tile shapes that minimize the total number of parallel execution steps. However, as shown in [9] and [16] , one needs to apply special scheduling strategies that are not considered in [14] and employ sophisticated communication hardware in order to hide some part of the transmission time. Similarly, Högstedt et al. [15] split the communication time in start-up and transmission times. The former is added to the computation time as a constant, while the latter is again considered overlapped by computations. Again here, the authors select a tile shape that minimizes the path to reach the last tile or, in other words, to minimize the processor idle times and maximize parallelism. We call the tile shapes selected in [14] and [15] as scheduling aware. Note that both approaches ignore the total communication volume imposed by a tiling selection, by assuming that all of the transmission time is hidden underneath useful computation. However, it is not always possible to overlap communication transmission even if advanced scheduling schemes and sophisticated hardware are employed [16] . In our approach presented in the next section, we show that the minimization of the total communication volume is an important issue that should be taken into consideration, especially when commodity interconnection networks with no overlapping capabilities are concerned.
Finally, Parsa and Lotfi [23] provide a genetic algorithm that minimizes an objective function encapsulating processing, communication, and disk-access times. Although their approach targets general tiling transformations, their objective function does not represent the overall parallel execution time, while the selected transformations are not tested in real problems and platforms as far as their efficiency is concerned.
COMMUNICATION-AWARE SUPERNODE SHAPE
In this section, we propose a new criterion for the selection of an efficient tile shape, i.e., the minimization of the per-process communication volume of the algorithm. In addition, we provide a method to select a tile shape based on this criterion, called the communication-aware tile shape. Prior to this, we formally define our problem and provide the schedulingaware solution to this problem proposed by previous work.
Definition of the Problem
The input of our problem is an algorithm following the model discussed in Section 2.1. For notational convenience, we assume that l i ¼ 0, i ¼ 1; . . . ; n þ 1; thus, we consider an ðn þ 1Þ-dimensional nested loop with a rectangular iteration space ðu 1 Â u 2 Â; Á Á Á ; Âu nþ1 Þ and a dependence matrix D of nonnegative constant flow dependencies. The tile size g is also given as an input to our problem. Note that the definition of the optimal tile size is a very difficult problem. However, for a specific parallel architecture, one can conduct a series of benchmarks, taking into consideration parameters such as the cache size, the CPU power, and the communication latency and bandwidth to experimentally approximate an efficient tile size. Finally, we consider a fixed number of available processes C.
Contrary to related scientific work, we adopt a different approach for the specification of the desirable communicationaware tile shape: instead of defining a tiling transformation matrix H or P , we equivalently aim at determining an appropriate process topology C ¼ Q n i¼1 C i for the mapping of the parallel algorithm, according to the mapping scheme presented in Section 2.4. Indeed, the selection of the process topology implicitly enforces a particular tiling transformation: determining a topology C 1 Â C 2 Â Á Á Á Â C n for the parallel mapping of an algorithm with iteration space u 1 Â u 2 Â; Á Á Á ; Âu nþ1 effectively slices dimension u i to C i parts ði ¼ 1; . . . ; nÞ. This fact is equivalent to applying a rectangular tiling transformation described by the following matrix:
where g is the tile size dictated by the underlying architecture (processor speed, interconnection bandwidth, etc.) and affecting the grain of parallelism. Thus, the problem of selecting an efficient tiling transformation collapses to the definition of the terms C 1 ; C 2 ; . . . ; C n of the above expression. Moreover, proposing an efficient Cartesian process topology can lead to the direct incorporation of the optimization technique in a message-passing library like MPI, e.g., through the MPI_Cart_create library routine. Note that in the above discussion, it is assumed that all tiles along the inner dimension are mapped to the same process. Since the algorithm contains no negative dependence element, any permutation of the loop nest is legal [29] , [37] , that is, any loop can be selected to be the innermost one. 
Previous Work: Scheduling-Aware Supernode Shape
According to [14] and [15] , given C processes for the mapping of an ðn þ 1Þ-dimensional algorithm on an n-dimensional process grid, the scheduling-aware tiling transformation can be obtained as a feasible solution to the following optimization problem:
A process topology complying to (4) tries to place an equal number of processes in each dimension, minimizing in this way the required total number of parallel execution steps, but fails to consider both the algorithmic dependencies and the iteration space, in order to reduce the communication volume. The advantage of such a process topology is that it minimizes the latency of the parallel program; it ensures that the most distant process will start executing its work share at the earliest possible time step. Note that in this paper, we do not compare against [14] and [15] as a whole, since these papers propose the selection of a general tiling transformation (not necessarily rectangular) that minimizes the idle tiles of processes. On the contrary, we compare our approach presented in the next paragraph against the approach of [14] and [15] to solve the problem defined in Section 4.1.
Communication-Aware Supernode Shape
In this section, we discuss that an important criterion for the selection of an efficient tiling transformation is the communication volume imposed by the transformation. We use the notion of the per-process communication volume, i.e., the data that need to be sent by one process due to algorithmic dependences. For the problems under consideration, a process in the n-dimensional process grid needs to send n distinct messages (one per dimension). The size of each message is equal to the product of the maximum dependence across the dimension of communication by the size of the n À 1-dimensional boundary surface. Lemma 1 provides an expression for this metric.
Lemma 1.
If an ðn þ 1Þ-dimensional rectangular iteration space u 1 Â u 2 Â; Á Á Á ; Âu nþ1 is assigned to an n-dimensional process grid C 1 Â C 2 Â; Á Á Á ; ÂC n ¼ C, then the total communication volume of a nonboundary process is given by the expression: The rectangular tiling transformation that will be applied can be defined by a 3D diagonal matrix P ¼ diagðp 1 ; p 2 ; p 3 Þ. We partition the u 1 Â u 2 space into 16 tiles and appropriately adjust the tile height to conform with the restriction of the tile size. Thus, we will first determine p 1 , p 2 , and subsequently, we will set p 3 ¼ g=p 1 p 2 . We will investigate two alternative feasible tile shapes, namely,
Fig . 2 shows the projection of the tiled iteration space on the j 1 j 2 surface and its allocation to the 16 processes ðc 1 ; . . . ; c 16 Þ for the two alternative tiling transformations. Note that each process is assigned a chain of tiles along the j 3 dimension. The shaded parts of Fig. 2 represent the communication data for the two candidate tiling transformations. The total communication volume V tcomm derives from the boundary area between the processes (3u 1 u 3 þ 3u 2 u 3 for the first transformation and u 1 u 3 þ 7u 2 u 3 for the second transformation), multiplied by the maximum coordinate of the dependence matrix in the corresponding dimension, which in our case is two. Thus, we have V tcomm;1 ¼ 6u 1 u 3 þ 6u 2 u 3 and V tcomm;2 ¼ 2u 1 u 3 þ 14u 2 u 3 . This difference in communication volume is depicted on the per-process communication volume as well, derived from Lemma 1, where
4 Þ, and V pcomm;2 ¼ 2u 3 ð
If we apply linear scheduling defined by vector Å ¼ ð1; 1; 1Þ, then the tile ð4; 4; T =p 3 Þ will be scheduled last according to the first tiling transformation and will be executed at time step T 1 ¼ 8 þ T =p 3 , while the respective last tile ð8; 2; T =p 3 Þ of the second transformation will be executed at time step T 2 ¼ 10 þ T =p 3 . This implies that the first transformation is better as far as the total number of parallel execution steps is concerned, since T 1 < T 2 . However, notice that if u 1 > 2u 2 , then V tcomm;1 > V tcomm;2 , which means that the second transformation is superior in terms of total communication volume. Consequently, when it comes to executing the above problem for u 1 > 2u 2 , we need to decide between scheduling-aware and communicationaware tiling. Intuitively, in our example, one can see that the communication-aware transformation entails a moderate increase in the number of time steps if we make the reasonable assumption that T =p t > > 2. On the other hand, if we have u 1 ¼ 4u 2 , the communicationaware transformation leads to almost 27 percent less communication data. For this reason, we argue that the communication-aware transformation will lead to a significantly lower total execution time.
The following lemma provides the condition that must hold in order to minimize the communication of a nonboundary process. This is achieved by the even distribution of communication data in all process dimensions.
. . . ; n be the maximum dependence per direction, and C be the number of processes available for the parallel execution of the algorithm. If there exist C i 2 IN such that
then process topology C 1 Â Á Á Á Â C n minimizes the perprocess communication for the tiled algorithm on C processes.
Proof. According to (6) , it holds that
Each process assumes du i =C i e iterations along direction i, where 1 i n. For the sake of simplicity, we assume that du i =C i e ' u i =C i . Using (8), (5) from Lemma 1 can be written by substituting C n as follows:
Note that V comm is substantially a function of C 1 ; . . . ; C nÀ1 (formally V comm : IN nÀ1 ! IR). Let V comm be the real extension of V comm , defined by (9) for C j 2 IR, 1 j n ðV comm : IR nÀ1 ! IRÞ. For a stationary point ðC 1 ; . . . ; C nÀ1 Þ of V comm and 1 j n À 1, it holds that
Also,
Because of (10) and (11), V comm has a minimum at ðC 1 ; . . . ; C nÀ1 Þ, and as C i 2 IN, 1 i n À 1, this will be the minimum of V comm as well. Therefore, the communication data is minimal when a topology C 1 Â Á Á Á Â C n satisfying (10) is assumed. t u
Finally, Theorem 1 makes use of Lemma 2 to derive an expression for the number of processes in each dimension.
. . . , n be the maximum dependence per direction, and C be the number of processes available for the parallel execution of the algorithm. In order to minimize the per-process communication volume, the number of processes in each dimension should be set by
Proof. Note that it holds that
By combining (13) with (10), we can easily deduce (12) . t u
It should be noted that (12) does not always define a valid integer process topology: It is possible that C j = 2 IN for some value j with j ¼ 1; . . . ; n. However, when truncated to an integer, it can serve as a good starting value for an exhaustive algorithm searching for feasible process topologies in the close neighborhood of the minimum of V comm , as determined by (12) . In practice, as n þ 1 does not exceed three or four and C ranges up to a few hundreds or maybe thousands of processes, the high complexity of the heuristic algorithm does not result in high execution times. Furthermore, the monotonicity of function V comm allows immediate elimination of candidate process topologies, which lead to increased communication cost. In order to verify this claim, we measured on a PIII @ 800 MHz the execution times for the specification of a feasible communication-aware 3D process topology, given all possible 4D iteration spaces ð100 . . . 10kÞ Â ð100 . . . 10kÞ Â ð100 . . . 10kÞ Â u nþ1 , data dependencies ½ð1 . . . 3; 0; 0; dÞ; ð0; 1 . . . 3; 0; d 0 Þ; ð0; 0; 1 . . . 3; d 00 Þ and for 100 C 1k. The execution time equaled on the average 21 msec, while under no circumstances did it exceed 0.9 second. Note finally that the approach described above does not theoretically ensure the finding of the minimum communication volume in integer space. However, it greatly restricts the search space compared to an exhaustive search within all possible process topologies at the cost of possible unoptimality.
EXPERIMENTAL RESULTS
In this section, we will experimentally evaluate the parallel performance of the communication-aware tiling transformations or, equivalently, process topologies as derived from (12) . In addition, we will compare the proposed topologies against the scheduling-aware ones proposed in [14] and [15] and derived from (4). We consider the 2D and 3D advection equations and an artificial kernel with a 3D nested loop following the algorithmic model described in Section 2.1. Our experimental platform is a 16-node Linux cluster (kernel 2.6.23.1). Each node includes two quad-core Xeon chips based on Intel's Core-2 microarchitecture (E5335 @ 2 GHz). Two cores per package share a 4-MB L2 cache. The interconnection network is Gigabit Ethernet. We experimented with 100 processes running in the above cluster. We used MPICH v1.2.7 MPI implementation, configured with gcc v4.2.3, and applied the -O2 optimization flag to all programs. In order to reduce as much as possible the differences in memory access patterns induced by various tiling transformations, we applied cache blocking as described in [38] .
2D Advection Equation
Recall from Section 2.2 that the 2D advection equation problem results in a 3D iteration space, which is mapped on a 2D process grid. We experimented with various iteration spaces ðX Â Y Â T Þ and all possible process topologies for the 100 processes. We mapped the X Â Y plane on the process grid and assigned tiles along T to the same process. In all cases, the scheduling-aware process topology that derives from (4) is 10 Â 10. Fig. 3 provides information on the scalability of our implementation for an increasing number of processing cores and four different problem sizes. As expected by the nearest neighbor nature of the communication and the dimension of the problem, the algorithm scales better for larger problems. The hardware configuration of our platform enforces several cores to share the same network interface, which is a bounding factor for the scalability of our implementation. Fig. 4 presents the first comparison between the scheduling-aware and the communication-aware strategies for the selection of a process topology for X ¼ 50 K and Y ¼ 8 K. The communication-aware topology in this case is 25 Â 4. We varied T and the tile size in order to assign a different number of tiles (denoted T iles) to the same process across different runs. We observe that when the total number of tiles assigned to each process is small, then the scheduling-aware topology outperforms the communication-aware one, since as expected, for a small number of total parallel time steps, it is crucial to maximize the concurrency of processes or, in other words, to minimize the steps before the last process starts its execution. The scheduling-aware topology enforces the last process to start its execution at parallel time step 20, while the communication-aware topology enforces the last process to start its execution at parallel time step 29. Observe also that this difference in concurrency diminishes as the number of tiles increases. In this case, the reduction of the communication volume imposed by the communication-aware topology leads to smaller total parallel execution times. Note that for meaningful parallelization scenarios, we assign a number of tiles to each process significantly larger than C; thus, for the forthcoming experiments, we will assume that T iles ) 100.
In Fig. 5 , we present normalized parallel execution times decomposed into communication and computation times for six iteration spaces and all feasible process topologies. We experimented with various tile sizes per iteration space and present the best attained performance. The communicationaware topology is denoted with a dot over the corresponding bar. The first observation that can be made is that the total parallel execution time greatly differentiates between different topologies (the difference reaches even to a factor of five). These large differences are due to the extreme increases in communication times that can be imposed by an unfortunate selection of the process grid. This fact justifies our argument that the minimization of the communication volume via the selection of a proper tiling transformation should be given high priority. Observe also that the communication-aware topology outperforms the scheduling-aware topology in the first five iteration spaces, while for the sixth iteration space, both strategies lead to the selection of a 10 Â 10 topology. Note finally that the communication-aware topology leads to the lowest total parallel execution times among all topologies in five out of six iteration spaces. Table 1 provides a direct comparison between the two strategies for the iteration spaces in Fig. 5 . The communication-aware topology exhibits an improvement in performance compared to the scheduling-aware topology in five out of six iteration spaces that ranges from 2.5 percent (in iteration space 40 KÂ10 KÂ5 K) to 32.8 percent (in iteration space 200 KÂ2 KÂ5 K). In iteration space 40 K Â 20 K Â 5 K, both strategies propose the 10 Â 10 topology, thus leading to the same execution and communication times. Note that we present the maximum computation and maximum communication times, reduced over all processes and normalized to the maximum computation time þ maximum communication time under the scheduling-aware tiling transformation. The sum of these partial times is not necessarily equal to the total execution time, as we depict the worst case scenario for both the communication and the computation times (this holds for the 3D advection equation and the artificial kernel in the next sections). However, despite the relatively small differences in the computation times, which can be attributed to data locality effects, this profiling confirms that the relative advantage of the communication-aware tiling transformation can be directly attributed to the respective reduction of the communication times. The communication-aware topology takes into consideration the bounds of the iteration space and adjusts the placement of processes per dimension in order to reduce the communication volume. This, as shown from the experimental results, has a significant positive impact on the overall performance of the algorithm.
3D Advection Equation
In our second set of experiments, we applied all feasible process topologies in various iteration spaces for the 3D advection equation. The iteration space in this case is 4D ðX Â Y Â Z Â T Þ; thus, we map the plane X Â Y Â Z on a 3D process grid and assign tiles across T to the same process. The scheduling-aware process topology in this case is 5 Â 5 Â 4, 5 Â 4 Â 5, or 4 Â 5 Â 5. We present the best result from the above three. In each iteration space and process topology, we experimented with various tile sizes and present the best attained results. Fig. 7 provides information on the scalability of the 3D algorithm. In this case, the performance and scalability are rather poor due to the more intense communication needs of the 3D process grid and the consequent bottlenecks created at the shared network interfaces. Communication dominates in this application and platform, which makes our approach to alleviate the communication overheads even more relevant. Fig. 6 presents comparison results for four iteration spaces. In this case, the differences in the total parallel execution time between different topologies are even greater. The communication overhead imposed by an unfortunate selection of process topology can kill performance. Thus, it is clear that one needs a criterion to effectively select between the 36 feasible topologies for this problem. Again here, the communication-aware strategy succeeds well in this selection. Table 2 performs a comparison between the schedulingaware and the communication-aware process topology for 12 iteration spaces. For the first two iteration spaces, both strategies lead to the proposal of the same topology. The third iteration space ð800 Â 200 Â 400 Â 1 KÞ is the only one in which the scheduling-aware topology outperforms the communication-aware one by a factor of 1.7 percent. For the rest of the iteration spaces, the communication-aware topology outperforms the scheduling-aware one by a factor that ranges between 3.3 percent to 213.8 percent. It is clear that for 4D iteration spaces mapped on 3D process grids, the selection of a communication-aware process topology is even more crucial, since the communication in this case occurs in three dimensions, and thus, the relevant overhead severely affects performance.
Artificial Kernel
In this last set of experiments, we implemented an artificial kernel expressed by a 3D nested loop following the model in Section 2.1 in order to compare the two topology selection strategies when apart from the iteration space, the dependencies of the problems vary. Note that in the 2D and 3D advection problems, the dependencies in the communication dimensions were always the same. However, since the communication-aware topology takes this factor into consideration as well, we varied d Table 3 presents results (total parallel execution times and communication times) for three iteration spaces and various combinations of d
. From this table, we observe four important issues:
1. The communication-aware topology adjusts to the iteration space shape and to the dependencies of the problem. 2. In 6 out of 23 cases, both strategies selected the same (10 Â 10) topology. 3. In three cases, the scheduling-aware topology outperformed the communication-aware topology since the latter caused a slowdown ranging from 2.7 percent to 23.5 percent. 4. In 14 cases, the communication-aware topology led to lower parallel execution times, providing an improvement that ranged from 0.4 percent to 96.9 percent.
Overall Conclusions on the Experiments
The experimental results presented in the previous paragraphs lead us to the following conclusions:
-When the total number of tiles assigned to each process is "small," then the minimization of the processor idle times with the scheduling-aware process topology is important (see Fig. 4) ; thus, the scheduling criterion should be given higher priority. As a rule of thumb, one can prioritize schedulingaware tiling transformations when T iles % C. However, we claim that for the majority of real-life problems, it will hold that T iles ) C. -In clusters with commodity interconnection networks, such as the one used in our experiments, it is crucial to reduce the communication volume as much as possible. If T iles ) C, then the communicationaware process topologies were able to drastically reduce the communication times with an important positive impact on the total parallel execution times compared to the scheduling-aware process topologies. The reduction in total parallel execution time reached up to 213 percent. (See Tables 1, 2 , and 3). The greater differences were observed in 3D advection, which used a 3D process grid. In this case, the communication overhead increases, both in terms of communication volume and in terms of the number of messages (see Fig. 6 ). -The proposed communication-aware tiling is particularly efficient when the algorithm exhibits asymmetric data dependencies and/or iteration space dimensions. -The proposed communication-aware tiling exhibits very good performance even when compared to the best possible total parallel execution time achieved by any topology (see Figs. 5 and 6), since in several cases it succeeds the minimum time. However, there exist cases where alternative topologies minimize the execution time, which is reasonable since, as discussed in Sections 2 and 3, the minimization of the total parallel execution time is a problem involving numerous parameters such as CPU power, memory organization, communication bandwidth, and latency. The communication-aware strategy takes into consideration only the communication volume of the problem. However, the fact that based on this sole criterion, we were able to minimize the execution time in several cases and approach close to the minimum in several others leads us to the conclusion that the minimization of the communication volume should be given a very high priority.
CONCLUSIONS
This paper presented a novel approach for the selection of an efficient and feasible tile shape for the parallelization of stencil algorithms. We formulate a simple and applicable method for the specification of an appropriate tile shape, which minimizes the communication volume of a nonboundary process, assuming a fixed total number of processes. Compared to alternative tile shapes that aim at minimizing the processor idle times, thus maximizing parallelism, the communication-aware tile shapes proposed here exhibit significantly lower parallel execution times for real-life problems. This improvement in performance is due to the drastic reduction in the communication volume imposed by the proposed tile shapes, which take into consideration the bounds of the iteration space and the problem dependencies in order to reduce communication data. The presented technique can be easily combined with the MPI_Cart_create primitive to deliver efficient Cartesian process topologies. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
