Abstract
Introduction
One of the most difficult areas in the field of parallel computing is loop parallelization and efficient mapping onto different parallel architectures. In order to achieve maximum acceleration of the final program, one of the key issues that should be taken into account is minimization of the communication overhead, which considerably decelerates the system. As far as fine grain parallelism is concerned, in order to reduce the communication overhead, several methods have been proposed to group together neighboring chains iterations [10, 7] , while preserving the optimal hyperplane schedule [13, 11, 3] . As far as coarse grain parallelism is concerned, researchers are dealing with the problem of alleviating the communication overhead by applying the supernode or tiling transformation. Supernode partitioning of the iteration space was proposed by Irigoin and Triolet in [6] . They introduced the initial model of loop tiling and gave conditions for a tiling transformation to be valid. Later, Ramanujam and Sadayappan in [9] showed the equivalence between the problem of finding a set of extreme vectors for a given set of dependence vectors and the problem of finding a tiling transformation H that produce valid, deadlock-free tiles. The problem of determining the optimal shape was surveyed, and more accurate conditions were also given by others as Xue in [14] and Boulet et al. in [2] . Nevertheless, all above approaches ignore the actual iteration space boundaries. Although tile shape is of great importance to communication reduction, the objective should be the overall tiled space completion time. Hodzic and Shang [5] proposed a method to correlate optimal tile size and shape, based on overall completion time reduction. Their approach considers a straightforward time schedule, where each processor executes all tiles along a specific dimension, by interleaving computation and communication phases. All processors first receive data, then compute and finally send result data to neighbors in explicitly distinct phases, according to the hyperplane scheduling vector.
In [4] we proposed an alternative method for the problem of scheduling the tiles to single CPU nodes. Each atomic tile execution involves a communication and a computation phase and this is repeatedly done for all time planes. We are compacting this sequence of communication and computation phases, by overlapping them for the different processors. The proposed method acts like enhancing the performance of a processor's datapath with pipelining [8] , because a processor computes its tile at time step and concurrently receives data from all neighbors to use them at · ½ time step and sends data produced at ½ time step.
Since data communication involves some startup latencies, we adjust the computation grain to make room for this overhead and try to overlap with all communication, which can be done in parallel. Previous work in the field of UET-UCT scheduling of grid graphs in [1] , has shown that this schedule is optimal when the computation to communication ratio is one.
In this paper we extend the method proposed in [4] for executing tiled iteration spaces in SMP nodes. We group together neighboring tiles along a hyperplane. Hyperplanegrouped tiles are concurrently executed by the CPUs of the same SMP node. In this way, we eliminate the need for tile synchronization and communication between intranode CPUs. As far as scheduling of groups is concerned, we take advantage of the overlapping schedule of [4] in order to "hide" each group communication volume within the respective computation volume. Under the above implementation scheme, the iteration space involves the overlapped execution of communication and computation phases between successive groups of tiles. We thus avoid most of the communication overhead by allowing for actual computation to communication overlapping.
We compare our method, using blocking schedules and vertical grouping of neighboring tiles along a specific dimension. Vertically grouped tiles are assigned to the same node, and an optimal hyperplane time schedule is applied. However, this imposes additional intranode synchronization delays. All experimental results show that when the hyperplane grouping of tiles together with the overlapping schedule are applied, the overall completion time is considerably reduced, under the condition of controlling the computation to communication grain.
The rest of this paper is organized as follows: Basic terminology used throughout the paper and definitions of loop tiling are introduced in Section 2. In Section 3 we supply an algorithm for the application of the overlapping scheme (proposed in [4] ) on clusters of SMP nodes and we investigate the resulting time schedule. In Section 4 we describe the experiments executed on a cluster of SMPs using PCI-SCI Network Interface cards in order to verify our theory. Finally, in Section 5 we summarize our results and propose future work.
Models -Loop Tiling
In this paper we consider algorithms with perfectly nested FOR-loops and uniform data dependencies, as in [4] .
Throughout the paper the following notation is used: AE is the set of natural numbers and Ò is the number of nested FOR-loops of the algorithm. Â Ò Ò is the set of indices: 
Given an algorithm with dependence matrix , for a tiling to be legal, it must hold À ¼ (see [6, 9] ). In this paper, as in [4] , we assume that all dependence vectors are smaller than the tile size, thus they are entirely contained in each tile's area, which means that À ½ [15] , or, alternatively, that the tile dependence matrix Ë contains only 0's and 1's.
Application of the overlapping schedule to SMP nodes
In the rest of this paper, we shall consider that the non-overlapping and overlapping schedules, extensively described in [4] (sections 3,4), are known. In the sequel, we shall generalize the proposed overlapping schedule and apply it on a cluster of Symmetric Multiprocessors (SMP nodes).
Let us consider the following scenario: A ¾-dimensional nested loop to be executed onto a cluster of ¿ single CPU nodes. We tile the Iteration Space of the algorithm and assign each row of tiles to a CPU node. Apparently, we should select the size and shape of tiles so that the Iteration Space is partitioned into ¿ rows of tiles (since ¿ CPU's are available). Then, the tiles can be computed using either the overlapping, or the non-overlapping schedule presented in [4] . with ¾ CPU's each, then we can split each tile into two subtiles and assign each subtile to one of the CPU's of the corresponding SMP node, as indicated in Fig. 1 . Equivalently, we may tile the initial Iteration Space, selecting the size of tiles so as to get six rows of tiles. Then, we assign a row of tiles to each CPU of the SMP nodes and group together neighboring tiles assigned to the same SMP node, as in Fig. 1 . It is obvious that the tiles grouped together by this scheme cannot be simultaneously executed, unless they are split into subtiles. Thus, additional synchronization overhead is imposed due to subtile dependencies. A more efficient scheme can be obtained if we group the tiles assigned to the same SMP nodes as indicated in Fig. 2 . Then both tiles belonging to the same group can be simultaneously executed by the CPU's of an SMP node. We shall call this grouping scheme as "hyperplane grouping". On the contrary, any other grouping scheme along a specific dimension, such as the one presented in Fig. 1 , will be called "vertical grouping".
Grouping Transformation
In order to generate an appropriate time schedule, we need to group together the tiles of Â Ë that will be concurrently executed by the CPU's of the same SMP node. We further apply an additional supernode transformation to the In order a grouping transformation to be valid, it should preserve the constraint of atomicity of groups (À Ë ¼ in correspondence to À ¼ for tiling). In addition, since within a group all tiles are concurrently executed by the CPU's of an SMP node, in order to preserve data consistency, there should be no direct or indirect dependence among them.
Determining È according to the number of CPU's within a node
Consider now the general case, where we have an Ò- 
The maximum number of tiles contained inside a group is Ø´È µ Ñ, exactly equal to the number of CPU's inside each SMP node.
In order to prove that À defines a legal grouping transformation, it suffices to prove that À Ë ¼, where Ë is the dependence matrix of the Tile Space Â Ë and that any two tiles´ Ë Ë ¼ ¾ µ within the same group are independent. We have assumed (see Ü2) that the dependence matrix Ë contains only ¼'s and ½'s. Consequently, the first condition is apparently valid. In order to prove the second condition, we assume that the dependence matrix Ë is equal to the unitary matrix. Even if there is a dependence vector with more than one ½'s, it is the sum of more than one unitary dependence vectors. So it will be included in the following proof as an indirect dependence:
If tiles Ë Ë ¼ ¾ Â Ë belong to the same group then it holds that: ½ to the same CPU, as indicated in Fig. 3 by the grey arrows. The CPU's of the same SMP node will undertake two neighboring rows of tiles. Then, during the time step t=0, the CPU-¼ of the SMP node¼ computes tile´¼ ¼µ. During the time step Ø ½ , the CPU-¼ of node¼ computes tile´½ ¼µ, while the CPU-½ of the same SMP node computes tile´¼ ½µ. Similarly, during the time step Ø ¾ , the CPU-¼ computes tile´¾ ¼µ, while the CPU-½ computes tile´½ ½µ. At the same time, the data computed in tile´¼ ½µ, which are necessary for the computation of tile´¼ ¾µ, can be sent to node½. During the time step t=3, the CPU's of node0 can continue the execution as above, while the CPU's of node1 start executing the same routine with the rows of tiles´Ü ¾µ and´Ü ¿µ.
In order to construct a time schedule for this example, we group together the tiles that should be concurrently executed by the same SMP node. In particular, we perform grouping to the Tile Space Â Ë , as indicated in Fig. 3 ´¾ ¼µ Ì . In Fig. 3 , the time step, when each group will be computed, is shown, together with the time step, where each data transfer will take place.
In Fig. 4 , the corresponding Group Space is also shown. It can be easily deduced that a group ´ ½ ¾ µ ¾ Â will be executed during the time step Ø´ µ ½ · ¾ 
Thus, the dependence matrix of the Group Space can be written as:
We are searching for an appropriate linear time scheduling vector ¥ ´ ½ Ò µ such that each group ¾ Â is computed during the time step Ø ¥ . Consider the last´Ò ½µ coordinates of a group indicating which SMP node of the cluster will execute this group. Then, the groups ´ ½ Ò µ and ¼ ´ ½ · ½ ¾ Ò µ will be successively computed within the same SMP node. There is a dependence between them, as indicated by the first column of , but there is no need for a communication step between their successive computation steps, because the necessary data are already located in the local shared mem- Notice that, in [4, 12] , for the single CPU pipelined schedule, ¥ was´½ ¾ ¾µ according to the UET-UCT theory. In other words, the optimal overlapping schedule could be achieved when we had equal computation to communication times, so that all communication could be hidden (overlapped) with the computation phase. Nevertheless, in the SMP case, presented here, the labeling of coordinates of groups, that is the grouping transformation È slightly skews the space (see Fig. 3 and the resulting Group Space in Fig. 4 , the relative positions of groups´¿ ¼µ and ¿ ½µ). So the optimal overlapping schedule is achieved bý ½ ½ ½µ. So, apparently, only tiles with the same coordinate Ë ½ will be assigned to the same CPU of the same node.
Generalization: Grouping along an arbitrary dimension of Â Ë
If we want to assign the iterations along the -th dimension of Â Ë to the same CPU of an SMP node, then it can be similarly proven that the appropriate grouping matrices are
where Ñ ½ ¢ ¢ Ñ ½ ¢ Ñ ·½ ¢ ¢ Ñ Ò Ñ. As previously, the time scheduling vector is ¥ ´½ ½µ. 
In Fig. 5 we show the grouping of tiles and when each computation step and each communication step will be executed. It can be easily deduced that a group´ ½ ¾ ¿ µ ¾ Â will be executed in node´ ½ ¾ µ during the time step Ø´ µ ½ · ¾ · ¿ . Therefore, the linear time scheduling vector for this example is ¥ 1 ½ ½µ. 
Comparison
In this section we shall compare vertical grouping, which is indicated in Fig. 1 , with the proposed scheme of hyperplane grouping, which is shown in Fig. 2, 3 
. Splitting tiles in vertical scheme
As we have already mentioned, vertical grouping cannot exploit the computational power of both CPU's of our SMP's unless we split each tile into smaller subtiles and compute some of them in parallel, as shown in Fig. 6 . Let us assume that a CPU needs time « for the computation of a tile with dimensions Ü, Ý (Fig. 6a) . Consequently, it will need time « AE for the computation of a respective subtile with dimensions Ü AE , Ý (Fig. 6c) . The subtiles which are created can be computed by ¾ CPU's in AE · ½ computational steps, interleaved with AE synchronization steps, following an optimal linear time schedule´½ ½µ as in Fig. 6c . If the average time consumed for the synchronization of ¾ CPU's of an SMP node is Ø ×ÝÒ Ò , then the total time required for the computation of a pair of initial tiles is:
The time required for the computation of a pair of tiles is minimized when In [12] we applied the pipelined schedule proposed in [4] , using a cluster of single CPU nodes with PCI-SCI NICs. In this paper, in order to evaluate the proposed methods, we ran our experiments on a Linux SMP cluster with 8 identical nodes. Each node had 128M of RAM and 2 Pentium III 800 MHz CPUs. The cluster nodes were interconnected with an SCI ring, using SCI Dolphin's PCI-SCI D330 cards. SCI NICs support shared memory programming, either through PIO (Programmed-IO) messaging, or through DMA. We are using their kernel-level DMA support for messaging. Invoking kernel system calls, causes extra CPU cycles overhead. However, we can avoid extra copying from user space to kernel space (physical memory) when using DMA. We allocate user level pages, which correspond to physically contiguous pre-reserved memory regions, for DMA communications.
Our test application was the following code:
for(i=1; i<=X; i++) for(j=1; j<=Y; j++) for(k=1; k<=Z; k++)
where is an array of ¢ ¢ floats and . Without lack of generality, we select as a tile a rectangle with , and sides. The dimension is the largest one, so all tiles along the -axis are mapped onto the same processor, as proposed in [4] . Each tile has , dimensions equal to Ü and the tile's "height" along -axis equal to Þ.
There are Ü tiles along of the dimensions and and Þ tiles along of the dimension . Tile's volume is equal to Ü ¾ Þ, and since the number of available processors is initially known, the only unknown parameter is Þ.
We applied both vertical and hyperplane grouping, using both blocking and non-blocking communication primitives. For each exemplary Iteration Space and each possible tile height, we calculated the total execution time for the above schemes. In order to implement these schemes we used Linux POSIX threads with semaphores for the synchronization among the processors of an SMP node and the SISCI driver and libraries for the communication among the SMP nodes.
First of all, as far as the implementation of vertical grouping is concerned, we experimentally verified formula (4), in order to find the optimal execution time for a couple of tiles by an SMP node. Once vertical grouping was implemented and precisely approximated with a theoretical formula, we implemented both blocking and non-blocking communication schemes. As far as the blocking communication scheme is concerned, it was implemented using the pseudo-code of Table 1 . On the other hand, the non- Table 1 . Non-overlapping scheme Implementation
Thread 0 Thread 1 foreach group assigned foreach group assigned to node(i,j) do to node(i,j) do receive from node(i-1,j) receive from node(i,j-1) receive from node(i,j-1) compute tile(i,j,k,CPU0) compute tile(i,j,k,CPU1) send to node(i+1,j) send to node(i,j+1) send to node(i,j+1) semaphore post(sem s1) semaphore post(sem s2) semaphore wait(sem s2) semaphore wait(sem s1) blocking scheme was implemented using the pseudo-code of Table 2 , because during each time step, every SMP node in the plane with coordinates´ µ receives from neighboring nodes´ ½ µ and´ ½µ, computes and sends to nodes´ · ½ µ,´ · ½ µ (Fig. 7) . Since the send dma() call is not blocking, the computation of the tiles will be performed concurrently with the transferring of data among the SMP nodes. After the execution of wait dma(), it is assured that both computation and communication are already completed.
The implementation of vertical and hyperplane grouping was achieved by a proper compute tile(i,j,k,CPUx) procedure. In order to implement vertical grouping we used the pseudocode of Thread 0 Thread 1 Explanation foreach group assigned to node(i,j) do foreach group assigned to node(i,j) do trigger interrupt to node(i-1,j) Inform "previous" nodes: trigger interrupt to node(i,j-1) trigger interrupt to node(i,j-1) "I am ready to accept data" wait interrupt from node(i+1,j)
Wait until "next" nodes wait interrupt from node(i,j+1) wait interrupt from node(i,j+1) are ready to accept data send dma(node(i+1,j),data)
Initialization of DMA transfer send dma(node(i,j+1),data) send dma(node(i,j+1),data) to neighboring nodes compute tile(i,j,k,CPU0) compute tile(i,j,k,CPU1) wait dma() Wait for DMA to complete wait dma() wait dma() trigger interrupt to node(i+1,j) Inform "next" nodes: trigger interrupt to node(i,j+1) trigger interrupt to node(i,j+1) "Your data has arrived" wait interrupt from node(i-1,j)
Wait until "previous" nodes wait interrupt from node(i,j-1) wait interrupt from node(i,j-1) have finished sending data semaphore post(sem s1) semaphore post(sem s2) semaphore wait(sem s2) semaphore wait(sem s1) Implementation of a barrier Table 3 . The number of subtiles inside a tile was selected according to formula (4) . Notice that, the implementation of hyperplane grouping was much simpler, as it is shown in Table 3 . 
Figure 8. Experimental Results
The problem was solved using various values of and . For each schedule, we are interested in the overall minimum execution time achieved at an optimally selected tile height (see [4, 12, 5] ). The experimental results, shown in Figs 8-9 , illustrate that, in every case, non-blocking communication is preferable to blocking communication and hyperplane grouping is preferable to vertical grouping. The lowest minimum is clearly achieved when using hyperplane grouping, in combination with non-blocking communication, in all cases.
As far as hyperplane grouping, in combination with non-blocking communication, is concerned, according to our scheduling theory, as in Example 2, the number of time steps required for the completion of an experiment 
Conclusions -Future Work
In this paper we presented a novel approach for the time scheduling of tiled nested loops on a cluster of SMP nodes. We minimized the total execution time by overlapping the computation with communication (as in [4, 12] ). In addition, we achieved the maximum CPU's utilization with a proper grouping transformation. What remain open are an analytical computation of the parameters Ñ ½ Ñ Ò of our grouping matrix, according to the initial space shape and communication minimization criteria, and an adaption of our theory for a cluster with a fixed number of SMP nodes.
