Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association of Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
ICS 94-7/94 Manchester, U. K. @ 1994 ACM 0-89791 -665-4194/0007..$3.50 its it erat ions and, therefore, they can be executed independently in parallel.) Many loop nests for solving differential equations using the finite difference method are DOACROSS loop nests. for i, = 0 to 127 for i2 = O to 127 for i3 = O to 127 for i4 ==0 to 127 a(il + l,iz + l,i3 + l,i4 + 1) = 0.25* (a(i~+ l,iz + l,iJ,i~+-1) + a(il + l,iz,i3 + l,i. + 1) + a(il + l,i2 + l,i3 ,i4) + a(il + l,i2,i3 + l,i4) + a(i, ,i2 + l,i3 ,iA ) + a(il + l,i2 ,i3 ,i4 ) + a(il,iz + l,i3 + l,i4 + 1) + a(il +l,i2 +l,i3 + l,i4 + 1) ); The tile with index~c T, denoted by Tr, is the set of loop iterations defined by:
I
In this paper, the term "tile ?' is used as a short orm for "the tile with index ?'. Given tile size vector k, the content of a tile is solely determined by its index Therefore, the tile and its index~are often used interchangeably in this paper. 
can be tiled with tile size vector k = (2, 2 ) and transformed to the following code: where fl 1,~is the subvector~compOsed of the first nẽ lements and operator mod applies to the corresponding elements of al,~and~. In this paper> We USe %,9 to represent the subvector of Z from the p-th element to the q-th.
To simplify the presentation, we assume that the number of tiles is a multiple of the number of processors along every dimension, i. e.,Pj divides Sj for l<j <m.
The chain-based parallel code for processor F c P is shown in In the direct message passing, a processor needs to sent 2m -1 messages to its neighboring processors after fin-ishing a tile.
We In general, a message Ivf;t~such that the compo-,, nents of i7 have at least two 1s will be routed and Given the size vector of the tile~ndex set~and the size vector of the processor array P, we want to calculate the speedup of the parallel code in Figure 4 on the PRAM.
It tur~s out t~at the speedup on the PRAM as a function of S and P cannot be expressed by a closed-form formula in general. Rather we need to follow an algorithm to calculate it. The derivation of the algorithm can be found in We parallelized the DOACROSS loop nest in Figure 1 1 Speedup:
In Figures 6(a) and 7(a), we notice that the speedup on the ideal PRAM is affected by the tile size and shape.
As tile size increases, the difference between the speedups of the AP 1000 and the PRAM decrease and the speedup of the AP1OOO is restricted mainly by the limited parallelism available. As tile size decreases, the speedup gap between the AP1OOO and the PRAM widens due to the increased communication overhead.
In each group of tests, the best speedup for the AP 1000 is marked by an arrow. The number of iterations in the tile size that gives the best performance is around 211 to 213.
The highest speedup on the AP1OOO observed is 121.9 with tile size 8x16x4x4 and indirect message passing on the 2D (16x8) processor array. passing is larger than that of indirect message passing, The gap between them increases as the tile size decreases. For some large tile sizes, the communication overhead of indirect message passing is larger than that of direct message passing (see tile size 4x8x1 6x16 on the 2D processor array and tile size 8x16x16x32 for the 3D processor array). This is probably because the large tile size causes more data to be forwarded in indirect message passing. After all, indirect message passing does help re-duce the communication overhead in most configllrations, but the reduction is limited.
There are a couple of anomalies in the speedups. We notice that, for tile sizes 8x16x32x32 and 8x16x16x16
(indirect message passing), the speedups of the AP1OOO on the 2D processor array are larger than those of the PRAM by small margin. That is why we observe the negative (small though) communication overhead in these configurations. This is probably due to the fact that we do not have the exact sequential execution T.s for the program.
Another cause may be the cache effect because a single processor running the sequential program on the large data sets tends to have more cache miss on the AP1OOO.
We conclude this section with the following remarks:
We By combining and tuning loop tiling, chain-base scheduling and indirect message passing, compilers can generate efficient parallel codes on multicomputers with high speedup and low data communication overhead for DOACROSS loop nests.
The parallelism of the chain-based parallel programs with tiling depends on the sizes and the shapes of the tile index set and the processor array. The algorithm to calculate the speedup of the PRAM can be used to search for the best configurateion.
The right tile size for the low communication overhead and high speedup depends on the machine used. On the AP1OOO, the right tile size for the DOACROSS loop nests like the one in Figure 1 
