Abstract. This paper presents a solution to the open problem of finding the optimal tile size to minimise the execution time of a parallelogram-shaped iteration space on a distributed memory machine when the rise of the tiled iteration space is larger than zero. Based on a new communication cost model, which accounts for computation and communication overlap for tiled programs, the problem is formulated as a discrete non-linear optimisation problem and the closed-form optimal tile size is derived. Our experimental results show that the execution times when optimal tile sizes are used are close to the experimentally best. The proposed technique can be used for hand tuning parallel codes and in optimising compilers.
Introduction
Loop tiling is a useful compiler optimisation for achieving coarse-grain parallelism on distributed memory machines. Tiling aggregates small loop iterations into tiles [13, 21] . By executing tiles as atomic units of computation, communication takes place per tile instead of per iteration. While small tiles expose more parallelism, large tiles require less communication frequency. By adjusting the size of tiles, we can make the tradeoff between parallelism and communication. Thus, finding the optimal tile size to minimise the execution time of a program is imperative.
This paper is concerned with finding the optimal tile size to minimise the execution time of a double loop with a parallelogram-shaped iteration space on a distributed memory machine. We introduce our program model in Section 1.1 and give a statement of the problem in Section 1.2. Figure 1 illustrates our model, which is the most general one used in the literature when the execution time is the objective function to be minimised [1, 2, 7, 11, 14] . optimal tile sizes are used are close to the best observed in our experiments. This also implies the accuracy of our communication cost model for tiling purposes. Unlike Cases (a) and (b), where the distribution of tile stacks blockwise is often optimal [2, 3, 14] , the optimal tiling in Case (c) often requires the tile stacks to be distributed cyclically. The tiling technique proposed in this paper is useful for hand tuning parallel codes and restructuring codes in an optimising compiler.
Program Model
The rest of the paper is organised as follows. Section 2 introduces our computation and communication cost models. Section 3 derives the execution time formula and formulates the problem of finding the optimal tile size as a discrete non-linear optimisation problem. Section 4 finds the closed-form optimal tile size. Section 5 presents our experimental results for running a 2-D PDE program on a distributed memory machine Fujitsu AP1000. Section 6 discusses the related work. Section 7 concludes the paper.
Computation and communication cost models
In our computation model, the number of iterations in a tile is approximated by . Let be the execution time for a single iteration. The time for executing a tile is given by:
In quantifying the communication overhead, we propose to use a simple yet effective cost model designed specifically for tiled programs with the computation and communication patterns as specified in the SPMD code ( Figure 2 ). As Figure 4 shows, different communication parameters are used to measure the communication overhead for the horizontal and vertical traversals.
Consider the two tiles Consider the execution of a stack of tiles in the processor # as illustrated in Figure 4 . In some distributed memory machines, some dedicated hardware is available for sending and receiving messages while the processor is processing a tile. For example, either the receive call or send call can be overlapped with the computation of ¡ . In fact, all the three can be carried out simultaneously if enough hardware resources are available. To account for the potential computation and communication overlap that arises this way, the communication cost charged for the execution of a single tile is approximated by a one-degree polynomial function:
where symbolises the startup cost and¨is the cost paid for one byte of message. The two parameters on Fujitsu AP1000 are shown in Figure 5 and © ¨ a re expected. Our cost model does not directly incorporate other costs such as buffer management and system-specific overheads. Therefore, the values of the four communication parameters are simple to establish for a machine. Since tiled programs are highly regular, these additional costs are typically proportional to the number and size of messages communicated [8] . Our experimental results confirm that our model can be used effectively to predict optimal tile sizes.
The message size § for a tile, which is proportional to the tile height , is approximated by: § where can be approximated based on the data dependences of the program [2, 14, 22] . Thus, the two communication cost formulas can be refined to:
Non-linear optimisation problem
For a fixed ! , we use to denote the set of all possible tile sizes or tilings:
In Section 4, we discuss how to use fewer than ! processors, whenever possible, to achieve the same execution time. In the extreme case when one single processor is the best choice, then is allowed to take effectively the values in the interval ¤ is pass-idle if a processor idle waits between the execution of two consecutive stacks of tiles and pass-free otherwise. By definition, a tiling that yields a single-pass program is always pass-free.
Let Figure 6 illustrates the derivation of the execution time formulas for these two cases. The critical path in each case represents the earliest completion time for the last processor The execution time corresponding to the critical path shown in Figure 6 (c) is:
The¨(i.e., the number of passes),
The optimal tile slope ¤ ¦ ¥ § (i.e., the one that minimises ¢ ©
) is chosen to be the smallest of the slopes of all distance vectors. For a fixed (11) Sometimes, the minimum point of (8) is located on the boundaries of its solution space. This is addressed in Section 4.3. In Section 4.4, we combine all these results to provide a solution to (8) .
The proofs of all theorems except Theorem 4 are given in the appendix.
Solving (10)
Theorem 1 (10) has a unique minimum point, denoted
is derived next. Our experimental results show that our approximation often differs from the real minimum point only in the fractional part.
The minimum point of (10) must satisfy the following two equations:
Solving (13) for yields its unique positive root:
Substituting this into (12) and simplifying, we get:
By dropping the last complex but less dominant term in (15) since¨ © , we get:
The optimal tile width is under approximated as the root of the above equation:
The corresponding optimal tile height becomes:
Solving (11)
We rely on the following assumption to prove that (11) also has a unique minimum point.
which should always be true since
is small for practical applications. For the PDE example in Section 5 running on AP1000, This inequality becomes
The minimum point of (11) must satisfy the following two equations:
Solving (20) for yields its unique positive solution:
Substituting this into (19) and simplifying, we get:
on the interval 
Theorem 3 Suppose that both (28) and (29) are true. Let
(31) intersects .
The minimisation of (8) 
Experiments
In this section, we present some experimental results of running a PDE program on a Fujitsu AP1000 with 128 processors. The AP1000 used consists of 128 SPARC cells each consisting of a SPARC processor running at 25MHz with 16MB RAM. All processors are connected by a 2-D torus network called the T-net. The T-net provides the interprocessor communication using the wormhole routing with a 25MB/sec of bandwidth. All I/O is performed on the host. the execution times for those tile sizes are close.
Related work
In their seminal work, Irigoin and Triolet [13] and Wolfe [21] introduce loop tiling as a compiler optimisation to improve data locality and parallelism in a program. There has also been an interesting thread of research on finding the best tile shape for minimising communication overheads once the tile size is given [4, 6, 15, 17, 22] . These research efforts are compared in [22] and not repeated here. Some recent efforts on locality optimisations can be found in [16, 18] .
Given a parallelogram-shaped iteration space to be tiled and executed in the SPMD mode on a distributed memory machine, the problem of finding the optimal tile size to minimise the execution time of the program has been solved when 
