Abstract-We present a new methodology for developing parallel distributed programs in a series of incremental steps. The methodology takes advantage of threads that are able to migrate through the network and thus are able to follow distributed data. This allows the data to be partitioned and distributed first, which guarantees that elements that are used together in a computation are collocated on the same node. Next, the loops in the code are tiled to minimize migration among nodes. After deciding on the location at which each loop is to execute, the necessary migration and remote access statements are inserted to make the code executable. This process is repeated based on feedback obtained from the execution, which may improve the overall performance by suggesting a different data distribution or a different coarseness of tiling. We illustrate the trade-offs and the performance using a well-known application with two different data distributions.
I. INTRODUCTION
Navigational Programming (NavP) offers a different approach to generating a distributed parallel program from a sequential one, in that the transformation occurs incrementally and produces an executable program at each stage. Under NavP, computations migrate using hop() statements inserted explicitly by the programmer. The cost of a hop() is essentially the cost of moving the data stored in its thread variables plus a small amount of state data. Although the state of the computation is moved on each hop, the code is not moved. The computations carry small amounts of data, such as intermediate results, as they migrate to large data structures that are stationary. The synchronization among different migrating computations is achieved by waiting on and posting of events.
The paper is organized as follows. Section II describes the contribution and related works. Section III summarizes the basic steps of our NavP-based methodology. In Section IV, we introduce a sample application that is used throughout the paper to illustrate the approach. The remainder of the paper presents the specific steps of the methodology and the experimental results.
II. CONTRIBUTION AND RELATED WORKS
Generating a parallel program from a sequential one for a distributed-memory environment is an important and difficult problem. A common approach is to decompose the original program into smaller computations and then to construct a distributed schedule for the computations [1] , [2] , [3] , [4] , [5] , [6] , [7] , [8] , [9] . The distributed scheduling task consists of several subtasks: assigning each computation to a processor, chronologically ordering the computations on each processor, and scheduling the data movement so that each computation has the necessary data when it executes. This is difficult under the classical message-passing approach, in which processes are stationary and any remote data required by a process is communicated through send/receive primitives. NavP [10] is still considered SPMD (Single Program Multiple Data) in that different processes execute the same code with different data, and differs from the traditional SPMD view [10] , [11] in that processes are able to access remote data by migrating to the target node. Hence the locus of any given computation is not stationary but follows the distribution of data as appropriate for best performance.
The NavP approach to generating a distributed parallel program can be summarized as follows. (1) the data is distributed; (2) the program is divided into computations (tiles) based on the data distribution, and each computation is assigned to a processor (again based on the data distribution); (3) the computations are scheduled in those tiles and combined into parallel threads that migrate through the network based on their data dependencies.
All three steps are extensions of existing techniques and tools. Data distribution (step 1) uses an affinity graph [12] produced by instrumenting the program and then partitioning the graph using Metis [13] . To break the program into subcomputations (step 2) we rely on the well-known techniques of tiling [14] , [15] , [16] , [17] , [18] . The main difference is that the execution provides feedback that guides the choice of the tile size. Scheduling (step 3) is performed only after each computation has been assigned a processor. It is straight forward because it follows the existing data dependencies; it is a form of dataflow scheduling [19] , [20] .
The NavP-based methodology provides two major advantages for distributed parallel programming: 1) Incremental parallelism. The sequential program evolves into the distributed program in a series of incremental steps: the data is distributed, the program follows the data, the program is divided into smaller computational units, the computational units are scheduled. At any point during this evolution there is a viable (executable) program that has the same semantics as the original sequential program. One consequence of this is that it is possible to return to any decision made during the generation of the parallel program, change the decision, and continue the generation process from that point forward. 2) Incremental performance improvement. The feedback mechanism provides performance evaluations, including speedup and load balance, that can be used to adjust the output of the preceding steps. In this way, the methodology is a closed-loop system that incrementally improves the performance of the parallel and distributed program. What enables the incremental parallelization and performance improvement is computation mobility. Our approach uses the principle of pivot-computes, which is different from the commonly used owner computes [21] . Pivot computes performs the computation on the node that contains the largest portion of the data, regardless of whether it owns it (writes it) or only reads it and then writes it on some other node. This is only possible if computations can migrate to the data. Consequently, the data layout can be decided first, the computations then follow the data distribution. This simplifies the programming task because it decouples the two main problems: data distribution/placement and code parallelization.
Another consequence of distributing data first is better performance, since migrating computation to data is frequently more efficient (only the state moves, not the code) than moving the data to computation.
III. METHODOLOGY TO GENERATE NAVP CODE
The NavP methodology consists of the following steps. 1) Data distribution: This step runs the sequential program using a small data sample and generates the initial data partitioning and distribution. (This has been presented in earlier work [12] and will not be discussed in this paper in detail.) 2) Code Transformation: This step distributes the computation. It partitions the original sequential program into smaller units, based on the given data distribution, it assigns each unit to execute on one of the nodes, and it inserts appropriate navigational statements into the code. The resulting code is still sequential but it runs in a distributed manner using migration. This is referred to as distributed sequential computing (DSC). IV. AN EXAMPLE We will describe our methodology using the code shown in Fig. 1 as an example, which is an abstraction of a method for solving a 2-dimensional heat equation originally presented in [22] . It is a variant of the alternating direction implicit (ADI) method, which makes multiple sweeps in different directions across an array representing the discretized domain. The example shows four 2D arrays involved in the computation (u, v, p, and q). It also shows the computation kernel which consists of an outer loop (line 1) that repeatedly performs a column sweep phase (lines (2)- (18)) followed by a row sweep phase (lines (19)- (35)). The row sweep phase and the column sweep phase consist of four loops each, for a total of 8 inner loops performed for each iteration of the outer loop.
The methodology discussed in this paper takes a specified data distribution as a starting point. We use the two data distributions illustrated in Fig. 2 to show the methodology and compare the respective performance. The data distributions are shown for 3 nodes, each represented by a different shade. The first was generated by our Data Distributor [12] , while the second is the well-known twisted data layout [23] . In each distribution, all four 2D arrays (u, v, p, and q) are distributed as shown in the corresponding figure.
V. CODE TRANSFORMATION AND PLACEMENT Starting from the sequential code and an initial data distribution, we first build the DSC that navigates through the network and accesses the distributed data sequentially. The simplest way of constructing a working DSC program would be to insert conditional hop statements in the sequential program before each data access. The program resulting from this simple strategy would be correct, but it could be quite inefficient due to the large number of hops. To reduce the number of hops we first partition the iteration space into smaller tiles, which will be executed locally. As discussed below, we use tiling to reduce communication and increase the opportunity for parallelism. After the tiling operation is complete, we assign a location to each tile.
A. Partitioning phase 1: Adapt to the data distribution
Tiling proceeds in two phases. The goal of the first phase, which is performed once, is to minimize communication
for (i=1; i<=N ;i++){ (6) for( j=1; j<=N ;j++){ (7) p (13) } (14) for (j=1; j<=N ;j++){ (15) for The tile size can be changed without changing the codes. Fig. 4 illustrates the first phase of tiling for a 3 × 3 twisted layout. Considering loop 4 (lines 14-18 of Fig. 1 ), if this loop is transformed to DSC simply by inserting hop statements, three hops would be required per column for a total of 3N hops ( Fig. 4(a) ). However, if the code is transformed as shown in Fig. 3 , then only 9 hops are necessary for the entire loop as can be seen in Fig. 4(b) . We define the write set and read set of a tile to be, respectively, the set of all data written by and read by the tile. The partitioning in this first phase is then based on the following principles: 1) Homogenous write sets: All the writes to memory from any given tile are to data on a single machine. 2) Isothetic cuts: The write sets of the tiles are all rectangles and the boundaries between write sets of tiles all fall on a common set of vertical and horizontal lines. 3) Minimized communication: The tiles should be as large as possible, consistent with the first two conditions. The read sets are not necessarily homogenous. Condition 2 implies that all write sets are rectangles, but the rectangles do not necessarily have the same size. This can be seen in Fig. 8(a) , which shows the result of applying phase 1 partitioning to the data distribution of Fig. 2(a) .
B. Partitioning phase 2: Increase opportunities for parallelism
The result of the first partitioning phase is a tiling with low communication cost. Splitting the tiles further will have two conflicting effects: it will increase the number of hops, but it will increase the potential for parallelism. The first effect will result in increased communication cost and hence decreased speedup, while the second effect will increase speedup.
As an example, Fig. 4(c) shows the result of splitting each tile into 4 subtiles. This transformation increases the number of hops from 9 to 36. It increases the opportunity for parallelism because, for example, the tile with write set c can begin executing as soon as the tiles with write sets a and b have completed. The tradeoffs between the increased communication cost and the increased opportunity for parallelism, and the optimal level of tile splitting that should occur, are difficult to evaluate a priori, as they depend on the tile size and on the order in which tiles are evaluated. These considerations suggest the following strategy: start with the largest possible tile; split the tiles; estimate the resulting speedup; and repeat these steps for as long as the estimated speedup continues to increase. The resulting feedback loop is discussed in Section VII. The result of this further partitioning is code that is identical to the code in Fig. 3 . The only difference is that the parameters describing the number and extent of the blocks have changed.
C. Assigning a location to each tile
Once the size of the tiles has been chosen, each tile needs to be assigned to one of the nodes. In most cases this is a straightforward procedure: all writes from each tile are to variables on the same node, and often the best strategy is to assign the tile to that node. This is consistent with the wellknown owner-computes strategy [21] .
The strategy that we actually use is to assign each tile to the node that holds the most data accessed by the tile (either as part of the read set or the write set). We call this node the pivot node, and we call the resulting strategy pivot-computes [24] . The pivot-computes strategy can result in significantly less data movement in certain situations, such as a REDUCEtype operation where a large amount of data stored on one node is summarized in a few variables stored on a second node. In this case, performing the computation on the node that holds the read set is more efficient than the node that writes the final value. For a specific example see [11] . The two strategies frequently produce the same result, as they do with the examples of this paper.
Once a tile has been assigned to a node, additional code is inserted to ensure that the execution is performed on the chosen node and all necessary data is carried there. Fig. 5 shows the result of inserting this code in the code of Fig.  3 . The additional partial row of v necessary for the tile computation is loaded into the local variable x. After the hop to the node where the computation will occur, the data carried in x is unloaded.
VI. PARALLELIZATION GENERATION
The result of the previous steps is a set of code tiles, each of which has been assigned to a node. The next step is to turn these into a parallel program. The fact that each tile has already been assigned a node makes this step relatively straightforward: all that is needed is to group the tiles into threads and insert appropriate commands for synchronization and transfer of data among threads. This step proceeds in three stages. First we build a Weighted Dependence Graph which captures the essential dependency relations among the tiles. Next we combine the tiles into threads. At this point, we can pass the Weighted Dependence Graph and thread information to the Feedback Mechanism. Based on the feedback, we may decide to go back and change some decisions made during the earlier code transformation and placement phase or to change the data distribution. If we are satisfied with the results of the feedback, we proceed to the third state, which is the generation of the parallel code.
A. Build the Weighted Dependence Graph
The Weighted Dependence Graph is a precedence graph that captures the relationships among the tiles derived using the code transformation. Each node represents a tile. All edges are directed and indicate that the tile corresponding to the origin node must be computed before the tile corresponding to the destination node.
Each node is assigned a cost, which is the relative amount of computation required by the corresponding tile. Associated with each edge is the communication cost and context switching cost associated with that edge. The context-switching cost is taken to be constant. The communication cost is zero if the two endpoints of the edge are tiles assigned to the same node. Otherwise, the cost is a function of the amount of data that needs to be moved, using a piecewise-linear communication cost model that takes into consideration packet size, latency per packet of the given size, and bandwidth [25] .
Once the graph has been constructed, we simplify it using a restricted form of transitive reduction: If an edge in the Weighted Dependence Graph only represents synchronization (i.e., carries no data) and if other edges guarantee the precedence relation represented by the edge, the edge can be removed. Fig. 6 shows the graph corresponding to loops 2, 4, 6, and 8(Loops 1, 3, 5, and 7 are used only for initialization and are omitted to simplify the presentation). The shapes of the graph nodes represent the processing node where the computation is executed. There are three different shapes because the data are distributed over three processing nodes. Each graph node is annotated L(I,J), where L is the loop number and the pair (I,J) is the write set of the tile. For example, 4(0,1) means loop4 and (0, 1) is its write set, where I=0 and J=1. A solid edge means that a whole data block has to be transferred, while a dashed edge means that the data dependence consists only of boundary data. 
B. Combine tiles into threads
To create execution threads from the reduced graph, we use a bottom up approach. Initially each tile is considered a separate thread. These are then combined into longer threads. There are many ways to combine the tiles into a thread. We apply a heuristic strategy, which combines threads inside a single loop within the same global iteration. Our heuristic threads together tiles that cannot be executed in parallel. In particular, it combines tiles connected by edges because the data dependencies require such tiles to be executed sequentially. For example, in Fig. 6, for loop2, tile (0,0), (0,1), (0,2) are grouped into a thread, tile (1,0), (1,1), (1, 2) are grouped into a thread, and tile (2,0), (2,1), (2,2) are grouped into a thread.
C. Generate the Code
Generating the code requires inserting code so that the multiple threads synchronize with each other. [i] are ready, which are computed by other threads. Synchronization code must be added to wait for data computed by other threads and to signal that data is ready to be accessed. The basic principle is to insert signal/wait primitives at each end of a solid edge and add code to transfer data for any edge that carries data (edges with nonzero weight.) Fig. 7 shows the portion of the DPC code that represents the transformed Loop 4. Builder.msgr injects a spawner. loop4 spawner.msgr creates the multiple threads for loop4 (Loop4 sweeper) and injects them at the node where the computing starts for loop 4.
Each of these threads processes a column of tiles. It executes as a doubly indexed loop (lines 9-10), where the outer loop runs through all iterations of the main loop in the original sequential program and the inner loop runs through all the tiles in the column. For each tile, it hops to the pivot node (line 11) synchronizes with other threads (line 12), computes the tile (line [13] [14] [15] [16] [17] , and synchronizes with other threads (lines [18] [19] [20] [21] . Since each hop is carrying data, we do not explicitly show load and unload data in this pseudo code.
In this example, no data transmission code is needed for solid edges because their weights are all zero. This is true in most cases because any two tiles that access the same data will, due to the pivot-computes principle, be assigned to the same node.
VII. FEEDBACK MECHANISM
The long-term goal of the feedback loop is to be able to periodically improve the performance of the actual program by adjusting its current data distribution and level of parallelism at runtime based on information gathered automatically from the instrumented program. Currently, the feedback loop uses only a simulator, which executes a discrete-event simulation of the simplified Weighted Dependence Graph. The communication costs are estimated using the communication cost model discussed earlier. The computation time of each tile is estimated (6) inject(Loop4_sweeper, J); (7) } (8) Loop4_sweeper(J): (9)for(iter=0; iter< ITERATIONS;iter--){ (10) for (I=(num_blk_I-1);I>=0;I--){ (11) hop(comp_loc[I,J]); (12) if(I==(num_blk_I-1)) wait(E(iter,loop4,J,I,p/q)) ; (13) for(j=j_start(I,J); j<=j_end(I,J); j++){ (14) for(i=i_end(I,J); i>=i_start(I,J);i--){ (15) v
} (18) if(I==0)signal(E(iter,loop6,I,J,v) ; (19) if ( • Degree of load balance: This measure captures how well the load is balanced over all physical nodes. After evaluating the parallel distributed programs, the Feedback mechanism passes its evaluations to the previous steps. These evaluations include the speedup, the idle time, the efficiency ratio, and load imbalance, and serve as guide to improve the respective transformations as shown next.
A. Case 1: Feedback to Code Transformer
When the Code Transformer receives the speedup feedback, it tries to increase parallelism by reducing the tile size. As the number of tiles increases, the communication and context switch cost increase while the parallelism increases. Both changes directly affect speedup. The process of decreasing the tile size is repeated until the speedup reaches its peak. Beyond this point the speedup decreases despite increased parallelism because the gains are outweighed by the even faster increasing communication and context switching costs. Fig. 8 shows the same partitioning of loop4 but with different number of tiles. Fig. 8 (a) has 9 tiles, Fig. 8(b) has 16, and Fig. 8 (c) has 25 . Clearly, the first case has the smallest communication cost but also the smallest opportunity for parallelism, while the third case shows the opposite. Fig. 9(a) shows the speedup for four different tile sizes. No.1 through No.3 correspond to the above three tile sizes, and No.4 corresponds to an even smaller tile size. The curve shows an improvement when tile size is increased from 9 to 16 but drops off with the next smaller tile size. Based on this, the second option, Fig. 8(b) , is the best choice. Fig. 9(b) shows the results for the same program but using the twisted data distribution. No.1 has 9 tiles, No.2 has 36 tiles, and No.3 has 81 tiles. The speedup decreases with the tile size. Hence the best choice for this distribution is the first option. 
B. Case 2: Feedback to Data distributor
When the Feedback mechanism determines that the load is highly imbalanced, it will suggest to the Data Distributor to increase the number of partitions and try a different assignment of partitions to nodes. Fig. 10 shows three versions of data partitioning and their assignment to three different nodes. (a) has 6 partitions (resulting in 2 partitions assigned to each node), (b) has 9 partitions (3 per node) and (c) has 12 partitions (4 per node). Fig. 11 shows the performance for the three cases. As the number of partitions increases, the distributor has more flexibility in assigning partitions to nodes, which increases the chances to keep the load balanced. However, more partitions result in more communication overhead. Thus when the number of partitions increases beyond its peak, the increased communication cost outweighs the benefits of the better load distribution. For the example, using 9 partitions was the best choice. 
VIII. EXPERIMENTS
In this section, we present our experimental results for the target application. The data was obtained using a network of Linux cluster and 100Mbps of Ethernet connection with a collision-free switch and using the NFS file-sharing system. Fig. 12 shows the performance when using the unstructured data distribution. Array size is the size of the arrays u, v, p and q. DPC is execution time for the parallel program, which is divided by the sequential time to get the speedup. Fig. 13 shows the performance results for the twisted data distribution, for array sizes of 5040 × 5040, 12600 × 12600, and 15120 × 15120. Both experiments show a significant speedup for all cases. 
IX. CONCLUSIONS
This paper presents a new methodology to generate parallel distributed programs for a distributed memory environment. We applied the approach to a well-known scientific application and present the experimental results to show the effectiveness of our technique.
Our methodology exploits a key advantage of Navigational Programming, namely that executing threads can "follow the data" by migrating between locations. This allows us to cleanly divide the parallelization into three phases: data distribution, partitioning the computation into tiles, and scheduling the computation. By simulating the resulting computation, we can assess whether the data distribution provides adequate load balance and whether the tiling strategy gives a good balance between parallelism and communication overhead. If the answer is determined to be no, then these phases can be rerun, using the feedback provided by the simulation. Our methodology is incremental in the sense that at any point during the feedback process we can stop and produce a working program.
One promising extension of our method is to eliminate the simulator and instead use the actual program. The feedback mechanism would then monitor the performance of the code, suggest a new data distribution and/or a new tiling strategy, and at some point cut over to the new code. This would represent incremental parallelism in the fullest sense of the phrase: a working program continually improving itself as it performs its task. The major challenges here are to efficiently gather runtime performance data and to manage the transition from the old program to the new program.
In the application discussed in this paper, the underlying data structure is a static array of a fixed size. Incremental program improvement would allow the possibility of adapting to changes in the data structure (e.g., a hierarchical data structure such as a B-tree or a self-refining multilevel grid) to improve performance.
