Many parallel applications require periodic redistribution of workloads and associated data. In a distributed memory computer, this redistribution can be di cult if limited memory is available for receiving messages. We propose a model for optimizing the exchange of messages under such circumstances which w e call the minimum phase remapping problem. We r s t s h o w that the problem is NP-Complete, and then analyze several methodologies for addressing it. First, we s h o w h o w the problem can be phrased as an instance of multi-commodity o w. Next, we study a continuous approximation to the problem. We s h o w that this continuous approximation has a solution which requires at most two more phases than the optimal discrete solution, but the question of how to consistently obtain a good discrete solution from the continuous problem remains open. Finally, w e devise a simple and practical approximation algorithm for the problem with a bound of 1.5 times the optimal number of phases.
INTRODUCTION
In many parallel computations, the workload needs to be periodically redistributed among the processors. On a distributed memory computer, this generally requires data structures associated with the computations to be transferred between processors. Many examples of this phenomena occur in scienti c computing. When computational work varies over time, the tasks and attendant data must be redistributed to keep the workload balanced. Examples include: adaptive mesh re nement, particle simulations with shortor long-range forces, state-dependent physics models, and multi-physics or multi-phase simulations. A n umber of algorithms and software tools have b e e n d e v eloped to repartition the work among processors (see, for example, 2, 5] and references therein). However, the mechanics of actually moving large amounts of data has received much less attention. When the processors have su cient memory, the simplest way to transmit the data is quite effective. Each processor can execute the following steps.
(1) Allocate space for my incoming data (2) Post an asynchronous receive for my incoming data (3) Barrier (4) Send all my outgoing data (5) Free up space consumed by m y outgoing data (6) Wait for all my incoming data to arrive The barrier in step (3) ensures that no messages arrive u ntil the processor is ready to receive them, so no bu ering is needed.
Unfortunately, this protocol can fail when memory is limited. It requires a processor to have su cient memory to simultaneously hold both the outgoing and the incoming data since incoming messages can arrive before outgoing data is freed. An alternative way to view this issue is that for a period of time the data being transferred consumes space on both the sending and receiving processors. A protocol that alleviates this problem is desirable for three reasons. First, since many s c i e n ti c calculations are memory limited, reserving space for this communication operation limits the size of the calculations which can be performed. Second, the amount of memory required by this protocol is unpredictable, so setting aside a conservative a m o u n t of space is likely to be wasteful. And third, a general purpose tool for dynamic load balancing should be robust in the presence of limited memory. I t w as the construction of just such a tool which inspired our interest in this problem 4].
To address these problems, we propose a simple modi cation to the above s c heme. Instead of sending all of the data at once, we will send it in phases. After each phase, processors can free up the memory of the data they have s e n t. That memory is now a vailable for the next communication phase. Since each phase can be expensive, it is important to limit the total number of phases.
More formally, consider a set of P processors. The amount o f data that needs to be communicated between processors is a transfer request. We will assume that the request is feasible { that the end result of satisfying the transfer request does not violate any processor's memory constraints. We will let Tij denote the total volume of data which is requested to be transferred from processor i to processor j.
We n o w wish to perform the requested transfer in a sequence of phases. Let t l ij denote the volume of data transfer from processor i to processor j in phase l, and let A l i be the memory available to processor i at the beginning of phase l. We will also use R l i and S l i to denote the total volume of data received and sent by processor i in the phase l (i.e., R l i = P k j=1 t l ji and S l i = P k j=1 t l ij ).
At each step the constraint of nite memory requires that R l i A l i for i = 1 2 : : : k . The available memory after each phase can be computed as A l+1 i = A l i + S l i ; R l i . Our objective is to nd a schedule of transfers which obeys the memory constraint, and satis es the transfer request in a minimal number of phases. We will call this the minimum phase remapping problem. Note that there is a corresponding decision problem: can a transfer request be completed in a speci ed number of phases?
In x2 we show that the problem of determining whether a given transfer can be completed in a speci ed numberof phases is NP-Complete. The remainder of the paper focuses on formulations and approximation algorithms which could be used in practice. In x3 we present a reduction of our problem to multi-commodity o w. We present a continuous relaxation of the problem in x4, and a practical approximation algorithm in x5.
Despite its practical importance, we are unaware of any p r evious work on e cient data transfers with limited memory. Some standard collective c o m m unication operations can be implemented in ways that limit memory usage, but the general problem we are proposing seems to be new. Cypher and Konstantinidou designed memory e cient message passing protocols 3]. However, their work addressed exchange of tokens as opposed to variable sized messages. And they didn't explicitly consider the e ect of nite memory in the processors. Their work conceptually divides a process into communication and application processes. Communication processes receive unit-size messages and copy them to application processes. It is assumed that application processes have enough memory, and the goal is to limit the memory requirement of the communication processes.
COMPLEXITY
In this section we show that determining whether a given transfer can be completed in a speci ed number of phases is NP-Complete. Our proof uses a reduction from the Hamiltonian Circuit problem. Recall that a Hamiltonian Circuit is a cycle in the graph that visits each vertex once. The directed Hamiltonian Circuit problem is known to be NP- Complete 6] . Given an instance of the Hamiltonian Circuit problem, the basic idea of our reduction is to construct an instance of the data transfer problem in which there is but a single unit of usable memory. This unit is a token which gets passed between processors, and possession of the token allows a processor to receive data in the next phase. In our construction, a solution to the data remapping problem occurs if and only if the token can be passed in a cycle among all the processors, which implies the existence of a Hamiltonian Circuit.
While the token is being passed in a cycle, the processors must not perform any other data transfers. But when the cycle is completed, they must be able to nish all their other communication operations. To see how this can be done, consider the Hamiltonian Circuit problem posed in the left portion of Fig. 1 . From this instance, we construct the data remapping problem in the right portion of the gure. The data remapping problem contains the original graph as its core (represented in the gure with dark lines) after replacing vertices with processors and replacing edges with unitvolume data transfers. It also contains a chain of processors to the left. The bottom processor in this chain has free memory which will percolate upwards with each phase, -nally allowing all the data transfers to be completed. Given a Hamiltonian Circuit Graph G = (V E), we construct a data remapping problem with P as the set of processors and T as the set of transfer requests as follows. Consider what happens as the data remapping occurs. In the rst phase, c1 will send its data to c2, m o ving the free memory one step up in the chain. After jV j ; 1 phases, this free memory will have arrived at c jV j , the top of the chain. Meanwhile, the single unit of free memory (the token) which started at p will have meandered about, enabling some data to be transferred.
In phase jV j, processor c jV j has enough free memory to receive all of the data that needs to come to it from the core processors. During this phase, the token can take one more step. The messages sent t o c jV j free up memory in the core processors. Speci cally, at the completion of phase jV j, e a c h core processor pi graph has (indegree(pi) ;1) units of free memory. (One processor might a l s o h a ve an additional unit of free memory from the token).
In phase jV j + 1 core processor pi can now receive all the data that needs to come to it, minus 1. The complete set of transfers to pi can be completed in this phase if and only if one of the data transfers to pi has previously been handled by t h e token. If there is a processor that was not visited by t h e t o k en in phases 1 to jV j, then that processor cannot receive all its data in jV j + 1 phases. But the only way for the token to visit all the core processors in jV j phases is to complete a Hamiltonian Circuit of the core graph. Note that the token must end up where it started, at processor p to enable the transfer from d to occur during phase jV j + 1 .
This argument leads to the following result.
Theorem 2.1. Determining whether an instance of the data remapping problem can complete in a speci ed number of phases is NP-Complete.
Proof. Given an instance of the Hamiltonian Circuit problem G = ( V E), construct a data remapping problem as described above. As sketched above, the data remapping problem nishes in jV j + 1 phases if the core graph has a Hamiltonian Circuit.
The total amount of data that needs to be transferred is jV j(jEj ; j V j) + jEj + 1 . The rst term comes from the data being sent to the chain and within the chain. The second term re ects that data being redistributed within the core and the last term is the transfer from d to p . This quantity equals (jEj ; j V j + 1 ) ( jV j + 1 ) . Since there are only (jEj ; j V j + 1) units of free memory, the transfers can complete in (jV j + 1 ) phases only if all the free memory is used at every phase. So the transfers must proceed as discussed above.
If the core graph does not have a Hamiltonian Circuit, then one of its processors will not have been visited by the token by the end of phase jV j. That unvisited processor, pi, still needs to receive indegree(vi) data, but has only (indegree (vi) ;1) units of available memory, so the data transfers cannot complete in jV j + 1 phases.
Notice that the construction of the data remapping problem is polynomial, so we can conclude that the data remapping problem is NP-Hard. A given solution can be veri ed in polynomial time, so the problem is in NP.
MULTI-COMMODITY FLOW FORMU-LATION
In this section, we p r e s e n t a m ulti-commodity o w (MCF) formulation to determine whether a given transfer can complete in a speci ed number of phases 1]. Once we can solve the decision problem, the number of phases in an optimal solution can be determined using parametric search. This formulation enables use of MCF technology to optimally solve the minimum phase data remapping problem. This might be helpful for three reasons. First, some MCF problems can be solved relatively fast, despite their intractability i n t h e general case. Second, the continuous version of the MCF problem can be solved in polynomial time and the solution can be used as a heuristic for the integer problem. Finally, MCF solvers will nd an optimal solution if runtime is not an issue.
In our MCF formulation, each processor corresponds to a commodity. Let P be the number of processors, and L be the number of phases. We w ant to decide if a remapping can complete in L phases. As depicted in Fig. 2 , our MCF graph contains a sequence of components, one for each phase. Each component a l l o ws for the communication which o c c u r s i n t h e corresponding phase. The MCF graph G = ( V E) has 2P L vertices. Each processor is represented by 2 L vertices: two processors (one sender and one receiver) at each p h a s e . We w i l l use r l i and s l i to denote receiver and sender respectively, for processor i in phase l. A sender vertex of the rst phase is the source of a commodity w i t h v olume equal to the total volume of the data originally stored by this processor. A receiver vertex in the last phase is a destination for a set of commodities which corresponds to data that will be stored by this processor after remapping is complete.
In the MCF graph, there is an edge from r l i to s l i for l = 1 : : : L and i = 1 : : : P . The capacity of an edge is equal to the total memory on the respective processor. There are also edges from each sender vertex s l i to all other receiver vertices r l j in the same phase to enable data exchange between any pair of processors in a phase. These edges have in nite capacities.
With this construction, all processors rst receive the data in a phase, and then send their messages. This corresponds to rst allocating space for the data to be received, and then sending the outgoing data. The edges from receivers to senders within a phase guarantee that there is available space to allocate memory for the incoming data before releasing the space for the data being shipped out, thus the memory constraints are guaranteed to be satis ed.
Finally, there is an edge (with in nite capacity) from each sender s l i to the receiver in the next phase r l i for l = 1 : : : L ;
1. The ow on these edges corresponds to data that is already in the memory of a processor at the beginning of a phase. The graph for P = 5 and L = 2 is depicted in Fig. 2 Proof. We can replace a data transfer from processor i to processor j in phase l, with ow on edge (s l i r l j ) of equal volume. As argued above, memory constraints on the processors are satis ed if and only if the capacity constraints on the edges are satis ed in G. So the feasibility of one solution implies the feasibility of the other.
In this formulation the number of commodities is equal to the number of processors, and the graph has 2P L vertices and P 2 L edges. The number of vertices and edges can be reduced for a more e cient f o r m ulation. First we can replace the crossbar between senders and receivers in a phase l with a v ertex, v l and edges from all senders of phase l to v l and edges from v l to all receivers of phase l. Second, we can merge the senders of phase l with receivers of phase l + 1 . The graph after these reductions is depicted in Fig. 3 . This improved formulation has P L +L+P vertices and (3L+ 1 ) P edges. 
CONTINUOUS RELAXATION
Although the multi-commodity ow formulation from x3 provides a methodology for solving instances of the minimum phase remapping problem, runtime can still be exponential in the problem size. In this section, we describe an e cient solution for an approximation to the remapping problem. In the approximation, integral constraints on the volume of data transfers are relaxed to allow c o n tinuous values. Naturally, t h e v olume of transfer between two processors in a phase must be an integer. But integer solutions near the continuous ones can be used as heuristics. Note that the unit of data transfer is only a byte, whereas the volume of data being transferred is often in the order of megabytes. So, conversion from a continuous solution to an integer solution will often be a small perturbation, and so heuristics based upon this idea may be generally e ective.
However, bad cases for this heuristic exist as discussed at the end of this section.
As de ned in the introduction, Tij denotes the total volume of data to be communicated from processor i to processor j, and t l ij denotes the volume of data transferred from processor i to processor j in phase l. The memory available to processor i at the beginning of phase l is denoted by A l i . We also use Ri and Si to denote the total volume of data received and sent b y processor i during remapping.
Let L = d T M e bethelower bound on the number of phases.
We will divide each message into L equal pieces, i.e., t 0 ij = t 1 ij = : : : = t L;1 ij = T ij L , and send a piece at each phase. If the memory constraints are satis ed, then the data transfers will complete in precisely L phases. However, there is no guarantee that memory constraints will not be violated. As a solution to this, we will use preprocessing and postprocessing phases to enable feasibility of the phases in between. Proof. At e a c h phase processor i will receive R i L units of data. By the second condition, each processor has su cient memory for the rst phase. By the rst condition, each processor ships out S i L = R i L units of data at each phase, which frees up su cient memory for the next phase. Proof. In the preprocessing phase we will reorganize the data to satisfy conditions (i) and (ii) from Lemma 4.1, and de ne a new mapping of the data. After the new mapping is complete, a single postprocessing phase will be su cient to get all of the data to the correct processor.
In the preprocessing step, all processors i with Ri < S i will transfer some of their outgoing data to processors j in which Rj > S j so that in subsequent p h a s e s Ri = Si. Note that if the transfer request is feasible then Rj ; Sj > A 0 j . So this rearrangement can be completed in a single phase.
Next, as a second part of the preprocessing step, processors i with Ai < R i L will transfer some of their outgoing data to processors j with Ai > R i L . To avoid disturbing the rst property, sending processors will also pass equal amounts of receiving assignment. Once again, this step can be completed in one phase, since, by construction, the receiving processors have su cient space.
Notice that, the actual data being transferred is irrelevant { we are just trying to balance the numbers. So a send and receive operation can cancel each other. This enables merging of the two steps above i n to one phase.
After the new transfer request R 0 is realized, we need to correct for the transfer of receiving assignments. This is the purpose of the postprocessing phase. Under the transfer of receiving assignments, each processor is either a sender or a receiver of such assignments. So, during postprocessing, each processor will either receive or send data, but not both. Since the initial remapping is feasible, each processor has enough memory for the data to be received, so the postprocessing can be completed in one phase.
The complexity of constructing the solution for the preprocessing phase is linear in the number of processors. To see this, divide the processors into two lists: those with Ri < S i and those Rj > Sj. Now step through the lists together, transferring sending responsibility from a processor in the i list to one in the j list. Each transfer balances Ri and Si for a processor in one of the lists. The same can be applied to balance initial available memories. Notice that the preprocessing step uniquely describes the postprocessing phase, and remapping for R 0 is straightforward. It is worth noting that a good solution of this continuous approximation may not lead to good solutions of the true discrete problem. For instance, consider the example depicted in Fig. 4 . This example consists of two groups of processors, with no communication between the groups, and there is only one unit of available memory. Available memory must be possessed by each component i n turn, and this requires temporarily moving some data from one component to the other to transfer the free memory, as will be discussed in more detail in the next section. In the preprocessing step described in the proof of Lemma 4.2, this available memory will be divided into two groups of processors, but the fractional transfers which follow give no insight i n to the correct way to orchestrate the data transfers for this instance. Specically, in the continuous solution all processors are identical, so no information is gleaned about the necessity o f w orking on components in turn.
EFFICIENT APPROXIMATION ALGO-RITHMS
In this section, we describe the basics of a family of e cient algorithms that provides solutions in which t h e n umber of phases is at most 1.5 times that of an optimal solution. The algorithm is motivated by some simple observations. First, the maximum amount of data that can be transferred in a phase is equal to the total amount o f free memory in the parallel machine. Let M be the total available memory in the parallel machine, and let T be the total volume of data to be moved. Note that M doesn't change between phases.
Lemma 5.1. The minimum number of phases in a solu-
This bound can only be achieved if available memory is used to receive messages at each phase. So free memory is wasted if it resides on a processor that has no data to receive. Our algorithm works by redistributing free memory to processors that can use it. Equivalently, data is parked on a processor with free memory it can't use, which frees up memory on processors which can use it. We will only park data that needs to be transferred eventually.
Parking
Parking aims to utilize memory that would otherwise be wasted. Consider a processor that received all its data and still has available memory. This memory cannot be utilized in subsequent phases, decreasing the total memory which is usable for communication, thus potentially increasing the number of phases. Instead, another processor can temporarily move some of its data to this processor to free up space for messages. An example is illustrated in Fig. 5 . In this simple example, the top two processors want t o e x c hange 100 units of data, but each has only one unit of available memory. A simplistic approach will require 100 phases. However, the third processor has 100 units of free memory. By parking data on this third processor (i.e. transferring free memory to another processor), the number of phases can be reduced to three. units. Any processor that has parking space can store parkable data from another processor, maximizing the amount of usable free memory. This parked data merely takes an extra step on the way to its nal destination. Exploiting this observation will allow us to construct an approximation algorithm.
In our algorithm, we merely store data in a parking space, and then forward it to its correct destination, when the destination processor has available memory. Note that it is inconsequential which processor owns the parked data. In other words, parking spaces are indistinguishable. What potentially e ects performance is which processors shunt t h e i r data to parking space.
Lemma 5.2. It is su cient to park data at most once t o get an optimal solution.
Proof. Assume there exists a solution that parks some data D twice. Let p1 and p2 be the rst and second processors on which D is parked. After data is moved from p1 to p2, if no other processor uses available memory at p1, then there was never a need to move data to p2. If another processor pi, parks data to p1, then we can rearrange the data movement a s D staying in p1, and pi parking to p2, due to indistinguishability of parking spaces.
It is worth noting that parking is not just a heuristic but a requirement in some cases. Consider the example in Fig. 5 , modi ed so that there is no available memory in the top two processors. In this case, the transfer request is still feasible, but realizing the remapping requires parking.
An Approximation Algorithm
In this section, we describe an algorithm that obtains a solution with at most 1.5 times the optimal number of phases. The algorithm is quite generic and allows for a number of possible enhancements. Algorithm 
5.1.
A processor receives as much data as it can in each phase (i.e., if a processor has available memory at the end of a phase then this processor does not have any more d a t a t o r eceive).
If the transfer request cannot be c ompleted in the next phase then park as much data as possible (i.e, park the minimum of the total parkable data and the total parking space).
Note that many details about the algorithm are unspecied: If I have more incoming data than free memory, which messages should I receive i n the current phase? If several processors want to park data, but limited parking spaces are available, which should succeed? We will show b e l o w t h a t with any answers to these questions, the resulting algorithm generates a solution with no more than 1.5 times the optimal number of phases. Intelligent a n s w ers to these questions could be used to devise algorithms with better practical (or perhaps theoretical) performance. It is enough to park data once due to Lemma 5.2, thus parked data is moved twice, and the total volume of data moved is 2Tp+T d = T+Tp. Because each parked unit of data enables at least one direct transfer, the algorithm guarantees that Tp T d , T h us at most half of T can be transferred through parking, i.e., Tp T Proof. The algorithm makes use of all M units of available memory until the amount of parkable data is less than the amount of parking space. It then completes in at most two additional phases, one in which some data is parked, and a nal phase in which e a c h processor has enough memory to receive all its messages. By Lemma 5.3 we k n o w t h a t the total volume of data transferred in the algorithm is at most d 3T 2 e. With M units of transfer in all, but the last two phases, the process can be completed in at most d Combined with Lemma 5.1, Theorem 5.4 shows that Algorithm 5.1 is a 3=2 approximation algorithm for the minimum phase remapping problem. Without a tighter lower bound, this value of 3=2 is tight as illustrated by the example in Fig. 6 .
. . . This example consists of an odd number of processors P. All but one of them are organized in pairs which e x c hange a single unit of data. Only the unpaired processor has a single unit of available memory. The total volume of data to be moved is T = P ; 1. T h e o n l y w ay for a pair to exchange their data is to rst park a unit elsewhere, so a total of P;1 2 units of parking are needed. Hence, the total volume of data transferred is , and the number of phases is 3T 2M , since M = 1 .
CONCLUSION
We studied the problem of moving large amounts of data among processors under memory constraints, which is required for applications where workload and associated data are periodically redistributed among processors. The problem arises when processors do not have enough memory to allocate space for their incoming data, before releasing the space for the outgoing data. In this case, the remapping operation must be decomposed into phases so that processors free up memory for the data they shipped out at end of a phase, making it available for the incoming data in the next phase. In this paper, we studied how to complete the remapping operation in a minimum number of phases, the problem we call minimum phase remapping. We showed that the problem of determining whether a given transfer can be completed in a speci ed number of phases is NP-Complete. A reduction of the minimum phase remapping problem to multi-commodity o w w as presented. We showed how a c o ntinuous relaxation of the problem admits a simple solution with two more phases than that of the optimal solution, but it may be di cult to get a good discrete solution from this continuous one. Finally, w e devised a practical approximation algorithm with a bound of 1.5 times the optimal solution.
We are currently implementing several of these approaches for use in the Zoltan dynamic load balancing tool 4]. We will report on our empirical comparisons in due course.
