Fully-populated torus-connected networks, where every node has a processor attached, do not scale well since load on edges increases superlinearly with network size under heavy communication, resulting in a degradation in network throughput. In a partially-populated network , processors occupy a subset of available nodes, and a routing algorithm is speci ed among the processors placed.
multistage network: A multistage network with k k switches (routing nodes) and log k n stages serves n injection points, and utilizes n log k n routing nodes 3].
In partially-populated tori, a routing algorithm which utilizes shortest paths is speci ed together with the placement. An optimal placement is a placement that achieves linear load on edges using maximum number of processors possible.
The notion of resource placement in general has been investigated by a number of researchers such as Bose et al. 5 ], Alverson et al. 1], F. Pitteli and D. Smitley 11] . Our aim is to give placements and routing algorithms which will enable e cient communication between processors, and at the same time reduce the susceptibility of the network to link faults by reducing the number of messages relying upon a particular edge 3]. This is achieved by providing routing algorithms in which the number of minimal paths specied between pairs of processors in the placement is kept large, without compromising the linearity of load.
Let E max denote the maximum load over all the edges for the placement P. Blaum et al. give the lower bound E max (jP j ? 1)=(2d) (1) which means that for d = 2, E max jPj=4 and for d = 3, E max jPj=6. If jPj is constrained to be of the form k i , then they also give placements of sizes k for d = 2 and k 2 for d = 3, together with routing algorithms. These placements are optimal in the sense that the two lower bounds are actually achieved by the placements.
How do we justify that in general a maximal size placement that can achieve linear load is O(k d This seems to imply that linear load is at least possible for jPj = ck d?1 . This is a faulty argument however, as we do not know a priori that number of edges needed to split P into two equal size pieces is the same as the bisection width of the whole torus. This may push the size of an optimal placement above or below k d? 1 . In this paper, we introduce the concept of bisection width with respect to a placement P , and use its properties to prove that in a d-dimensional k-torus, the size of an optimal placement is (k d?1 ). Given a placement P of maximal size, we also prove that there exists an edge separator of size (k d?1 ) which splits the torus into two components with (k d?1 ) processors of P on each side. This gives a lower bound of the form E max ck d?1 (2) for maximum load. In (2) c is a constant independent of d. This is a tighter lower bound for the load for large d than the lower bound (1) .
Finally, we give optimal placements (called linear placements ) achieving the lower bound (2) and corresponding routing algorithms (Ordered Dimensional Routing (ODR) and Unordered Dimensional Routing (UDR)) in tori with arbitrary number of dimensions. Of the two routing algorithms ODR is simpler, but UDR provides fault tolerance by allowing more routes. We also show how tho extend these to more general placements in tori that we refer to as multiple linear placements.
The outline of the paper is as follows. Section 2 gives necessary de nitions and the formal statement of the problem. In section 3, a lower bound on the maximum load on an edge is given, which is also a generalization of the lower bound given by 3]. This bound, along with the notion of bisection width with respect to a placement, is used to get an upper bound on the number of processors in an optimal placement. We introduce the notion of ?separator with respect to a placement in section 4, and use it to give a new lower bound on the maximum load which is independent of the dimension parameter in section 5. Finally, in sections 6-8, we de ne and analyze an important class of placements called linear placements, and give associated routing algorithms which achieve linear load and fault tolerance. Section 9 includes conclusions and some future considerations.
Preliminaries and Problem De nition
In this section, we start out with the problem de nition, and follow it by a sequence of formal de nitions and terminology that will be used in the rest of the paper.
Problem De nition
Our aim is to nd placements and associated routing algorithms in the d-dimensional ktorus T d k that have linear message load (in number of processors in the placement) on edges under the complete exchange scenario. Speci cally, we like to devise a placement P, and a routing algorithm A for P for which E max = cjPj, for some constant c.
De nition 1 fd-Dimensional k-Torusg
The d-dimensional k-torus is a directed graph T d k = (V; E), with vertex set V = fã jã = (a 1 ; a 2 ; : : : ; a d ); a i 2 ZZ k g where ZZ k denotes the integers modulo k, and edge set E = f(ã;b) j 9j such that a j b j 1 (mod k) and, a i = b i for i 6 = j; 1 i dg:
T d k has a total of k d nodes. Each node has two neighbors in each dimension, for a total of 2d neighbors. Directed edges of T d k are also referred to as links.
De nition 2 fPlacementg
A placement P of processors in T d k = (V; E) is a subset of V . We use the term node for a generic element of the vertex set of T d k . A node with a processor attached is simply called a processor.
De nition 3 fRouting Algorithmg
Let P be a placement in T d k . A routing algorithm A is a subset C Ã p!q of the set of all shortest paths betweenp andq for every pairp ,q 2 P (See Figure 1) .
The routing algorithm A is used to deliver packets fromp toq : Whenp needs to communicate withq , a shortest path in C Ã p!q is selected randomly with uniform probability.
For any link l, we denote the set of paths in C Ã p!q going through l by C Ã p!l!q , and use the following de nition of load as given in 3].
De nition 4 fLoadg
Given a placement P in a T d k along with a routing algorithm A, the load of an edge l is de ned as
De nition 5 fMaximum Loadg The maximum value of E(l) for a network with placement P and a routing algorithm A is called the maximum load and denoted by E max . Thus E max = max l2E E(l):
Considering the expression (3) for E(l), the more paths the routing algorithm provides between any two processors, the smaller the load on any edge that is used to route messages between these processors. In addition to this, availability of a large number of choices means better fault tolerance. We shall consider algorithms which use minimal (shortest) paths. Minimal paths are associated with the notion of cyclic distance and Lee distance which we de ne next.
De nition 6 fCyclic Distance, Lee Distanceg Given three integers, i, j and k, the cyclic distance between i and j modulo k is given by minfi ? j (mod k); j ? i (mod k)g where the equivalence classes modulo k are taken to be 0; 1; : : : ; k ? 1 
De nition 7 fBisection Widthg
The bisection width of a graph is the minimum number of edges which must be removed in order to split the node set into two parts of equal (within one) cardinality.
De nition 8 fBisection Width with respect to a Placementg
The bisection width with respect to a placement P of T d k = (V; E) is the minimum number of edges which must be removed from E in order to split V into two parts each of which containing an equal (within one) number of processors in P.
We denote by @ b P a minimal cardinality set of edges of T d k which needs to be removed to bisect P. Thus Another important issue is fault tolerance. Speci cally, the routing algorithm should provide multiple routing paths between each pair of processors so that, if any of the links fails, the network will remain functional by routing the messages through paths which do not include the defective link. Consequently we also address the following problem: is it possible to construct optimal placements which are at the same time fault tolerant?
In the following sections, we analyze lower bounds for maximum load and study the above questions. . Among the links, the ones on speci ed shortest paths between the processors are highlighted.
A General Lower Bound for Maximum Load
We start out with an important lemma which will prove to be a very useful tool in the subsequent sections. The lower bound for maximum load originally given by Blaum et 
The following lemma gives a more general form of (4).
Lemma 1 Let P be a placement in a T d k = (V; E), also, let S P and @S be the set of all edges each connecting a node in S with another node not in S. Then E max 2jSj(jP j ? jSj) j@Sj (5) Proof The total number of messages exchanged between processors in S and processors in P ? S, in either direction, is 2jSj(jP j? jSj) under all-to-all personalized communication scenario. Also, these messages must go through one of the edges in @S. The average number of messages going through an edge in @S is 2jSj(jP j ? jSj)=j@Sj and the lemma follows.
2
It is easy to see that (5) reduces to (4) if the set S is taken to contain only one processor,
i.e., jSj = 1 and j@Sj = 4d. The lower bound (5) is valid independent of the routing algorithm used. Another interesting form of (5) that we shall subsequently make use of is obtained when the set S consists of half of the processors in P, i.e.,
j@ b Pj (6) Note that in this case @S becomes @ b P, which is the bisection width of T d k with respect to placement P. Next we give an upper bound on the size of @ b P, which we then use to calculate the maximum number of processors an optimal placement can contain. As an example, two copies of the complete graph K 2n on 2n nodes joined by a single edge has bisection width 1. Its subgraphs with 2n nodes have bisection widths ranging from 1 to (n 2 ), depending on how evenly the 2n nodes are distributed among the two copies of K 2n .
Maximum Placement Size
An upper bound for the maximum number of processors an optimal placement can contain can now be obtained by substituting the bound for j@ b Pj given in the corollary into the inequality (6), while at the same time insuring that E max = O(jPj), i.e. the load remains linear in the number of processors in the placement. 
-Separator Width with respect to a Placement
From corollary 1, we know that the bisection width of T d k with respect to a placement P is no larger than 6dk d?1 . The lower bound on maximum load that one can obtain using this result is a function of dimension d, however. In this section we show that given a placement P on T d 
Remark
We would like to point out that linear (and multiple linear) placements themselves do not guarantee the linearity of the load on edges. Linear only refers to the fact that the coordinates of the processors in the placement satisfy a linear equation over ZZ k . We still need to construct routing algorithms which enable communication between pairs of processors in a way that yields load that is linear in jPj.
In the remaining sections, we will specify di erent routing algorithms and analyze their maximum communication load on edges. As we have mentioned earlier, the routing algorithms will use minimal (shortest) paths between processors. To deliver a message from processorp toq , the value ofp in each dimension is \corrected" towards the corresponding value inq by the amount and direction ( ) dictated by the shortest cyclic distance between the values in that dimension. The exact way of correcting the dimensions to route the packets is speci ed by the routing algorithm.
We consider two classes of routing algorithms and the analysis of the load in each case both for linear and multiple linear placements: Ordered Dimensional Routing (ODR) and Unordered Dimensional Routing (UDR).
The Ordered Dimensional Routing Algorithm (ODR)
The algorithm is simple. Given a placement P on T Note that if k is odd, jC ODR p!q j = 1, i.e. there is only 1 path speci ed by the ODR algorithm for any givenp andq 2 P. However, when k is even the ODR algorithm may result in multiple paths between some pairs of processors in the placement. To aid in the analysis, we will use the following (restricted) version which ensures the existence of only one canonical routing path between any given pair of processors regardless of the parity of k. Thus if there are two choices for some p i ; q i coordinate pair, the algorithm routes through p i + 1 (mod k), p i + 2 (mod k), . . . , q i . The shortcoming of having only one path between a pair of processors is the lack of fault-tolerance in the network. Speci cally, if an edge over which a pair of processors communicate fails then the pair will no longer be able to exchange messages. In section 8 we look at another routing algorithm which does not su er from this limitation. We will count pairs of processors which communicate using l. Letp (8) and (9) . These conditions a ect the choices of p s and q s . A more accurate expression (though of the same order) can be obtained by paying closer attention to these parameters: To determine the number of di erent ways p s and q s may be chosen, consider the 1-dimensional k-subtorus (ring) on which the edge l lies. Assume rst that k is even. Without loss of generality, also assume that the nodes in the ring are enumerated from 0 to k ? 1 such that i s = k=2 ? 1. Then, the ODR algorithm will use l to deliver messages from node 0 to only node k=2 on this ring. Similarly, it will use edge l for messages from node 1 to node k=2 and from node 1 to node k=2 + 1 and so on. Messages from node k=2 ? 1 (= i s ) can be sent using l to any node indexed k=2 to k ? Therefore the number of solutions to equations (8) and (9) (10) is no more than tk s?1 . Similarly, the number of solutions to equations (11) We mentioned in section 7 that ODR algorithm su ers from lack of fault-tolerance, since there is only one path between each pair of processors. In this section, we introduce Unordered Dimensional Routing (UDR), which eliminates this problem. The algorithm is as follows: To route a packet fromp = (p 1 ; p 2 ; : : : ; p d ) toq = (q 1 ; q 2 ; : : : ; q d ), both in P for i := 1 to d do begin Select a number j from the set f1; 2; : : : ; dg that has not been used before;
Load Analysis for Linear Placements with ODR
Correct p j in the direction of shortest cyclic distance end As was the case in ODR, a dimension is corrected completely before another is selected. Unlike ODR, however, the order in which the dimension to be corrected next is picked is arbitrary. This algorithm thus provides multiple paths for each pair of processors and improves the fault-tolerance of the system. Ifp andq are two processors di ering in s dimensions, then there will be s! di erent paths fromp toq in UDR, i.e. jC UDR p!q j = s!. Next we show that UDR algorithm results in linear load in edges.
Load Analysis for Linear Placements with UDR
For a linear placement P which uses UDR algorithm, the load on an edge l is
Since there exist some pairs of processors for which jC UDR p!q j > 1 we have, E(l) < X p2P;q2P jC UDR p!l!q j
The upper bound on the right hand side of this inequality speci es the number of messages sent between pairs of processors which could \potentially" route their messages through l. Such processors can also use other paths that do not include l, since UDR algorithm provides multiple routing paths.
Theorem 4 Following the work of Blaum, Bruck, Pifarr e, and Sanz 3, 4], we have considered communication in partially-populated torus networks in terms of placements of processors and associated routing algorithms. We have provided lower bounds for the maximum load under the all-to-all communication scenario, and found bounds on the size of an optimal placement. We have shown that arbitrary placements can be bisected by removing a set of edges of the same order as the bisection width of the torus. We then provided optimal placements of size (k d?1 ) on the d-dimensional k-torus using what we call linear and multiple linear placements, and gave load analyses of each under two di erent routing algorithms. There are some interesting combinatorial properties of placements still to be resolved. Among these are the characterization of optimal placements in terms of restrictions to subtori and an extensive analysis of the properties of edge separators of tori relative to optimal placements.
Appendix I
In the following proposition we treat T d k as an undirected graph, rather than a directed one. Thus each edge below translates into a pair of edges of the directed case. Consider the collection of hyperplanes H t de ned as above using a transcendental number in the range 1 < < Now as soon as the interval ( ?1 n ; 9 8 ?1 n ) has length larger than 2 it must contain an integer. Therefore we can construct a sequence k 1 < k 2 < such that L kn 2 3 k n ? 1 :
i.e. the lengths of the middle intervals for this subsequence satisfy (17). But in row k of the array b k;i , 1 i k, an interval of consecutive elements whose length is 2=3 the total length, necessarily contains the central element of the row. This means that if we bisect the rows by taking = 1=2, then for k 2 fk 1 ; k 2 ; : : :g. Therefore writing S k = A k +B k with A k = b k;1 +b k;2 + +b k; 1 2 k , both A k ; B k > 1 8 for k 2 fk 1 ; k 2 ; : : :g. Since both sequences are bounded above by c 2 , A k = (1) and B k = (1) as desired. Now assume that S k 1 on some in nite set k (1) < k (2) < , and not necessarily for all k. The crucial point in the above argument is the fact that 1 n=1 ( ?1 n ; 9 8 ?1 n ) = ( ?1 1 ; 1) for appropriate n . We can extract a subsequence of fk (1) ; k (2) ; : : :g just as we constructed fk 1 ; k 2 ; : : :g from f1; 2; : : :g. On this subsequence = This provides the decomposition into subtori required for the conclusion of theorem 1.
