Two and three dimensional k-tori are among the most used topologies in the designs of 
of parallel computers. Topology of the network, oblivious and adaptive routers, processor-network interfaces, and tolerance to faults are some of the key concerns of parallel computer architects.
In recent years, hypercubes, meshes, tori, cubeconnected cycles, fat-trees, shuffle-exchanges, and the wide class of multistage interconnection networks have been extensively used among many others [7] . Although the hypercube has been a very popular architecture, the fact is that 2 and 3-dimensional k-tori are two of the most used topologies in new designs of parallel computers. The renewed interest in these topologies stems both from their low degree and their scalability. Also, these networks have topologies amenable to natural two and three dimensional layouts and seem to adapt well to the presence of faults [2] [3] .
Throughput is a widely used measure of network performance, i.e., the ability of the network to manage the maximum possible message traffic. A well-known bound on the throughput of a network is based on the bisection of the network and its message-traffic characteristics [4] . The bisection bandwidth of a network is the minimum number of links that must be cut in order to divide the network into about two equal halves. Multistage networks accomplish their throughput by providing a significant surplus of routing nodes in comparison to points where processors inject messages into the network. A multistage network with k x k switches (routing nodes) and logk(n) stages serves n injection points, has bisection bandwidth of 2n (when directed links are considered) and is utilizing n logk (n) routing nodes. In this respect, multistage networks can be regarded as partially-populated networks, in the sense that only a subset of routing nodes are targets of message injection by processors.
On the other hand, networks such as tori, meshes, and hypercubes have been designed and/or built where the number of routing nodes is equal to the number of processors [5] . Hence, these networks have been used asfullypopulated networks, in the sense that every routing node in the topology is subjected to message injection. Fullypopulated tori and meshes exhibit a theoretical throughput which degrades as the network size increases. Note that the bisection bandwidth of a d-dimensional k-torus is 4kd-l when directed links are considered. In a fully-populated torus, there would be k2d/2 messages passing through the bisection assuming that every pair of processors is communicating simultaneously. This means that there is an edge in the bisection with load at least kd+l/8. This is in comparison to a load of n in an n processor multistage network. In fact, the increased load in the case of tori networks has led researchers in [ 11 and [ 101 to consider a smaller number of processors when compared to routing nodes in a 3-D mesh type network. The main question we address is this paper is: can we achieve a load in a torus similar to the one in a multistage interconnection network? Namely, can we achieve a linear load in partially populated tori networks?
Introducing slackness in fully-populated tori, i.e., reducing the number of processors, and studying optimal routing strategies for the resulting interconnections are the central subjects of this paper. The key concept that we study is the placement of the processors in a network. Namely, a placement is the subset of the nodes in the interconnection network that is attached to processors. In addition, given a placement P we need to define the routing method between arbitrary pairs of processors in P. We assume that the routing consists of only minimal length paths. Our goal is to find P, a placement of processors, in a d-dimensional k-torus, such that lPl = ka with i 5 d being as large as possible while the maximal load on the links is linear in IPI (like in the case of multistage networks). Our main contribution is the construction of optimal placements of size k and k2 and their corresponding routing algorithms for the cases d = 2 and d = 3, respectively.
In Section 2, we give some formal definitions and notations, introduce a simple lower bound and present an optimal placement together with a routing algorithm for the case of a 2-dimensional tori. In Section 3 we present an optimal placement and routing algorithm for 3-dimensional tori networks.
Preliminaries and the 2-Dimensional Case
In this section we formalize the processor placement problem and present an optimal solution for the 2-dimensional case as well as a lower bound. The optimal placement for the three-dimensional case is presented in the next section.
Problem Definition
We model an interconnection network of a parallel computer as a directed graph where the nodes represent switches and the directed edges represent links connecting the switches. In particular, we are interested in tori networks. Notice that the links connect each node of the torus with its neighboring 2d nodes.
We assume that a subset of the nodes in the interconnection network has processors attached to them. We call this subset a placement. Formally, Definition 2 Aplacement of processors in a d-dimensional k-torus is a subset of nodes,
Given a placement P, we need to define the routing algorithm between arbitrary pairs of processors in P. In this paper we will assume that the routing occurs only through minimal length paths.
Definition 3 Let P be a placement. Let p'and {be a pair of processors in P. The routing algorithm to transmit packets fromp'to q'will use a set of minimal length (shortest) paths, by choosing them uniformly at random from this set. Let C$+bbe the set of minimal length paths from p'to q'given by the routing algorithm. Let 1 be a link in the network, then denote by Cp'+l+a C Cp'+athe set of paths from p'to athrough link 1 .
The main goal of this paper is to find a "good" method for placing processors in tori interconnection networks. The key concept for defining the quality of a placement is the load of a link.
Definition4 Given a placement P with a routing algorithm, the load of link 1 is:
We denote by C ( N , P) the maximum value of € ( l ) for a network N with a placement P. Namely, C(N, P ) = maxE(1).
l € N
Our goal is to find P, a placement of processors, in a ddimensional k-torus N, such that I PI = ki and i 5 d is as large as possible, while C ( N , P) is linear in [PI. This will provide a way to scale-up the size of the network. Our main contribution is the construction of optimal placements for the cases d = 2 and d = 3. In particular, we present placements that achieve the following lower bounds: for d = 2, C ( N , P) 2 IP1/4 and ford = 3, C ( N , P ) 1: IPl/S.
A Lower Bound
In this section, we present a lower bound for C ( N , P ) , N being a d-dimensional k-torus.
Lemma 1 Let P be a placement in a d-dimensional k-torus N together with a routing algorithm. Then C ( N , P ) 2
Notice that an optimal placement in a d-dimensional ktorus cannot contain more than kd-' nodes. The reason is that if we have k d -l + 1 nodes in the placement it means that at least two nodes will be on the same cycle (they will have d -1 identical coordinates). Namely, by a similar argument to the the one in Lemma 1 there will be a link with load higher than the average load. Hence, in the rest of the paper we will focus on constructing optimal load placements with maximum size (i.e. (PI -1)/(24*
The Two-Dimensional Case
In this section we describe an optimal placement for the 2-dimensional k-torus. The placement is optimal in the sense that it achieves the lower bound for C ( N , P ) given by Lemma 1.
The placement that we use is P = {(i,i) I 0 5 i 5 k -l}, so, !PI = k. The routing algorithm associated with P is very simple. The dimensions are corrected in any order, but if the message begins correcting one dimension, it must finish that dimension before correcting the other one.
Let us be more specific. Given two integers m and n, the cyclic distance between m and n modulo k is given by min{m -n (mod k) ; n -m (mod k)}.
The Lee distance between two nodes d and b' in the d-
is the sum of the cyclic distances between the coordinates [8] . The Lee distance represents the length of the shortest path in the ddimensional k-torus between two nodes.
Consider now the 2-dimensional k-torus. Assume that we have two processors in P, for instance a' = (a,a) and
, and we want to send a packet from a' to b' . We choose a dimension at random, say dimension 0, and we correct that dimension first. This means, we move from (a, a) to (a, b) along a path with minimal cyclic distance. We then move from (a, b) to (b, b), also along a path with minimal cyclic distance. We observe that if IC is odd, the algorithm above gives only two minimal paths between two different processors.
If k is even, we may have more than two minimal paths since, when the cyclic distance in one dimension is exactly k / 2 , we have more than one choice to correct that distance. Stating formally the discussion above, we have the following algorithm:
Algorithm 1 Consider the placement P in a 2-dimensional k-torus. Assume that4we want to send a packet from processor a' to processor b. Then proceed as follows:
1. Correct either dimension 0 or dimension 1 through the shortest cyclic distance.
2. Correct the dimension 0 or 1 not corrected in the previous step, also through the shortest cyclic distance.
We want to show next that Algorithm 1 is optimal in the sense that C ( N , P ) = ([PI -1)/4, i. e., the lower bound on C ( N , P ) given by Lemma 1 is achieved. Without loss of generality, consider a link I in dimension 0. As stated before, E is contained in a unique cycle and there is exactly one processor in the cycle to which the link belongs. Let f be such a processor. We denote by s the cyclic distance between one of the two nodes that are the endpoints of 1 and F' , whichever is smaller. The next lemma gives the value of € ( I ) for such a link.
Lemma 2 Consider placement P in the 2-dimensional ktorus. Let I be a link in, say, dimension 0, f E P is in the cycle to which I belongs, and I is at (cyclic) distance s from F. Then 
The Three Dimensional Case
In this section, we will present an optimal routing algorithm for a particular regular placement in the 3-dimensional k-torus which is an extension of the placement considered before The first option to consider for the routing algorithm in the 3-dimensional case is the straightforward extension of Routing Algorithm 1. However, it turns out that this approach results in a link load that is not optimal. The key observation is that a path between two processors that goes through a third processor, say p, increases the load on the links adjacent to p beyond the optimal link load (this follows from the lower bound argument in Lemma 1). Now notice that this might happen in the 3-dimensional case since any two processors that are not in the same plane determine a cube, six minimal paths can be used. We will overcome this difficulty by presenting a routing algorithm for the shifted diagonal placement (defined below) that avoids going through processors and results in an optimal link load.
Definition 5 We call a placement P in a 3-dimensional k- Next, we will present an optimal minimal routing algorithm for the shifted diagonal placement. Essentially, the algorithm starts by correcting one of possible three dimensions in the direction of minimal cyclic distance. After it corrects this first dimension, it corrects a second one (also in the direction of minimal cyclic distance) as long as by doing so, it does not encounter a processor. The key in the algorithm is that it has the property that it will not encounter a processor in at least one of the remaining two dimensions. Finally, the algorithm corrects the remaining dimension. Namely, the following algorithm has the property that any pair of processors in the network can communicate without passing over another processor.
Algorithm 2
Assume that we have a 3-dimensional IC-torus with a shifted diagonal placement, and we want to route a message from processor a' = (a2 , a1 , ao) to processor b = ( b 2 , bl, bo). Let I be the set I = {i E {0,1,2} : ai # bi} (notice that, either 1 1 1 = 2 or [ I ) = 3). Then proceed as follows (whenever we correct a dimension, we do it through the path of minimal cyclic distance):
Choose j E I and correct dimension j.
If 1 1 1 = 2 then correct dimension 1 E I -{ j } and stop. Else, choose i E I -{ j } .
If, when correcting dimension i, the path does not pass over a processor, then correct dimension i. Correct dimension t E I -{ j , i} and stop. Else, correct dimension t E I -{ j , i}.
Correct dimension i and stop.
We can prove that Routing Algorithm 2 allows for any pair of processors in the network to communicate without passing over another processor, hence, resulting in an optimal link load.
Theorem 1 A message between any pair of processors in a 3-dimensional Ic-torus with shifted diagonal placement routed using Algorithm 2, never passes through another processor.
Conclusions
case that avoids going through processors and results in an optimal link load. Further work may include extending our results to tori of higher dimensions and finding a non-minimal optimal routing algorithm with uniform load distribution over the links.
