Abstract-In this paper we describe a new family of topologies for interconnecting many identical processors to form an MIMD multiprocessor. It extends to arbitrarily many processors-while keeping the number of neighbors of any one processor fixed. We show that this family behaves very well with respect to uniformity of bus load, simplicity of routing algorithms, and distance between processors.
I. INTRODUCTION
IN THIS paper we describe a new family of topologies for interconnecting many identical processors to form an MIMD multiprocessor. The purpose of the interconnection is to allow the individual processors to communicate with each other as they cooperate to perform tasks. Special topologies can be designed for particular tasks, but general purpose topologies are important for networks like Cm* [ 1] , Micronet [21, and Arachne [3] , [4] . Such topologies may be judged by the following five criteria.
1) The diameter of the network (the length of the longest path, where a path is a shortest route between two processors) should grow slowly with the number of processors.
2) The number of neighbors of any one processor should be independent of the size of the network.
3) Addresses should exist for each processor that permit efficient algorithms for routing messages from any processor to any other.
4) The traffic load on various parts of the network should be uniform.
5) Message pathways should be redundant to provide robustness in the event of component failure.
Previously reported interconnection strategies perform well according to some of these criteria, but not others. For example, the hypercube [5] performs excellently with respect to criteria 1) (logarithmic relationship), 3), 4) (completely uniform), and 5) (very high redundancy). Unfortunately, criterion 2) is violated. As the dimension rises, the number of neighbors of each processor increases. The dense snowflake [6] satisfies criteria 2) (fixed complexity), 3) , and 5) (good redundancy), but it does less well with respect to criteria 1) and 4). The diameter grows as a root of the number of processors (rather than a logarithm) and traffic load varies widely. The multitreestructured network [7] satisfies criteria 1) and 2), but is very poor with respect to criterion 3). There is no general addressing and routing scheme. It is hard to evaluate its performance on criteria 4) and 5).
This paper presents a new family of interconnection strategies called the lens. We show that lenses perform well with respect to all the criteria.
II. BACKGROUND Many other authors have considered interconnection topologies, although differing assumptions about the underlying technology often lead to solutions that are difficult to compare. A recent survey [8] distinguishes three main approaches: multistage networks, in which processors are connected to memories or to each other through a network of lines and switches, dedicated path networks, in which processors are connected to each other with (simplex or duplex) lines, and shared path networks, in which sets of processors share a communications medium such as a bus. The lens structure presented in this paper is, strictly speaking, a shared path network, although its principles may be applied to dedicated path networks, and it has striking similarities to several proposed multistage schemes.
The work described here is descended from a shared-path scheme called the Mega-Micro-Computer network [9] . In such a network, a large number of processors are connected by buses, with some small fixed number of processors (say p, where p > 1) allowed on each bus, and at most two buses attached to each processor. Any processor may send a message directly to all neighboring processors, that is, all those connected by any bus to the originating processor; messages may be relayed to more distant destinations. Local connectivity (criterion 2) above) is dependent only on p, not on the total number of processors.
We have investigated several such structures in detail [6] , deriving exact formulas for average and worst case 'interprocessor distances and bus traffic, and presenting distributed algorithms for message routing. One of these structures, the dense snowflake, is a direct generalization of the Mega-Micro structure. It is constructed recursively, starting with a cluster of p processors on one bus and building a larger cluster from p smaller ones, conecting processors from the edges of subclusters with p -1 new buses. The diameter of the snowflake grows in proportiQn to a root of the number of processors. We presented the star as an attempt to decrease the diameter. Its construction is similar to the snowflakes, but processors near the centers of the subclusters are used to connect them together. The diameter is improved (proportional to the loga-0018-9340/81/1200-0960$00.75
rithm of the number of processors), but the problem of uneven bus traffic is worse. For example, if a random pair of processors exchange messages, the central bus will be used with probability (p -l)/p, whereas the outermost buses will only be used with probability about 2(p -1)/N2 (where N is the total number of processors). Moreover, the star has only one path between any pair of processors, so any component failure splits the network into unconnected regions.
In this paper we improve on these previous results. We also consider the more general situation in which each processor connects to q or fewer buses (for some fixed q > 1).
III. DEFINITION OF THE LENS
Roughly speaking, the lens is constructed of subclusters, q -1 of which are joined by their borders to form a multilayered structure. The thickening at the center, which gives the lens its name, is intended to add extra buses at the center where they are needed to share the greater message traffic.
More precisely, assuming at most p processors per bus and q buses per processor, a level-I cluster consists of q -1 buses sharing each of p -1 processors. The situation in which p = q = 3 is shown in Fig. 1(a) , in which processors and buses are represented as "P" and "B," respectively. Each of the shared processors is deficient in that it is connected to only q -1 of the q buses allowed. The buses are also deficient,-since each has on-ly p -1 processors. To create a level-n cluster, take q -1 level-(n -1) clusters, add a new bus to each deficient processor, and introduce p -1 new processors. The q -1 subclusters are connected as follows. Each set of corresponding new buses from the subclusters is connected to one of the new processors [ Fig. 1(b) and (c)]. In this way, each new processor is connected by new buses to q -1 previously deficient processors, one from each subcluster.
The lens has a natural addressing scheme. We give a processor an address of the form "w.v". (By convention, we will use w and v to represent sequences of digits, and a and b to represent single digits. In this case w is a sequence of base p -1 digits and v is a sequence of base q -1 digits.) In a level-n lens, all addresses have n digits. We may define the addresses recursively according to the level of the lens. When a cluster is formed from q -1 subclusters, each address in the ith subcluster is extended by adding the digit i to its right end. None of these new addressed ends with a dot. As mentioned above, each new processor is connected to q -1 previously deficient processors, one from each subcluster. Those old processors have new addresses of the form "w.a," where w is an n -1 digit string (their old address) and a is one digit (the new digit). The addresses of all old processors connected to a single new processor have the same "w" part, but different "a" parts. We give the new processor an address "wb.," choosing a different digit b for each new processor connected to the same group of old ones. Since each new bus is connected to p -1 new processors, all the possible digits b will be chosen. Fig. 1 shows the addresses of processors.
In the special case p = q, the number of deficient processors (which we call the border of the cluster) is the same as the number of deficient buses (which constitute the center of the cluster). Such a lens may be completed by connecting these If p = q, counting addresses shows that a level-n lens has n(p-1 )n processors. Each processor-except for the (p -1 )fn ones on the border was given a new bus during the construction at some level; those last ones may be connected to unique buses in the center. Thus, the number of buses equals the number of processors.
IV. ROUTING
In this section we assume that the lens is not completed; we extend the algorithm to the completed lens in the next section. Processors are adjacent (that is, they share a bus) if their addresses follow either of these patterns. P1) One address is "wa.u" and the other is "w.bu". P2) One address is "wa.u" and the other is "wb.u". In order to find a direct neighbor of any processor, start with its address. Pattern PI implies that the dot may be moved one position to the right or the left; the position that is passed over may be set to any digit. Pattern P2 implies that the digit before the dot may be changed to any digit. In both cases the new address names a processor adjacent to the old one. These patterns lead to an algorithm for determining a path between arbitrary processors. For Pattern Pi has been used for the first three steps, and pattern IEEE TRANSACTIONS ON COMPUTERS, VOL. C-30, NO. 12, DECEMBER 1981 P2 for the last one. If we restrict ourselves to pattern P1 alone, that last step requires two:
02.220 0.1220 00.2 20. Given two addresses, which we may call the source and destination, we wish to find all minimal paths between them. First, define a region ofdifference as the minimal set of digit positions that satisfies the following properties:
1) the positions are contiguous in the range [1 *n] 2) the source and destination addresses agree in all positions outside the region of difference, 3) the dot is within the region of difference or on its borders.
In the example above, the region of difference is those positions marked here with 1:
01110.
The source and destination may agree in places within the region of difference, but they must disagree on the borders of that region (at least with respect to the position of the dot).
To move from the source to the destination, it is necessary to change all the digits in the region of difference and end up with the dot in the correct place. The algorithm has four steps as follows.
1) The dot moves from its position in the source address to one end of the region of difference; all digits are replaced with arbitrary ones.
2) The dot moves back to the correct position in the destination address; all digits are replaced with correct ones for the destination.
3) The dot continues to the other end of the region of difference; all digits are replaced with arbitrary ones.
4) The dot moves back to the correct position in the destination address; all digits are replaced with correct ones for the destination.
Any of these steps may be trivial if the source or destination address has its dot at the end of the region of difference.
Moving the dot by one position corresponds to traversing one bus on the.path. Let k be the number of positions in the region of difference, and let i and j be the distance of the dot from the left end of the region of difference in the source and destination addresses, respectively. If step 1) moves the dot towards the right end of the region of difference, then the path length will be k -i for step 1), k for steps 2) and 3), andj for step 4); a total of 2k + j -i steps. Similarly, if step 1) moves the dot left first, the total distance is 2k + i -j. Therefore, a shortest path begins by moving the dot right if and only if j < i. If j = i, both directions yield minimal paths.
The maximum distance between two processors occurs when i = j; in this case, the path length is 2k. Since k < n, the maximum distance between two processors is 2n.
This routing algorithm uses pattern P1 exclusively. The lens also provides pattern P2, which is useful when the dot changes direction from left to right. This situation only occurs at most once in any path, so using pattern P2 can decrease the length of some paths by one step
V. ROUTING IN THE COMPLETED LENS
The routing algorithm is somewhat complicated by completion. We still have the two adjacency patterns shown earlier, but now an address of the form "w." is considered to be equivalent to the address ".w". In other words, "a.w" is now adjacent to "bw." It is easiest to view an address as a circular list of digits with a dot between some two adjacent digits. (Since p = q in the completed lens, digits before and after the dot are in the same base.)
Suppose source and destination addresses have been chosen and we want to find all shortest paths. The dot positions in the source and destination divide the circle of digits into two arcs; let A be one of the arcs, and let a denote its length [see One way to find a path from the source to the destination is to move the dot from its initial position to one end of B, then back around to the other end of B, and finally to its destination position [see Fig. 2(c) ]. Call paths formed this way class-1 paths. The total length of these paths is n + a -2b. As in the routing algorithm for the uncompleted lens, when the dot moves past the same position twice, the digit at that position is set to an arbitrary value the first time and to the value from the destination address the second time. Different choices of B and of the arbitrary values yield different paths within the class. As before, each traversal of a digit by the dot corresponds to a traversal of a bus in the actual lens. The path of the dot and the choice of values to which the digits are set as the dot passes determine the path through the lens.
Paths in class 2 move the dot from the source position, through A to the destination position, and then make another paths. The total length of these paths is n + a -2b. As in the routing algorithm for the uncompleted lens, when the dot moves past the same position twice, the digit at that position A). In an analogous manner, letting C be a maximal region of agreement outside of A, and letting c be its length, we get classes 3 and 4 of lengths 2n -a -2c and 2n -a, respectively H(x): shift x cyclically to the right, G(x): increase the rightmost digit of x by 1 (mod p -1).
Each operation corresponds to a graph automorphism. Given two addresses s1 and s2, transform s1 and s2 as follows. First, repeatedly apply H to bring each digit of s, to the right end. If the digit differs from the corresponding digit of s2 (that is, the digit at the same distance from the dot position), then make it equal by repeated applications of G. Finally, apply H as necessary to make the dot position agree with s2.
Definitions: We say that message traffic is uniform in a network if (source, destination) pairs for messages are randomly and uniformly distributed. We say a routing strategy is regular if the path chosen for a message is always chosen at random from the set of minimal paths, with all minimal paths equally likely. The loading of a processor is the probability that it is used to relay a message between a random (source, destination) pair. Corollary: If message traffic in a completed lens is uniform and routing is regular, then all processors have equal loading.
Proof: Given two processors S and D, there is an automorphismf that maps S onto D. Sincefextends to a bijection on the set of minimal paths, and since path p goes through S ifff(p) goes through D, the result follows.
Note: The hypothesis of regular routing is stronger than necessary in the corollary above. However, the routing strategy must be sufficiently "fair" that automorphisms preserve path probabilities. The routing algorithm given above satisfies this condition.
Definition: The dual of a processor graph G is a processor graph G' formed by relabeling the graph, interchanging buses and processors. (p(G') = q(G), and q(G') = p(G)).
Theorem 2: A completed lens is isomorphic to its dual. Proof: Let G be the completed lens. Label the buses of G as follows. Each bus is adjacent to a set of q -1 processors labeled wi.v (i = 0 ... q -2) and one processor labeled w.av, for some strings w and v and digit a. Assign the label w.av to this bus. Let f be the mapping from G to G' that assigns a processor to the bus whose label is the reversal of the processor's label, and vice versa. ClearlyJ is one-one and onto. The buses adjacent to processor wa.v are wa.v and w.iv for i = 0 ... p -2, and the processors adjacent to bus w.av are w.av and wi.v, for i = 0 ... q -2. Hence, ifp = q,fis an automorphism interchanging processors and buses.
Corollary: For any two buses in a completed lens, there exists a graph automorphism on the lens that maps one onto the other.
Proof: Let bI and b2 be buses, letf be the isomorphism of Theorem 2, and let g be the automorphism of Theorem 1 that mapsf(bi) ontof(b2). Thenf-I 0 g Of is an automorphism mapping bI onto b2.
Corollary: If message traffic in the completed lens is uniform and routing is regular, then bus loading is uniform. (Bus loading is defined analogously to processor loading.)
VII. THE LENS AND MULTISTAGE NETWORKS
An alternate picture of the lens indicates a correspondence between shared-path and dedicated-path topologies and also suggests a striking similarity between the lens and several multistage schemes previously proposed. Fig. 3 represents the same interconnection schemes as Fig. 1 , but with the processors and buses arranged differently on the page. Drawn this way, a level-k lens is formed of k rows of "exchange groups," with each row connected to the next by a perfect shuffle interconnection [10] . Each exchange group consists ofp -1 processors and q -1 buses, connected in a complete bipartite graph. Fig.  4 shows Fig. 3(c) with each exchange group replaced by a box. The extra connections to complete the lens are shown as dotted lines. For clarity, the bottom row of boxes is repeated at the top.
The graph structure of Fig. 4 has been proposed several times as a multistage network, especially in the case where p = q = 3. It has been variously called the flip network by Batcher [1 ] , the indirect binary n-cube by Pease [12] , the Omega network by Lawrie [13] (who also mentions the pos- sibility of other values of p and q), and the SW-Banyan by Goke and Lipovski [ 14] . The case p = q = 4 has certain optimality properties for producing permutations [15] . In each case the interpretation of the graph as an interconnection network has been slightly different, and hence different properties of the graph have been of interest.
To the best of our knowledge, all previous papers mentioning the graph of Fig. 4 interpret boxes in the interior of the graph switching elements of some sort. The boxes at the top and bottom are interpreted as processors, processors and memories, or processors and buses. The results of this paper show that a fruitful interconnection topology can be achieved by interpreting all the boxes as processing elements or as shared-path subnetworks.
When the boxes are interpreted as processing elements, we will refer to the graph as the "multistage lens." In the multistage lens, the processing elements have addresses of the form u.v, where uv is a string of n base-p -1 digits and u is nonempty. The adjacency rule is similar to that of the multibus lens: u.av is adjacent to ub.v and au. is adjacent to b.v, where a and b are arbitrary digits. There are thus n(p-1 )n processors, each with 2(p -1) neighbors and the diameter is-L3/2 nj. Another similar topology is the Cube-Connected Cycles (CCC) [16] . This graph is formed from an n-dimensional binary cube by replacing each vertex with a cycle of n nodes. Each node in the cycle is a terminus of an edge corresponding to an edge of the cube. The CCC can also be described by an addressing scheme that highlights its similarity to the multistage lens. Each corner of the cube is indicated by an n-bit binary string; the position of a node in the cycle at that corner is indicated by a dot somewhere in the string. The CCC, like the multistage lens for p = 3, has n28 processors of the form u.v., but whereas processor ua.bv has the four neighbors in the 964 lens, u.abv, u.abv, uab.v, and uabv, it has only three neighbors in CCC, u.abv, uab.v, and ua.bv. Moreover, two arc-traversals are required for each position in which a source and destination address differ: one to change the digit at one position and the other to move the dot to the next position. Adding n/2 more steps to move, the dot to the destination dot position gives a diameter of approximately 5n/2. VIII. CONCLUSION In this paper we have presented a new interconnection strategy called the lens that performs well with respect to all the criteria listed in the introduction.
1) The diameter of the completed lens grows slowly with the number of processors. The diameter d is L3n/2] and the number of processors N is n(p -1) n, so the diameter is d = 13(log N -log n)J 2 log(p- 1) which is logarithmic in N. The uncompleted lens diameter also grows logarithmically with N; in this case d is 2n.
2) Every processor is connected to q buses, and every bus is connected to p processors, independent of the size of the network.
3) The processor addressing scheme permits efficient algorithms for routing messages from any processor to any other. Completing the lens does add complexity to the algorithm, but it is still linear in the length of addresses.
4) The traffic load is uniform for all processors and all buses.
5) Multiple minimal paths provide robustness in the event of component failure.
One problem with the complete lens is that it seems to require expansion in large increments, since the number of processors must be an integer of the form n(p -1)n. Structures of other sizes could be built by viewing them as lenses of the next larger size with some processors and buses missing. The characteristics of such "partial lenses" have been investigated in another paper [17] .
