Abstract. Cayley graphs have been used extensively to design interconnection networks and provide a natural setting for studying point-to-point routing [1, 2, 3, 5, 6, 7, 12] . The extension of these techniques to the more important problem of permutation routing on interconnection networks presents fundamental problems. This is due to the potentially explosive growth in both the size of the graph and the number of generating permutations, referred to as one-step permutation routes, used to define the underlying graph. This paper describes a technique for moderating that growth so that the techniques in [8] can be applied for finding optimal permutation routes. In a particularly striking example, a bus interconnection architecture involving 1.0 × 10 17 permutations (nodes of the Cayley graph) is reduced to a computation on a graph with only 3,950 nodes. Further, it is shown how many of the 58,624 generators (directed edges labelled by one-step permutation routes) at each node of the graph may be eliminated as locally redundant.
Introduction
There has been an extensive literature on the use of Cayley graphs to design interconnection networks [1, 2, 5, 6, 7, 12] . Cayley graphs have been used extensively to study point-to-point routing, and they have been particularly attractive for the degree-diameter problem [3, 5, 7] . Nevertheless, other patterns of routing, such as broadcast, permutation routing, and general many-to-one routing are at least as important for parallel computing. This paper describes the use of Cayley graphs to study permutation routing. There are also other approaches to permutation routing [1, 11] that concentrate more on heuristics and an overall framework.
A Cayley graph G is a directed labelled graph associated with a group, G, and generating set S. The nodes of G are elements of G and the edges are labelled by generators in S. Two nodes, g and h are connected by a directed edge, (g, h), and the edge is labelled by s ∈ S, if h = gs. In the case where S = S −1 , whenever g is connected to h by a directed edge labelled by s, h is simultaneously connected to g by a directed edge labelled by s −1 , and so G may be viewed as undirected. This is typically the case and will be assumed throughout this paper. Cayley graphs have the property that they are vertex-symmetric. This means that for any two nodes, there is a graph isomorphism of the graph into itself that maps the first node into the second. In the case of Cayley graphs, this graph isomorphism also preserves the labels of the edges. In fact, this property can be easily established by simply embedding G into Aut(G) through the left regular action so that the resulting image is transitive on the nodes of the graph and preserves edge labels.
A spanning tree T for G rooted at the identity e allows one to find shortest paths between arbitrary nodes. Given elements g and h of G, one uses T to find a shortest path from e to g −1 h. If w is the word in the edge labels along this path, then h = gw. Since G is vertex-symmetric, it follows that w also defines a shortest path in G from g to h as well. An important interpretation is that w is a minimal length word in the generating set S that represents the group element g −1 h. In particular, this allows for the solution of the minimal word problem for G relative to the generating set S.
In [8] , an efficient method of encoding Cayley graphs was developed, using standard techniques from computational group theory [13] . This method required log 2 3 bits of storage per node of the graph to store a data structure for a spanning tree for the Cayley graph. The spanning tree was derived essentially through a method of breadth-first search. During computation of the data structure, log 2 5 bits per node were temporarily required. Since this is approximately 2 bits, it was called the 2-bit method. Furthermore, it was shown how to use this data structure to find shortest paths in G between any two nodes. Applications of this method are presented in [9] .
We will now show how to apply these ideas to permutation routing in an arbitrary interconnection graph I. We first define a one-step permutation route π to be a permutation of the nodes Ω of I with the property that either π(x) is connected to x for each node x or π(x) = x. The idea is that any one-step permutation route can be executed in parallel without any conflicts. A k-step permutation route is one that is obtained from a sequence of at most k one-step permutation routes. If I is connected, then a transposition of the nodes of any edge is a one-step permutation route. This leads to the following observation. Theorem 1. If I is connected, then the set S of one-step permutation routes generates Sym(Ω) and any permutation of the nodes of Ω is a |Ω| 2 permutation route.
The basic problem we want to consider is given an arbitrary permutation π of I, express π as a sequence of k one-step permutation routes where k is minimal. This can then be viewed as finding a word of minimal length in the set S of one-step permutations that represents π. It is precisely this problem which must be solved.
Bus Interconnection Networks
We now apply the previous ideas to bus interconnection networks. From this, it will be clear how to generalize to other examples. Specifically, let C be a set of chips and let B be a set of bus lines. Let Ω = C ∪ B and let I be a graph on Ω such that (a, b) is an edge of I if and only if a ∈ C, b ∈ B, and chip a is connected to bus line b.
The following model corresponds to the case in which each chip has only one register. In each clock cycle, for each chip a, either the data in a is interchanged with the data on some unique bus line b or else the data on chip a remains fixed. Section 5 discusses how to generalizes this to multiple registers per chip. Essentially, one identifies C with the set of chip registers, rather than chips. The simpler model is analyzed in detail in section 3 to illustrate the concepts.
Let S be the set of all involutions g ∈ Sym(Ω) with the property that g is a product of disjoint transpositions of the form (a b) where a ∈ C and b ∈ B. The elements of S are precisely the one-step permutation routes of I. Let M = max(|C|, |B|) and let m = min(|C|, |B|). Then, the size of |S| is bounded above by (M + 1)
m . This is a restricted version of an interconnection network in which distinct data is both read and written to the bus in one clock cycle, thus insuring that the communication is indeed a permutation. This may be the physical situation if the setup time to connect chips and bus lines dominates the actual transfer time. The bus line and/or the chip can hold dummy data, in which case the important transfer of data is in one direction only.
Experimental Model
For simplicity in computational experiments, we also assume that each chip is connected to each bus line. This allows us to find experimental data dependent only on parameters |C| and |B|. It is easy to see that if S is the set of one-step permutation routes for I defined in the previous section, then S generates G = Sym(C ∪ B). The technique can be easily applied to other interconnection patterns, and details of that generalization are contained in section 5.
The goal is to express each permutation route as a shortest possible sequence of one-step permutation routes. In the language of group theory, for each permutation g of Ω that setwise stabilizes B, we must find a shortest word in S that represents g. Given the Cayley graph G for G with generating set S, this is equivalent to finding a shortest path in G from the identity node 1 to the node g.
Rather than work with the Cayley graph G, we will work with a reduced labelled multigraph, defined below. This smaller graph greatly reduces the necessary computation. The reduced graph is constructed through two reductions. First, we construct a Cayley coset graph G H from G, and second, we construct a reduced labelled multigraphG from G H .
The first reduction is obtained by noting that we are concerned only with permutations of data on the chips, and not with permutations of whatever data may initially be on the bus lines. This can be best expressed by right cosets Hg ∈ G/H, where g ∈ G and H ∼ = Sym(B) is the subgroup of G that fixes C pointwise. Formally, the Cayley coset graph of G/H with generating set S is a directed labelled multigraph G H , with nodes labelled by G/H. Two nodes Hg 1 and Hg 2 are connected by an edge with label s if Hg 1 s = Hg 2 . Thus, the image of g ′ ∈ Hg on C depends only on g and not on the choice of g ′ . So, it suffices to find a shortest word w in S representing an element of Hg. Many routing problems can be formulated in terms of Cayley coset graphs (also referred to as group action graph in [2] ). In general, Cayley coset graphs are not vertex-symmetric.
The second reduction is obtained from the symmetries of the chosen bus interconnection architecture. Recall that a group automorphism is a permutation σ of G such that σ(gh) = σ(g)σ(h) and σ(g −1 ) = (σ(g)) −1 for all g, h ∈ G. For G corresponding to the bus interconnection architecture I, we define a symmetry to be an automorphism, σ, of G such that σ(H) = H and σ(S) = S. This definition insures that the symmetries of G act as graph automorphisms of the Cayley graph G (with generating set S), which map paths of minimal length in G to other paths of minimal length. If (g 1 , g 2 ) is a directed edge of G with label s ∈ S, then σ maps (g 1 , g 2 ) to the directed edge (σ(g 1 ), σ(g 2 )) with label σ(s) ∈ S. This and the invertibility of σ imply that σ preserves the paths of minimal length in G. Furthermore, as indicated above, we are actually interested in shortest words which represent a fixed coset Hg. But the symmetries of G also act as graph automorphisms of G H in the same sense as for G. To see this, note that if (Hg 1 , Hg 2 ) is an edge of G H with edge label s ∈ S then σ(H) = H and σ(s) ∈ S implies that (σ(Hg 1 ), σ(Hg 2 )) = (Hσ(g 1 ), Hσ(g 2 )) is an edge of G H with label σ(s). Thus σ preserves paths of minimal length in G H as well.
Conjugation provides a natural means of constructing such symmetries. Define the conjugate of g by u to be g u = u −1 gu, and extend the definition to sets
is an automorphism of G, and is known in group theory as an inner automorphism. In the case of our bus interconnection network, we let U ∼ = Sym(C) × Sym(B), and choose as the set of symmetries {σ u : u ∈ U }. We often refer to a symmetry σ u simply as u. Note that H U = H and S U = S for this choice of U , and so the definition of symmetry is satisfied. Hence, conjugation by U preserves shortest words. From group theory, it is known that if |C ∪ B| = 6, then the set U contains all possible symmetries.
With this motivation, we now introduce the reduced labelled multigraphG, derived from G, in which nodes ofG are identified with subsets [Hx] = (Hx) U = Hx U ⊆ G for x ∈ G. Thus, {[Hx]} x∈G partitions G into disjoint subsets. In the language of group theory, [Hx] can also be viewed as the union of the cosets in the orbit of Hx under the conjugation action of U on G/H.
It remains to define the edges ofG. For purposes of computation, the edge set ofG is defined to be only a subset of the naturally induced edge set from G H . We next show that the edges in the two views ofG really do correspond. There is an edge in the unlabelled graph if and only if there is at least one directed, labelled edge in the directed, labelled multigraph. The backward implication is easy. For the forward implication, suppose there is an edge in the unlabelled graph. So, there are
. Note also that when S is closed under inverses, if there is a directed edge from [Hx] to [Hy], then there is at least one directed edge from [Hy] to [Hx] .
Suppose now that g is an arbitrary element of G, and we want to construct a shortest word on S whose action on C is the same as the action of g on C. This is the same problem as finding a shortest sequence, g 1 , . . . , g k of elements of S such that g( 
Proof exists g 1 , . . . , g k ∈ S and w 0 , . . . , w k ∈ T (hence w
∈ S proves the assertion in the forward direction.
Conversely, suppose g ∈ Hg
]. This then gives the required path of length k from [H] to [Hg] in G.
3.1. Encoding of nodes of reduced graph. Next, we consider how to produce a 1-1 mapping of the nodes of G into a "small" set of integers. The encoding serves as an almost perfect hash function of G, in that the number of integers is not much larger than the number the number of nodes. The results in [8] show how to use such a 1-1 mapping to store a spanning tree using storage of only log 2 (3) bits times the size of the "small" integer set. This is especially important for the larger graphs. In order to construct a unique encoding of [Hg], we require two auxiliary functions, r(x, y) and s(x, y). We denote by r(x, y) the number of partitions of x indistinguishable elements, such that the largest set within the partition is of size no larger than y. So, for example, r(5, 2) = 3 since the set of 5 elements can be partitioned into subsets of size (2, 2, 1), (2, 1, 1, 1) and (1, 1, 1, 1, 1 ) for which no subset has more than 2 elements. In general,
Also, s(x, y) is the number of ways in which x + y indistinguishable elements can be partitioned, where the x + y elements are first split into two distinguished subsets, the first of size at most x and the second of size at least y. So, for example, s(1, 2) = 5 and the partitions are of the form ((1), (2)), ( (1) , (1, 1) ), ((), (3)), ((), (2, 1)) and ((), (1, 1, 1) ). In general, s(x, y) = r(x, x)r(y, y) + s(x − 1, y + 1) for x > 0 and s(0, y) = r(y, y).
We now give a unique characterization of a node of the reduced graph in terms of a sequence of invariants. The proof that this is a unique characterization is easy, but is not included here. In the restriction of g to the union of the cycles that fix B, let c i be the length of the i-th such cycle. Let b i be the number of points x ∈ C such that x g j ∈ C for all j < i and x ((x 1 , . . . , x i ) ) is a unique encoding of the partition (x 1 , . . . , x i ) among all partitions of b elements. The value of the function "t" is defined to be independent of the order of its arguments. We re-order the arguments so that x 1 ≥ · · · ≥ x i and define
for i > 0 and t(()) = 0 for i = 0. Note that t((x 1 , . . . , In order to discover these redundant edge labels, consider the the node [Hx], x = x, with outgoing edge labelled by g 0 . One must eliminate those g ∈ S such that g = g 0 and [Hxg] = [Hxg 0 ]. Equivalently, for a given x ∈ G and g 0 ∈ S, one must find those g ∈ S with g = g 0 such that h(xg) u = xg 0 for some h ∈ H and u ∈ U .
The following three heuristics are used to remove certain g ∈ S from consideration. The statement assumes some arbitrary ordering of the points C ∪ B, so that "smallest" and "<" are well-defined. The correctness of these heuristics follows from the results of section 4, which define a total ordering on S. So, if g, g 0 ∈ S, g 0 < g, and the heuristic can establish that [Hxg] 
(1) First, g can be removed if there is a point a ∈ C ′ moved by g that is not the smallest in the cycle of x containing a and for which g does not move the point that is first in that cycle. (There is a conjugation in U that "rotates" the points of the cycle until a is first in the cycle, while leaving x unchanged.) (2) Second, g can be removed when there is a point of C ′ moved by g such that it is smallest in a cycle of x if there is an earlier cycle of points of the same length as the current cycle, and g does does not move the point that is smallest in that cycle. (In that case, there is a conjugation in U that interchanges the two cycles, and so leaves x unchanged.) (3) Third, g can be removed if there are points a, a
3.3. Experimental Results. The following table describes the maximum length permutation route, the CPU time, and the number of nodes in the reduced graph for various values of the number of chips and the number of buses. The computation was carried out under AKCL 1.615 (Common LISP). A SPARCstation-2 was used. For each tested combination of number of chips and bus lines, three numbers are reported. The first is the number of steps for an optimal permutation route for the worst case permutation (the diameter of the reduced labelled multigraph). The second is the number of nodes required to store the internal data structure. For large numbers of nodes, this will usually dominate the space requirements. Only two bits are required to store each node [8] . So, storage requirements are minimal for the examples shown. The third number is the number of seconds of CPU time to generate the data structure. From this data structure, an optimal permutation route can be computed in less than a millisecond for any desired permutation.
For 12 chips with 4 bus lines and 16 chips with 4 bus lines, two special tests were made. The results were a diameter of 5 with 919 nodes (5/919) and a diameter of 6 with 3,950 nodes (6/3950), respectively. The computation was carried out on a SPARCserver 670 in 43 minutes and 10 hours of CPU time, respectively.
The range of encodings is larger than, but usually close to the actual number of states. The range of encodings is independent of the number of bus lines. For example, at the beginning of section 3.1, we saw an example of 16 chips and 4 bus lines, in which the range [0, 5,822] was needed for 167 nodes. If the 2-bit encoding scheme is used, space must be allotted for the total range of encodings. Where the actual number of states used is much smaller, one can use hashing techniques to require space proportional to the actual number of states. Naturally, the constant of proportionality would be much higher than two bits.
Note that the length of the optimal permutation route for the worst case is always at least three. The reason for this is easy to see, since a one-step permutation route is required to move the data from the chips to the bus lines. A second one-step permutation route moves the data from the bus lines to other chips. A third onestep route then places some of the data back on the bus lines. In practice, one is only interested in permutation routes that leave all data on the chip, and none on the bus lines. Hence, for permutation routes of interest, with sufficiently many bus lines, there is always a permutation route of length at most two.
Pruning Locally Redundant Generators
AlthoughG was previously defined as the reduced labelled multigraph, we also identify it with the labels the nodes of the graph, which are the equivalence classes The next theorem characterizes a set of representatives, xS x , for the distinct equivalence classes [Hx]S ⊆G. Equivalently, this characterizes representatives for the nearest neighbors in the reduced labelled multigraph. 
Thus if g, h ∈ S x and [Hx]h = [Hx]g then g and h must both be lexically least, and so g = h.
Define the centralizer of x ∈ G in U to be Cent U (x) = {g ∈ U : x g = x}.
Corollary 4. Let ρ be a total ordering on S. Let H ⊆ U , H U = H and S U = S.
The advantage of S ′ x over S x is that, given Cent U (x), there are well-known techniques for enumerating the elements of S ′ x [4] . For U equal to Sym(C) × Sym(B), Sym(Ω) and many other special cases of interest, computing Cent U (x) is also efficient [10] .
The heuristics of section 3.2 follow from the special case for S ′ x . The order, ρ, is a variation of lexical ordering.
The use of a lexical ordering means that one can generate all elements of S ′ x in lexical order and if a permutation g ∈ X
x = ∅ and so the entire subtree can be eliminated.
Where it is feasible to compute S x or where S ′ x is not too much larger than S x , the entire computation of a data structure for the spanning tree of the reduced labelled multigraph has a nice complexity characterization. Since |S x | ≤ |G, the algorithm examines at most |G| 2 directed edges in its breadth-first search. Note that enumeration of S ′ x in lexical order costs O(|S ′ x |n) for n = |Ω|, since on the current branch, either the entire subtree will be eliminated, or a new element of S ′ x will be found before the necessity of backtracking. Computing the product xg and its encoding cost O(n). So, the total cost of the algorithm is O(|G| 2 (n + E) + |G|(A + Rn)), where A is the cost of computing the automorphism group and R is the cost of enumerating the elements of S x of S ′ x . Further, for the bus interconnection networks of section 3, O(|G| 2 n) is the dominant term.
Generalization to Other Models
The previous section was described under the assumption that each chip had only one register (and hence hold only one piece of data at a time). Naturally, most implementations will allow multiple registers or memory cells on a single chip. Hence, shortest paths derived from previous calculations serve only as an upper bound on the shortest length path when multiple registers are allowed.
The model can be extended to k registers per chip, by including k copies of each chip. Every generator must also be replicated to include variation in which each bus is connected to an arbitrary register. Finally, the subgroup H must be extended to include generating elements that transpose arbitrary registers from the same chip.
The previous section was described in a situation with full connections between chips and registers. If there are fewer connections, in general, one will need to choose a subgroup of U = Sym(C) × Sym(B) that preserves the architecture under conjugation. Thus, the subgroup U can be viewed as the automorphism group of a labelled graph, where each node of the graph is labelled either by C or by B. Usually ease of manufacturing dictates a uniform architecture, and so most realworld example are still likely to have a fairly large automorphism group.
Finally, a few remarks about generalizations to other models besides bus interconnection networks is worthwhile. There is a common difficulty in translating routes in physical networks to permutation routes. Suppose one has nodes A, B and C with edges between A and B, and between B and C. One would like to add a transposition between A and B as a one-step permutation route. Similarly, one would like to add a transposition between B and C. Finally, one would like to to move data from A to B while other data is moved from B to C, in parallel, in a single one-step permutation route. However, a naive product of transpositions would send data from A to C, which cannot be done in one step. The solution is to split each node into two nodes corresponding to two registers/ports: a sending port and a receiving port. Thus, one can move data from A to B in parallel with distinct data moving from B to C, and this can be expressed as a permutation.
Conclusion
It has been demonstrated how to find permutation routes for larger data structures than would be possible by a direct approach. The solution involves precomputing a special data structure in seconds. One can then derive shortest permutation routes achieving an arbitrary permutation from the pre-computed data structure. Finding that solution is a computation that can be carried out in less than a millisecond (and much shorter times are possible with hardware support).
In implementations, one would pre-compute the data structure for a given hardware architecture off-line. The required space is very small, although the CPU time for the pre-computation can be appreciable. Since the pre-computation is done once only, long CPU times may be acceptable. Specific permutation routes would either be computed at compile-time for an individual application, or else it would be be computed as part of a set-up routine for a route determined at run-time. In the latter case, one would expect the computed permutation route to be used many times to justify the overhead of computing such a route at run-time.
