Abstract
Introduction
The performance of a multiprocessor system hinges critically on the intercommunication ability of its processors which is often impeded by the bandwidth of the network interconnecting them. The full crossbar network goes a long way to solve the bandwidth problem, but at the expense of O(N 2 ) crosspoints for an N -processor system. What is desirable is a network which functions much like a crossbar but with fewer crosspoints. Such networks have been investigated extensively in communication switching and are commonly referred to as strictly nonblocking networks [Ben65] . Formally, an N × N strictly nonblocking network is a directed acyclic graph with N source vertices, called inputs, and N sink vertices called outputs such that, given any idle input x and idle output y, the network always possesses a path from x to y which does not overlap with any of the paths which might already be established between other inputs and outputs. A number of strictly nonblocking networks comprising fewer than O(N 2 ) crosspoints have been reported in the literature. The well-known 3-stage Clos network [Clo53] provides a strictly nonblocking network with O(jN 1+2/j ) crosspoints and depth j. Improved versions of Clos network due to Cantor [Can71] give a strictly nonblocking network with O(N log a N ) crosspoints, 2 < a ≤ 3, and O(log a N ) depth. Cantor also provided the first strictly nonblocking network with O(N log 2 N ) crosspoints and O(log N ) depth [Can71] . Subsequent to these efforts, Bassalygo and Pinsker [BP74, Bas81] obtained a strictly nonblocking network with O(N log N ) crosspoints and O(log N ) depth. Nonetheless, unlike Clos and Cantor networks, Bassalygo and Pinsker's network relies on the existence of certain bipartite graphs called extensive graphs. This makes their network implicit or nonconstructive. In this paper, we explore the possibility of explicitly constructing strictly nonblocking networks with O(N log N ) crosspoints, O(log N ) depth and O(log N ) routing time. Here, routing generically refers to establishing and abolishing paths between the idle inputs and idle outputs of a nonblocking network. We differentiate between two specific routing problems. The first routing problem deals with establishing (or abolishing) a path between any idle (or busy) pair of inputs and outputs. The second routing problem deals with establishing (or abolishing) paths between any two or more idle (or busy) pairs of inputs and outputs. We shall refer to these two problems as the single routing assignment problem and multiple routing assignment problem, respectively. Two recent efforts on routing problems in nonblocking networks are worth mentioning here. In [ALM90] , Arora et al. reported a greedy algorithm for establishing paths between single and multiple pairs of idle inputs and outputs in Multi-Beneš networks. These networks encompass O(N log N ) crosspoints and have O(log N ) depth. Even though Multi-Beneš networks meet our objective of constructing efficient nonblocking networks in order of complexity terms, the results reported in [ALM90] have at least two drawbacks. First, MultiBeneš networks require expanders with large expansion coefficients in order to have a small constant factor in their routing time complexity. It is stated in [ALM90] that one could generate expanders with large expansion coefficients randomly. However, this assumption makes Multi-Beneš networks nonconstructive or implicit in contrast to our requirement that the construction be explicit. The second drawback is that the time complexity of the greedy algorithm described in [ALM90] depends on two key design parameters, namely, the indegree (or out-degree) of each switch node d, and loading factor L. This relation was not spelled out exactly in [ALM90] , but the authors pointed out that to reduce the routing time one need to increase L and d. In particular, they stated that, to achieve a routing time of 100 log N , the values of d and L should approximately be 10 and 300, respectively. These parameters affect the constant in the crosspoint complexity of Multi-Beneš networks. More specifically, when d = 10 and L = 300 constructing a Multi-Beneš network requires 240, 000N log N crosspoints. More recently, Lin and Pippenger [LP91] reported both deterministic and nondeterministic path selection algorithms for a strictly nonblocking network which resembles Cantor's network. Not only that their network exacts more than O(N log N ) crosspoints, but also their deterministic path selection algorithm requires O(log 5 N ) steps and the nondeterministic one requires O(log 2 N ) steps. In this paper, these results are substantially improved. We provide an explicit construction of a strictly nonblocking network with −756.18N + 352.8N log N crosspoints and 2 + log(N/5) depth by combining the recent expander construction of Alon, Galil and Milman [AGM87] with the strictly nonblocking network introduced by Bassalygo and Pinsker [BP74] . We present O(log N ) bit-step parallel algorithms to solve the single routing assignment problem for both establishing and abolishing paths and the multiple assignment problem for abolishing paths on this new nonblocking network. Our algorithms can also be used to establish paths for multiple assignments but, at present, this requires O(k log N ) bit-steps where k is the number of requests ((input, output) pairs) in the assignment.
Basic Facts
In the first part of this section, we restate some definitions and facts mostly from [BP74] which will be used throughout the paper. In We prove this lemma as the proof provides an insight into the construction of an F a,km network 3 .
2 The reader who is sufficiently familiar with [BP74] may skip Section 2.1.
3 It is worth noting that this lemma was stated 
Proof:
The lemma is proved by induction. The basis of induction is a (at + t)/2 -homogeneous bipartite graph with t inputs and at outputs. In this case, given r established paths, each idle input can be connected to
idle outputs since r ≤ t − 1. Therefore, the first stage is an F a,t network. Now consider the F a,km network in Figure 1 . without a proof in [BP74] . outputs of the first (am, kam, α, β)-extensive graph. Since the number of established paths is at most r, the idle input x can therefore be connected to at least
idle outputs. Because 0 ≤ r ≤ km − 1, it follows that x can be connected to at least
idle outputs. Therefore, by Definition 4, the construction in Figure 1 is an F a,km network. Lemma 2 (Bassalygo-Pinsker): Let G be a network obtained by tieing two 
and H(·) is the binary entropy function, i.e.,
It is easy to see that the F a,k s t network constructed this way has k s t inputs and ak s t outputs. Furthermore, summing the number of edges in the homogeneous graphs in stage 0 and the extensive graphs in all the remaining stages, the number of edges in the F a,k s t network is found to be
Even though this result does not lead to an explicit construction of an (17)4 s , the number of edges in the F 50/17,N network is determined as 34 * 17 * 4 s + 1150s4 s = 67.65N log 4 N .
Explicit Construction of Extensive Graphs
Bassalygo and Pinsker obtained an N -input nonblocking network with 136N log 4 N = 68N log N edges 4 by cascading two F 50/17,N networks back-to-back, where each F 50/17,N network consists of 1 + s = 1 + log 4 (N/17) stages, and hence the two networks cascaded together have 2 + 2 log 4 (N/17) stages. While this network has O(N log N ) edges and 4 The constant factor in the N log N expression was subsequently reduced to 53.4 in [Bas81] by refining the notion of extensive graphs and choosing the values of the parameters more judiciously. O(log N ) stages, it is only an existential construction, since the extensive graphs used in its construction were shown only to exist but not explicitly constructed. In order to convert this network to an explicit form, we first obtain an explicit construction of Bassalygo and Pinsker's extensive graphs by using the following lemma. 
The main point of this lemma is that an (m, km, α, β)-extensive graph can be constructed by using k (m, d, c)-expanders. In the context of Lemma 1, this amounts to replacing each of the (
. Re- calling from Lemma 1 that α = (a − 1)/(2a), (ii) holds for any a > 1. Hence the above three conditions amount to the inequality:
, where a > 1. By solving this inequality for a, we obtain a ≥
. Hence, invoking Lemma 2 to obtain an (m, km,
. Now returning to Equation (5), we note that the factor in front of the N log N term is given by aδ log k or adk log k since the extensive graphs used in the Bassalygo-Pinsker network have degree δ = dk. Since N log N is the highest order term in Equation (5), we seek to minimize adk log k N log N with respect to a, d and k, and subject to the constraint that a ≥
. An additional constraint also imposed on a and k is that k i−1 at and k i at, 1 ≤ i ≤ log k (N/t), be both squares since all explicit constructions of (m, d, k)-expanders reported in the literature that we know of have a square number of inputs. This implies that k and at be both squares. Under these conditions, it is obvious that k = 4 minimizes adk log k N log N . As for a min , Figure 4 shows that its values increases as c decreases. While we do not know of a close form relation between d and c, most explicit constructions of (m, d, c)-expanders suggest that d increases with increasing c as seen in Table 1 
The Parallel Routing Algorithm
In this section we present parallel algorithms to establish a path between any pair of idle inputs and idle outputs and abolish paths between any number of pairs of busy inputs and outputs in Bassalygo-Pinsker networks. To simplify the description of our algorithms, we will combine the extensive graphs whose outputs are merged together in each stage of the F 9.8,N network into a single bipartite graph of degree dk as shown in Figure 5 . This simplifies the representation of the F 9.8,N network without altering its structure. The single assignment routing problem for a Bassalygo-Pinsker network includes two main tasks: establishing paths and abolishing paths. First, we formalize the pathestablishing problem. Let x be an idle input which requests to be connected to an idle output, say y. A free path between x and y (a path between x and y comprising unused switching vertices) will be established by traversing the left and right F a,N networks separately. That is, traversals from x to the idle outputs of the left F a,N network will be combined with the traversals from y to the idle 
240, 000N log N 2(log 300N ) + 1 [Bas81] 53. outputs of the right F a,N network to determine the free paths between x and y. The path-abolishing problem (single routing assignment version) for a Bassalygo-Pinsker network amounts to dismantling an established path between a busy input and its busy output pair. The paths between inputs and outputs must be abolished after the transactions between them are over because leaving the edges on these paths busy invalidates the nonblocking property of a Bassalygo-Pinsker network (any nonblocking network for that matter). That is, unless the paths between the inputs and the outputs that complete their transactions are abolished, the assumption that an input be connected to at most one output at a time does not hold anymore and without this assumption the definition of a nonblocking network no longer applies. We also note that abolishing paths between two or more busy inputs and their busy output pairs is timewise no more complex than abolishing a path between a single busy input and its busy output pair.
A. Representation of Paths
We will characterize the paths between the inputs and outputs of each F a,N network in terms of sequences of vertices where each vertex identifies an (output,input) pair (that is, a link between two consecutive stages) of the network. This can be viewed as collapsing the outputs of each stage with the inputs of the succeeding stage without altering their original ordering. We let vertex (i, j) denote the ith vertex between the j − 1th stage and jth stage of the For an F a,N network with fixed parameters a, t, d and k, variable S i,j in each entry P [i, j] is fixed for the specific structure of the ex-tensive graph used to build the network, variable b i,j is updated after a request to establish a path or a request to abolish a path has been completed so that they together reflect the current state of the F a,N network. 
B. Parallel Path-Establishing Algorithm
Given an idle input (I x , 0) of the left F a,N network and an idle input (I y , 0) of the right F a,N network (equivalently, an idle input-output pair of the entire network containing the left  F a,N network and right F a,N network) , the path-establishing algorithm consists of three phases: path-claiming phase, pivot-selection phase and path-tracing-back phase. Before getting through these three phases, variables µ i,j except µ Ix,0 and µ Iy,0 , in all entries of the status matrices associated with the two F a,N networks should be initialized with an invalid vertex ID. The invalidation of the µ i,j 's for the intermediate vertices (or links) is necessary in order not to taint the predecessor information that is to be compiled for the current idle input and idle output pair by the predecessor information due to the previously established paths. The µ i,j 's of all other inputs except those for (I x , 0) and (I y , 0) are also invalidated to indicate that only inputs (I x , 0) and (I y , 0) are to take part in the current round of pathclaiming phase. During the path-claiming phase, we mark all free paths between (I x , 0) and (I y , 0) which are vertex-disjoint with the already established paths in the F a,N networks by linking the vertices along the free paths with the variables µ i,j . This phase consists of (2 + log k (N/t)) steps for each F a,N network (one step for each value of j). During the jth step, each vertex (i, j), 0 ≤ i ≤ aN − 1, with its variable µ i,j containing a valid vertex ID in stage j of the F a,N network broadcasts its ID to its dk successors specified by variable S i,j . Each vertex (i, j + 1), 0 ≤ i ≤ aN − 1, in stage j + 1 keeps only one of its predecessors' ID's and stores it in variable µ i,j+1 if its variable b i,j+1 = 0 (an unoccupied vertex) and discards all the ID's from its predecessors if its variable b i,j+1 = 1 (an occupied vertex). Figure 6 illustrates the activities that take place during the jth step, where vertex (i, j) which has received the ID from vertex (p, j−1) in the last step transmits its ID to all its suc- It follows that upon applying the pathclaiming phase to the left F a,N network, all its idle output vertices that have free paths to the chosen idle input (I x , 0) can be determined. Likewise, upon applying the same procedure to the right F a,N network, all its idle output vertices that have free paths to the chosen idle input (I y , 0) can also be determined. We call each idle output common to both the left and right F a,N networks a pivot vertex if it can be reached through a path determined by the path-claiming phase from both input (I x , 0) and input (I y , 0). That the BassalygoPinsker network is nonblocking ensures that there exists at least one pivot vertex for any given idle input of the left F a,N network and any given idle input of the right F a,N network. The pivot-selection phase of the pathestablishing algorithm uses a backward traversal from pivot vertices in the output end of the right F a,N network toward its input (I y , 0) to locate a free path. This traversal takes 1 + log k (N/t) steps. More specifically, during step j, 0 ≤ j ≤ log k (N/t), the pivot vertices in stage 1 + log k (N/t) − j of the right F a,N network send their ID's to their neighbors as specified by variables µ i,1+log k (N/t)−j . The vertices in stage 1 + log k (N/t) − j − 1 that receive any ID's from stage 1 + log k (N/t) − j retain only one of these ID's they receive and then the same step is repeated between the vertices in stage 1 + log k (N/t) − j − 1 and those in stage 1 + log k (N/t) − j − 2 and so on. This phase generates a free path, (linked by variables µ i,j ), from the input (I y , 0) of the right F a,N network to one of the pivot vertices on its output end.
Thus, once the pivot-selection phase is completed, all that remains to be done is to establish a free path by tracing it back from output vertex (I y , 0) through the pivot vertex in the center to the input vertex (I x , 0) of the combined network. This path-tracing-back phase takes (2 + log k (N/t) steps on each of the left and right F a,N networks. We first start from the right F a,N network and make the idle output (I y , 0) as the only marked vertex. In the jth step, the marked vertex (i, j) in stage j of the right F a,N network sets variable b i,j = 1 to indicate that vertex (i, j) is occupied and transfers its ID to its neighbor specified by variable µ i,j so that the edge between these vertices in the right F a,N network is activated. The unique vertex in stage j + 1 which receives this vertex ID becomes the marked vertex in stage j + 1 for the following step. After 2 + log k (N/t) steps, a particular pivot vertex in the center stage will be marked. The same process then proceeds in the left F a,N network for another 2 + log k (N/t) steps starting with the chosen pivot vertex as marked vertex. At the end of this phase, a free path between (I x , 0) and (I y , 0) is established and the request is served.
The algorithm to abolish a path is much simpler. Given a busy input (I x , 0) of the left F a,N network and a busy input (I y , 0) of the right F a,N network, abolishing the path between them only takes one phase which consists of (2 + log k (N/t)) steps on each F a,N network. Also the busy paths can be abolished in parallel since they are disjoint.
Realization and Performance
In this section we will discuss the implementation and performance of the routing algorithm on parallel processors with three different connection topologies.
A. Direct Realization
In this case, all phases of the routing algorithm associated with the two F a,N networks presented in the previous section are mapped directly onto a parallel processor with 2N + 2aN × log k (N/t) + aN processors that are interconnected exactly the same way as the vertices are connected in the BassalygoPinsker network. We shall refer to this parallel processor realization as a BP-processor since its topology is patterned after the BassalygoPinsker network. The routing algorithm described in the previous section can be realized on the BPprocessor using either a centralized or a distributed processing scheme. In the centralized scheme, we assume that a master control unit initiates the various phases of the routing algorithm. Recall that the routing algorithm consists of two parts: path-establishing and path-abolishing. In the path-establishing part, upon receiving a request to connect an idle input x to an idle output y, the master control unit activates the processor associated with the idle input and the processor associated with the idle output. Each of these two processors then simultaneously initiates the path-claiming phase. Once the path-claiming phase is completed, those processors in the center stage invoke the pivot-selection phase. After this phase is completed, the processor associated with the idle output y then invokes the path-tracing-back phase. At the end of this phase, a path between idle input x and idle output y is formed. In the distributed scheme, the request for a connection between an idle input and an idle output arrives directly at the processor associated with the idle input and this request must be transmitted to the processor associated with the idle output. This is accomplished by invoking a broadcast of the destination address of the idle output via the processor associated with the idle input to all the processors associated with all the idle outputs. The processor (associated with an idle output) whose destination address matches the broadcast address is then activated to initiate the three phases of the path-establishing process. These three phases are also carried out by the processor associated with the idle input. The rest of the realization proceeds as in the centralized scheme. It follows that all three phases of the path-establishing algorithm can be completed in 4 * (2 + log k (N/t)) = O(log N ) steps on the BP-processor under the centralized scheme and in 6 * (2 + log k (N/t)) = O(log N ) steps under the distributed scheme. Similarly, it can be shown that the path-abolishing algorithm can also be realized on the same parallel processor in (2 + log k (N/t)) = O(log N ) steps under the centralized scheme and 3 * (2 + log k (N/t)) = O(log N ) steps under the distributed scheme.
B. Indirect Realizations
It should be obvious that the parallel processor realization just described can be simplified by combining some of the processors together and restructuring the communication links between them so as to maintain the connectivities in the original topology of the BPprocessor. In general, this can lead to a variety of realizations with centralized routing schemes for the Bassalygo-Pinsker network. One possibility is to collapse all the processors into a single column of aN processors and restructuring the communication links so that if any two processors have a direct communication link before the contraction, they have a direct communication link after the contraction as well. Since each vertex in a stage of the BP-processor is connected to dk successors in the succeeding stage. Therefore, each processor in the contracted BP-processor would have dk(2 + log k (N/t)) communication links connecting it to the other processors. Thus, the contracted BP-processor consists of O(N ) processors and a total of O(N log N ) communication links. We leave out the details of the realization of the path-establishing algorithm on this contracted BP-processor and only mention that it can be realized in 5 * ((2 + log k (N/t))) = O(log N ) steps. The path-abolishing algorithm can similarly be realized on the same processor with a time complexity of 2 * ((2 + log k (N/t))) = O(log N ) steps. The same algorithm can be realized on other parallel computers consisting of O(N ) processors. In particular, we can realize this routing algorithm on a perfect shuffle processor using the data broadcast algorithm of Nassimi and Sahni [NS81] . Consider a perfect shuffle processor with aN processors, and suppose that processor P (i) contains an index register W (i) and data register D(i). Nassimi and Sahni described an algorithm, called random access write (RAW) that broadcasts the contents of D(i) in processor P (i) to processor P (W (i)), 0 ≤ i ≤ aN − 1. If two or more processors attempt to broadcast to the same processor, that is, If W (i 1 ) = W (i 2 ) = ... = W (i r ) = i, then P (i) receives its data from P (j), where j = Min 1≤k≤r {i k }. This algorithm takes O(log 2 N ) steps to execute on a perfect shuffle processor. The various phases of the routing algorithm described in Section 3 can be broken down into a sequence of steps each of which amounts to executing the Nassimi and Sahni's data broadcast algorithm. To see this, consider the path-claiming phase of the parallel path-establishing algorithm. In the BP-processor realization, each processor within a stage broadcasts its own ID to its dk successors in the next stage. On the receiving end, each processor keeps only one of the ID's that reach it. This broadcasting of ID's between the processors in consecutive stages can be performed by iterating the Nassimi and Sahni's RAW algorithm dk times, where, during each iteration, all active processors send their ID's to one of their successors. Since each iteration takes O(log 2 N ) steps and a total of dk iterations are needed to complete the broadcast of the ID's for all active processors during each step of path-claiming algorithm and since the entire algorithm encompasses O(log N ) steps, the path-claiming phase can be completed in O(log 3 N ) steps on a perfect shuffle processor with O(N ) processors using Nassimi and Sahni's algorithm. It should be noted that this realization increases the time complexity from O(log N ) to O(log 3 N ) when compared to the fully-contracted BPprocessor, but it only requires O(N ) communication links as compared to O(N log N ) communication links for the fully-contracted BP-processor. Table 3 summarizes the processor and time complexities of the various realizations of our routing algorithm. The steps in both direct and indirect realizations involve broadcasting, updating vertex ID's and checking binary variables. In section 3A, we stated that a switching vertex in a Bassalygo-Pinsker network will be given or assigned a pair (i, j) as its ID (see footnote 6). For an Ninput Bassalygo-Pinsker network this implies that the ID of each vertex takes up O(log N ) bits, and hence broadcasting and updating ID's would require O(log N ) bit-steps. Fortunately, the bit-level complexity of the steps in our algorithms can be reduced to O(1) by noting that each vertex in Bassalygo-Pinsker network has only 2dk neighbors and thus it suffices to use 2 log(dk) = O(1) bits to identify the neighbors of a vertex. Therefore, broadcasting a vertex ID reduces to setting a single-bit flag and identifying the neighbor that sets a single-bit flag reduces to encoding a log(dk)-bit address which can be done in O(log 2 (dk)) = O(1) bit-steps. Recalling that the three phases of the path-establishing algorithm requires 4 * (2 + log 4 (N/5) steps, the total bit-level time complexity of this algorithm will be ≈ 4 * log(dk) 2 * (2 + log 4 (N/t) = 72 log N +120.82 when d = 9, k = 4 and t = 5. While these results are rewarding, it will be worthwhile to further reduce the constant 352.8 in the crosspoint expression of the nonblocking network described in the paper. This would require new constructions of expanders with lower densities and larger expansion coefficients. Another direction for further research is to extend the path selection algorithm of this paper to handle multiple connection requests. Such requests can be handled by iteratively applying the algorithm given in this paper, but this is likely to lead to excessive routing time when the number of requests gets very large. These problems and other related questions will be dealt with in detail elsewhere.
Concluding Remarks

