Abstract-The load-balanced switch architecture is a pmmis-be aware of which linecards are present and which are not. If ing way to scale muter capacity. We explained in [I] how it can be used to build a 100Tb/s router with no centralized scheduler, no memory operating faster than the line-rate, no packet m i s sequencing, a 100% throughput guarantee for all trafllc patterns, and an optical switch fabric that simply spreads traffic evenly among linecards. This switch fabric uses optical MEMS switches that are reconfigured only when linecards are added and deleted, allowing the muter to function when any subset of linecards is present and working.
I. INTRODUCTION
Our goal is to identify router architectures with predictable throughput and. scalable capacity. At the same time, we would like to identify architectures in which optical technology (for example optical switches and wavelength division multiplexing) can be used inside the router to increase capacity by reducing power consumption.
In a previous paper [I] we explained how to build a IOOTbls Internet router with a single-rack switch fabric built from essentially zero-power passive optics, but without sacrificing throughput guarantees. Compared to routerd available today, this is approximately 40 times more switching capacity than can be put in a single rack, with throughput guarantees that no commercial router can match today. The key to the scalability is the use of the loud-bahnced swifch, first described by C-S.
Chang er al. in [3] . In [I] we extended the basic architecture so that it has provably 100% throughput for any traffic pattern, and doesn't mis-sequence packets. It is scalable, has no central scheduler, is amenable to optics, and can simplify the switch fabric by replacing a frequently scheduled and reconfigured switch with a single, fixed, passive mesh of WDM channels.
Unfortunately, as mentioned in [ 2 ] , the number of linecards present can keep on changing as more and more linecards are added as the network grows or linecards are removed as they fail. The load-balanced switch works by uniformly spreading packets over all linecards, and therefore needs to 0-7803-8686-8/04/$20.00 02004 IEEE some linecards are missing, the switch fabric must be able to schedule the traffic uniformly over the linecards present. In [I] we described a hybrid electro-optical architecture that solves this problem, and will operate with any subset of linecards.
[Z] describes an algorithm to configure the switch fabric, and proves that it will always find a valid configuration in polynomial time.
Upon linecard failure we require a restoration time below 50ms in order to provide a fast recovery [41, 151, [61, [71. However, the polynomial-time algorithm we described previously took over 50 seconds to run. A simple conversion to hardware of the software algorithm would be too slow by at least an order of magnitude because the algorithm is extremely memory intensive. The goal of this paper is to show that a suitably modified hardware implementation can keep the reconfiguration time below 50ms. The polynomial-time algorithm requires many repetitions of two graph matching algorithms. The first finds the maximum flow in a graph, which is commonly realized using the FordFulkerson [SI algorithm. The second algorithm decomposes a matrix into a minimal number of permutations, which is commonly solved using a , [IO] decomposition.
Both the Ford-Fulkerson and the Birkhoff-van Neumann algorithms require a large number of memory accesses in order to find matches. Therefore, in order to speed up the running time, we adapt the original algorithms to minimize memory accesses. First we modify the Ford-Fulkerson algorithm to work specifically for bipartite matches. Based on the binary matrix structure specific to our problem, we can then utilize bit-manipulation schemes to reduce the time required to search for new matches.
Second, in order to decompose a matrix into permutations, the Birkhoff-van Neumann decomposition repeatedly finds a permutation using either a maximum size match or a simplified Ford-Fulkerson. By using the Slepian-Duguid algorithm instead, we find all the permutations at the same time. This reduces the number of iterations to one, and therefore the number of pre-processing steps linked with each iteration. In addition, we provide a simple mechanism to search for the matrix elements not yet assigned to a permutation.
Finally, the experimental results show that it is possible to achieve the 50ms target for our IOOTb/s router consisting of up to 640 linecards.
Here is an outline of this paper. Section Il provides an overview of the algorithm used to configure the switch fabric of the IOOTb/s router. Then, sections 111 and IV respectively present the details of the modified Ford-Fulkerson algorithm and the Slepian-Duguid algorithm. These sections describe how these algorithms are memory-intensive, and how bit manipulation schemes can drastically reduce the number of memory accesses. Finally, in Section V, the simulation results show how the reduction of memory accesses makes the 50111s target feasible.
OVERVIEW OF CONFIGURATION ALGORITHM
Although the configuration algorithm is described fully in [2] (and we assume the reader is familiar with both references [ I ] , [Z] ), we give a brief reminder of the algorithm here.
As explained in [2] , there are G groups; group i contains Li linecards, and the total number of linecards is:
We will assume that Ll, Lz, _.., L c are fixed for a given linecard arrangement. Our objective is to create a schedule where linecards spread packets evenly across all other linecards. Therefore, during every frame of N time-slots each sending linecard needs to be connected exactly once to each of the N receiving linecards and vice-versa. This is the classical time-slot assignment problem, known as a Latin square when rates are equal. However, the main difference is an additional constraint which arises from the use of MEMS switches in the switch fabric architecture. Within each time-slot, the rate from each transmitting group of linecards to each receiving group of linecards is limited. Therefore, it is possible that two different linecards in a transmitting group cannot simultaneously send to two different linecards in a receiving group.
An algorithm for constructing the schedule was proposed in [2] . The algorithm constructs three consecutive schedules. First, it creates a schedule between sending groups and receiving groups by repeatedly solving the connection assignment problem defined in Section 111. Second, the algorithm creates a schedule between sending linecards and receiving groups. Third, the algorithm creates the final schedule between sending linecards and receiving linecards. These last two steps repeatedly decompose matrices into a minimal number of permutations as defined in Section IV.
In the next two sections, we will formally define the connection assignment problem and the matrix decomposition problem, and then show how they can be efficiently solved in hardware.
CONNECTION ASSIGNMENT PROBLEM

A. Problem Definition
The configuration algorithm of the load-balanced switch needs to solve the following connection assignment problem. Consider 2G nodes separated into G left nodes and G right nodes. The left nodes are connected to the right nodes using a 0-1 capacity matrix C of size G x G. The rows of the connection matrix correspond to the left nodes, and the columns to the right nodes. We want to find a 0-1 connection matrix R such that it is below capacity and satisfies a target number of connections per node. RL; represents the target number of connections needed for left node i, and RRj similarly represents the target number of connections needed for right node j. Table I shows an example of the connection assignment problem with G = 3. For instance, the first left node in this table needs to make two connections, as specified in the first element of RL. Similarly, the second right node needs to make one connection, as shown in the second element of RR. Therefore, the 0-1 solution matrix R has two elements on its first row, and one element on its second column.
Put mathematically, we want to solve the following problem.
Find a 0-1 matrix R 5 C such that:
Note that the solution is not necessarily unique, and that for the load-balanced switch configuration the capacity mauix will always be sufficiently large to guarantee the existence of a solution (21.
B. Earlier Work
Given a capacity matrix, it was shown in Our goal is to implement the algorithm in hardware, and reduce its runtime. search from the left nodes to the right nodes, not from the source to the sink. In addition, we do not allow connections from the right nodes to the left nodes.
Our modified Ford-Fulkerson algorithm can be subdivided into two separate parts. The first part uses a greedy approach to make connections between nodes. The second part uses back tracing to find the remaining connections. Let's first explain the greedy part of the algorithm.
Greedy Algorithm For each left node, the greedy algorithm keeps adding as many temporary connections as possible to the right nodes. A connection can be added if and only if this connection exists in the binary capacity matrix and the target numbers of left and right connections are not exceeded. Table I1 shows the matrix P of temporary connections after the greedy algorithm is applied. RL' and RR' represent the remaining target number of connections to he made for the left and right nodes. Notice that after the greedy algorithm is applied, LI is connected to R I and Rz, and Lz is connected to R I . Therefore, the target number of connections for L1, Lz, R I , and R2 are met. The only connections not yet met are for L3 and R3 as seen in RL' and RR'. The greedy algorithm cannot connect L3 to R3 since the only-connections available in the capacity matrix from L3 are to R I and R2. After the greedy algorithm is applied, the remaining target connections specified by RL' and RR' are made through the back tracing algorithm. The C' matrix specifies the connections not used by the greedy algorithm (C' = C -P). Figure 2 illustrates in our example how the back tracing is done using a simplified version of the BFS algorithm.
Back Tracing Algorithm
Initially the greedy algorithm finds the connections made in the P matrix, shown by thin solid lines in Figure 2a . These edges are the temporary connections currently made. The connections in the C' matrix are shown by the dashed lines. In our example L3 has no connection to RS, but has connections to R I and R2. This is where the back tracing algorithm starts. Either RI or R2 can he traced back as shown in Figure 2b This hack tracing algorithm is repeated for all other nodes that do not achieve their targets.
Implementation
Memories are used to keep track of the following elements. Throughout both the greedy and back tracing algorithms, we store the current capacity matrix C', which keeps track of the temporary connections made, and the remaining number of connections needed to be made to each node. In addition, in the back tracing algorithm, a predecessor memory is needed to remember the trace.
Let's see why the greedy algorithm is memory-intensive. In the greedy algorithm, when connections are added, the algorithm must search for the next available connection IO a right node. If there are G right nodes, this could require up to G memory accesses per left node, and therefore a total of up to G2 memory accesses.
We use hit manipulation schemes to reduce the number of memory accesses in the greedy algorithm. We first arrange the current capacity matrix C' associated with a left node as a bitmap of size G. We similarly represent the RR' m a y as a'bitmap of size G, where the hit is set if the corresponding RR', is positive. Then, a logical AND between these two hitmaps gives a bitmap representation of the available connections. We can then find the next available connection by finding the first set bit in the resulting bitmap. This can be done in a single clock cycle by using a priority encoder. Therefore, by reusing the resulting bitmap, we can reduce the total number of memory accesses by a factor of up to G. Now let's consider why the hack tracing algorithm uses many memory accesses. In back tracing we need to keep track of the trace. Since we are using a BFS-based hack tracing, each step of the search might require adding up to G nodes to the predecessor memory. For instance, in one search step of a left node, up to G right nodes can be considered.
In our implementation we arrange the predecessor memory as a binary matrix of size G x G. We implement this matrix by using a memory structure that allows a memory write to an entire row, and a memory read of an entire column. In a search step of the BFS algorithm, instead of writing each node individually, we write the entire set of available nodes in parallel to the entire row. Then, after a trace is done, in order to find the predecessor of a node in.the trace, we use an encoder on the entire column of the memory read to find the position of the single bit set in the column. This position corresponds to the index of the predecessor. Using this bit manipulation scheme, we can reduce each search step to a single memory access. Therefore, we reduce the total number of memory accesses by a factor of up to G. Therefore, in both the greedy and back tracing algorithms, we can reduce the total number of memory accesses by a factor of up 1o.G by using hit manipulation schemes and encoders.
IV. MATRIX DECOMPOSITION PROBLEM
The configuration algorithm of the load-balanced switch needs to repeatedly decompose matrices into a minimal number of permutations. In this section we'll describe the Birkhoffvon Neumann solution, explain why it is memory-intensive, and then explain why the Slepian-Duguid algorithm leads to a more efficient implementation.
A. Problem Definition positive integer n satisfying:
Assume that we are given a 0-1 square matrix S and a Cj, Sijl = n xi, S,jj = n for all i for all j { S;j E {O, I} for all i , j
We want to decompose S into n permutation matrices, i.e. find n permutation matrices { P k } l g s n such that: E,, P:., = 1 E,, P$i = 1 for all i, k for all j , k i Ck, P;' = n for all i , j Note that although the decomposition is not necessarily unique, it always exists because the chromatic number of a bipartite graph is equal to its maximum degree. Note that in the load-balanced switch example, the binary matrices S could be of a size up to 640 x 640, and they are typically sparse, having a maximum of 16 ones in each row and each column.
B. Earlier Work
In are fast enough for us because they all require at least one occurrence of the maximum size matching algorithm.
C. The Slepian-Duguid Algorithm
Instead we use the Slepian-Duguid [I61 algorithm designed for scheduling calls in a circuit switch.
First, to reduce the size of the memory, we use the sparsity of the matrices in the load-balanced switch example. In particular, the ones of the binary matrix S are represented as a list of (row,column) pairs.
Then, to reduce the number of memory accesses, we apply an algorithm based on Slepian-Duguid. This algorithm attempts to produce n permutation matrices at once, and uks the (row,column) pair list structure. The initial part of our algorithm uses a greedy scheme to assign the easily-matched elements, and the second part uses the Slepian-Duguid algorithm to reassign these elements and provide a solution.
I
Greedy Algorithm
The n permutations are also organized in a sparse manner, i.e. by using a (row,column) pair list structure. For clarity, we will refer to rows as inputs and to columns as outputs. Each permutation is arranged as an array of outputs. For instance, the i-th element in the array refers to the output that is matched with input i. Note that a valid permutation will not match more than one output to the same input. Since we want to find n permutations, we maintain n such arrays. We m a n g e these arrays into a matrix A where each row corresponds to a different permutation and therefore to a different array.
In the greedy algorithm, the matrix A is initially empty.
Then, the algorithm goes through the list of (input,output) pairs, denoted ( i , o), and tries to assign each such pair to a permutation for which input i and output o are both unassigned.
This continues until no more ( i , o ) pairs can he assigned. the (2,l) pair is assigned in the 4-th step. In Ad, the ( 2 , l ) pair can only be assigned to the second or third permutation since output I is already scheduled in the first permutation.
Let's assume that it is assigned to the second permutation. It is possible that an (i,o) pair can not he assigned. For instance, in A8, the (3,5) pair cannot be assigned since the only permutation free for input 3 is the third permutation, and output 5 is already assigned in the third permutation.
Als in Table 111 shows the final state of the permutations after the greedy algorithm. Notice that the (3,5) and (4,3) pairs are not yet assigned and need to be assigned in the Slepian-Duguid algorithm.
Slepian-Duguid Algorithm For each (input,output) pair (il,ol) that is left unassigned, the algorithm works as follows: is not assigned in Pi, and output o1 is not assigned in 2) Swap the input i' with i 2 , where i 2 is an input such that ( i 2 , 0 1 ) was already assigned in P,I. Now we need to track (i2,01).
3) Swap the output o1 with 0 2 , where o2 is an output such Po,.
that ( i ' , 0 ' ) was already assigned in Pol. Now we need to track ( i z , 0 2 ) .
Repeat steps 2 and 3 until we have unassigned slot for (in, on)
in either of the permutations P,I or P,L.
The B, matrices are similar to the Ai matrices. repeats the procedure of swapping until no permutation has multiple outputs assigned. B3 shows the final resulting matrix.
Implementation
Memories are used to keep track of which inputs and outputs are unassigned after each pennutation. For each input i, we store the n permutations in a bitmap of size n. If input i is not assigned in a given permutation, we set the bit corresponding to this permutation in the bitmap. The same is done for each output. Then, in the greedy algorithm, an (input,output) pair can easily find a free permutation by taking a logical AND between the input and output bitmaps. By finding the first set bit in the resulting bitmap, the (input,output) pair can be matched to a free permutation in a single clock cycle. Therefore, by reusing the resulting bitmap, we can reduce'the total number of memory accesses by a factor of up to n.
V. RESULTS
The algorithms have been implemented in hardware and the results are presented here.
A. Synrhesis
The hardware implementations mentioned in the previous section are implemented using Verilog and synthesized using a 0.13um process. In this implementation for 40 groups, up to 640 linecards and a maximum of 16 linecards per group, the modified Ford-Fulkerson algorithm uses 10K gates, 24Kbits of memory, whereas the core for the Slepian-Duguid algorithm uses 25K gates, 230Kbits of memory. Based on the synthesis results, both the cores ran within a 4ns clock cycle time.
B. Simulation's
Because of the number of experiments we wanted to run, and the complexity of the algorithm, running our Verilog implementation was too slow. Instead, we developed a cycleaccurate C-model and verified its accuracy by comparing it with the Verilog implementation. We then used the C-model to A processor is assumed to be connected to the cores uploading and downloading the necessary information into the memories. The times to transfer the initial matrices and to obtain the final results tolfrom the processor are not considered in the results. We compare the simple conversion to hardware of existing algorithms with our implementation using the total number of memory accesses. We assume that memories can he accessed in a pipelined manner and that each memory access requires one clock cycle. We measure the number of hardware clock cycles needed. Figure 3 shows the time spent for the simple conversion and our implementation assuming a 4 ns clock cycle time. The graph plots the largest number of clock cycles needed, in any of the tests we ran. We use a logarithmic scale so as to represent both plots on the same graph. Since the algorithms are polynomial, the plots appear logarithmii. Note that we did not attempt to pipeline the Verilog implementation and believe that we can reduce the time by an additional factor of at least two. Even without complete pipelining, our results show that our implementation meets the 50ms target over the range of linecards needed in [I] .
VI. CONCLUSIONS
This paper implements the configuration algorithms of the load-balanced switch introduced in [2]. The implementation meets the 50ms recovery time imposed by network operators.
Our hardware implementation relies on bitmap manipulation schemes with priority encoders to drastically reduce the memory intensive operations. Further improvements can be achieved by pipelining, using multiport memories and by exploiting some of the parallelism in the greedy pans of the algorithms. We believe that these schemes can be generalized to accelerate hardware implementations of other graph coloring algorithms. 
