INTRODUCTION
Switches, and the routers that use them, are the basic building blocks for constructing high-speed networks that employ point-to-point links. As the demand for network throughput keeps climbing, switches are needed with both faster and more ports. This article concerns switch scalability when the number of ports increases. For low to modest numbers of ports -up to about 64 -the crossbar is the switch topology of choice, due to its simplicity and nonblocking operation. However, its cost grows with N 2 , where N is the number of ports, which makes it very expensive for large N. Additionally, crossbar scheduling is a hard problem, and gets harder with increasing N.
For switches with hundreds or thousands of ports, multistage switching fabric architectures are needed, whose cost growth rate is less than quadratic. Researchers have investigated such scalable fabric topologies since the days of electromechanical telephony. The banyan network features low cost, N ⋅ log N, and a rich set of paths. Although it can support full egress link utilization under uniformly destined traffic, as well as a number of other specific traffic patterns, it does suffer from internal blocking: not all feasible rates λ i,j can be routed through it. The lowest-cost N × N network that is free of internal blocking is the Benes network, whose cost is N ⋅ 2 log N. The Benes network is rearrangeably nonblocking; that is, when each connection is routed through a single path, setting up new connections may require rerouting of existing connections; however, using multipath routing, this disadvantage can be eliminated. This article concerns the Benes network.
If a multistage switching fabric contains no buffer storage, there must exist a mechanism to handle the cell routing conflicts that arise in internal paths due to the routing algorithm and due to output conflicts. The former conflicts can be handled in a distributed manner (self-routing fabrics) using Batcher sorting networks [1, 2] . The latter conflicts, cells destined to the same output at the same time, must be avoided at the inputs or tolerated in the fabric. Avoidance at the inputs is equivalent to crossbar scheduling and requires global coordination; hence, it is unrealistic for large fabrics. To tolerate output conflicts in the fabric, designers have used recirculation of cells or multiple paths to each output buffer. All these mechanisms use a large number of stages and paths per stage: the switching fabric cost is O (N ⋅ log 2 N), and the constant in front of the actual cost is significant. In essence, these techniques spend (expensive) communication resources in order to economize on (inexpensive) storage resources, which is the wrong trade-off in modern very large-scale integration (VLSI) technology.
It is preferable for the switching fabric to contain internal buffer storage, in order to buffer conflicting cells until the conflict goes away. Such internal storage may be small enough to fit inside the switching element chips, or large enough to replace the buffer space typically 
ABSTRACT
Multistage buffered switching fabrics are the most efficient method for scaling packet switches to very large numbers of ports. The Benes network is the lowest-cost switching fabric known to yield operation free of internal blocking. Backpressure inside a switching fabric can limit the use of expensive off-chip buffer memory to just virtual-output queues in front of the input stage. This article extends the known credit-based flow control (backpressure) architectures to the Benes network. To achieve this, we had to successfully combine per-flow backpressure, multipath routing (inverse multiplexing), and cell resequencing. We present a flow merging scheme that is needed to bring the cost of backpressure down to O(N) per switching element, and for which we have proved freedom from deadlock for a wide class of multipath cell distribution algorithms. Using a cell-time-accurate simulator, we verify operation free of internal blocking, evaluate various cell distribution and resequencing methods, compare performance to that of ideal output queuing, the iSLIP crossbar scheduling algorithm, and adaptive and randomized routing, and show that the delay of well-behaved flows remains unaffected by the presence of congested traffic to oversubscribed output ports.
Benes Switching Fabrics with O(N)-Complexity Internal Backpressure
found on the ingress line cards (usually hundreds of megabytes), thus requiring off-chip DRAM. In the former case, backpressure is used to prevent small buffers from overflowing; effectively, the majority of the buffered cells are pushed back onto the ingress line cards, as in the usual case of virtual output queues (VOQs) on the input side. Given that ingress lines are much fewer than intrafabric links, this architecture results in significant cost savings from the offchip DRAM case for intrafabric buffers, as shown by the ATLAS I switch evaluation [3] . Several commercial chip sets use backpressure in the ingress-switch-egress connection chain [4, 5] . This article concerns the application of this advantageous internal backpressure architecture to the Benes network, the lowest-cost scalable switching fabric.
In this work we extend the backpressure architecture from single-path fabrics (like banyans) to multipath topologies and specifically to the Benes network; we first presented this extension in [6] . In this article first we review the requirements for the Benes fabric to operate free of internal blocking and the operation of the backpressure protocol on a per-flow granularity, as required to eliminate head-of-line blocking effects. Then we present appropriate flow merging techniques that are needed when combining the above two requirements in order to reduce the complexity of the switching elements in the middle stages of the Benes fabric from O(N 2 ) down to O(N). Multipath cell distribution interacts with flow merging, and they both interact with the organization and placement of buffers; we show which organization is preferable, and we refer to its deadlock-free nature. We consider and validate through simulations the pros and cons of our architecture relative to previous systems with randomized or adaptive routing schemes for the Benes fabric; we also compare with an ideal output queuing switch architecture, and with input queuing using an iSLIP scheduler. Finally, we present our conclusions.
THE BENES FABRIC
This section reviews the two foundations of our design: the Benes fabric and internal backpressure in switches.
NONBLOCKING OPERATION
The Benes network can be constructed recursively, using inverse multiplexing [7, 8] , as shown in Fig. 1 . The N × N Benes network consists of two N/2 × N/2 Benes subnetworks, N/2 switches of size 2 × 2 connected to the inputs of the two subnetworks, and N/2 switches of size 2 × 2 connected to the outputs of the two subnetworks.
Let λ i,j denote the traffic entering the network from input i and destined to output j. In order for the N × N network to be nonblocking, the 2 × 2 switch connected to input i must distribute λ i,j equally among its two outputs. The output switch that feeds output j receives λ i,j /2 on each of its inputs, reconstructs λ i,j , and routes it to the appropriate output. and leaving each N/2 × N/2 subnetwork will also be feasible. Specifically, input k of either subnetwork will be receiving
λ2 k+1,j /2 which is ≤ 1/2 + 1/2 = 1 because of the above feasibility of the overall traffic. Symmetrically, the same holds for the load of each output of either subnetwork. Thus, it follows by recursion that the overall N × N network will also be internally nonblocking. The resulting topology, for N = 8, is also shown in Fig. 1 . Traffic λ i,j goes through log N stages of distribution and log N corresponding stages of reconstruction. The figure also shows that an N × N Benes network can be constructed by placing two banyan networks back to back. The two banyans are called the distribution and routing network, respectively [9] , since the first distributes incoming traffic over the N links in the middle of the network -a virtual "wide" link of aggregate throughput N -and the second routes cells to the proper output link. The Benes topology can be generalized to use switching elements of valency (number of ports) higher than 2 × 2.
Nonblocking operation as above is based on (repeated) inverse multiplexing or load distribution in a balanced manner. A "poor man's" method for load distribution is to send all packets of half the microflows through one path, and all packets of the other half through the other path (e.g., using a pseudo-random hash function of the source-destination IP address pair to decide the path). This ensures that all packets of a given microflow follow the same route, and hence arrive in order. The disadvantage of this method is that load distribution may not be balanced, especially where the number of microflows is limited. Imbalanced load distribution will result in internal blocking in the Benes fabric; thus, we do not use this method. At the other end of the spectrum is a method for exact load distribution that resembles the bit-sliced processors of the '70s. Each cell is split in two units, of half the original cell (payload) size each, and each unit is sent in one of the two directions. This method is used in several commercial chip sets, but only with splitting degrees up to 8 and carefully equalized delays through the paths [5, 10] . This method is far from scalable, due to the fixed header and per-unit-processing overheads,; thus, we do not use it.
To achieve balanced load distribution in the long run, even if not on a very short-term basis, while still operating at the cell level, a number of methods have been proposed: randomized [11] , adaptive [8] , and per-flow round-robin cell distribution [12] . In all of these methods, cells of a given microflow are routed through either path, so they may arrive out of order. For the switching fabric to preserve cell order within individual microflows, resequencers must exist at the points of path reconvergence. Resequencing is an important issue in our system, and is dealt with in later sections.
INTERNAL BACKPRESSURE PROTOCOLS
Switches with multistage buffering typically use backpressure feedback control between these stages to avoid overflow of downstream buffers and control individual flow rates when multiple flows merge into oversubscribed resources, thus enforcing quality of service (QoS) guarantees.
We assume credit-based backpressure: the upstream stage maintains a credit counter (in total or per flow), specifying how many cells it is allowed to transmit in the downstream direction before new credit is received via backpressure feedback signals. The buffer space needed is λ × RTT (in total or per-flow), where λ is the peak rate and RTT is the round-trip time.
Backpressure signals may refer to individual (micro) flows, to flow aggregates, or indiscriminately to all traffic passing through a link. Indiscriminate backpressure leads to very poor QoS, because a single oversubscribed flow may stop the service to all other flows with which it shares a link or buffer (this is analogous to head-ofline, HOL, blocking). Thus, per-flow or virtualchannel or multilane backpressure is needed. The number and definition of flows is a crucial parameter, and affects cost (amount of state and granularity of feedback information) and QoS (degree of isolation among competing flows). When individual flow granularity is excessive, one can use a "compromise" solution or appropriate flow aggregation. Compromise backpressure protocols yield good performance in the usual cases, but perform badly in some worst cases; they include wormhole virtual channels [13] , a DEC proposal [14] , quantum flow control (QFC), and the ATLAS I multilane backpressure [3] .
This article is concerned with full-fledged per-flow backpressure, which ensures that even if all output ports but one are oversubscribed, traffic going to that one noncongested output will still enjoy delays comparable to those of an ideal output queued switch. We obtain such strong QoS guarantees at a cost no worse than O(N) per switching element, which is realistic for modern VLSI technology.
SWITCHING ELEMENT ORGANIZATION
In this section, we present flow merging schemes that reduce the O(N 2 ) backpressure cost (per switching element) down to O(N). Next, we describe the queues and the functionality inside the distribution and routing switching elements.
The main tool used in this endeavor is the merging of flows with common destinations. When multiple flows of a same priority level follow a common path to a common destination, they can be treated as a single merged flow over the common path for purposes of buffer allocation and backpressure granularity. The reason is that cells of one flow will never need to overtake cells of another after the merge point.
FLOW GROUPS
As noted earlier, for an N × N Benes fabric, backpressure must operate at the granularity of the N 2 flows (per priority level) defined by all input-output pairs. In banyan fabrics, although the total number of flows is N 2 , only N flows pass through any individual link in the fabric. In the Benes fabric, however, the traffic of every flow is distributed and sent over both "even" and "odd" subnetworks in Fig. 1 ; consequently, all subnetworks, no matter how small, down to the individual switching elements in the core of the fabric, are traversed by N 2 flows (per priority level).
In order to reduce the number of flows, we use per-output merging of the flows destined to the same output port of the fabric. Figure 2 shows the case for two flows originating from inputs 0 and 1 and destined to the same output 0; 01 → 0 denotes the merging of flows 0 → 0 and 1 → 0. This example uses 2 × 2 switching elements. Each switching element of the distribution network (left half of the Benes fabric) merges, one by one, the N flow groups entering through one of its inputs with the N flow groups entering through the other, and produces N merged flow groups; the merging factor is two to one. These switching elements also distribute the cells to both of their outputs, so the N merged flow groups appear on each of these outputs; hence, all links carry precisely N flow groups.
In the routing network (right half of the Benes fabric), cells that had been distributed to the even and odd subnetworks must be resequenced. Resequencing in output switches must be performed separately for each flow in a merged flow group. The reason is that merged flow groups carry cells that were distributed at different input switches, independent of each other, before the merge points. Hence, merged flow groups from different inputs to a same output must be split again in order for resequencing to work correctly.
Splitting of flow groups and cell resequencing can be performed progressively, per stage, or cumulatively, in the very last stage of the fabric. In the latter case, we need not split flows within the routing banyan; thus, there would be N/2, …, 2, 1 flows passing though the switching elements in the log 2 N stages of the routing banyan, respectively. However, each resequencer at the output ports of the fabric would then require N resequence buffers, one for each of the N (per input) flows leading to that output, each of size O(N). There is no reason to accumulate so much complexity in the last stage of the fabric, so we prefer the former solution: progressive flow group splitting and cell resequencing.
In conclusion, per-output flow merging with per-stage resequencing is much simpler to implement and has a uniform implementation cost of O(N) per switching element across all stages of the switching fabric, so we use this architecture in the rest of the article. Lucent's ATLANTA chip set [4] also uses per-output flow merging and cell distribution, but avoids resequencing because the middle stage consists of N/P × N/P bufferless crossbars (where P is the number of port interfaces connected to each input module); thus, it does not reorder cells. Figure 3 shows the preferred logical buffer organization of the distribution and routing switching elements, along with the active components needed. We follow the flow merging and cell resequencing architecture chosen above. The figure shows a distribution switching element at the second stage of the Benes fabric that is connected, through switching elements, with inputs 0, 1, 2, and 3; and a routing switching element at the second to last stage of the fabric that leads to outputs 0, 1, 2, and 3. The flow groups from inputs 0,1 and 2,3 to the eight fabric outputs are shown in the left (distribution) switching element, along with the flows to outputs 0,1 and 2,3 from the eight fabric inputs in the right (routing) switching element. The FIFO's shown are logical queues, containing references to cells; the actual cells do not move inside the switching element.
LOGICAL BUFFER ORGANIZATION

FREEDOM FROM DEADLOCK
The interleaving of multiple stages of cell resequencing and flow splitting combined with backpressure has the potential danger of deadlock: a resequencer may be waiting for cells from a given path, while the splitter in the previous stage may be delivering cells in the wrong queue. We have shown that for a wide and interesting class of cell distribution methods, no deadlock situation can arise. In [15, S. 3 .3], we describe the potential deadlock situation and derive sufficient conditions for this situation never to occur in per-stage resequencers.
SIMULATION RESULTS
A simulation model, operating at cell time granularity, was developed in order to verify the design and evaluate its performance under vari- (2) Resequencing (2) 1->0 01->0 01->0 1->0 0->0 01->0 01->0 ous traffic patterns and for various switch sizes, and to evaluate cell distribution and resequencing methods. We simulated the switch under smooth, bursty, and hot spot traffic. Smooth traffic consisted of Bernoulli arrivals with uniformly distributed destinations. For bursty traffic, each source alternately produces a burst of cells (all with the same destination) possibly followed by an idle period of empty cells; the bursts and idle periods contain a geometrically distributed number of cells. The reported results use bursty/12 traffic, where the mean burst size is 12 cells; this is close to one of the modes of IP traffic size distribution (assuming 48-byte cell payload). Under hot spot traffic, each destination belonging to a designated set of hot spots receives (smooth or bursty) traffic at 100 percent collective load, uniformly from all sources; the rest of the destinations receive smooth or bursty traffic as above. The reported results use hot spot/4 traffic, where the four hot spots are ports 0, 1, 2, and 3. The delay reported is the average queuing over all cells plus one.
As a means to get an indication regarding the lack of internal blocking, we also simulated the 64 × 64 fabric under the following artificial load. In each and every cell time, a randomly selected full permutation was presented to the input of the switch; that is, all inputs were continuously loaded at precisely 100 percent, while the overall load presented to the fabric was feasible, in the sense of an earlier section, during each and every cell time. After one million simulation cell times, there were virtually no cells queued at the inputs: most of the VOQs were empty, while a few others contained one or two cells each; given that the fabric never drops cells, this indicates 100 percent throughput under this special case of random feasible traffic.
CELL DISTRIBUTION METHODS
We experimented with two cell distribution methods, PerFlowRR and PerFlowIC, on a 64 × 64 Benes fabric made of 4 × 4 switching elements with buffers of size up to one or two cells depending on the cell distribution method. PerFlowRR is per-flow round-robin cell distribution, where the per-flow distribution pointers are randomly initialized. PerFlowIC (standing for per-flow imbalance count) chooses the port for forwarding the next cell as follows: among the set of ports that have received the least number of cells of this flow up to now, choose the least loaded port, that is, the port that currently has the least number of ready cells (i.e., cells that have an available downstream credit) awaiting transmission, in all flows. Both methods, in the long run, send the same number of cells in each path; PerFlowIC, though, is more flexible every time the imbalance across ports returns to 0. The results are shown in Figs. 4 and 5 for uniformly destined traffic, and in Fig. 6 for traffic in the presence of hot spots.
Under smooth (Bernoulli) traffic, the cell distribution method does make some difference: imbalance count (PerFlowIC) yields 30-60 percent lower delay than round-robin distribution (PerFlowRR). The difference is more pronounced for medium loads, and less pronounced for light or heavy loads. Under bursty traffic, though, the cell distribution method makes virtually no difference. This must be due to the large number of back-to-back cells in the same flow; in this case, PerFlowIC becomes similar to Per- FlowRR not only in the long but also in the short term. By comparing the delays with and without the presence of hot spots, both shown in Fig. 6 for comparison, we notice that they are almost identical, which shows that non-hotspot traffic stays virtually unaffected by the presence of hot spots in the network, thus proving the excellent QoS properties of this switch.
COMPARISON WITH OQ, ISLIP, RANDOMIZED, AND ADAPTIVE
Figures 4, 5, and 6 also show the delay of the ideal output-queued (OQ) switch under each traffic load. We see that, under bursty traffic, the Benes fabric has only 20 percent to 60 percent worse delay when compared to ideal output queuing. Under smooth traffic, the Benes delay exceeds the ideal OQ delay by a factor up to 2.3 for PerFlowIC and 3.6 for PerFlowRR, the difference being less pronounced for light load and more pronounced around 80 percent load. We also performed simulations for the two cell distribution methods under bursty/32 arrivals and either uniform or hot spot/4 destinations. Compared to ideal output queuing, the average delay was 10-60 percent higher for uniform destinations, and 15-85 percent higher for hot spot/4 destinations, indicating that the fabric behaves well with increasing burst size.
We also compare the Benes fabric with perflow backpressure and cell distribution with limited imbalance against the more traditional architectures of the Benes fabric with randomized and adaptive routing, shown in Figs. 4 and 5. These differ from our architecture in that randomized uses no backpressure, while adaptive uses single-lane (indiscriminate, not per-flow) backpressure. Randomized routing features delays comparable to PerFlowRR, but for high loads it requires an excessive number of buffers to achieve this -up to 16,000 cells per switching element under bursty traffic and up to 1800 cells per switching element under smooth traffic, while PerFlowRR uses 512 cells per switching element in all cases. With regard to adaptive routing, we show results for buffer sizes from 8 to 64 cells per input and per output link, which result in a total of 64 to 512 cells per 4 × 4 switching element. We see that adaptive operates with limited buffer space, like our architecture does; however, adaptive suffers from problems similar to HOL blocking due to indiscriminate backpressure: saturation throughput is well below 100 percent, and delay quickly deteriorates with increasing buffer sizes.
Last, we compare the performance of the Benes fabric with that of a crossbar with VOQs and the 2-SLIP crossbar scheduling algorithm [10] . 1 We see that for loads under 70 percent, the delay for 2-SLIP is small, comparable to the delay through the Benes fabric. As the load gets higher, around 80 percent, the delay for 2-SLIP becomes more than 14 times worse than the delay through the Benes fabric.
FABRIC SIZE DEPENDENCE OF PERFORMANCE
One of the advantages of the proposed architecture is that is can scale to very large sizes. It is important for the performance of the fabric not to degrade with increasing size. We experimented with fabrics of up to 256 ports. The results [15, S. 4.3] , not shown here due to space limitations, show that average cell delay remains virtually unaffected by fabric size.
ALTERNATIVE CELL RESEQUENCING METHODS
As discussed earlier, cell resequencing can be performed progressively, PerStage, or cumulatively, in the very last stage of the fabric, FinalOut. From the point of view of implementation, PerStage resequencing is simpler and less expensive than FinalOut, but the question regarding performance remains: it appears that FinalOut lets cells go faster through the routing network, and thus may lead to lower delays. In reality, things are the other way around! n n n nFigure 4. The results [15, S. 4.4] , not shown here due to space limitations, show that although cells do indeed get a bit faster through the fabric than the case where per-stage resequencing delays them in the routing network, when the delay of FinalOut resequencing is added, the overall delay of FinalOut is worse. We see that letting some cells get quickly through the fabric, ahead of their order, without per-stage resequencing, appears to consume such fabric resources that, overall, it harms other cells more than it benefits the early-out cells. We conclude that per-stage resequencing is strictly better than cumulative resequencing in the very last stage of the fabric, from the points of view of both implementation cost and complexity, as well as from the point of view of performance.
CONCLUSIONS
We showed how to efficiently scale packet switches to very large numbers of ports, while maintaining nonblocking operation and high QoS. This can be done using the Benes network, the lowest-cost switching fabric that is free of internal blocking. Large buffer memories are only needed at the inputs of the system, to implement virtual output queues; their number scales linearly with system size, the number of queues in each memory also scales linearly, while their throughput stays fixed. Internal backpressure is used in the Benes fabric in order to provide: • Low-cost switching elements, since they only need on-chip buffer memory.
• Zero cell loss in the switching fabric, although buffer memories are small. • Low system cost: since no global scheduler is needed, the fabric needs no internal speedup, and it does not need redundant paths to handle cell conflicts.
• High system performance and high quality of service, even though system cost is kept low as detailed above. To achieve all these, we had to extend the known per-flow backpressure architecture to make it applicable to multipath routing (inverse multiplexing) and cell resequencing. To the best of our knowledge, this is the first time that this combination of architectures is studied. In order to keep the cost manageable, we used an appropriate flow merging scheme that keeps the cost of backpressure down to O(N) per switching element. We proved freedom from deadlock for a class of multipath cell distribution algorithms. Finally, using a cell-time-accurate simulator, we demonstrated system performance.
