Abstract-This paper contributes a distributed packet controller which reduces queueing to a single stage in two-stage packet switches. Software and neural network based controllers are described. Simulations under a range of traffic conditions for a 1024 2 1024 switch size shows the simplest architecture has the best performance.
asynchronous transfer mode (ATM) switch, fixed-size packets arrive in periodic time slots at one of the input ports destined for one of the output ports. Because multiple packets can be destined for the same output, packets must queue somewhere in the switch to be sent in a later time slot (so-called output blocking). Because of speed and complexity constraints, large ATM switches are built from stages of smaller switch modules as in Fig. 1(a) . Blocked packets can queue at several points in the switching network with different levels of coordination ranging from no coordination between switch stages to various back pressure mechanisms [1] . This paper develops a distributed packet controller that extends a single-stage controller in a prior work [2] and reduces queueing to a single stage. Simulated performance under a range of traffic conditions for a 1024 1024 switch size shows the simplest architecture has the best wait time and buffer size performance.
A. Input versus Output Switch Modules
We focus on nonblocking switch modules that buffer blocked packets at output or input ports. The output-buffered switch has maximal throughput and the shortest queues [3] , but it is also more expensive as the switch must operate at times the input port speed to ensure that every packet can reach its output. Once at the correct output, packets are simply sent, in turn, at every time slot. Shared output buffers [4] , although they have no effect on the mean performance, reduce the buffer size needed to absorb queue-size variations and are used in this paper.
The input-buffered switches in this paper use separate input queues that can send any waiting packet. This avoids head- of-line blocking and enables maximal throughput [2] , [3] . Unlike an output-buffered switch, a packet arbitrator (PA) must decide which packets to send in every time slot. Given a set of input-queued packets, the PA chooses the largest subset so that at most one packet is sent per input queue, and at most one packet arrives per output port.
B. Software (SW) and Neural Network (NN) Packet Arbitrators
The packet arbitration problem reduces to the cardinality graph matching problem, which has a polynomial-time algorithm solution [5] . This algorithm readily computes the largest packet subset in a SW simulation although it is unlikely to be fast enough for large high-speed switches to make their selection within a time slot. Also, it does not apply to switches with internal blocking such as banyan networks. For this reason, we consider an NN PA.
An NN PA was presented in [2] , and the reader is referred there for details. The neuron matrix solution is governed by the differential equations (1) 0090-6778/99$10.00 © 1999 IEEE where for neuron is its state variable, is a constant external input, and is any of the usual neuron sigmoid functions (e.g., If (1) approaches a stable equilibrium where if when a packet at input is waiting for output otherwise, the set always corresponds to a valid packet subset [2] . In simulation, the are set to match the queue state, and (1) is numerically integrated using a fifth-order adaptive step size algorithm [6] . [7] , [8] . Circuits similar to (1) have been built with less than 1-s decision time [9] . For signaling, at every time slot, a queue sends an -bit message to the PA with the queue state, and the PA sends back a log -bit message with the selected packet subqueue [2] . This indicates the input-buffered switch has a compact and fast implementation compared to the more complex output-buffered switch [4] .
II. LARGE SWITCH ARCHITECTURES
A common way to build a large switch is to use two stages of smaller switches as in Fig. 1(a) . While straight forward, this may not be optimal. Packets buffer in two stages, which, in general, doubles the delay. We consider three versions of the architecture in Fig. 1(a) .
A. Two Stages of Output-Buffered Shared Memory Switches
This is the direct two-stage implementation using the highest performance switch. No PA is necessary. This is the baseline for comparison with other switches and is denoted OO.
B. One Stage of Output and One Stage of Input-Buffered Switches with Packet Arbitrator
Output-buffered shared memory switches comprise the first stage. From a second-stage switch's perspective, the output buffers of the first stage appear as input buffers. A single stage of buffering is created with stage one buffers logically acting as input buffers to simpler second-stage switches [ Fig. 1(b) ]. PA's located in each of the second-stage switches control one output queue in each of the first-stage switches. Each queue in the first-stage switch must now maintain subqueues so the PA can choose packets destined for any second-stage output. If the queues are maintained as linked lists in a single buffer as in [4] , then the overhead for the linked lists is small. Thus, compared to the OO switch, minor modifications are required in the first stage, while the hardware in the second stage is significantly reduced. This switch is denoted OIA.
C. Two Stages of Output-Buffered Switches with Packet Arbitrator
In the OO switch (Section II-A), the second-stage delay depends on the order that packets are sent from the first stage (e.g., Fig. 2 ). In the OIA switch (Section II-B), the PA eliminates second-stage queueing, but contention may prevent a nonempty first-stage queue from sending a packet. This section's switch sends the head-of-queue packet from such queues to be buffered in the second stage. As in Section II-A, every nonempty first-stage queue sends one packet while the PA reduces second-stage queueing. This switch shows the advantage of packet arbitration and is denoted OOA.
III. EXPERIMENTS WITH A 1024 INPUT SWITCH
The following experiments are designed to compare the switch designs. Absolute switch performance would require a range of switch sizes under more diverse and realistic traffic types. Instead, the designs are compared under different loads for a 1024 switch size (i.e., two stages of 32 32 32 switches). This size is an order of magnitude larger than existing switches today, and the 32 32 neurons in each neural PA is well beyond the "toy problem" size. To lower bound the performance, we look at the first-stage buffering and wait times of the OO switch. This represents the irreducible queueing and contention at the first-to second-stage links.
The packet traffic is generated using a Bernoulli process and a burst process with different loads 1 The switches are simulated for 11 000 time slots starting from empty queues. For a given load and traffic type, the sequence of arrivals is the same for every switch. The first 5000 time slots of data are discarded so that only steady-state behavior is observed. With 32 switches in each stage, the remaining 6000 time slots contain a total of 192 000 observations. Average packet delay is used as a measure of average performance. To remove details of the switch implementation, the inherent delay (i.e., the delay of a single packet through an empty switch) is subtracted out so that only the queueing delay is measured. In every time slot, the size of every shared buffer is recorded after all arrivals, but before any m. This is modeled by having a burst start in a time slot with probability =m. All packets in a burst arrive in consecutive time slots and have the same destination chosen uniformly. Bursts start immediately or as soon as all prior bursts finish, whichever is first. In this paper, m = 10. departures. As a measure of required buffering (an outlier metric), the buffer size exceeded in only 10 of the time slots is computed directly from the buffer size trace data. Smaller blocking probabilities, while more realistic, require extrapolation techniques that do not illuminate the comparative results further.
Since packet behavior is correlated over time scales ranging up to several hundred time slots, error bars are computed indirectly. For each metric, six subestimates are computed across successive 1000 time slot intervals, which are then assumed independent. Average delay, being a linear combination of the subestimates, uses the standard deviation of the subestimate average. Buffer size, not being linear, uses the standard deviation of the subestimates. The buffer-size subestimates are from a smaller sample than the final estimate so this upper bounds the error. Fig. 3 shows results for the three architectures using the SW PA. The OO and OOA switches include the total de-lay/buffering of both stages. Across metrics, the difference between the OO and the lower bound is a factor of 2. For queueing delay, both arbitrated architectures approach the lower bound at high loads. For buffering, the simpler OIA architecture approaches the lower bound for low loads, while for high loads both arbitrated architectures approach a buffer size midway between the OO and lower bound. The OIA's superior performance is due to the single stage of buffering. Although the OIA has longer first-stage queues, these are the only queues and all queued packets are therefore under the PA's control.
Results using a NN PA are shown for Bernoulli arrivals in Fig. 4 . The NN is guaranteed to provide valid but not necessarily optimal solutions. When applied to the OOA switch, there is little difference. The nonoptimal solutions of the neural PA are compensated for by always being able to send packets from any nonempty queue so that the number of packets sent is the same with both controllers. In the OIA switch, the nonoptimal neural solutions translate directly into longer queues and delay at high loads.
IV. CONCLUSION
A distributed PA was developed and applied to a large highspeed packet switch. The arbitrated OIA and OOA switches of Sections II-B and II-C bring queueing delays to nearoptimal and required buffer sizes to within a factor of 1.4 of optimal at high loads. This result applies equally across different arrival processes. Surprisingly, the best performance is with the less complex OIA switch. Comparing SW and NN PA, the NN is similar when applied to the OOA switch while performance degrades at high loads when applied to the OIA switch. Even so, the fast NN decisions could enable high-speed applications. Future work includes controlling max delay/fairness and simpler internal blocking switches.
