This paper describes a new architecture for a multicast ATM switch scalable from a few tens to a few thousands of input ports. The switch, called Abacus switch, has a nonblocking switch fabric followed by small switch modules at the output ports. It has bu ers at input and output ports. Cell replication, cell routing, output contention resolution, and cell addressing are all performed in a distributed way so that it can be scaled up to thousands of input and output ports. A novel algorithm has been proposed to resolve output port contention while achieving input bu ers sharing, fairness among the input ports, and call splitting for multicasting. The channel grouping mechanism is also adopted in the switch to reduce the hardware complexity and improve the switch's throughput, while the cell sequence integrity is preserved. The switch can also handle multiple priority tra c by routing cells according to their priority levels. The performance study of the Abacus switch in throughput, average cell delay, and cell loss rate is presented. A key ASIC chip for building the Abacus switch, called the ARC (ATM Routing and Concentration) chip, contains a two-dimensional array (32x32) of switch elements that are arranged in a cross-bar structure. It provides the exibility of con guring the chip into di erent group sizes to accommodate di erent ATM switch sizes. The ARC chip has been designed and fabricated using 0.8-m CMOS technology and tested to operate correctly at 240 MHz.
Introduction
There are several approaches to build a large-scale ATM switch. First, using small ATM switch modules (e.g., 32x32) as building blocks and connecting them in a multi-stage structure (e.g., Clos-type interconnection) 1, 2, 3, 4, 5] . The problem of this approach is the performance degradation due to the internal blocking between the switch modules. Although the performance can be improved by speeding up the internal links or providing more interconnection links between modules, this approach has not been convinced to be capable of providing satisfactory performance for a large-scale ATM switch.
Second, using high-speed technology to switch cells at multiple Gb/s rate in a core switch 6, 7, 8, 9] . For instance, AT&T, Fujitsu, NTT, and BNR switches switch cells at 2.5 Gb/s or 10 Gb/s. The advantage of this approach is the bu er required in the core switch is minimized. There are two reasons. First, as users' tra c is multiplexed to a high bandwidth link, each individual user's tra c looks more like random tra c (i.e., less bursty). Second, when cells are multiplexed and switched at high speed, channel grouping technique is applied implicitly and thus requires less memory for the same performance. However, demultiplexers at the output of the core switch require large bu ers because high-speed cell streams are routed to lower speed output links. Since the speed required for the demultiplexer's memory is lower than that of the core switch's memory, the need of large bu ers at the demultiplexer can be justi ed.
Output bu ering (including shared-memory output bu ering) has been proven to provide the best delay and throughput performance. As the switch grows up to a certain size (e.g., 256 input and output ports), memory speed may become a bottleneck or the technology used to implement such kind of memory may become too costly. One way to eliminate memory's speed constraint is to temporarily store some cells destined for the same output port at the input bu ers. Input bu ering's well-known head-of-line (HOL) blocking drawback can be improved by speeding up internal links' bandwidth (e.g., 2 to 4 times of input line's) and bu ering excessive cells at the output ports. The input-and-output bu ering approach thus provides satisfactory performance and eliminates memory speed limitation. Examples of input-and-output bu ered ATM switches are NTT's and BNR's 160 Gb/s switch. The challenge for implementing input-and-output bu ered switches is the output port contention resolution of the input cells destined for the same output port (or the same module). Such kind of function is usually handled by an arbiter. The bottleneck caused by the memory speed is now shifted to the arbiter. If parallel processing and pipeline techniques can be intelligently applied to implement the arbiter, a large-scale switch will be feasible.
In 10], we proposed a recursive modular architecture to implement a large-scale ATM switch. It was then modi ed to cope with the multicast capability 11], in which we showed that a switch that is designed to meet the performance requirement for unicast calls will also satisfy multicast calls' performance. Both architectures employed the generalized Knockout concept 12] with output bu ers. However, both switch fabrics are a lossy system, where cells may be discarded when the number of routing links are less than the number of incoming cells destined for the same output port (or output group).
Here, we propose a new architecture eliminating the possibility of cells being discarded due to the loss of contention in the switch fabric. The new scalable multicast ATM switch has input and output bu ers. It is named Abacus switch because its switch fabric looks like an abacus. The Abacus switch consists of a nonblocking switch fabric followed by small switch modules at the output ports. Cell replication, cell routing, output contention resolution, and cell addressing are all performed distributedly in the Abacus switch so that it can be scaled up to thousands of input and output ports. The switch can be implemented with traditional economic CMOS technology while achieving compatible performance as those switches in 6, 7, 8, 9] . Furthermore, the Abacus switch can be engineered in such a way that it requires small input bu ers (e.g., 100 cells per input port) compared with large output bu ers (e.g., a few tens of thousand cells per output port). When implementing call admission control or bu er management, we will just need to focus on the output bu er rather than both input and output bu ers, which reduces implementation complexity signi cantly.
We have proposed a novel algorithm to resolve the contention of multicast cells destined for the same output port (or output group). The new algorithm also has the following nice features: achieving input bu ers sharing, providing fairness among the input ports, and supporting call splitting for multicasting. The call splitting function allows a multicast cell to be delivered to subsets of destined output ports in multiple cycles, thus increasing the system throughput 13]. We have also applied distributed and parallel processing techniques in the contention resolution to accommodate a large-scale switch. Several output contention resolution algorithms have been proposed, such as recirculation algorithm 14], three phase algorithm 15], ring reservation algorithm 16] , and centralized contention resolution device 17]. Most of them can only handle unicast calls (i.e., point-to-point communication) and N-to-1 selection (N is the switch size), while our algorithm can handle multicast calls, call splitting, and N-to-multiple selection when channel grouping mechanism is applied. The channel grouping mechanism 18] is adopted in our switch to reduce the hardware complexity and improve the switch's throughput. It bundles multiple output ports and permits them to share routing links among them. This paper is organized as follows. Section 2 describes the architecture and operations of the Abacus switch. Section 3 presents the novel multicast contention resolution algorithm, which plays a major role in our switch architecture. Section 4 describes the implementation of the input port controller. Section 5 shows an architecture to build a large-scale ATM switch with thousands of input and output ports. Section 6 presents the performance study of the Abacus switch in throughput, average cell delay, and cell loss rate. Section 7 brie y describes the design of the ARC (ATM Routing and Concentration) chip and its testing results 19] . Section 8 gives conclusions.
Architecture and Operations of Abacus Switch
As shown Figure 1 , the proposed Abacus switch consists of input port controllers (IPCs), a multicast grouping network (MGN), multicast translation tables (MTTs), small switch modules (SSMs), and output port controllers (OPCs). The switch performs cell replication and cell routing simultaneously. Cell replication is achieved by broadcasting incoming cells to all routing modules (RMs), which then selectively route cells to their output links. Cell routing is performed distributedly by an array of switch elements (SWEs). The concept of sharing routing-links (also called channel grouping 18] ) is also applied to construct the MGN in order to reduce hardware complexity, where every M output ports are bundled in a group. For a switch size of N input ports and N output ports, there are K output groups (K = N=M). The MGN consists of K routing modules; each of them provides L M routing links to each output group. L is de ned as group expansion ratio: the ratio of required routing links to the group size. Cells from the same virtual connection can be arbitrarily routed to any one of the L M routing links and their sequence integrity will be maintained. Based on a novel arbitration mechanism to be described in Section 3, up to L M cells from N IPCs can be chosen in each RM. Cells that lose contention are temporarily stored in an input bu er and will retry in the next time slot. On the other hand, cells that are successfully routed through RMs will further be routed to proper output port(s) through the SSMs.
We can engineer the group expansion ratio L in such a way that the required maximum throughput in a switch fabric can be achieved. Performance study (in Section 6) shows that the larger M is, the smaller L is required to achieve the same maximum throughput. For instance, for a group size M of 16 and input tra c with an average burst length of 15 cells, L has to be at least 1.25 to achieve a maximum throughput of 0.96. But, for a group size M of 32 and the same input tra c characteristic, L can be as low as 1.125 to achieve the same throughput.
The IPCs terminate input signals from the network, look up necessary information in a translation table, resolve contention among cells that are destined to the same output group, bu er those cells losing contention, and attach routing information in front of cells so that they can be routed properly in the MGN. Its implementation can be found in Section 4.
Each routing module (RM) in the MGN contains a two-dimensional array of switch elements and an address broadcaster (AB), as shown in Figure 2 . The SWE routes cells from the west and north to east and south, respectively, when it is at cross state, or to south and east, respectively, when it is at toggle state. The SWE's state is determined from the comparison of address bits and priority bits of cells from west and north. The AB generates dummy cells that carry proper output group addresses. This permits the SWE not to store the information of output group address, which simpli es the circuit complexity of the SWE signi cantly and results in higher VLSI integration density. The detailed operations of the SWE and the AB can be found in 10, 11, 19] . In addition to routing cells, the RMs also sort cells' priority at the output links, which facilitates the new multicast contention resolution algorithm (see Section 3 for more details). Each routing module in the MGN has N horizontal input lines and L M vertical routing links. These routing links are shared by the cells that are destined for the same output group (i.e., the same small switch module). Each input line is connected to all routing modules so that cells from any input line can be broadcast to all K output groups. Cells from multicast calls are rst replicated and routed by the MGN to multiple SSMs. Before the copied cells are further replicated and routed by the SSMs, their routing eld will be updated by the multicast translation tables (MTTs) with proper routing information that is to be used by the SSMs. Each SSM has L M inputs and M outputs. The SSMs must have multicast capability and output bu ering structure. The latter is required to maintain the cell sequence for cells distributed among the L M links. One example for such SSMs is Hitachi's 32 32 shared-bu ered ATM switch 1]. The output port controller (OPC) updates each multicast cell with a new VCI/VPI and sends the cell to the network. Figure 3 shows an example illustrating how a cell is replicated in the MGN and the SSMs. Suppose a cell arrives at input port #3 and is to be multicast to three output ports: #1, #M, and #(N-M+1). The cell is rst broadcast to all K RMs in the MGN, but only the RM 1 and RM K will accept the cell. Note that only one copy of the multicast cell will appear at one of the L M links of the RM. The copied cell at the output of RM 1 is further replicated into two copies by SSM 1 . There are a total of three replicated cells at the output ports. Figure 4 shows routing information for a multicast ATM switch with N = 256 and M = 16, which consists of several elds, multicast pattern (MP), priority eld (P), and a broadcast channel number (BCN). A multicast pattern is a bit map of all the output groups and is used in the MGN for routing cells to multiple output groups. Each bit indicates if the cell is to be sent to the associated output group. For instance, if the i-th bit in the MP is set to`1,' the cell is to be sent to the i-th output group. The multicast pattern MP has K bits for an MGN that has K output groups (16 in this example). For a unicast call, its multicast pattern is basically a attened output address (i.e., a decoded output address) in which only one bit is set to`1' and all other (K ? 1) bits are set to`0.' For a multicast call, there is more than one bit set to`1' in the MP, corresponding to the output groups for which the cell is destined.
A priority eld (P), used to assist contention resolution, can be exibly set to any value to achieve desired service preference. For instance, the priority eld may consist of an activity bit (A), a connection priority (C), a bu er state priority (Q), a retry priority (R), and an input port priority (S). Let us assume the smaller the priority value, the higher the priority level. The activity bit (A) indicates the validity of the cell. The activity bit (A) is set tò 0' if the cell is valid and set to`1' otherwise. The connection priority (C) indicates the priority of the virtual connection, which can be determined during the call setup or service provisioning. The bu er state priority (Q) provides a sharing e ect among N input bu ers by allowing the HOL cell in an almost-over owed bu er (e.g., exceeding a predetermined threshold) to be transmitted sooner so that the overall cell loss probability is reduced. The retry priority (R) provides global rst-come-rst-serve (FCFS) discipline, allowing a cell's priority level to move up by one whenever it loses contention once. The retry priority (R) can initially be set to`1111' and decreased by one whenever losing contention once. In order to achieve fairness among input ports, the priority levels of the head-of-line cells at the input ports dynamically change at each time slot. The input port priority (S) can initially be set to its input port address with log 2 N bits and decreased by one at every time slot, thus achieving round-robin fairness.
The broadcast channel number (BCN) in Figure 4 will be used to nd a new multicast pattern in the MTT, allowing the copied cell to be further duplicated in the SSM. The BCN will also be used by the OPC to nd a new VCI/VPI for each copy of the replicated cell. The BCN can be either assigned during call setup or a combination of input port number and the VPI/VCI value.
Multicast Contention Resolution Algorithm
Here, we describe a novel algorithm that resolves output port contention among the input ports in a fair manner. It can also do call splitting for multicasting and thus improves the system throughput. By applying distributed and parallel processing techniques, our contention resolution algorithm is able to accommodate a large-scale switch. The output port contention resolution is often implemented by a device called an arbiter. Most proposed arbiters can only handle unicast calls (i.e., point-to-point communication) and N-to-1 selection, for example: three phase 15], ring reservation 16], and centralized contention resolution device 17] .
Implementing an arbiter capable of handling call splitting and N-to-multiple selection is much more challenging in terms of timing constraint. At the beginning of the cell time slot, the arbiter receives N multicast patterns, one from each input port, and returns acknowledgement to those input ports whose HOL cells have won contention. These cells are then allowed to transmit to the switch fabric. Let us consider these N multicast patterns (each with K bits in our architecture example) being stacked up and there are K columns with N bits in each column. Each column associates with each output group. The arbiter's job is to select up to, for example, L M bits that are set to`1' from each column and repeat the operation for K times, which must be nished in one cell time slot. In other words, the arbitration's timing complexity is in the order of O(N K). The arbiter may become the system's bottleneck when N or K is large. The arbitration scheme we propose here performs N-to-L M selection in a distributed manner using the switch fabric and all input port controllers (IPCs), thus eliminating the speed constraint. Another di erence between our arbitration scheme and others is that in our scheme the HOL cell is repeatedly sent to the switch fabric to compete with others until it has successfully transmitted to all necessary output groups that the cell is destined for. Unlike other arbitration schemes, our scheme does not wait for an acknowledgement before transmitting the cell. When a cell is routed in a switch fabric without waiting for an acknowledgement, two situations are possible. It could be successfully routed to all necessary output groups, or only routed to a subset of the output groups (including an empty set). The latter case is considered a failure, and the HOL cell will retry in the next time slot. When a cell is transmitted to the switch fabric, since it does not know if it will succeed, it must be stored in a one-cell bu er for possible retransmission. Now the question is how the IPC knows whether or not its HOL cell has been successfully transmitted to all necessary output groups. In our implementation, the routing modules (RMs) are responsible for returning the routing results to the IPC. One possible way is to let each RM inform IPCs the identi cation (e.g., the broadcast channel number) of the cells that have been successfully routed. However, since a cell could be routed to multiple output groups (for instance, up to K output groups for a broadcast situation), one IPC may receive up to K acknowledgements from K RMs. The complexity of returning the identi cation of every successfully routed copy to all IPCs is too high to be practical for a large-scale switch. In the following, we introduce a scheme that signi cantly simpli es the complexity of the acknowledgement operation.
The RM can not only route cells to proper output groups, but also, based on cells' priority levels, choose up to L M cells that are destined for the same output group. The HOL cell of each input port is assigned a unique priority level that is di erent from the others. After cells are routed through an RM, they are sorted at the output links of the RM according to their priority levels from left to right in a descending order (See Figure 2) . The cell that appears at the rightmost output link has the lowest priority level among the cells that have been routed through this RM. This lowest priority information is broadcast to all IPCs. Each IPC will then compare the local priority level (LP ) of the HOL cell with a feedback priority, say FP j , to determine if the HOL cell has been routed through the RM j . Note that there are K feedback priorities, FP 1 ; ; FP K . If the feedback priority level (F P j ) is lower than or equal to the local priority level (LP ), the IPC determines that its HOL cell has reached one of the output links of the RM j . Otherwise, the HOL cell must have been discarded in the RM j due to loss of contention and will be retransmitted in the next time slot. Since there are K RMs in total, there will be K lines broadcast from K RMs to all IPCs, each carrying the lowest priority information in its output group.
The priority assigned to the HOL cells will be dynamically changed according to some arbitration policies, such as random, round-robin, state-dependent, and delay-dependent 20]. The random scheme randomly chooses the HOL cells of input ports for transmission; the drawback is it has a large delay variation. The round-robin scheme chooses HOL cells from input ports in a round-robin fashion by dynamically changing the scanning point from the top to the bottom input port (e.g., S eld in Figure 4 ). The state-dependent scheme chooses the HOL cell in the longest input queue such that input queue lengths are maintained nearly equal, achieving the input bu ers sharing e ect (e.g., Q eld in Figure 4 ). The delaydependent scheme performs like a global FIFO, where the oldest HOL cell has the highest priority to be transmitted to the output (e.g., R eld in Figure 4 ). Since our arbitration is performed in a distributed manner by K RMs and in parallel by IPCs, we can implement any of the above policies, or a combination of them, by arbitrarily assigning a proper priority level to the HOL cell.
At the beginning of the time slot, each IPC sends its HOL cell to the MGN. Meanwhile, the HOL cell is temporarily stored in a one-cell size bu er during its transmission. After cells have traversed through the RMs, priority information, FP 1 to FP K (the priority of the right most link of each RM), is fed back to every IPC. Each IPC will then compare the feedback priority level FP j , j = 1; 2; ; K, with its local priority level, LP. Three situations can happen. First, MP j = 1 and LP FP j (recall that the smaller the priority value, the higher the priority level), which means the HOL cell is destined for the j-th output group and has been successfully routed through the j-th RM. The MP j bit is then set tò 0.' Second, MP j = 1 and LP > FP j , which means the HOL cell is destined for the j-th output group but discarded in the j-th RM. The MP j bit remains`1.' Third, MP j = 0, the j-th bit of the HOL cell's multicast pattern can be equal to`0,' which means the HOL cell is not eestined for j-th output group. Then, the MP j bit remains`0,' After all MP j bits (j = 1; 2; ; K) have been updated according to one of the above three scenarios, a signal indicating whether the HOL cell should be retransmitted, resend, will be asserted to`1' if one or more than one bits in the multicast pattern remains`1.' The resend signal is initially set to`0.' If multicast pattern bits are all`0,' meaning the HOL cell has been successfully transmitted to all necessary output groups, the resend signal will be disasserted. The IPC will then clear the HOL cell in the one-cell bu er and transmit the next cell in the input bu er in the next time slot (if any). Figure 6 gives an example of how a multicast pattern is modi ed. Let us assume at the beginning of the m-th time slot, the HOL cell is destined for three output groups: #1, #3, #K. Therefore, the multicast pattern at the m-th time slot, MP m , has three bits set tò 1.' Let us also assume the local priority value (LP ) of the HOL cell is 5 and the feedback priority values from #1, #2, #3, and #K are 7, 2, 3, and 5, respectively, as shown in Figure  6 . The result of comparing LP with FPs is`0110 00,' which is then logically ANDed with the MP m and produces a new multicast pattern,`0010 00,' for the next time slot (MP m+1 ). Since only the MP m+1 3 is set to`1,' the IPC determines that the HOL cell has been successfully routed to RMs #1 and #K but discarded in RM #3 and will retransmit in the next time slot.
4 Implementation of Input Port Controller (IPC) Figure 5 shows a block diagram of the IPC. For easy explanation, let us assume the switch has 256 input ports and 256 output ports and every 16 output ports are in one group. A major di erence between this IPC and traditional ones is the addition of the multicast contention resolution unit (MCRU), shown in a dashed box. It determines, by comparing K feedback priorities with the local priority of the HOL cell, whether or not the HOL cell has been successfully routed to all necessary output groups.
Let us start from the left where the input line from the SONET/ATM network is terminated. Cells with 16 bits wide are written into an input bu er. The HOL cell's VCI/VPI is used to extract necessary information from a routing table. This information includes a new VPI/VCI for unicast connections, a broadcast channel number (BCN) for multicast connections, which uniquely identi es each multicast call in the entire switch, multicast pattern (MP) for routing cells in the MGN, and the connection priority (C). This information is then combined with a priority eld to form the routing information, as shown in Figure 4 .
As the cell is transmitted to the MGN through a parallel-to-serial converter (P/S), the cell is also temporarily stored in a one-cell bu er. If the cell fails to successfully route through RMs, it will be retransmitted in the next cell cycle. During retransmission, it is written back to the one-cell bu er in case it fails to route through again. The S down counter is initially loaded with the input address and decremented by one at each cell clock. The R down counter is initially set to all`1's and decreased by one every time the HOL cell fails to transmit successfully. When the R-counter reaches zero, it will remain at zero until the HOL cell has been cleared and a new cell becomes the HOL cell.
K feedback priority signals, FP 1 to FP K , are converted to 16-bit wide signals by the serial-to-parallel converters (S/P) and latched at the 16-bit registers. They are simultaneously compared with the HOL cell's local priority (LP ) by K comparators. Recall that the larger the priority value is, the lower the priority level is. If the value of the FP j is larger than or equal to the local priority value (LP ), the j-th comparator's output is asserted low, which will then reset the MP j bit to zero regradless of what its value was (`0' or`1'). After the resetting operation, if any one of the MP j bits is still`1,' indicating that at least one HOL cell did not get through the RM in the current cycle, the`resend' signal will be asserted high and the HOL cell will be retransmitted in the next cell cycle with the modi ed multicast pattern.
As shown in Figure 5 , there are K sets of S/P, FP register, and comparator. As a switch size increases, the number of output groups, K, also increases. In order to reduce hardware complexity, if we time-division multiplex the operation of comparing the local priority value, LP, with K feedback priority values, only one set of this hardware is required.
An Architecture for a Large-scale Abacus Switch
The Abacus switch has employed several techniques to accommodate a large-scale size (e.g., 1,024 1,024). For instance, a cross-bar structure of the SWE array permits short interconnection between SWEs. Input-output bu ering allows lower speed memory chips at the input and output ports. Distributed and parallel processing techniques have been used to implement the multicast contention resolution. However, the timing requirement of routing cells and resolving contention also needs to be taken into consideration when building a large-scale switch.
Due to the timing alignment requirement for the signals of the vertical routing links and the horizontal lines in the RM, incoming cells and dummy cells from the address broadcasters (ABs) are skewed properly before they are sent to the SWE array. To implement the proposed multicast contention resolution algorithm, the time it takes to route cells through an RM and to feed back the lowest priority information from the RM to all IPCs must be less than one cell slot time. If the time is greater than one cell slot time, two situations can happen. First, if the HOL cell is hold up in the one-cell bu er longer than a cell slot time, the system throughput will be degraded. On other hand, if a cell next to the HOL cell is allowed to transmit before the HOL cell has been successfully transmitted to output(s), it may cause a cell out-of-sequence problem. Although it can be resolved with a resequencing circuit at the output port, the complexity may be too high to be practical.
Here we will ensure the feedback priorities are returned to all IPCs within one cell time slot. Since each SWE in an RM introduces a 1-bit delay as the signal passes it in either direction, the sum of the maximum number of SWEs between the IPC and the rightmost link of the RM, and the number of bits in MP eld and P eld, should be less than the number of bits in a cell. In other words, (N + L M ? 1) + (N=M + log 2 N + 8) should be less than the cell length in bits, where 8 is chosen based on the partial routing information attached to the 53-byte cell in Figure 4 (i.e., M, A, C, Q, and R bits). For example, if we choose M = 16, L = 1:25, and a cell size of 64 bytes within the switch fabric, the equation becomes N + 20 ? 1 + N=16 + log 2 N + 8 < 512. Or, the maximum value of N is 448, which is not large enough for a large-scale switch. Figure 7 shows a proposed architecture to implement a large-scale ATM switch. In order to reduce the time spent on traversing cells through the RM and returning the lowest priority information to IPCs, the number of SWEs in the RM cannot be too large. For instance, if we partition N inputs into K 1 groups, each group with n inputs (i.e., N = n K 1 ), the MGN's size is reduced from N to n. In other words, one big MGN is divided into K 1 smaller MGN and each MGN has n input lines, as shown in Figure 7 . Recall that each output group of M output ports requires L M routing links to achieve an acceptable throughput. Now, there are K 1 MGNs; each has L M routing links for each output group. Therefore, we need to further concentrate K 1 (L M) lines to L M outputs by using concentration modules (CMs) at the second stage, where N=M CMs are required. The structure and implementation of the CM and the RM is identical except that the function performed is a little di erent. Since cells that pass through the RMs to the CM always have correct output group addresses, we just need to perform concentration by using the priority eld in the routing information.
Note that the feedback lines carrying the lowest priority are returned from the secondstage CMs' output links, instead of the rst-stage RM's outputs. Since there are K 2 (N=M) output ports, K 2 feedback lines are needed. Now, the maximum delay between the cell entering the switch fabric and the availability of the feedback priority is n + K 1 (L M 
Performance Analysis of Abacus Switch
In this section, the performance analysis of the Abacus switch is presented. Both simulation and analytical results are shown to compare with each other. Simulation results are obtained with a 95% con dence interval, not greater than 10% for the cell loss probability or 5% for the maximum throughput and average cell delay. In our analysis, we consider an ON-OFF source model in which an arrival process to an input port alternates between ON (active) and OFF (idle) periods. A tra c source, during the ON period, continues sending cells in every time slot but stops sending cells in the OFF period. Both the duration of the ON and OFF periods are assumed to be geometrically distributed.
Maximum Throughput
In this section, we analyze the maximum throughput of the Abacus switch. The maximum throughput of an ATM switch employing input queueing is de ned by the maximum utilization at the output port. An input-bu ered ATM switch has the so-called HOL blocking problem, which degrades the switch's throughput. However, the throughput can be improved by speeding up the switch fabric's operation rate or increasing the number of routing links with an expansion ratio L. Several other factors also a ect the maximum throughput. For instance, the larger the switch size (N), burstiness ( ), or input bu er size (B i ) is, the smaller the maximum throughput ( max ) will be. However, the larger the group expansion ratio (L) or group size (M) is, the larger the maximum throughput will be. Figure 8 shows that the maximum throughput is monotonically increasing with the group size. For M = 1, the switch becomes an input-bu ered switch, and its maximum throughput max is 0.586 for uniform random tra c ( = 1), and max = 0:5 for completely bursty tra c ( ! 1). For M = N, the switch becomes a completely shared memory switch such as Hitachi's switch 1]. Although it can achieve 100% throughput, it is impractical to implement a large-scale switch using such an architecture. Therefore, choosing M between 1 and N is a compromise between the throughput and the implementation complexity. 
Average Delay
A cell may experience two kinds of delay while traversing through the Abacus switch: input bu er delay and output bu er delay. To evaluate the delay at the output port, we assume a small switch module in the Abacus switch is a shared-bu ered switch, where all M output ports share a physical memory. Figure 11 shows simulation results of input and output bu er's average delay versus input o ered load i . Note that the input bu er's average delay is much smaller than the output bu er's average delay at tra c load less than the saturated throughput. For example, for an input o ered load i of 0.8 and an average burst length of 15, the output bu er's average delay T o is 58.8 cell times, but the input bu er's average delay T i is only 0.1 cell time. It also shows that the impact of the burstiness of input tra c to the input bu er's average delay is very small when the tra c load is below the maximum throughput. Figure 12 shows simulation results of input bu er's average delay versus expanded throughput j for both unicast and multicast tra c. Here we assumed that the number of replicated cell is distributed geometrically with an average of c. The expanded throughput j is measured at the inputs of the SSM and normalized each output port. Note that multicast tra c has a lower delay than unicast tra c because a multicast cell can be sent to multiple destinations in a time slot while a unicast cell can be sent to only one destination in a time slot. For example, assume that an input port i has 10 unicast cells and the other input port j has a multicast cell with a fanout of 10. Input port i will take at least 10 time slots to transmit the 10 unicast cells while input port j can possibly transmit the multicast cell in one time slot.
Cell Loss Probability
As suggested in 23], there can be two bu er control schemes for an input-output-bu ered switch: queue loss (QL) scheme and back pressure (BP) scheme. In the QL scheme, cell loss can occur at both input and output bu ers. All the simulation results shown in previous sections are based on the QL scheme.
In the BP scheme, by means of backward throttling, the number of cells actually switched to each output group is limited not only to the group expansion ratio (L M) but also to the current storage capability in the corresponding output bu er. For example, if the free bu er space in the corresponding output bu er is less than L M, only the number of cells corresponding to the free space are transmitted, and all other HOL cells destined for that output group remain at their respective input bu er. The Abacus switch can easily implement the backpressure scheme by forcing the address broadcaster (AB) in Figure 2 to send the dummy cells with the highest priority level, which will automatically block the input cells from using those routing links. Furthermore, the number of blocked links can be dynamically assigned based on the output bu er's congestion situation.
Here, we only consider the QL scheme (cell loss at both input and output bu ers). In the Abacus switch, cell loss can occur at input and output bu ers, but not in the MGN. Figure 13 shows input bu er over ow probabilities with di erent average burst lengths, . For uniform random tra c, an input bu er with a capacity of a few cells is su cient to maintain the bu er over ow probability to be less than 10 ?6 . As the average burst length increases, so does the cell loss probability. For an average burst length of 15, the required input bu er size can be a few tens of cells for the bu er over ow probability of 10 ?6 . By extropolating the simulation result, the input bu er size is about 100 cells for 10 ?10 cell loss rate. Figure 14 shows output bu er over ow probabilities with di erent average burst lengths. Here, B o is the normalized bu er size for each output. We notice that the required output bu er size is much larger than the input bu er size for the same cell loss probability. x 0 signal is broadcast to all SWEs to initialize each SWE to a cross state, where the west input passes to the east and the north input passes to the south. x 1 signal speci es the address bit(s) used for routing cells, while x 2 signal speci es the priority eld. Other x output signals propagate along with cells to the adjacent chips on the east or south side. m 0 : 1] signals are used to con gure the chip into four di erent group sizes as shown in Table 1 : (1) 8 groups, each with 4 output links, (2) 4 groups, each with 8 output links, (3) 2 groups, each with 16 output links, and (4) 1 group with 32 output links. m 2] signal is used to con gure the chip to either unicast or multicast application. For the unicast case, m 2] is set to 0, while for the multicast case, m 2] is set to 1.
32x4 SWE Array
As shown in Fig. 16 , the SWEs are arranged in a cross-bar structure, where signals only communicate between adjacent SWEs, easing the synchronization problem. ATM cells are propagated in the SWE array similar to a wave propagating diagonally toward the bottom right corner. The x 1 and x 2 signals are applied from the top left of the SWE array, and each SWE distributes the x 1 and x 2 signals to its east and south neighbors. This requires the same phase to the signal arriving at each SWE. x 1 and x 2 signals are passed to the neighbor SWEs (east and south) after one clock cycle delay, as are data signals (w and n). x 0 signal is broadcast to all SWEs (not shown in Fig. 16 ) to precharge an internal node in the SWE in every cell cycle. The x 1e output signal is used to identify the address bit position of the cells in the rst SWE array of the next adjacent chip.
The timing diagram of the SWE input signals and its two possible states are shown in Fig. 17 . Two bit-aligned cells, one from the west and one from the north, are applied to the SWE along with the dx 1 and dx 2 signals, which determine the address and priority elds of the input cells. The SWE has two states: cross and toggle. Initially, the SWE is initialized to a cross state by the dx 0 signal, i.e., cells from the north side are routed to the south side, and cells from the west side are routed to the east side. When the address of the cell from the west (dw a ) is matched with the address of the cell from the north (dn a ), and when the west's priority level (dw p ) is higher than the north's (dn p ), the SWEs is toggled. The cell from the west side is then routed to the south side, and the cell from the north is routed to the east. Otherwise, the SWE remains at the cross state.
Testing Results
The 32x32 ARC chip has been designed and fabricated using 0.8-m CMOS technology with a die size of 6.6mmx6.6mm. Note that this chip is pad limited. The chip has been tested successfully up to 240MHz by using a high-speed oscilloscope, timing analyzer, and a pattern generator capable of generating signals up to 1GHz. The chip's characteristics are summarized in Table 2 . Its photograph is shown in Fig. 18 . Figure 19 shows a testing result, where x 2s ; s 0 , s 1 , and s 2 are shown from top to bottom, respectively. x 2s speci es the range of the priority eld of the cells at the output, which is chosen to be 7 bit in this test. Since x 2s is taken from the bottom left of the SWE from which s 0 comes out, it is aligned with s 0 . s 1 is delayed by one clock cycle with respect to x 2s . Similarly, s 2 is delayed by two clock cycles with respect to x 2s . In this test, cells are applied to the west inputs, while north inputs are tied to VDD. It is observed that south outputs come out in a sorted priority order. The priority of s 0 is 1000100, which is the highest priority among all inputs. The priority of s 1 is 1000101, which is the second highest. The priority of s 2 is 1000110, which is the third highest. Note that the cell length used here is kept short in order to be able to see one cell cycle in the viewing window of the oscilloscope.
Conclusion
We have described a new architecture to implement a multicast ATM switch scalable from a few tens to a few thousands of input ports. The switch, called Abacus switch, consists of a nonblocking switch fabric followed by small switch modules at the output ports and has bu ers at input and output ports. The switch employs a novel algorithm to resolve the contention of multicast cells destined for the same output port (group). The algorithm also provides the capability of sharing input bu ers, e ectively achieving fairness among the input ports, and performing the call splitting for multicasting. The channel grouping mechanism is adopted in our switch to reduce the hardware complexity and improve the switch's throughput, while the cell sequence integrity is preserved. The switch can also handle multiple priority tra c by routing cells according to their priority levels.
Cell replication, cell routing, output contention resolution, and cell addressing are all performed distributedly in the Abacus switch so that it can be scaled up to thousands of input and output ports. The cell replication is achieved by broadcasting incoming cells to multiple routing modules, which consists of a two-dimensional array of switch elements (SWEs). The regular structure permits us to implement a high-density VLSI chip and to have relaxed synchronization for data and clock signals. A key ASIC chip for building the Abacus switch, called the ARC (ATM Routing and Concentration) chip, contains a twodimensional array (32x32) of switch elements that are arranged in a cross-bar structure. The chip has been designed and fabricated using 0.8-m CMOS technology and tested to operate correctly at 240 MHz.
The performance of the Abacus switch under bursty tra c was presented. By engineering the expansion ratio (number of routing links/group size), the head-of-line blocking probability can be lowered arbitrarily so that the throughput of the input-output bu ered switch approaches to the output bu ered switch. For a given expansion ratio, as the group size increases, the maximum throughput also increases. It shows that when the tra c load is below the maximum throughput, the input bu er's average delay is much smaller than the output bu er's average delay, e.g., by one order of magnitude, and the impact of the burstiness of input tra c on the input bu er's average delay is very small. It was also shown that multicasting has better throughput and smaller input-bu er delay than unicasting under uniformed tra c distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
List of Tables
H. Jonathan Chao (S'82-M'85-SM'95) received the B.S.E.E. and M.S.E.E. degrees fromJin-Soo Park (S'96) received the B.S. degree from Seoul National University, Seoul, Korea
