Abstract-This paper focuses on designing a large N 2 N high-performance broad-band ATM switch. Despite advances in architectural designs, practical switch dimensions continue to be severely limited by both the technological and physical constraints of packaging. Here, we focus on augmentation in a "single-switch" design: we provide ways to construct arbitrarily large switches out of modest-size components and retain overall delay/throughput performance. We propose a growable switch architecture based on several key principles: 1) the knockout principle exploits the statistical behavior of cell arrivals, and thereby reduces the interconnect complexity; 2) output queueing yields the best possible delay/throughput performance; 3) distributed control in routing (multicast) cells through the interconnect fabric without internal path conflicts; and 4) simple basic building blocks facilitate scalability. Other attractive features of the proposed architecture include: 1) intrinsic broadcast and multicast capabilities; 2) built-in priority sorting functionality; and 3) the guarantee of first-in, first-out cell sequence. To achieve 10 014 cell loss probability, only maximum size 32 2 16 basic building modules are required, and no crossover interconnects exist between modules in a three-dimensional configuration.
I. INTRODUCTION

S
EVERAL researchers have investigated the problems in constructing large-scale switch architectures [1] - [3] . As discussed in [1] , various switches can be designed on paper to large dimensions, but the technological and physical constraints (e.g., chip, multichip module, board sizes, and speed of interconnects) often impose a practical limit on their maximum size. If we want a larger switching system, then two or more smaller switches have to be interconnected; however, the resulting complete architecture might no longer be the same as each smaller one. Simple interconnection of these smaller switches creates several stages of queueing delay, and results in severe performance degradation if congestion occurs at intermediate modules. For instance, the Clos connection pattern shown in [4] has only 46% throughput.
In building a large scale ATM "single-switch" architecture, there are several important considerations: 1) the use of simple basic building blocks and modular construction allows the expandability of a switch; 2) the self-routing property eliminates the complicated central processor in path hunting; 3) the complexity of interconnection wiring between stages time. This is because the technological limitation, the memory access speed, cannot be changed with services.
In a multimedia environment, many different traffic types are statistically multiplexed, and ATM cells can have different priority levels. To achieve priority sorting functionality, an additional sorting network can be used in input queueing or shared queueing switches [8] . However, the priority requirement may conflict with basic architectural designs in output queueing switches. For example, in the knockout switch [6] , the design of the -input/ -output concentrator cannot let some cells with higher priority go to their output ports. In the concentration section of the proposed switch, we develop a design that has priority sorting capability along with the knockout principle. The knockout principle is that, in an switch configuration, the probability of having more than contending cells for the same output port in a time slot is negligibly small for all values of .
Since there are no internal buffers inside the switch fabric, the switch preserves cell sequence order, and is not vulnerable to bursty traffic. Moreover, with the deployment of multicast services, frequently there are many transient situations that an output port will temporarily suffer higher than one output traffic loading. We observe that an output queueing discipline with the knockout principle is superior in providing very low cell loss probability even when these traffic conditions persist. We discuss this in detail in Section VI.
In Section II, we describe the switch architecture and the basic building components. This switch provides prioritized services, and it is an internally nonblocking, internally unbuffered multicast (PINIUM) switch. The structural augmentation of the proposed switch is presented in Section III. Section IV discusses the performance analysis of the PINIUM Switch under random multicast traffic and bursty traffic conditions. Section V gives the complexity evaluation. Results and discussions are presented in Section VI. The conclusions are provided in Section VII.
II. THE PINIUM SWITCH ARCHITECTURE
The basic architecture of the PINIUM switch is shown in Fig. 1 . The architecture consists of a distribution section and a concentration section. The distribution section is made up of a stack of multicast radix-trees. These tree networks provide the routing and multicasting functions for the switch. The concentration section is made up of a row of -to-priority concentrating sorters. These priority concentrating sorters provide the hardware priority sorting and concentration functions. We employ the knockout principle in the concentration section to exploit the statistical behavior of cell arrivals, and thereby reduce the interconnect complexity. With this design in a switch, we can assume that there are at most simultaneous arrivals to a given output port. Some researchers expand the above principle to a group of output ports in the so-called generalized knockout principle [1] , [3] . There are reductions in hardware with the generalized principle, and the switch will perform well under well-behaved traffic such as uniform Bernoulli traffic. However, in the case of bursty traffic, the buffer size increases drastically in the output shared memory blocks as shown in [10] . It is possible to find many traffic patterns such that the switches in [1] - [3] cannot deliver more than one cell to an output port in one time slot, even when there is a small number of cells destined for the same output port. The proposed PINIUM switch does not have this problem.
In Fig. 1 , every input port has a multicast tree plane and every output port has an -to-priority concentrating sorter. Therefore, these input and output switching modules are completely partitioned. This partitioned switch fabric provides a flexible distributed architecture which is the key to simplifying the operation and maintenance of the whole switching system. The modularity implies less stringent synchronization requirements and makes higher speed implementation possible.
A. The Distribution Section
In the distribution section, the radix-tree is responsible for carrying out the routing and multicasting functions. The multicast copies are self-generated and self-routed to the destinations in stages. The value of is determined by the fan-in and fan-out driving capacity of current technology. Unlike other nonblocking copy network designs, the radixtree structure will not suffer from an overflow problem, thus providing fair and instant services to all arriving multicast cells.
The multicasting functionality comes naturally in the tree networks. Every input port has a separate tree plane; therefore, there is no conflict in generating multiple copies with other incoming multicast cells. We require the tree planes to perform several basic functions which are the point-to-point routing, point-to-multipoint multicasting, group distribution, and broadcasting. In [11] , several techniques are discussed for achieving point-to-multipoint services in radix trees and banyan networks in a single pass. In the following, we briefly discuss the two techniques which are suitable for use on radixtree networks. They are the cell filter approach and the vertex isolation addressing (VIA) scheme.
The cell filter approach is a straightforward method to broadcast a cell to all output ports. At each outlet of the multicast tree, a cell filter is needed to determine whether a cell is destined for that particular output port. The configuration is shown in Fig. 2(a) . This approach needs modification in the distribution section of the PINIUM switch architectural design. We should use a broadcast tree instead of a multicast tree. Since we need only 1 bit to represent one output address, there are extra bits in the cell header. From [11] , this method has the minimum overhead in routing multicast cells in a single pass. However, we need an extra cell filters in the cell filter version of the PINIUM switch.
Another method is the VIA scheme. The cell format is shown in Fig. 2(b) ; each bit represents the activity of a link of a node in the multicast tree. Thus, the number of bits in VIA cell header is equal to the number of links within a tree. VIA retains the self-routing property, provides multicast services, and does not require extra cell filters. The overhead is slightly higher than the cell filter approach. In a radix-tree, a multicast element is a -to-multicast unit. The number of extra bits required is . This overhead is asymptotically close to the cell filter approach when approaches . For example, when , we need a 1.8 speed-up factor in a VIA multicast tree compared to the 1.6 speed-up factor required in the cell filter approach.
The main objectives of implementing radix trees in the distribution section are to relax the synchronization problem and provide a regular augmentation technique in the high-speed VLSI realization. Indeed, the hardware in this section can be substantially simplified by employing optical implementations with cell filter approach. There are choices such as the optical backplane with smart pixel arrays [9] , miniature cylindrical lenses system, or fiber splitters.
B. The Concentration Section
In the concentration section of the PINIUM switch, each output port has a separate switching plane. Each of these switching planes performs two functions: concentration and priority sorting. A trivial observation is to use an sorter for each output port. A sorted cell sequence is obtained according to the cell priority. Then we can choose the higher prioritized cells to route to an output port controller. If a bitonic sorter [7] is used, the number of stages and the number of switching elements in a sorter are on the order of and , respectively. Therefore, we note that the design of a priority concentrating sorter is crucial in determining the hardware complexity. Moreover, it is well known that bitonic sorter limits how much a switch can be expanded. In the following, we describe a pruning technique to build large concentrating sorters with the knockout principle.
The number of switch inputs is much larger than the knockout parameter . Therefore, it is not reasonable to build a large sorter for each output port. The knockout principle also implies that many switching elements in a full sorter are actually not used. We can simply decompose a priority sorter based on the knockout parameter . We can partition a sorter into two basic components: two sorters and one merger. Note that the losers as well as the unrelated switching elements in each merger can be removed. We call the resulting structure a -to-priority concentrating merger. Fig. 3 (a) shows a four-input bitonic sorter [7] . Fig. 3(b) shows an 8-to-4 priority concentrating merger using the modified odd-even merge [14] or a modified bitonic merging technique to accept two monotonically sorted cell lists.
In an -to-priority concentrating sorter, the first column of operation is done by sorters to obtain sorted sequences. A -to-priority concentrating merger is then used to merge every two sorted sequences into one sorted sequence based on the priority of the cells. A cell successfully arriving at an output port controller will pass through stages of -to-priority concentrating mergers. Thus, we only require two basic components to construct an -topriority concentrating sorters, which are the sorter and the -to-priority concentrating mergers. In this method of decomposition, an -to-priority concentrating sorter preserves all of the desired minimum global information, the higher priority cells, throughout the whole switching plane. Because the two basic building components are small and constructible with current technology, we can easily facilitate the concept of modularity and scalability. Fig. 4 shows an example of the construction of a priority concentrating sorter with , and incoming cells with priorities . Note that no matter how we arrange the five cells at the inputs of the concentrating sorter, only the cell with the lowest priority will be dropped.
III. EXPANSION AND PACKAGING OF THE PINIUM SWITCH
The switching planes of the PINIUM switch can be incrementally augmented to the desired dimensions. The maximum value of of a radix-tree is determined by the fan-in and fan-out driving capacity of the current technology. We assume that the largest such value is used. As shown in Fig. 5 , we only need to add one extra 1-to-multicast unit in front of -output trees to form an -output tree plane. The incremental expansion of the priority concentrating sorter is also simple. We only need to append one stand-alone -to-concentrating merger to two -to-sorters in order to obtain a -to-concentrating sorter. Such an example is also illustrated in Fig. 4 .
The simple augmentation methods of the multicast radixtrees and the concentrating sorters imply that the PINIUM switch can be built up in a similar fashion. If , as shown in Fig. 6(b) , a -input PINIUM switch can be built from four -input PINIUM modules. A new column of 1- to-2 elements is added in the front to construct -output multicast binary trees, and a new row of -to-priority concentrating mergers is added at the back to set up -topriority concentrating sorters. The resulting larger switching system is still a regular PINIUM switch which preserves all of the properties previously discussed. The expansion displays some similarities to a single-stage crossbar switch, as shown in Fig. 6 (a), but no external row (or column) arbiter is required for the PINIUM Switch.
In addition, the architecture of the PINIUM switch prefers a natural three-dimensional packaging technique in which the distribution planes are orthogonal to the concentration planes as shown in Fig. 1 . Every input port has a separate multicast tree plane, and every output port also has a priority concentrating sorter. More importantly, the sorters and -to-concentrating mergers are small and easily constructible. This structure provides a very promising and realizable largescale architecture. In the three-dimensional configuration, there are no cross wirings between modules. All of the interconnection wires are arranged in parallel between the basic building modules. Hence, there is no delay discrepancy in transmitting a signal between building blocks, which relaxes the synchronization problems. These advantages will allow the user to augment the size of a PINIUM switch easily.
IV. PERFORMANCE OF THE PINIUM SWITCH
A. Loss Performance
In this section, we investigate the general performance of the PINIUM architecture under unicast and multicast traffic, random, and bursty traffic. We ignore the fact that cells may have different priorities. We define the input offered load as the average number of cell arrivals to an input port in a time slot. Let be the input offered load at each input port, and . If is the probability that number of cells are destined for the same output port, and is the average cell loss probability, then we have the following general expression:
Random Traffic: We assume that the incoming traffic follows a Bernoulli distribution, that every cell goes to any output port with equal probability, and that there is infinite waiting room in the output port controllers. In a unicast traffic environment, we then have (2) In the worst performance scenario, , and (1) becomes [6] 
For the case of maximum input loading in a unicast environment, , and if , then is about 10 ( is about 10 ). In a multicast traffic environment, we assume that all cells destined for the same "tagged" output port go to an imaginary output queue in the same time slot before going through the knockout process. Let be the effective offered load to the tagged output port. With the multicast tree structure of the PINIUM switch, there is no overflow while all of the desired copies are generated.
Suppose every input port has a cell requesting multicast copies, . If 's are independent and identically distributed, and all of the output ports are uniformly loaded, then it is easy to show, using generating functions, that the number of cells arriving at the tagged output follows a binomial distribution (see Appendix). Moreover, the function is only dependent on the value of , the average number of copies requested per input port. Therefore, we obtain (4) for the tagged imaginary output queue. Hence, when the input loading is high, the value of can be larger than one after deploying multicast services. For system performance, we can replace the by in (2); we obtain the following for (1):
When , we obtain an equation which is similar to (3) (6) As an example, if we assume that every input cell requests a number of copies which follows a truncated geometric distribution with parameter , and is the probability of requesting number of copies per cell, then we have for
Therefore, the average number of copies requested in this case is
As goes to infinity, for the worst performance scenario, then we obtain . Using this result with (4) and (6), we are able to evaluate the performance of the PINIUM switch under this multicast traffic pattern. Bursty Traffic: The internal bufferless design of the PINIUM architecture is an important factor which provides the system with a memoryless property. There is no impact on the switch performance under bursty traffic. To investigate the performance of the switch under bursty-type traffic, we only need to find the mean offered load as well as the mean multicast copies requested. For instance, suppose the arrival process at each input port is an identical two-state Markov modulated Bernoulli process (MMBP). The source is characterized by a transition matrix and a rate matrix where and are the transition probabilities between the two states, and is the probability that the source at state has a cell arrival at an input port in a time slot. The mean input offered load is (9) We can then use this parameter in (4) and (6) to calculate the cell loss probability. This conclusion can be verified by using the matrix-analytic technique as shown in [12] .
B. Delay Performance
After going through the concentration section of the PINIUM switch, the cell arrival process of the output port controller follows a truncated binomial distribution. Let denote the number of cells destined for a tagged output port controller; then we have (10) In this discrete-time model, we assume that both the arrivals and departures occur on the slot boundaries simultaneously. Every cell requires one time slot for processing; therefore, a cell arriving at an empty queue cannot be delivered instantaneously. In an infinite buffer system, if denotes the steady-state probability of the queue length, we have the following equations:
(11) . We can then obtain the average cell delay in the output port controller by using Little's formula: expected delay expected queue length offered load (12) In Fig. 7 , we plot the expected cell delay with a different offered load occurring at the output port controller with .
V. COMPLEXITY EVALUATION
A. Conventional Complexity Measures
The PINIUM switch has two different sections, a stack of multicast tree planes and a row of priority concentrating sorters. In calculating the number of stages, a radix-tree with output ports needs stages. A priority concentrating sorter also consists of two parts: an sorter and the -to-priority concentrating mergers. We assume that both of them are constructed from Batcher's bitonic method; then an sorter will have stages, and a -toconcentrating merger will have stages. In an -topriority concentrating sorter, we obtain monotonically sorted sequences using a column of sorters. Then another stages of -to-concentrating mergers are used to obtain the ultimate cells. Suppose that is the total number of stages in the PINIUM switch with inputs and knockout parameter ; we have (13) For the knockout switch, the number of stages varies. From the example given in [6] , an eight-input/four-output concentrator will require eight stages, and an eight-input/two-output concentrator will require five stages. Therefore, the PINIUM switch is expected to perform as well as the knockout switch in terms of latency. Moreover, is small when compared to ; the number of stages of the PINIUM switch is on the order of . Therefore, the PINIUM switch has shorter latency than those of the sort-banyan type of networks.
In calculating the number of switching elements, suppose we adopt the VIA scheme to perform multicasting, and we further assume that each multicast switching element has the same complexity as each 2 2 switching element in the concentration section. There are multicast radix-elements in a tree plane. In a concentrating plane, we let be the number of switching elements in a -topriority concentrating merger, which is (14) and each sorter has switching elements. Suppose that is the total number of switching elements in the PINIUM switch with inputs and knockout parameter ; we have (15) From this expression, we notice that the number of switching elements is on the order of , whereas in the knockout switch, neglecting the cell filters and delay elements, there are switching elements. Thus, we note that the PINIUM switch incurs only a short latency for all incoming ATM cells. Moreover, it provides extra features such as the priority sorting function and multicast services. There is a drawback in terms of the number of switching elements, but with existing high-density VLSI technology, the weighting of this factor is actually reduced when compared to other physical constraints such as the limitation of the chip, MCM, or board size (see the following section). Hence, with the decomposition of the -topriority concentrating sorters, the PINIUM switch provides the advantages of fewer number of stages, fewer number of switching elements, and easier expansion.
B. Core Chip Complexity
Since there is no interconnect wiring problem in constructing a large-scale PINIUM switch and the synchronization requirement is greatly relaxed between modules, then the number of pin-limited chip counts is the most important measure when we implement the design. In the following, we calculate the chip counts based on the constraints of both the pin-limited count and the transistor-limited count [20] .
The main property of the PINIUM switch is the separate module designs of the distribution and concentration sections. It is natural to implement the two sections into two different chips. The largest such chips are referred to as the core chips. For the distribution section, we investigate packaging using the cell filter approach because it is easier to determine the number of transistors used in each 1-to-broadcast element. From the pin-limited consideration, we can construct a tree with 512 output ports. We can use a three-level radix-8 broadcast tree. Each broadcast element is composed of two levels of inverters, and the first one has to be resized to maintain a good stage ratio [18, p. 229] . We need 73 1-to-8 elements (18 transistors/element) for a tree; thus, in total, there are 1314 transistors. Moreover, if we employ a straightforward implementation of a cell filter using 512 latches, each latch can be implemented with seven transistors using dynamic logic [18, p. 332 ], and we need 3584 transistors for a cell filter. In total, we require 1 835 008 transistors to construct all cell filters. We then package these 2 million transistors in about a 550 pin grid array. The cell filters dominate the area of this chip. Actually, we can optimize the design of every cell filter because there is only one active clock time in a given cell to examine the address of the routing header. The transistor count can be substantially reduced with careful clocking while using several latches. As a simple comparison, if we use the VIA scheme and we estimate that each multicast element might take 2500 transistors (a very large estimation), then there are about 0.2 million transistors in total.
For the concentration section, only sorting elements are required. With dynamic logic [15] , we require 113 transistors for each comparison element to run at a speed of 155 Mbit/s. To achieve the cell loss rate at 10 , we use , and then we need to construct 16 16 sorters (80 comparison elements/sorter) and 32 16 concentrating mergers (48 comparison elements/merger). If there are 512 input ports, then we need 32 sorters (289 280 transistors) and 31 concentrating mergers (168 144 transistors). There are about 0.5 million transistors in total, including the clock and control components. From the above discussions, the transistor count does not have important weighting in the implementation, but the pin count exhibits itself as the only constraint in constructing large-scale PINIUM architecture. After designing these two chips for the basic 512 512 PINIUM switch, we only need two types of small chips, the 1-to-elements and the 32 16 concentrating mergers, to augment the switch. We can also deploy the G-VIA scheme in [11] for the multicast services in the augmented PINIUM switch. In Table I , we show the packaging requirements for several realizable sizes of the core chips.
VI. RESULTS AND DISCUSSIONS
The performance of the PINIUM switch in random traffic condition is bounded by (6) when . We investigate the switch performance when approaches infinity. The incoming traffic expands after deploying multicast services, and in Table II , we tabulate the effective offered load at an output port when an input cell is destined for different . Under the effect of this traffic expansion, we plot a graph in Fig. 8 when . Fig. 8 depicts the cell loss probability versus the value of knockout parameter with different values of . In order to achieve low cell loss probability such as 10 or 10 , we have to appropriately choose the output size of the priority concentrating sorters. Table III shows the minimum value of to achieve 10 average cell loss. We observe that if the effective output offered load is consistently larger than one, then the output queueing discipline with the knockout principle is an excellent approach to handle the incoming traffic with ease. For example, in Fig. 8, with and , the switch can sustain the cell loss probability at 10 even when we have traffic expansion . The plot indicates the superiority of the PINIUM switch in handling consistent high surges of incoming multicast traffic. This observation is important because the sudden onrush of multicast traffic may obviously deteriorate the performance of other switch designs. In Table IV , we want to find the maximum acceptable offered load at an output port in the PINIUM switch for some values of and some criteria of cell loss probability. The values of are calculated using a bisection search method with tolerance . The tolerance is defined as where is the cell loss probability due to the maximum acceptable offered load at output port and is the desired cell loss probability. Thus, if , from Table  IV , the transient acceptable offered load at an output port can be as high as two with cell loss probability being kept around 10 .
We have neglected the impact of the traffic expansion on the output port controller buffers. To achieve steady-state condition and accommodate multicast traffic, we should keep the average input offered load from exceeding . In this case, the offered load at an output port can be kept below one, and different values of can be chosen to combat the traffic surge problem. Similarly, the effect of burstiness of the input traffic streams has essentially no impact on the switch because of the bufferless design inside the PINIUM switching fabric. However, it has a substantial effect on the buffer size in each output port controller, and we should engineer the buffer size carefully.
For the different priority class traffic in the PINIUM switch, we have done extensive simulations to verify the priority class performance. We set the buffer size to 1024 cells to simulate the case of infinite waiting buffers in an output port controller. Furthermore, we assume , and that the incoming random traffic of three different priority classes follows a truncated geometric distribution with parameter . Suppose is the probability that an incoming cell is of priority where . Then we have for We consider a unicast traffic environment; the input offered load has to be smaller than one to achieve a stable system. Each simulation has run through 10 million time slots. We notice that there are no cell losses in all priority classes when the offered load goes up to 0.9. However, the average cell delay increases with traffic intensity. In Fig. 9(a) , we plot the average cell delay against different offered loads from 0.1 to 0.9 with . In the simulation, we assume that there is an extra time slot used to propagate a cell through the switching fabric. Therefore, in the delay analysis of the output port controller, we add an extra time unit to (12) for and . The analysis result is also plotted in Fig. 9(a) ; the curve is close to and bounds above the average aggregate delay in simulation. This is because we have infinite buffers in the delay analysis. Also as expected, the highest priority cells suffer the shortest delay in the switch, whereas the lowest priority cells suffer the longest delay.
If we fix the value of , we notice that the buffer size in an output port controller is the only crucial factor which can affect the switch performance. Thus, we further simulate the priority class cell loss probability against different buffer sizes in an output port controller; a graph is shown in Fig. 9 (b) with and . In this simulation, all traffic classes do not have any cell losses when the buffer size is 51 or larger. Moreover, both priority classes 1 and 2 have much better performance, as expected. We increase the traffic intensity to , and simulate the performance of the switch under different parameters and different buffer sizes in output port controllers. In Fig. 10 , we observe that with
and with approximately 96 cell buffers in each output port controller, the switch lost no cells during the simulation, even at high traffic intensity. A small amount of buffer causes flat performance as compared to the analytic result.
Moreover, we investigate the priority class performance with different values of in Fig. 11 at high traffic intensity with 64 cell buffers. We notice that there is no cell loss at all for the highest priority class, even when with . Different mixes of priority classes can provide different qualities of services. When is smaller, the chance that a high-priority cell gets lost is smaller accordingly; hence, the services for the cells belonging to higher priority classes can also be improved. For implementation, suppose we put 64 cell buffers in an output port controller in a 1024 1024 PINIUM switch; then we need 3.4-kbyte memory in each output port controller, and only 3.4-Mbyte memory for the whole switch. The cost of the memory chips is very inexpensive. In this design, the memory is centralized only in the output port controller, thus further simplifying memory management and reducing cost. Therefore, we can adjust to assure that the overall PINIUM switch performance and the throughput are close to 100% with a practical amount of buffers located in each output port controller.
Since the value of is small, e.g., 8 or 16, the output port controllers can be designed or programmed easily for functions such as traffic and priority management. There exist VLSI designs for these functions, for instance, the output sequencer chip [19] . The fault-tolerance issue of the PINIUM switch can be addressed at building block or switching plane levels. We can test and replace the defective tree planes and concentrating sorters easily. The PINIUM switch architecture introduces minimum interference to other in-service users while changing the faulty components. Fault tolerance can be accomplished by providing a spare switching tree plane and a spare concentrating sorting plane, not a duplication of the entire switch.
From the evaluation of hardware complexity, we notice that the number of switching elements is not an important measure when we want to implement a large-scale switch. The packaging of the core chips is shown in Table I . If the core chips are designed for , we can easily augment it to 1024 1024 by employing multichip module [16] technology. The number of the core distribution and concentration chips is 2048 each, and the number of the small distribution and concentration units is also 1024 each. If each input port is running at 155 Mbit/s, then the aggregate throughput can reach 160 Gb/s with only 44 stages of switching elements.
From the commercial point of view, we can employ the hierarchical multiplexing and demultiplexing technique in reducing the total chip counts without changing the design concept. Suppose the core chips are constructed for ; we can easily quadruple the transistor counts in every core chip (see Table I ) with internal multiplexing and demultiplexing units. Therefore, each core chip can work for four input and output ports with link speed running at 622 Mbit/s without changing the pin counts. We can then multiplex 1024 input ports at speed 155 Mbit/s by TDM multiplexers, and demultiplexers. The construction is shown in Fig. 12 . In this case, the number of high-speed distribution chips, concentration chips, TDM multiplexers, and demultiplexers is 256 each. The resulting aggregate throughput is also 160 Gbit/s.
Up to now, we have not applied the common trick of increasing the aggregate switch capacity by applying serial/parallel conversion. If we can undergo bit/byte conversion, we put 8-bit planes together in a switch fabric with 1024 ports, and each port operates in the range of 155 Mbit/s. The resulting modular switching system has an aggregate throughput of 1.3 Tbit/s with the current state-of-the-art VLSI technology.
VII. CONCLUSIONS
In this paper, we propose a large-scale ATM switch architecture that is growable to large dimensions and has the best possible delay/throughput performance. The switch architecture consists of a stack of radix-trees and a bundle of modified Batcher's sorting networks. The basic building blocks are small, and are used to construct the large-scale ATM switches. This modular architecture provides multicast services and priority sorting functions. Physically, it can be realized as an array of three-dimensional parallel processors. The input and output switching modules are completely partitioned, and this partitioned switch fabric provides a flexible distributed architecture, which is the key to simplifying the operation and maintenance of the whole switching system. The modularity implies less stringent synchronization requirements, and makes higher speed implementation possible. The proposed modular switch is intended to meet the needs of broad-band exchanges of all sizes. We estimate that a switch with terabit capacity can be built using current VLSI technologies.
APPENDIX EFFECTIVE OFFERED LOAD
With multicast services, we show that the effective offered load to a queue is related to the input offered load by , where is the average number of copies requested by an incoming cell.
As shown in Fig. 13 , there are input streams to a queue. Each stream has an offered load . The parameter is the traffic intensity offered from a unicast traffic, and is the variable denoting the number of multicast copies requested. Let be the total number of arrivals to the queue, and let be the arrival due to input , their corresponding probability mass functions are and . Therefore, we have . Since are independent, is the convolution of all 's. Then the generating function of is (16) where is the generating function of . We assume that is also i.i.d., and follows a probability mass function . Now, we calculate (16) based on conditional probability: 's are also independent for all The resulting generating function follows a binomial distribution. The parameter of the distribution depends on , which is the effective offered load to the queue.
