Abstract-A new packet switch architecture using two sets of time-division multiplexed buses is proposed. The horizontal buses collect packets from the input links, while the vertical buses distribute the packets to the output links. The two sets of buses are connected by a set of switching elements which coordinate the connections between the horizontal buses and the vertical buses so that each vertical bus is connected to only one horizontal bus at a time. The switch has the advantages of: 1) adding input and output links without increasing the bus and I/O adaptor speed; 2) being internally unbuffered; 3) having a very simple control circuit; and 4) having 100% throughput under uniform traffic. A combined analytical-simulation method is used to obtain the packet delay and packet loss probability. Numerical results show that for satisfactory performance, the buses need to run about 30% faster than the input line rate. With this speedup, even at a utilization factor of 0.9, each input adaptor requires only 31 buffers for a packet loss rate of 10 06 . The output queue behaves essentially as an M/D/1 queue.
I. INTRODUCTION
T HE development of communication networks has reached a point that the switching system rather than the transmission system becomes the bottleneck for the growing volume and varieties of traffic. In Hong Kong, as an example, a large quantity of dark fibers have been laid, but good quality video and image communication is still a rarity because currently available switching facilities cannot accommodate them economically. Many fast packet switches have been proposed in recent years, and they can be classified into three broad types: shared-memory based [1] - [4] , shared-medium based [1] , [2] , [5] - [12] , and space-division based [1] , [2] , [13] - [18] .
The shared-medium based switch has among its advocates IBM's PARIS switch designers and NEC's ATOM switch designers. The PARIS switch [5] - [7] , [11] is designed for private networks. With the use of automatic network routing, the architecture of the switch can be kept very simple. Variablesize packets can be accommodated, and a very efficient roundrobin exhaustive bus-access policy is adopted. On such a single broadcasting medium, multicasting and broadcasting functions can easily be implemented. The ATOM switch [8] , [12] uses the bit-slice organization to alleviate the limitation of the bus speed. For still large switches, a multistage organization was Paper approved by G. P. O'Reilly, the Editor for Communications Switching of the IEEE Communications Society. Manuscript received July 7, 1993;  revised November 1, 1994 and October 13, 1995 . This paper was presented in part at IEEE INFOCOM'92, Florence, Italy, 1992.
Y.-W. Leung is with the Department of Computing, Hong Kong Polytechnic University, Kowloon, Hong Kong.
T.-S. Yum is with the Department of Information Engineering, Chinese University of Hong Kong, Shatin, Hong Kong (e-mail: yum@ie.cuhk.hk).
Publisher Item Identifier S 0090-6778(97)05178-7.
proposed. Store and forward of packets, however, is needed at every stage. An alternative to the multistage organization is to use multiple shared media. Nojima et al. [9] have developed a switch in which several shared buses are connected in matrix form with memory located at each crosspoint of the buses. Packets contending for access to the same bus are stored in the crosspoint memories connected to this bus. Arbiters scan the crosspoint memories and remove packets from them.
In this paper, we study a new switch architecture using multiple shared buses. This switch has the following advantages: 1) adding input and output links without increasing the bus and I/O adaptor speed, 2) they are internally unbuffered, 3) they have a very simple control circuit, and 4) they have 100% throughput under uniform traffic. We derive the expected delay and the packet loss probability under various bus transfer rate for this switching system.
II. THE TDM-BASED MULTIBUS PACKET SWITCH
The multibus packet switch is designed for switching fixed size packets. The packet size can be set to 53 bytes for ATM switching.
A. Architecture Fig. 1 shows the architecture of an multibus packet switch. Packets enter the switch through the input links. Each input link is operated synchronously, with time being divided into link slots, where each link slot can accommodate one packet. Each input link and each output link are connected to the switch through an input adaptor and an output adaptor, respectively. Fig. 2 shows the internal structure of an input and an output adaptor. The input adaptor receives packets from the input link, performs a serial-to-parallel conversion, and queues the packets in a set of buffers. The output adaptor performs two functions. First, it filters out all packets destined for this particular adaptor and puts them in the output buffer. Second, it performs a parallel-to-serial conversion for the packets for onward transmission. higher data transfer rate, and hence, a higher switch throughput at the expense of a higher implementation cost. Based on the current technology, a bus of width 64 bits operating at 100 MHz can provide a bus transfer rate of 6.4 Gbits/s [7] .
The horizontal buses are connected to the vertical buses in a bus matrix, with a total of switching elements at the crosspoints of the vertical and horizontal buses. The switching element placed at the crosspoint of HB and VB is identified as SE . Fig. 3(a) shows the schematic of a switching element. It connects the horizontal input bus to either the horizontal output bus or the vertical bus. Fig. 3(b) shows the circuit realization of the switching element, using relays ( is the bus width), inverters, and one shift register. The relay is a three-terminal element with one input, one output, and one control line. It connects the input line to the output line whenever there is a "1" on the control line. For prototyping, the set of relays are available as off-the-shelf IC chips (e.g., Motorola's SN54LS). For actual implementation, ASIC chips with multiple switching elements per chip can be used. Since the circuitry in each switching element is very simple, the number of switching elements per chip depends only on the number of available pins per chip. For example, if the bus size is 32 bits and a chip consists of four switching elements, the chip must have 256 pins for inputs/outputs. The shift register in SE stores a bit pattern which determines when to connect the horizontal input bus to the vertical bus. When a clock pulse arrives, the last bit is shifted out to the relays. If this bit is "1," the horizontal input bus is connected to the horizontal output bus; otherwise, the horizontal input bus is connected to the vertical bus. The connection patterns of the switching elements are chosen such that one vertical bus is connected to only one horizontal bus at a time. Note that the clock rate is equal to the packet rate on the bus (e.g., if the bus is operated at 6 Gbits/s and the packet size is 53 bytes, the clock rate is 14.2 MHz).
B. Operation
The transmission of packets on a bus is divided into cycles of equal duration. Each cycle is subdivided into subcycles of equal duration (Fig. 4) . In the th subcycle, group input adaptors are connected to vertical bus VB where
Thus, in the th subcycle, packets from group input adaptors are switched to group output adaptors. Hence, only the switching elements SE ( ) connect the horizontal buses HB to the vertical buses VB ( ), while all of the other switching elements connect the horizontal input buses to the horizontal output buses. Fig. 4 shows an example of this transmission arrangement when and . This transmission arrangement ensures that in each subcycle, there is a unique one-to-one connection from every group of input adaptors to every group of output adaptors. This means that the groups of input adaptors can simultaneously transmit packets to the groups of output adaptors through the bus matrix.
To resolve the bus contention among the input adaptors in each group, each subcycle is further divided into bus slots, where each bus slot can accommodate one packet and is dedicated to one input adaptor. Each adaptor can, therefore, 1 Note that if we would have labeled the vertical buses as 0; 1; 2; 111; M 0 1 instead of 1; 2; 111; M , (1) will be simplified to f (i; j) = (i + j) mod M . But doing so would complicate the subsequent discussion. transmit one packet in each subcycle. Note that when the bus transfer rate is fixed, a larger number of inputs increases the cycle duration.
Global timing is used to ensure that all transmissions are properly synchronized. This requires that all of the input adaptors and switching elements are triggered by a common clock.
C. Speedup Factor
We define the speedup factor SF of the switch as the ratio of the sum of the data rates of all of the vertical buses to the sum of the data rates of all of the input links. Since there are vertical buses and input links, SF can be written as SF data rate of a bus data rate of an input link
In the next section, we will analyze the performance of the switch with SF as a parameter. Note that when SF is made larger, the buses can serve the input adaptors at a higher rate and yields a smaller input queueing delay. When SF , input queueing is not required, but the implementation cost is high. Our results will indicate that a small SF (say, SF 1.3) can already give satisfactory performance in the sense of very little queueing at the input adaptor. Note also that the cycle length and the link-slot size are given by and SF , respectively, where is the duration of a bus slot.
III. PERFORMANCE ANALYSIS

A. Queueing Model and Decoupling of Queues
In each link slot, there is a packet arrival with probability . Let be the probability that an incoming packet from input link is destined for output link . Then, the probability of packet arrival from input link to group output links is (2) The packets in an input adaptor are logically organized into queues such that queue contains all packets destined for group links. For convenience, we let denotes queue in input adaptor . Let and be the buffer size in each input and output adaptor, respectively. The queues share these buffers by the complete sharing strategy [19] . Packets queued at the output adaptor are transmitted in a first-in, first-out order.
Without loss of generality, let us consider the delay of the packets departing from the group 1 output links. As group 1 output adaptors only get packets from VB , we shall model VB as a bus server. For convenience, we shall call the subsystem up to and after the bus server the input queueing system and output queueing system, respectively. As seen from Fig. 5 , there are altogether input queues feeding packets to the bus server. The output queueing system consists of queues corresponding to the output adaptors in group 1.
The switching elements connecting the horizontal buses and the vertical buses are operated in such a way that each input queue has a fixed dedicated bus slot for transmitting a packet in every cycle. All input queues are therefore independent. Recall that the duration of each bus slot and each cycle are and , respectively (see Section II-C). Therefore, all queues are served once every seconds with service time . Analysis of the input queues is given in the next subsection.
The arrival process to the output queueing system is the superposition of the departure processes of all of the input queues in the input queueing system. To characterize this arrival process, we must first characterize the departure process of each of the input queues. The bus server visits an input queue every seconds, and removes one packet from the queue when the queue is not empty. As far as the characterization of the departure process is concerned, the service time in the input queue can be considered as equal to seconds. For input queue , packet departure occurs at time epochs that are integer multiples of . Therefore, the durations of busy and idle periods are both integer multiples of . The input queues are served by the bus server in a similar manner as for , except for a time lag of seconds, respectively. The probability mass functions of the idle period and busy period durations for input queue are derived in Section III-D. In general, the departure epoch for the input queue occurs at for . The departure process of each of the input queues is characterized by these time epochs and the distributions of the idle and busy periods. The arrival process to the output queueing system is the superposition of the departure processes from all of the input queues. Fig. 6 shows an example of the departure process from the input queueing system with . Departing packets from join output queue with probability . So the arrival to output queue due to is just the bifurcation of the departure process from . The composite arrival process to output queue is the superposition of the bifurcated departure processes from to output queue . With such a complicated arrival process, we have to resort to computer simulation to obtain the queueing delay. As only a single server queue needs to be simulated, very accurate delay and packet loss results can be obtained.
B. Expected Delay in the Input Queue
We assume that the packet loss probability at the input queue with a finite buffer size is very small, and we approximate the expected delay with a finite buffer by the delay with an infinite buffer. As all queues are similar, we choose to analyze a particular one with packet arrival probability . We use the standard imbedded Markov chain analysis. The input queue is served once every cycle for seconds. The imbedded points are chosen at the time instances at which the bus server has just visited the input queue. We let there be link slots in the time interval [ and let be the number of customers in the queue at the th imbedded point. Then is related to by (3) where is a random variable denoting the number of arrivals in link slots (i.e., in one cycle). Taking the transform, we have (4) In steady state, . The generating function of the number of customers in the input queue can be obtained from (4) as (5) The expected number of customers at the imbedded points is given by (6) Consider the arrival of a tagged packet. Since the arrival of packets is a Bernoulli process, each link slot is equally likely to contain an arrival. Then, the arrival time of the tagged packet is equally probable in any of the link slots. Let be the event that the tagged packet arrives in link slot . Then the expected number of packets arriving from the last imbedded point until the arrival of the tagged packet is (7) The number of packets in the queue averaged over a cycle, denoted as , is
The probability of packet arrival to an input adaptor is . Therefore, by Little's formula, the expected delay is (9) Consider the following numerical example. Let there be inputs, and they are divided into groups. Let the link utilization be 0.9. Then,
. If the packet size is 53 bytes and the bus is operated at 1.25 Gbits/s, then s. If the input link rate is 100 Mbits/s, then and hence, the delay from (9) is 25 s.
C. Packet Loss Probability at the Input Adaptor
The logical queues in each input adaptor share the buffers by the complete sharing strategy which has the best blocking performance [19] . When all of the buffers are occupied, incoming packets are lost. The buffer size must be chosen such that the probability of packet loss is very small. In this section, we derive an approximate expression of as a function of the buffer size . This expression is an upper bound of . First, we derive the probability mass function of the number of packets in the input queue with infinite buffers. Let be the probability that there are packets in an input queue at the imbedded points. It is given by the coefficient of in the power series expansion of . Consider a tagged packet that arrives at the th link slot after an imbedded point. The probability that this tagged packet sees packets in queue is given by 
The probability that there are packets in an input queue with infinite buffer is given by [the tagged packet sees packets in the input queue (11) When the buffer size is finite and the complete sharing strategy is employed, the input queues in an input adaptor are not independent. However, for a well-designed fast packet switch, the buffer size can be chosen such that the probability of packet loss is very small (say, less than ). In this case, the input queues can be regarded as independent. The probability of packet loss in an input adaptor is given as (12)
D. Probability Mass Functions of the Idle and Busy Periods
In this section, we derive the probability mass functions of the length of the idle and busy periods from an input queue. These functions characterize the departure processes from the input queues, and are used to generate the arrival process to the output queue in the simulation experiments. Let and be the duration of the idle and busy periods in unit of cycles. Then, is given by Prob[no packet arrival for consecutive cycles and a packet arrives at cycle
Let be the probability generating function of . can be found using the method in [20] as (14) where (15) From (14) and (15), is obtained as (16) where (17) which can be evaluated recursively.
IV. NUMERICAL RESULTS AND DISCUSSION
Consider a 1024 1024 switch ( ) with inputs divided into eight ( ) equal groups. There are 128 links per group, and a total of switching elements. (If a single chip can contain four switching elements, then the switch fabrics requires only 16 chips.) Let the packet transmission time in any input link be normalized to one time unit. Fig. 7 shows the average queueing delay in the input adaptor. As SF increases, the queueing delay becomes smaller because the input queues are served at a faster rate. At 30% speedup, very small delay is obtained even at . Fig. 8 shows the packet loss probability at the input adaptor for various buffer sizes at . When SF 1, the required buffer size to achieve a packet loss probability of is found to be about 150. However, with SF 1.3, the required buffer size is reduced to only 31. In the input queue, only packets to a certain destination can be served at a certain time. As all packet destinations are assumed to be independent and uniformly distributed, this extra "randomness" makes the speeding up of the bus rate necessary for satisfactory performance. Fig. 9 shows the average queueing delay in the output adaptor. Here, on the contrary, a larger SF gives a larger delay at the output adaptor. The difference, however, is only apparent at very large (a difference of one time unit at ). Moreover, all SF 1.3 cases give almost identical delay characteristics. This phenomenon can be explained as follow. When SF 1, there is essentially no queueing at the input adaptor. All packets to a certain output link will immediately appear at the output queue. The input process is, therefore, a superposition of Bernoulli processes. For , that process should be indistinguishable from a Poisson process. Thus, the output queue is just a simple M/D/1 queue. In fact, the M/D/1 delay characteristics coincide with the SF 8 curve in Fig. 9 . What is interesting to note is that for SF 1.0, the delay at the output queue is smaller than that of the M/D/1 queue, and for SF 1.3, the delay is essentially that of the M/D/1 queue. Fig. 10 shows the packet loss probability versus the buffer size in the output adaptor when SF 1.3. As can be seen, at a packet loss probability of can be achieved with a buffer size of 30, and can be achieved with a buffer size of 43. 
V. CONCLUSIONS
There are various approaches to the design of fast packet switch. Using a high-speed bus, packet switching can be done very simply by individual stations through the filtering of unwanted packets. The bus width and bus speed, however, limit the total throughput of the switch. To bypass this limitation, we have designed a TDM-based multibus switch. In addition to modular growth, it preserved the advantages of an internally unbuffered, simple control circuit and 100% throughput under uniform traffic. We have analyzed the performance of the switch in terms of the speedup factor, and have found that for satisfactory performance, the buses need to speed up 30% relative to the individual input line rate.
Since the bus bandwidth is allocated to the input adaptors in a fixed cyclic order, the performance of the switch might not be satisfactory under highly asymmetric traffic conditions. We are currently investigating dynamic bandwidth allocation policies for use in such conditions. (M'92-SM'96) Tak-Shing Yum (S'76-M'78-SM'86) worked at Bell Telephone Laboratories in the U.S. for two and a half years and taught at National Chiao Tung University, Hsinchu, Taiwan, R.O.C., for two years before joining the Chinese University of Hong Kong in 1982. He has published original research on packet switched networks with contributions in routing algorithms, buffer management, deadlock detection algorithms, message resequencing analysis, and multiaccess protocols. In recent years, he branched out to work on the design and analysis of cellular network, lightwave networks, and video distribution networks.
Yiu-Wing Leung
