Abstract-With the projected growth in demand for
I. INTRODUCTION
The growing acceptance of the integrated services digital network (ISDN) promises increased bandwidth and new telecommunications services at reasonable cost [1] . User demand for bandwidth and services is forecast to rise rapidly following the widespread adoption of ISDN access standards. In addition, when research into multiservice metropolitan area networks (MAN's) [2] and private networks reaches the market, we may expect an acceleration in demand, possibly fueled by an increasing availability of video services. To meet this demand for increased bandwidth and for an expanding diversity of services in the backbone network, evolution towards the broadband ISDN, with the consequent requirement for new switching techniques, will become increasingly desirable [3] . The enhancements offered by a suitable switching mechanism should include: increased flexibility, high traffic capacity, enhanced bandwidth efficiency for "bursty" services, inherent rate adaption, and the service independent support of multiservice traffic.
A recent study of switching techniques appropriate to a multiservice backbone network [4] concludes in favor of fast packet switching, a statistical switching mechanism also known as asynchronous time division [5] , [6] . Two problems require attention before a statistical switching mechanism may be employed in a multiservice backbone network, whether public or private. First, a fast packet switch of high maximum traffic capacity must be designed and implemented, in current technology, at an acceptable cost. Second, for multiservice operation, a mechanism is required to support real-time traffic across a fast packet switch at a level of service at least commensurate with that offered by the ISDN. Thus, for example, voice traffic requires a blocked calls lost service with a guaranteed maximum delay performance on each voice connection throughout the entire duration of a call.
This paper presents a simulation study of the multiservice traffic performance of a fast packet switch based upon a nonbuffered, multistage interconnection network. The design of the switch is first discussed followed by a simulation of its throughput at saturation. Then follows an investigation of the performance of the switch for multiservice traffic, in which a simple hardware mechanism is employed to offer a guaranteed maximum delay performance for real-time services such as voice. The results suggest, for example, that fast packet switches may be realized with a total switch capacity of up to 150 Gbits/s, constructed from identical switching elements, in current CMOS gate array technology operating at 50 MHz. Furthermore, if the reserved service traffic load is limited on each input port to a maximum of 80 percent of switch port saturation, then a maximum switch delay may be guaranteed of the order of 100 µs for 99 percent of all reserved service packets, while the switch is continuously loaded to saturation with multiservice traffic.
II. DESIGN OF A FAST PACKET SWITCH
Fast packet switching is a connection oriented packet switching mechanism which achieves high throughput and low delay by reducing the processing required per packet to an absolute minimum [7] - [10] and then implementing it in hardware. Routing is performed at call setup and a virtual circuit is allocated which is fixed for the duration of the call. All flow control and error recovery protocol functions are performed on an end-to-end basis. The packet length across any virtual circuit is constant and small, and the packet format is very simple: a packet header containing a priority field and a label, (to identify the virtual circuit), of fixed length, e.g., 16 bits, followed by the information component typically in the region 4-64 octets.
Two fundamental components are required to construct a fast packet switch: switching and buffering. This results in three possible classes of fast packet switch design: input buffered, in which the buffering precedes the switching using a nonbuffered switch fabric [13] ; output buffered, in which the buffering follows the switching also using a nonbuffered switch fabric [11] , [14] ; and the buffered switch fabric where buffering occurs internally within the switch fabric [6] , [15] - [18] , [33] , [34] . The decision to investigate the design of a fast packet switch based upon a nonbuffered switch fabric was taken on the basis that a nonbuffered switching element is much simpler to implement than is its buffered counterpart. This implies the possibility of implementation in gate array technology offering greater flexibility in the cost, performance, and other design parameters than that available from a dedicated VLSI solution. Furthermore, a simple design permits switching elements of greater degree to be fabricated leading to a reduction in the number of interconnections required to form a given size switch fabric compared to that of a buffered design. The long term goal of an all optical implementation of the switch fabric, or at least the switch fabric data paths, also motivated the selection of a nonbuffered switch fabric.
Pure input buffering has a performance which is approximately 58 percent that of pure output buffering [19] all other factors being equal. However, pure output buffering requires an order of magnitude, more hardware, and switch fabric interconnections than does an input buffered solution [14] . The fast packet switch to be described is thus based upon a pure input buffered switch fabric, but to improve performance, to facilitate maintenance, and to accommodate real-time traffic, a two-plane structure has been adopted which permits a limited amount of output buffering to be implemented if desired. The design may be extended to more than two switch planes in parallel but results suggest that this is unlikely to be necessary unless extremes of performance or reliability are required.
A. The Switch
The basic structure of the fast packet switch is given in Fig.  1 . An incoming packet arrives in a first-in first-out (FIFO) queue. When free, the respective input port controller extracts the label from the packet at the head of the queue and uses it to reference a connection table. Each input port controller operates asynchronously, at the packet level, and independently of all other controllers. From the table it receives two components, an outgoing label and a tag. The outgoing label is used to replace the incoming label within the packet. The tag specifies the required destination output port of the switch and is attached to the front of the packet. The input port controller then initiates a setup attempt by launching the packet into the switch fabric, tag first and in bit serial form. There are two possible outcomes, either the packet will be successful and reach the desired output buffer, or it will fail. A setup attempt may fail either because it is blocked by other traffic within the switch fabric or because the requested output port is busy serving another packet. If the setup attempt fails, the switch fabric will assert a collision signal which is returned to the input port controller, along a reverse path, typically within a few bit times of emission of the packet tag. On receiving the collision signal, the input port controller removes the setup attempt from the switch fabric and waits for a delay typically equivalent to 10 percent of the length of a packet. This is the retry delay and at the end of this period the input port controller begins a fresh attempt to transmit the packet. It continues to do so until it is successful or until it exceeds a limit designed to detect fault conditions.
A slightly more complex algorithm that offers an improvement in performance at high loads does not repeatedly attempt to transmit the same packet but on the failure of a setup attempt searches through the input queue and attempts to transmit the second packet. If that attempt fails the third packet on the queue is attempted and so on cyclically through the queue until a successful transmission is achieved. This overcomes the so called "head of line" blocking problem [13] , [19] but care has to be taken not to get packets on the same virtual circuit out of sequence. This algorithm will be referred to as input queue bypass [15] .
A simple model of the operation of the fast packet switch may be drawn by analogy with the operation of a well-known local area network: Ethernet. Ethernet may be considered, as a fast packet switch which distributes the switching function across the local area using a single shared medium switch fabric. The fast packet switch described above merely confines the switching within a box so that a multipath medium of much higher bandwidth may be implemented. The input port controller of the fast packet switch corresponds to the media access controller of Ethernet and in both cases the controller throws a packet at the switch fabric and if it is unsuccessful the switch fabric informs it immediately. The difference between the two lies in the fact that in Ethernet, a collision destroys both colliding packets therefore an exponential random backoff algorithm is required. In the fast packet switch, however, collisions are nondestructive in the sense that one of the colliding packets always survives, so a simple retransmission algorithm in sufficient.
B. The Switch Fabric
1) The Routing Fabric: A fast packet switch requires a highly parallel structure for the switch fabric both in the number of switching elements and in the number of interconnection paths between switching elements. Also, control of the switch fabric must be distributed, with each active switching element operating independently of all others upon control information at the head of each packet. This suggests the use of a selfrouting, multistage interconnection network. Such a network consists of many identical and independent switching elements, organized in stages, with the interconnection pattern of links between stages so arranged that each switching element may be controlled by the relevant digit from within a tag prefixed to the head of each packet. The tag simply contains the required destination port number of the switch. For switching elements of degree d each digit within the tag contains log 2 (d) bits and the first digit controls the first stage of switches, the second digit controls the second stage, and so on. Multistage interconnection networks that display this self-routing property belong to the class of banyan networks [20] and have been called delta networks [21] , and although many examples of such networks are discussed within the literature [22] , they have been proven topologically equivalent [23] . An example of a single plane 64 × 64 delta network constructed from switching elements of degree 8 is given in Fig. 2 . In general, a delta network of size N requires log d (N) stages with N/d switching elements per stage. Each interconnection link in the delta network consists of two paths, a forward path to carry the data and a reverse path, set up in parallel with the forward path, to carry the collision signal.
While the majority of research interest has been expended upon delta networks constructed from 2 × 2 switching elements, our previous investigations [24] suggested that it might be possible to implement nonblocking switching elements of up to degree 16, in gate array technology. The use of switching elements of degree greater than 2 raises the problem that delta networks are only defined in sizes that are an integer power of the degree of the switching element. This would result in large increments between valid sizes of network. The proposed solution is to replicate the interconnection links between stages which permits networks to be built to any size that is an integer power of 2, from switching elements of any degree that is an integer power of 2, [25] , [26] . We now have the possibility of multiple paths existing between the same pair of input and output ports. This increases the performance and fault tolerance of the switch but requires an algorithm to select between equivalent paths. Fortunately, as there is no buffering within the switch fabric, each incident packet may be routed independently without the risk of out-ofsequence errors between packets traveling on the same virtual circuit. Two algorithms have been investigated: searching and flooding. In the searching mechanism, the input port controller attempts to transmit across each of the equivalent paths in sequence until it meets with success. In the flooding method, the incoming packet is broadcast simultaneously over all paths that lead to the destination such that the destination selects one of the incident copies and all others collide and are removed immediately.
2) The Distributed Fabric: The above switch fabric performs well for traffic which has a random destination distribution but its performance can be markedly impaired for incident traffic with a worst case distribution of destinations. For some applications this may not be significant, however, for high performance switches, and in order to handle traffic sources which have an average bandwidth in excess of about 10 percent † x signifies the smallest integer equal to or greater than x. of the switch port bandwidth, extra stages of switching must be introduced to distribute the incident traffic across the routing fabric. This has been termed the distribution fabric and to distribute the incoming traffic across an entire s stage delta network requires s -1 distribution stages and results in a Benes topology [28] . Fig. 4 illustrates a 64 × 64 Benes network of switching elements of degree 8. Clearly, we have now introduced a large number of equivalent paths into the switch fabric and again for each incident packet we are free to select any free path independently. The simplest method of achieving this is to implement the distribution stages of the switch fabric with switching elements that select any free output at random.
C. The Two-Plane Switch Structure
It is common practice in the design of a telecommunications switch to duplicate or even replicate the switch fabric and control hardware for reliability and ease of maintenance. If this is achieved in a load sharing manner the performance of the switch is also enhanced. The general structure of a two-plane switch is shown in Fig. 5 and may be extended to form a multiplane switch of any arbitrary number of planes. It consists of two identical switch planes, each switch plane being a complete delta network with or without a distribution fabric. The two switch planes are connected in parallel to form a load sharing arrangement [26] , [27] . Once again we are introducing multiple paths and at the input port controller we may use either the searching or the flooding algorithm to select a path. Considering the output port controller: a simple implementation is only able to handle a single packet at a time and thus rejects setup attempts arriving across the free plane while it is busy serving a packet. A more complex output port controller is capable of handling two packets arriving at the same time and buffering them in a first-in first-out manner in the output buffer. Thus, a measure of output buffering may be provided at the cost of a more complex output port controller.
III. A SIMULATION STUDY OF SWITCH PERFORMANCE AT SATURATION
The above design of fast packet switch features a number of design parameters the effect of which, on switch performance, needs to be investigated. The simplest way to quantify the performance of a particular switch implementation is to specify the normalized average throughput of the switch when saturated with traffic with a uniform random destination distribution. A simulation model has thus been developed to investigate the, throughput at saturation of the switch with respect to the design parameters summarized in Table I .
A. The Simulation Model
In order to reduce the amount of computer time required by the simulation model to reasonable proportions, the setup of a packet has been modeled as an instantaneous event. In reality, a packet will setup on a stage by stage basis, thus a packet which fails setup could itself cause blocking during its setup attempt. The effect of this simplification is to overestimate the throughput at saturation and the results of a more detailed simulation model show that the error introduced by this assumption is in general no more than about 2 percent.
In the model used to determine the throughput at saturation of the switch fabric each packet source supplies a new packet immediately upon completion of transmission of the previous packet and all output ports act as a perfect sink. Packet destinations follow a uniform random distribution and all packets are of the same length. No limit is placed upon the number of setup attempts allowed. The simulation was initialized with random time relationships between all packets and run to attain stability before measurements commenced. Simulations were run for a total of 200 000 packets minimum which yielded results with a standard deviation of about 0.8 percent of the mean for the smaller network sizes to about 0.2 percent for the larger networks. The results are normalized and presented as the throughput per port at saturation which represents the average utilization of an output port at saturation.
The total traffic capacity of a fast packet switch is thus the product of the normalized throughput per port at saturation, the size of the switch, and the system clock.
B. The Crossbar Switch Fabric
First, we consider the operation of the crossbar switch fabric as it gives the ideal performance for a nonbuffered switch against which other interconnection networks may be compared. In the crossbar switch, blocking proceeds solely from the probability of multiple sources attempting to transmit to the same destination at the same time. The upper two curves of Fig. 6 show the difference between the simulator output and the analysis [21] under the assumptions of synchronous operation and blocked packets discarded. (The switch size (N × N) is expressed as log 2 (N) and the curves are discrete, points being connected purely for visual convenience.) The next curve shows the effect of resubmitting blocked packets under the assumption of synchronous switch operation and its asymptote agrees with the analytical results of [19] . This is followed by a set of curves assuming asynchronous arrival of packets, with asynchronous switch operation and blocked packets retried, at different values of retry delay, expressed as a percentage of the packet length, (i.e., the emission delay of a packet).
Whilst discussing the performance of the crossbar switch fabric it is interesting to introduce a simulation study of the delay performance for slotted traffic which has been analyzed in [19] . Fig. 7 shows how input queue by-pass and the use of a two-plane output buffered crossbar switch fabric improves the average delay performance of the pure input buffered switch. For the case of a two-plane crossbar switch fabric with output buffering and input queue by-pass, (the deluxe model,) a performance very close to that of the pure output buffered switch may be achieved but at a much reduced cost in terms of hardware and interconnections within the switch fabric. The detailed results of the simulation model for the throughput at saturation of crossbar switch fabrics under the various design parameters are given in Appendix I. Fig. 8 gives the maximum throughput performance of a single plane pure input buffered delta network constructed from switching elements of degree 2, 4, 8, and 16 using a flooding algorithm and a retry delay of 10 percent. The corresponding curve for the crossbar switch is included for comparison. The perturbations in the curves are due to the number of equivalent paths through the network with the minima indicating the pure delta network. Curves are also presented of the analysis [21] and simulation results for the 2 × 2 delta network, under the assumptions of synchronous operation and blocked packets discarded, demonstrating an agreement which renders the curves virtually coincident. Comparison to the simulation results of [32] also reveals a close agreement. In Fig. 9 , the improvement in throughput obtained with a two-plane, pure input buffered delta network is shown using a routing algorithm which floods between planes but searches within a plane, commencing with a random selection from all equivalent paths within a plane. (The hardware for this hybrid mechanism s easier to implement, is more flexible, and its performance differs only marginally from that of the pure flooding case.) An investigation of multiplane, pure input buffered delta networks with more than two planes shows that, in the case of switching elements of degree 8 and 16, little is gained in increased throughput as the asymptote of crossbar network performance is approached rapidly. Further, for a two-plane, pure input buffered network, the pure searching algorithm yields a performance that is only slightly inferior to that of a flooding mechanism, (no more than 2 percent with 8 × 8 switching elements). The detailed results of the simulation model for the throughput at saturation of delta networks with respect to the various design parameters are tabulated in Appendix II.
C. The Delta Network

D. The Distribution Fabric
The performance of the Benes network as a switch fabric for a fast packet switch has been reported in [29] and for the purposes of this discussion we state the obvious that the introduction of a distribution stage into the switch fabric does not degrade its throughput performance, but rather, enhances it to approach the performance of the equivalent crossbar switch fabric. The results reported for the delta network routing fabric may thus be taken as a lower bound on performance when considering a switch fabric with distribution stages and the results for the crossbar switch fabric taken as an upper bound. Appendix III gives the throughput at saturation of the single plane pure Benes network for comparison.
IV. MULTISERVICE INTEGRATION OVER A FAST PACKET SWITCH
From the results presented of switch performance at saturation, it may be seen that switches of very high total traffic capacity may be constructed from LSI switching elements operating at conventional speeds. We now consider how to integrate multiple services, (voice, video, image, text, data, etc.) onto the structure.
A. Multiservice Traffic Requirements
We argue that all communications services may be classified into two fundamental categories according to the delay requirement they present to the network, and for lack of better terminology we will refer to them as reserved and unreserved services. A reserved service exacts an inflexible, low delay and low variance of delay requirement, whereas unreserved services are much more flexible in the range of delay that can be tolerated. The majority of reserved services derive from information based upon a physical property that changes rapidly with time, e.g., voice and video, and often contain a high degree of redundancy, thus permitting an appreciable packet loss before any noticeable deterioration in quality is perceived. There are some reserved services, however, that are highly sensitive to error, e.g., process control, in which the delay constraint proceeds from the requirement for a high priority service, yet such services are generally of low bandwidth. Unreserved services include the bulk of data transfer, interactive, and transaction services at various priorities.
B. Extensions to the Switch
In order to support the two basic services, reserved service traffic must be given priority at all input and output ports. At the input ports, the single input queue at every port of Fig. 1 is replaced by two queues, one for reserved service packets and one for unreserved service packets. A priority field is also added to the tag to distinguish the two classes of packet. The input port controller is modified so as to transmit unreserved service packets only when the reserved service packet queue is empty, and to postpone repeated setup attempts of an unsuccessful unreserved service packet on the arrival of a reserved service packet. The transmission of a successful unreserved service packet is not interrupted by the arrival of a reserved service packet. Reserved service priority at the output port is ensured by a simple mechanism implemented in hardware in each of the output port controllers. If there is competition between packets from different input ports for access to an output port, this mechanism ensures that reserved service packets are given priority.
C. Simulation Traffic Models
Two models of unreserved service traffic have been used, saturation and Poisson. In the saturation model, unreserved service traffic is generated to keep each input port continuously busy while in the Poisson model, unreserved service packets are generated according to a Poisson arrival process. Both models generate traffic with a uniform random destination distribution. Three models of reserved service traffic were investigated: Poisson, talkspurt voice, and TDM voice. In the Poisson model, reserved service packets are generated according to a Poisson arrival process with a uniform random distribution of packet destinations. In the talkspurt voice case, a superposition of individual voice sources has been modeled, on every input port of the switch, in which the on-off characteristics of speech have been used for bandwidth compression, (i.e., packet voice with silence detection.) Each voice source is assumed to exhibit two states, active, and silent, representing the talkspurts and pauses present in conversational speech [36] . In the active state each voice source generates packets at a regular rate representing 32 kbits/s voice coding, 256 bit packets with a further 32 bits overhead, and a 20 MHz system clock. No packets are generated in the silent state. The two states are modeled by an exponential distribution with means of 1.2 and 1.8 s, respectively [37] , and each voice source transmits packets to a single destination which is selected at random during initialization. The TDM voice model is simply a talkspurt model with silent periods of zero duration to represent packet voice without silence detection. A random phase relationship is assumed between all voice sources.
D. Multiservice Switch Performance
The simulation results for a 64 × 64 fast packet switch constructed from 8 × 8 switching elements using a twoplane, pure input buffered delta network are now presented for various combinations of the multiservice traffic models. Investigations suggest that the major characteristics of the results are general to all sizes of fast packet switch constructed from switching elements of any degree according to any permutation of the design parameters discussed above. A good approximation to the throughput and delay performance for other sizes and designs of fast packet switch may be obtained by scaling the measurements presented for this example in proportion to the throughput at saturation of the desired switch fabric.
The measurement of delay selected for the performance of the reserved service is that of the 99th percentile of the delay distribution [38] . It is assumed that packet voice traffic may withstand a 1 percent random packet loss, for small packet sizes [39] , [12] , without perceptible loss of quality. Hence, our measure of guaranteed maximum delay is the delay within which 99 percent of all reserved service packets arrive at their destination. The consequence is that the accuracy of the maximum delay measurements is much lower than that of throughput as we are examining the tail of the delay distribution.
Delay is normalized to the packet length and all measurements are taken with a retry delay of 10 percent of the packet length. Applied load and throughput per port are also normalized and reflect the average utilization of input and output ports, respectively. Fig. 10 gives the basic result for a switch with a Poisson reserved service traffic source and a saturated unreserved service traffic source on each of the switch input ports. As the reserved service traffic load is increased, the maximum unreserved service traffic load that the switch is able to sustain falls, so as to maintain the total load on the switch reasonably constant at saturation. The reserved service throughput response, in the absence of any unreserved service traffic, is identical to that in the presence of unreserved service sources. Fig. 11 gives the corresponding maximum delay curves for reserved service traffic with and without the presence of saturated unreserved service traffic. The maximum delay for reserved service traffic in the presence of saturated unreserved service traffic is approximately 50 percent greater than in the absence of unreserved service traffic. This difference is due to the probability of an incident reserved service packet finding the input node already busy serving an unreserved service packet that has achieved setup. Further, the throughput and maximum delay performance of reserved service traffic is not adversely affected by a nonuniform distribution of packet destinations for unreserved service traffic. Investigations also suggest that it is possible to operate a fast packet switch with input and output ports running at widely different mean traffic loads, as might be the case, for example, between ports connected to interswitch trunks and those connected to local area networks.
1) Poisson Reserved Service Traffic:
In Figs. 12 and 13 , a Poisson reserved service traffic source is multiplexed with a Poisson unreserved service source at every input port of the switch. Fig. 12 shows the throughput performance of unreserved service traffic for several reserved service traffic loads. Fig. 13 shows the corresponding average delay for unreserved service traffic. Both curves saturate at a level that reflects the remaining switch bandwidth available after serving the requirements of reserved service traffic. The reserved service throughput characteristic in this case is identical to that observed with a saturated unreserved service traffic source while the maximum reserved service delay is reduced in proportion to the amount that the total load on the switch falls below saturation.
To give a comparative impression of switch performance Fig. 14 shows the maximum delay performance of various designs of fast packet switch for Poisson traffic. Once again it may be seen that the performance of the pure output buffered switch [14] is only slightly greater than that of the highest performance two-plane delta design. This in turn is of slightly greater performance than a two-plane Batcher-banyan [13] as the latter is synchronous at the packet level and therefore cannot take advantage of input queue by-pass.
2) Talkspurt Voice: For the above 64 x 64 switch with Poisson traffic sources, the queue lengths at the input ports were observed to be short and to stabilize rapidly for traffic loads below about 0.45. This figure represents a load of 80 percent of saturation and is a valid conservative estimate for the upper bound of the applied reserved service traffic load for stable operation of all sizes and designs of fast packet switch. The maximum mean reserved service traffic load for any switch port may therefore be fixed at 80 percent of the saturation load for that switch. The maximum delay performance of the talkspurt and TDM voice source models, to the above value of maximum mean reserved service traffic load, is now compared to the result for the Poisson reserved service traffic model.
The maximum delay performance, in the absence of unreserved service traffic, is given by Fig. 15 and in the presence of saturated unreserved service traffic by Fig. 16 . (For the talkspurt voice model an applied load of 0.45 corresponds to 625 voice sources per switch port, and to 250 voice sources per switch port for the TDM voice model.) It is evident that within the region of stable operation there is no significant difference in the guaranteed maximum delay across the switch for Poisson, talkspurt and TDM voice sources, either in the presence or absence of saturated unreserved service traffic. Furthermore, an observation of the interarrival times of packets generated by the talkspurt model on a single input port reveals a very close approximation to the exponential distribution [40] . Thus, the superposition of a large number of talkspurt voice sources may be modeled by a Poisson arrival process, with reasonable accuracy, for applied loads below about 80 percent of saturation [41] , [42] .
3) Packet Length: Finally, we consider the effect of variable unreserved service packet length upon performance. In the results presented so far, we have assumed constant packet length and normalized all results to become independent of the absolute value. Now we assume that all packets consist of a header and an information component and we normalize results to the value of the information component. First, we consider the case in which reserved service packets and unreserved service packets are of different but constant length. The length of the unreserved service packet is expressed in terms of the reserved service packet information field, and all packets have a header of one eighth of the length of the reserved service packet information field. The throughput results are presented in Fig.  17 where it may be seen that the reserved service throughput performance is not unduly affected by the unreserved service packet length. However, the unreserved service throughput at saturation, for large unreserved service packet lengths, is lower than that for small packets showing that the advantage of low packet overhead is rapidly outweighed by the superior multiplexing capability of small packet sizes. An examination of the case in which the length of the information component of all unreserved service packets is given by an exponential distribution reveals similar results with a reduction in unreserved service throughput performance of between 10 to 20 percent, due to the variability in packet length. An investigation of the case in which all packet lengths follow a uniform random distribution of ±10 percent about a mean value reveals no drop in performance when compared to that of constant length. Thus, the switch is insensitive to the variation in packet length that might be introduced by a line code employing "bit-stuffing." The effect of the unreserved service packet length upon reserved service packet delay performance is given in Fig. 18 . As expected, a variable length unreserved service packet exerts a greater detrimental influence than one of constant length, and the shorter the mean packet length the less the reserved service packet delay performance is affected. Hence, conventional sizes of data packet must clearly be broken down into short packets for multiplexing with real-time traffic but this may not be necessary for a "data-only" environment.
V. IMPLEMENTATION
An experimental implementation of the fast packet switch has been completed in low cost 3µm HCMOS gate arrays [30], [31] . A 4 × 4 crossbar switching element and an experimental input port controller, with a standard 8-bit microprocessor bus interface, have been fabricated and demonstrated to operate as expected at a clock rate of 8 MHz. The throughput at saturation and delay performance of the switching element have been measured and agree with the simulation results to within 1 percent. The switching element required a total of 378 gates and the input port controller 292 gates which allows an estimate of the gate complexity of fully implemented parts to be made for crossbar switching elements of various sizes, Table II. It is reasonable to expect an implementation in 2µm CMOS to achieve speeds of around 50 MHz without great difficulty and beyond this we observe that ony the data path within the switching element is required to operate at high speed. The majority of the logic in the switching element handles packet setup and if a small increase in overhead is permitted in the packet setup time then this logic can operate at a slower speed than that within the data path. (In the current design, the data path passes through no more than two gates and a flip-flop.) We may thus consider implementation in BICMOS, ECL, and even GaAs at speeds approaching 500 MHz and beyond without exceeding the power budget, Table III. For even higher speed operation the switching and data paths within the switching element may be implemented optically with the control logic in ECL or GaAs to form an electro-optic switching element [43] . Switching times down to a few nanoseconds might thus become feasible on switch ports handling several Gbits/s to yield a total switch capacity measured in Tbits/S.
VI. CONCLUSIONS
The design of a fast packet switch based on a nonbuffered interconnection network has been reported and simulation results of its throughput performance at saturation discussed. The design is modular and will operate at any speed, with any device technology, including integrated optics. Maximum switch size is limited only by implementation considerations for the technology and operating speed selected. This design of fast packet switch uses fewer active elements than the equivalent crossbar switch, whilst offering a similar performance at saturation, for all sizes of switch greater than 16 x 16.
An extension to the design of the switch has been proposed in order to support multiservice traffic. Simulation results indicate that with a reserved service traffic loading of up to 80 percent of switch port saturation, the upper bound on delay for 99 percent of all incident reserved service packets is in the region of 20 packet lengths. Further, unreserved service traffic may be multiplexed with reserved service traffic, at every input port of the switch, so as to operate the switch continuously at saturation, without affecting the bounded delay performance of the reserved service. These results hold for voice traffic modeled as Poisson sources, talkspurt voice sources and TDM voice sources which yield a very similar maximum delay performance. The reserved service throughput and delay performance also appears insensitive to the arrival distribution and to the destination distribution of unreserved service traffic.
For delay-sensitive, reserved service performance, the packet length for unreserved service traffic should be short and constant. No performance impairment is introduced by a ±10 percent variation in packet length. For a single service implementation, moderately insensitive to delay, variable length packets of any reasonable maximum length may be supported.
An experimental implementation of the fast packet switch in 3µm HCMOS gate arrays has demonstrated that the switch can be implemented at low cost in conventional gate array technology and that the performance of a 4 × 4 switching element agrees closely with that predicted by the simulation model.
Work is currently in progress on the problem of supporting multicast operation across the switch, for both reserved and unreserved traffic, with a similar throughput and delay performance to that of unicast traffic. Initial results suggest that this may be achieved with the same philosophy of simple implementation in gate array technology. The much more interesting problem of how to organize, manage, control, and interface to a network of such fast packet switches is also under consideration.
Finally, by way of summary, we observe that the Cambridge fast packet switch is but: "One small chip for MAN'S…
