Design of an ATM switch and implementation of output scheduler by Fang, Jun
p.tr'q
LIDesign of an ATM Switch and
Implementation of Output Scheduler
Jun Fang, B.Sc.
Thesis submitted for the degree of
Master of Engineer Science
in the department of Electrical and Electronic Engineer



















1.2 The advantage of ATM .
1.3 ATM switching system
1.3.1 Switch matrix . . . .
l.3.2Port controller











1.4 A time scheduling ATM switch 10
I
1.4.I Structure
1.4.2 lnput port controller and header processor
1.4.3 Scheduler. . .
1.5 Description of a Time scheduling Algorithm





Chapter 2 The Design of ATM switch
2.I Overview of the ATM switch. . . .
2.1. 1 Design Objective.
2.1.2 Switching process. .
2.1.3 Interface definition. . .
2.2 Switch Matrix.
2.3 Design of the input port controller. . . . .
2.3.t Overview of the input port controller.
2.3.2The function of the header processor
2.3.3 The operation of the buffers
2.3.3.1 Temporary buffers.























2.3.3.3 Operation of the main buffers
2.4 Design of the output scheduler. . .
2.4.IThe structure of the output scheduler. . .
2.4.2 Operation of the output scheduler




Chapter 3 Investigating Basic
Functional Blocks of the Output Scheduler
3.1 Elementary scheduler
3.1.1 Structure
3.L.2 Operation of the comparison unit
3.1.3 Modifications for Priority
3.1.4 Operation of the schedule register
3.1.5 Input address generation.


















3.1.6.1 Generating the schedule. . . . . . .
3.1.6.2 Updating the output status.
3.1.6.3 Interfacing with subsequent units
3.2Input status register. . .
3.2.1The structure of the input status register. . .
3.2.2 Operction of the input status register. . . .















Physical Design of the Output Scheduler
4.1 The design methodology.
4.ZTechniques for high performance digital design.
4.2. I Design specif,tcation
86
86
4.2.2 D esign requirements
IV
87
4.2.3 Design techniques for high speed.
4.2.3.1Floorplanning
4.2.3.2 Clock distribution and skew.
4.2.3.3 Critical path analysis and optimisation.
4.2.3.4General techniques to decrease delay
4.2.4Design techniques for low power dissipation.
4.2.4.1 Logic family selection.
4.2.4.2 Reducing the effective capacitance
4.2.STechniques for reducing the area.
4.2.6VO system Design. .
4.2.7 Power distribution. .
4.3The simulation result of output scheduler
4.3.1 Delay and power dissipation. . . . .
4.3.1.1 Simulation environment.
4.3.1.2 Selection of the stimuli. . . . .
4.3.1.3 Simulation result.























5.1 Speed-up two 120
5.2 Apossible way to improve the speed. r24
5.3 Challenges on packaging. I27
5.4 129




ATM (Asynchronous Transfer Mode) is regarded as the solution for next generation
telecommunication network. ATM switches are the critical parts of an ATM network. In this
project, an experimental 16 x 16 input buffered ATM switch is developed, which employs a
time scheduling algorithm developed by Sarkies and Main. The ATM switch discussed in this
thesis contains four basic functional blocks: switch matrix, output scheduler, input port
controller and output port controller. The output scheduler is the key part of the project, which
is designed to the chip-level. The other parts of the switching system that interface with the
output scheduler are designed to the architecture level. The objective ofthis research project is
not only to design a high-speed output scheduler that can support the switch matrix working at
10Gb/s/channel, but also to design the output scheduler that can provide variable priority
threshold and multicast to improve the performance.
The output scheduler described in this thesis is designed with TSMC 0.25pm CMOS
technology. The entire chip contains over 600,000 transistors. The simulation results show
that delay of the critical path is 20.8ns, which is much less than the design requirements, 40ns.
The estimated power dissipation of the circuits is 0.785w with a 2.5v power supply at l00oC.
The circuit area is 4.g mrr?_ and chip area is about Izmtr'2 .
We demonstrate the output scheduler can coordinate with the other parts in the ATM switch to
provide high quality service
VII
Declaration
This work contains no material which has been accepted for the award of any other degree or
diptoma in any university or other tertiary institution and, to the best of my knowledge and
belief, contains no material previously published or written by another person, except where
due reference has been made in the test.
I give consent to this copy of my thesis made, when deposited in the University Library, being





I would like to express my appreciation to the people who made their contributions to the
completion of this resea¡ch. First of all, I would like to thank my supervisor, Dr- Kenneth
Sarkies. He not only gave me invaluable guidance and continued support in the course of my
research, but also gave me many comments on the structure, contents and grammar of my
thesis. Secondly, I also wish to extend my sincere gratitude to Mr Kiet N. To. He patiently
taught me the usage of the CAD tools used in VLSI design. Last but not least my thanks go to
Mr Andrew Beaumont-Smith, Mr Said AlSarawi, Dr. Alireza Moini and Mr. Michael Liebelt
for many discussions in various stages of design.
IX
List of Figures
1 A general model for ATM switch
2 Architecture of crosspoint switch developed by Lowe
3 Architecture of crosspoint switch developed by Savara and Turudic . . .
4 Structure of an input buffered ATM switch . " .
5 Diagram to illustrate the time scheduling algorithm
6 Flow chart of operation switching process
7 HighJevel architecture of an input buffered ATM switch
8 Architecture of crosspoint switch used in this ATM switch
9 Diagram of input port controller
10 Structure of write controller . .
11 Operation of the write controller
12 Structure ofoutput scheduler
13 Detailed structure of output scheduler
14 Flow chart of operation of output scheduler
15 Diagram of the elementary scheduler
16 A column of elementary scheduler
17 Circuit of comparison unit
18 Diagram of input status register
19 Diagram of output status register
20 Diagram of clock generator






















22 Floorplan of elementary scheduler
X
90
23 Structure of buffer distribution
24 The critical path of output scheduler . .
25 Circuit of an NAND gate
26 Avoid extensive bus sharing
27 Placementof pads .
28 Transient current wave form
29 Simulation of critical path.
30 Architecture of an input buffered ATM switch with speed-up two .
31 Diagram for an elementary scheduler group












1. Truth table of schedule generation . .
2. Truth table of output status updating
3. Truth table of generating fairness signal







1.1 Trends in the Development of Telecommunication Network
Nowadays, computers have infiltrated all walks of life, such as the home, banks,
manufacturing industries and so on. Although stand-alone computers are still widely
used, in more and more instances they are networked. As a result, both computer and
telecommunication networks are developing rapidly.
Recently, the telecommunication network has acquired two new characteristics: one
is the need to support services of different characteristics, for example digitised
video and image; the other is the need to support all of these services on a single
network [1]. Hence, a new telecommunication network standard is being established,
namely B-ISDN (Broadband Integrated Services Digital NetworÐ. A great deal of
research work is taking place to find a solution for the B-ISDN. Eventually, ATM
(Asynchronous Transfer Mode) was agreed as the target transfer mode for
implementing a B-ISDN [2].
1.2 The Advantage of ATM
ATM is a type of packet switching system that operates at high speed. In an ATM
network, the information is transferred asynchronously in the form of cells, a type of
small, f,rxed length packet. Cells have a length of 53 octets and a¡e comprised of two
1
parts: one is the header that contains the routing infbrmation, the other is the
information fleld that is the cell payload. On the sending end the infbrmation is
organised into cells, on the receiving end the information in the cells is recombined
together. Such a technique provides great flexibility, so that the ATM network is
very suitable for new high-bit-rate services. Real-time video is a typical example for
this kind of service, which is known as variable bit-rate (VBR) service. It is
intuitively obvious that the bit rate generated by a video picture of static scenery is
certainly different fiom that of a racing fbx. When the scene switches from the
scenery to the fox, there is a large burst of information to he transferred. ATM can
allow this variable rate of information generation to be transf-erred eflectively across
the network. Moreover, another inherent advantage of ATM network is niultiplexing
of cells. Cells from diflèrent services can be transf'erred onto one link. That means
the network operator need only provide one connection to the customers and all
services can be provided over this link. Also, as described previously, a signifìcant
characteristic of mo<lern telecommunication networks is to support multimedia
services. ATM networks use a standard size of cell for all media, which means that
switching of the cell streams can be performed at very high rate and this simplifies
the design of the ATM switch considerably. Clearly, ATM ofTers great ease of
integration of sources. It was for this reason that ATM w¿s selected as the transfèr
mode fbr the new generation of high-speed telecommunication network.
1.3 ATM Switching system
Generally, an ATM switching system is conprised of three elements: switch matrix,
an put port con or lnpu
2
for each output port [3]. A diagram that shows these three modules is sketched in
f igure I
Switch Matrix OPC
Figure 1: A general model for ATM switch
1.3.1 Switch Matrix
Clearly, the switch matrix is the core of this switching system, in which the data path
is developed between any input-output-pair. Hence, matrix will signifìcantly affect
the capability of the whole switch. The increasing traffic and muitiple services in the
morJern telecommunication network raise the requirement for high-capability
switches. High-performance switches should present such characteristics as very
high speed, versatility of switching mode (selective and broadcast), ease of control,
small loss, small delay, good signal integrity with little noise and so forth.
Furthermore, for the ATM network the capability of handling asynchronous data is




In the past few years, much work has been rJone to implement the high-speed
crosspoint switches that can support ATM switch. We shall describe two different
architectures used in giga-bit crosspoint switches.
ln 1997, Lowe reported a lQGb/s/channel crosspoint switch [4], which employs a
multiplexer/decoder type of architecture. The architecture of this crosspoint switch is
shown in figure 2. As shown in the diagram, the switch consists of l6 selector slices.
In each selector, there is a 4-bit register for the input address, a I 6 : I multiplexer and
an output buff'er. The 4:16 decoder selects the desired output among the l6 selector
slices ¿ccording to the output address. In each input port there are sixteen input data
buff-ers and input address buffers. They are used to drive the cells and addresses to
all the selector slices.
When a cell anives at the switch, its input arJdress and output address are sent to the
input address bufl'ers and clecoder respectively. According to the outpttt address the
decoder selects one of the 16 outputs, namely one of l6 selector slices' Note that the
decoded olltpllt address is latched by a loarl pulse. That means that only when the
loarl pulse is asserted can the output address enter the selector slice. On the other
hand, the input address is buflèred with low impedance amplifiers and driven to all
the slices. 'l'he cell is also broadcast to all the slices. Ret'erring to the detailcd
structure of each slice, we note that the input address is stored in a register and it will
not be sent to the multiplexer until it receives a clock signal from the decoder' When































asserted bit will activa[e the register to send the input address to the multiplexer
According to this address, the multiplexer connects one of the cells to the buf'fer
Finally, the buffers drive the cell ottt of the switch'
This broadcast-and-select architecture needs a relatively simple structure, so it can be
implemented in one chip. This characteristic is signifìcant as it improves the speed of
the switch and clecreases the cost of production. However, this architecture also has
some disadvant¿lges. This switch can set Llp only one data path at one time' so it
needs 16 consecutive loarJ pulse to fully reconfigure the whole switch' That means
in one time slot the switch must be programmed 16 times. This increases the
complexity of external control and the complexity of interactions with the data flow.
Savara and Turudic developed another architecture tbr a crosspoint switch in
l995t5l (see fìgure 3). As normal switch, it has input buffers, outptlt bufl'ers and a
crosspoint switch matrix. There are only sixteen l6:l multiplexers in the switch
matrix and each multiplexer corresponds to one outpttt port. A characteristic ol this
switch is the utilisation of confìguration latches and sixteen 4-bit shifi registers to
deal with the input addresses.
The incoming cells are stored in the input butfers, and the 4-bit input addresses are
sent to the address registers. Each arJdress register corresponds to one multiplexer-
The addresses are shifted into registers serially. When all the sixteen 4-bit binary
numbers are stored into the shift registers, the switch control centre sends a signal to





























multiplexer in parallel. v/ith the input addresses, the niultiplexer will select one of
the 16 inputs and connect it to the output. Thus, 16 nonblocking data paths are
established at the sarne time in the switch matrix'
Clearly this architecture is more efficient than the last one, as the cells that
asynchronously arrive at the 16 input ports within one time slot are transfèrred
through the switch simultaneously. The switch is configured only once per time slot,
which makes its control easy. Hence, this architecture presents a signifìcant
advantage when used in an ATM network.
1.3.2 Port Controller
In order to avoid excessive cell loss in the case of internal collisions, buff'ers have to
be provided in the switching system. Basically, there are two possibilities for the
buffer location:
o located in the output port controllers
. located in the input port controllers'
Diff'erent buflèr location of bufl'ers results in diffèrent performance'
For the output buflèred ATM switch, if the switch matrix can not work fast enough
contention may occur. In this case Several cells are requesting the Same outptlt port
simultaneously. In order to achieve collision free switch, the speed-up factor of N
must be reached for an N x N switch matrix. That is, it must be possible for each






needed. This characteristic makes the output buffered ATM switch undesirable fiom
a performance viewpoint.
The input bufl-ered switch woukl not suf-fer from the speed limitations of the output
bufTered switch, but it has its own problems as well. When first-in-flrst-otlt buffèrs
are used in the input port controller, a collision occurs when two or more head-of-
the-line cells compete f'or the same output simultaneously. If so, one cell passes and
the others are blocked. All cells in the blocked queues will be blocked, even if they
are requesting other possibly unused outputs. Consequently the throughput of the
input buff'ered switch is comparatively low. In order to overcome this disadvantàge,
some controller mod¡le or scheduler must be employed to manage the input queues
intelligently. Thus, the input butTered switch is nore practical but harder to control.
1.3.3 Multi-stage Switching
For a large switching system, a single stage switch can not provide enough inputs
and outputs, so à mlllti-stage network is used. A multi-stage network is built of
several stages which are interconnected by internal links in such a way that any
output can be reached tiom any input. According to the number of paths which are
available fbr a cell to reach a destination output tiom a given input, these networks
can be subdivided into two groups: single-path and n-rultiple-path networks. For the
single-path network, due to the lact that only one path exists fiom an input to an
output, routing is very simple. The disadvantage is that when an intemal link is used




liom a given input, internal blocking can be reduced or avoided. However, the
possibility exist for cell streams to arrive at the olltptlts with cells placed out of
sequence.
1.4 A Time Scheduling ATM Switch
In rhis thesis, we will discuss an input bufl'ered ATM switch using a time scheduling
algorithm.
1.4.1 Structure
As stated above, fbr a high-performance input buffered ATM switch, a controller or
scheduler should be employed to manage the queuing. As shown in the fìgure 4' -such
an ATM switch consists of these four parts:
1 The switch matrix is the core of a switch. The data path between the input and
output is set up in it;
Z The scheduler selects a suitable time slot in which both the input and output are
available to send the cells so that no contlict could occur within the switch
natrix,
3 The input port controller is the interface between the switch matrix and thc
scheciuler. The controller shouftl coordinate with the scheduler to nlanage the cell





















Figure 4 Structure of an input buffered ATM switch
1.5.1 Basic Algorithm
A block diagram is shown in fìgure 5 to illustrate this algorithm. The basic idea of
this algorithm is that in the input port controller and output port controller an input
St¿ìtus array and an output Status array are maintained and updated respectively.
These indicate the usage of the input ports and output ports in successive time slots.
Specifically, a status affay records which time slots have been scheduled to send a
cell for this port and which slots are available for new cell. Referring to fìgure 5,
here we take 1 time slots âs an example. Both the input status array and the output
status afray are maintained in the fbrm of binary number array. In these arrays, -1
represents a time slot that has been scheduled oul, while 0 represents a time slot that
is available for scheduling. When a cell is to be switched lrom an input to an output,
the corresponding input and output porl controller will send the status affays to the
scheduler. The scheduler compares these two status arrays so as to fìnd the first time
slot that both input port ând output port are available. Finally, the scheduler .sends
the scheduling results to the input port controller and output port controller,
respectively.
14






Updated output stôtut arrlyUpdetcd lnput statut tir.Y







We can use a flow chart (Figure ) to describe this algorithn-t
Figure 6 Flow chart of operation of switching process'
To make this a bit clearer, let us stuily such two arrays shown in the diagram. The
input status array is "0010011" and the output status array is "0000101"' We assume
that the least-signifìcant bits of these affays represent the 1ìrst time slot. When this
input-output pair is requestecl by a cell, these two àrrays are Sent to the comparator in
IPC senr:ls each cell to the switch matrix when its
schetluletl tittte ¡f transmissirln arrives.
The schetlule is sÈnt to the relavant IPC and ÛPC t¡
update the input anrl rrutput rtatus array The cell is
arrangetl in the input bulTer t¡ tleparl at its scheduled tinre
Schecluler rnakÊs a schetlule and generates input
addresses for switch matrix on ihe base rll rrlquest and
inPut and outPut status arraY.
IPC receive a tÈll lrüm nËtwork and stores it tetnporarily;
at the same tifiìe, IFC carries nut header processing t0
generate the scheduling request
16
tlme
slot until a schedule is lbund. For the tlrst bit, both of the arrays are 1 which means
neither of them is available for a schedule. Then the scheduler compares the second
bit, for this bit the input port has been allocated, so no schedule can be made. For the
3'd time slot, although the input port is available the output will be busy, so there is
still no schedule. For the 4'h one, both ports are available, so a schedule is made.
Since the purpose of the comparator is to lind the tìrst time slot available for
switching, the comparàtor stops as soon as a schedule is found.
The con-rparator generates a schedule result and sends the updated input status array
to the input port controller. As show in the diagram, the schedule result is an array of
binary numbers, in which the binary 1 represents the schedule. Also, the 4h bit of the
inpllt status affay is upclated to /. In addition, the comparator sends an updated
outpnt status array to the output port controller. In a similar way to the input Status
aff¿ìy, the 4th bit of the output status array is changed to / which means that this time
slot has been scheduled out. This new output status array will be stored in the output
port controller. Atier this time slot, both the input and output status arrays are shified
by one bit so that the second time slot becomes the tìrst and so on. The bit
corresponding to the 1't time slot is usecl to guide the input port controller to senri the
cell to the switch matrix. The bit corresponding to the l6'h time slot is f'ed a 0 that
means a new time slot is available.
This scheduling algorithm can eftectively avoid conflict among cells that may
otherwise use the same output port at the same time slot. This can decrease the loss
rate and improve the delay performance signifìcantly. Sarkies and Main [6] showed
m agÞ
11
status length is sufficiently large. Therefore, this time scheduling algorithm can
ensure high perfbrmànce of the switch
1.5.2 Enhancement
The algorithm can be enhanced by adding priority and multicasting.
1.5.2.1 Priority
Basically, there are two types of priority algorithn-r, namely delay priority and discard
priority. For the fìrst case, the cell with a low priority may suft'er more delay, because
the network always serves the high-priority cell fìrst. Discard priority means that at
cell with lower priority will be more likely to be discarded when compared with a
high-priority cell. Usually, the discard priority is very simple and is easily
implemented. The priority used by this algorithm is discard priority.
Specifìcally, for this algorithni a priority threshold is associated with each time slot.
The high priority cell can use any time slot fbr scheduling, while the low priority cell
can only be scheduled within the time slots that occur bef'ore the threshold. If a cell
with a low priority can not make a schedule within the tin-re slots permitted for it, the
cell will be discarded.
Clearly, this approach is very simple to in-rplement, but the penalty for it is less
f-lexibility. In difterenr ATM networks the threshold for the priority may need to be
set to a ue.
18
scheduler should be designed to be able to support a variable threshold' In other
words, the threshold value can be set to any point and this value is decided by the
priority level.
1.5.2.2 Multicasting
The algorithm also supports multicasting. Multicasting means a cell fiom one input
can be switched through a number of outputs simultaneous. In order to support
multicast, the switch matrix should broadcast a cell to all the 16 outputs. Then input
addresses are Sent to their corresponding output ports, which select and connect one
of the inputs to the output. Clearly, if a number of selector slices select the same
input, the cell fiom this input will be multicast. we will discuss the crosspoint
switch architecture that Supports multicast in a later section'
1.6 Summary
In this chapter, we introduced the basic structure of an ATM switch. The switch
matrix is the core of the ATM switching system. Therefore, we reviewed the
crosspoint switch architecture. Then, we discussed particularly the Structure of an
input bulTered ATM switch with a time scheduling algorithm. Finally, the time
scheduling algorithm employed by this ATM switch is discussed'
In the next chapter, we take up these basic concepts and carry out the design of an






















E!-..¡¡ z rriah-tarral arr.lrilantrrre ôf ân inout bUffefed ATM SWitCh
2.1.3 Interface Definition
The simulation results [6] ol rhe time scheduling algorithm used in this switch show
that the switch can achieve adequate throughput when 16 tin-re slots are used for
scheduling. Therefore, the scheduler discussed in this thesis is designed for
scheduling within the l6 subsequent time slots.
Before we càn discuss thc detailed design of each functional block, we should be
clear in what format each functional block interfaces with the others, because it will
affect the structure of the basic tirnctional blocks.
The Format of the OutPut Addressa
As stàted above, to make a schedule for a cell, the schedule needs a copy of the
o¡tput address of this cell. In what form shoukJ we use to describe this oLttpLlt
adrlress? Recall rhat one of the design objectives of this ATM switch is to sttpport
multicast. Multicast means a cell liom an input may be switched to a number of
outputs simultaneously. Therefore, we need an array of binary number that cân select
any number of outputs. Towarcls that end, for an ATM switch with 16 output ports,
we need a l6-bit binary number to rJescribe the output address. In this binary array,
we use 1 to represent a selected output port and 0 to represent a port not selectecl. In
order to minimise the pin count lor the scheduler, this i6-bit array is shified into the
scheduler in series. Therefore, the scheduler should have l6 inputs for the output
addresses, each of thc addresses coming fiom a diflerent lnput port controller
23
a The Format of the Priority Information
As we have discusserl in the first chapter, for such a scheduler with l6 time slots for
scheduiing, we neecl a 16-bit array to describe the threshold value for priority. Note
that we can also use a four-bit binary number to describe such a 16-bit array, and
then the 4-bit binary number can be decoded into the 16-bit affay. As will be shown
in the later chapter, the scheduler should shili in and out the information at à very
high speecl, fÌom a point ol view oI power saving it is advantageous to Llse the 4-bit
addresses instead of 16-bit. Also, fbr this scheduler we assunle that a cell to be
multicast to the diffþrent outputs would have identical priority for all the outptlts,
because it would be very diflicult ro distinguish dift'erent priorities for diffèrent
outp¡ts of a multicast cell. Therefore, only one bit of priority information is required
for any cell. Therefbre, the scheduler has l6 inputs for priority infomation, and each
one is connected to one input port controller.
The Format of the InPut Addresso
We have mentioned that the scheduler shoultl generate the input addresses for each
selector slice in the switch matrix to set up the data path. As with the outpttt address,
a question may be raised regarding the fornat of this input address. Since in one timc
slot one output port can only handle one cell, we should use such an àfray that can
select one input trom sixteen inputs. In order to minimise the hardware of the
scheduler, we LlSe a l6-bit array with one asserted bit to select an input. The input
adclress is shifted out of the scheduler to the switch matrix in serres.
24
a The Format of the Schedule
As we know, the scheduler should return a schedule for each input port controller as
the result of request processing. This array of schedules informs the input potl
controller in which time slot the cell is to be scheduled. For the multicast case, a cell
may be scheduled into a number of time slots to be switched out through difl'erent
ontputs. If this is the case, v/e have to use a 16-bit array to describe which time slots
the cell is scheduled to. Therefore, the scheduler has l6 outpttts for shifting the l6-
bit schedule alray back to the input port controllers.
2.2 Switch Matrix
As descrit.,ed above, the core of the ATM switch is a switch matrix, in which the data
paths are developed between the input and outpttt ports at the request of the cell. In
the first chapter, we have reviewed some possible architectures. In this section, we
will discuss the crosspoint switch architecture that meets the requirement of the
pàrticular scheduler discussed in this thesis. According to the above disctlssion, we























Figure 8 shows a crosspoint Switch architecture, which employs a broadcast and
select architecture. This crosspoint switch consists of l6 cell inputs, 16 cell otttputs
and l6 acldress inputs. The switch is comprised of l6 selector slices, each of which
corresponds to an output. The incoming cells are driven by the buffèrs and
broadcasted to all the selector slices. Each selector slice receives an input address
that informs the selector to select one of the cell inputs and connect it to the output.
The structure of the selector slice is shown at the bottom of the fìgure 5. The selector
inclurJes four parts: a l6-bit shifl register, a confìguration latch, a l6:l multiplexer
and a bufl'er. The input address is shifted into the shifi register in series. When all the
addresses are ready, the confìguration latch is turneil on, and the input address is
loaded into the nultiplexer in parallel. The asserted bit in the address will select one
gf the cell inputs and send it to the buffers. The buffers drive cells to the outptlt port
controller.
On the one hand, the broadcast and select architecture ensures that the switch can
support multicasting and the simple structure makes very-high-speed switching
possible. On the other hand, each selector slice receives a particular input address for
its corresponding oLltput port simultaneously, which means that the switch need only
set up once per time slot. It simplifìes the external control significantly' Moreover,
we employ a 16-bit shifi register to receive the 16-bit input address generated fiom
the output scheduler, to ensure that the switch matrix can intedace with output
scheduler pert'ectly.
21
Currently, the GaAs crosspoint switch as described in the literature can of-fer the
capability of up to l6 input and output ports at l0Gb/s for each [4], so the design of
the scheduler and input port controller shoultl be able to support such very-high-
speed switch matrix.
2.3 Design of the Input Port Controller
In the ATM switching system, the input port controllers which act as the interfäce
between the switch matrix and the olltput scheduler play a signilìcant role in the data
flow control. Now let us look at how an input port controller works. The input port
controllers should coordinate with the output. scheduler and switching f'abrics to
provide not only the basic cell control capability but also the advanced functions
such as multicast and variable threshold fbr priorities.
2.3.1 Overview of the Input Port Controller
As we have discussed, the input port controller is the interfâce between the network
and switch. On the one hand, the controller processes the cell header and deduces the
necessary routing information for switching. On the other hand, the input port
controller coordinates with the scheduler to manage the cell flow. To realise these
two functions, the input port controller consists of two basic fÏnctional blocks: the




















































Figure 9 Diagram of input port controller
2.3.2The Function of Header Processor
When a cell arrives at an input port controller, first of all, it goes through a heacler
processor. Recall that in each ATM cell there is a header that consists of the
properties and the routing information of the cell. The header processor analyses the
header of the cell so as to determine the type of the cell. If the idle cell thât is
inserted by the physical layer anrl contains no Lìser infomration is detected, this 
cell is
discarded inmediately. However, for the user cell the header processor generates 
the
necessary information for switching the cell. The routing information is updated and
the cell is passed to the bufÏèrs.
As we have iJiscussed in the previilus section, lor this ATM switch the inlormation
required to switch a cell is the output address that indicates to which outputs the 
cell
is to be switched, and the priority informarion that describes how many scheduling
resources can be userl by the cell. Recall that because of the need of multicast' the
output adrJress should be in the form of an array of 16-bit binary numbers' Each bit
of this affay cofresponds to an output and an asserted bit represents an outpllt fbr the
cell to be switched ottt. The priority intormation should be a 4-bit binary nunber'
which will be decoded to a 16-bit array in the scheduler. The header processor sends
them to the output scheduler respectively'
2.3.3 The Operation of the Buffers
The buffers in the input port controller not only store the cells but also control the
30
stored in the #l butfer, the cell stored in the #l buffèr is forwarded into the #2
bufl-er. Similarly, at the beginning of the 3"i tine slot, the cells in the #1 and #2
bufþrs are lorwarded into the #2 and #3 buffers respectively and à new incoming
cell is stored in the #l buff'er. At the end of the 3rd time slot, the schedule is
received, and the cell in the #3 buffer is written into the main bufl'er under the
guidance ofthe schedule result'
2.3.3.2 The Structure of the Main Buffers
The block diagrani of the main buffÞrs is shown in figure 9' The main bufter consists
of two basic blocks: one is the memory, the other is the read-write controller'
The memory is comprised of 16 memory units. Each unit can store a cell' The
capability of the main bufl-er is determineo by the scheduling capability of the output
scheduler. Recall that this output schecluler is clesigned to be able to make 
a schedule
within the subsequent l6 time slots. As we will see later, storing a cell into the main
butlèr is clependent on the schedule for it. If no scheclule is made for a cell, this cell
will not be written into the main buff'er at all. The cell that fails to enter the main
buffèr will be discarded. Therefore, the number of the memory units in the main
buffer is identical to the maximum number of the cells that can be scheduled' In
addition to 16 memory units, there is a temporary buff'er in the main buffers, which
receives the cell from the cell buses and sends it to the switch matrix after one time
slot. The necessity of this temporary butfer will become evident shortly'
32
Associated with the memory, there are two shift regislers, which are the write
controller and read controller respectively (see fìgure 9). Each bit of the shifi register
is connected to a memory unit so aS to control the write and read process'
2.3.3.3 Operation of the Main buffer
In the last subsection, we mentioned that the operation of the main bufl'er relies on
the schedule results. Therefore, we can associate the position of the memory unit
with the time slot. Specifically, each memory unit corresponds to a time slot' Thus,
we can identify the memory address v/ith the time slot'
We shall now discuss the operation of the main bufTer. Similar to other nlemory
devices, it includes two processes: read and write'
Reading Processa
As we know, the cells are scheduled into the subsequent l6 time slots' When a cell is
sent out, the subsequent cell becomes the leading cell in the new time slot'
Therefore, when a cell is saved in a memory unit, the time slot to which this memory
unit corresponds will change with the time. For example, at one moment, the #1
memory corresponds to the fìrst time slot. The #2 memory corresponcls to the second
time slot (re1èr to tìgure 9). In the next time slot, the time slot that the #1 memory
corresponds to becomes the current time slot ancl the cells in #1 memory is sent out'
At the sante time, the #2 memory conesponds to the l'' time slot and all the other
groups now correspon su
33
#2 memory will be sent out and #3 memory will correspond to the I't timc slot'
Therefore, we need a pointer that can indicate which memory corresponds to the
current time slot and activates it to send the cell out'
In order to select the cell to be sent out, we need a read controller. Indeed' this
controller is a circular l6-bit shifi register with the input of the first bit connected to
output of the last bit (see fìgure 9). Thus, the state of this shift register can be rotated
around the register. Each bit of this l6-bit shifi register is connected to the "read
enable" input of a memory.
When the controller is initialiseci with an array of states such as
,,0000_0000_0000_0001". The shitt register is designed to shifi the state in it in a
counterclockwise direction, namely, the state in each bit is moved to the bit on its
left and the left-most bit is shified into the right-most bit. The shifi register is driven
by a signal that is asserted at the beginning of each time slot. Clearly, the state 1 will
be moved around one bit per time slot. We can regard the state 1 in the shift register
as a polnter.
As stated above, each memory unit corresponds to one of l6 subseqllent time slots,
so the menìory unit that has just sent the cells out should correspond to the l6th time
slot, a new time slot f'or scheduling. Therefore, we conclude that the pointer is
always pointing to the memory unit that corresponds to the 16'h time slot'
34
Since the pointer is always moved counterclockwise, the memory unit on the left of
the pointeci one should correspond to the l " time slot. These properties are very
important fbr the operation of write controller. We will come back to this point later.
If we have been clear about the properties of this pointer, it is very easy to
understand the operation of lhe reading process. The pointer is moved around the
register one bit per time slot. The memory pointecl by a pointer will send the cell to
temporary buf¡er #4 the cell buses (ret'er to fìgure 9). After one time slot, the cell in
buffer #4 is fbrwarded into the switch rnatrix. Then a read process is fìnished- Wc
will explain the reason of using this bufl'er #4 in a later section.
Writing processa
The write process is io save the cell held in the #3 temporary bufl-er into the matn
but¡ers. The memory in which the cell is written is controlled by the write controller.
As shown in figure 9, the write controller is also a l6-bit shitl register with the input
of the first bit connected to the outpllt of the last bit. The shift register receives the
schedule in series and sends them out in parallel. Each bit of this shil1 register is
connected to the "write enable" input of a memory unit. This shitt register is
somewhat diflèrent trom a common sertes-in-parallel-out register. We note that the
schedule inpur is directeri to all the l6 bits of the register. That means any bit of the
register can be selected as the entry-point for shifting in the schedule 'àrÍry, but only



























part of this shifÌ register is shown at the bottom of fìgure 10. Note that there is a 2: I
multiplcxer between each flip-f1op. The control signal of this multiplexer comes
from the pointer. When the pointer is /, the input of the flip-f1op is connected to the
schedule, so the 11ip-f1op pointed by the pointer becomes the entry-bit and the
schedule shifts into the shift register. At the same time, the multiplexer blocks the
signal lrom the prececling flip-flop. On the other hand, those 1ìip-f1ops that are not
pointed to by the pointer will receive the signal from its preceding stage, whose
operation is identical to an ordinary shift register. The reason for using such a
structure will become evident shortly.
Note that the output of the #3 temporary buff'er is connected to the input of all the 16
memory units in the main buffer. That means the cell can be sent into any number of
enabled memory units simultaneously. This is clue to the need for multicast, in which
the cell fiom one input may be scheduled to a number of diflþrent time slots. Since
we associate the nìemory place with the time slot, it makes a good sense to save the
cell into the memory unit that corresponds to the time slots scheduled by the cell. For
example, if a multicast cell is scheduled into both the 1" and 2'"1 time slots, we
should save the cell into the group that currently corresponds to the l'' and 2nd time
slot. Clearly, due to this association relationshiP, we can use the schedule array to
select the memory units and write the cell into then-r. SpecilÌcally, the asserted bits in
the schedule array represent schedules. The asserted bits will turn on the "write
enable" of certain memory units and the cell is written in'
If we use the schedule array as the write address of the main bufTers, we have to
i
! 31
time slot. The time slot to which each memory unit corresponds is always changing,
which is under control of the pointer. Therefore, a write controller is necessary to
select the entry-point of the schedule.
The write controller is a shifl register with variable shifi-in bit. Here we assume that
the schedule array is always shifteil into the register counterclockwise. For example,
the schedule result in #2 bit register is shifted to #3 bit; the schedule result in #3 bit
is shifted ro #4 bit; the schedule result in #i6 bit is shilied to the #1 bit. In addition'
we assume that the n-rost-signifìcant bit of the schedule array is shified into the
register 1ìrst, and the least-signitìcant bit is shilted in last'
On the basis of the above assumption, we note that no rnatter liom which bit the
schedule array is shifted in, the final result will follow such regularity: the schedule
result thât corresponcls to the tìrst time slot (least-signifìcant biÐ is always stored in
the entry-bit register; thc schedLrle result that corresponds to the 2n'l time slot is
stored in the register to the lef t of the entry-bit; the schedule result corresponding to
the l6'h time slot (most-significant bit) is stored in the register to the right of the
entry-bit. An example is shown in figure 10. The diagram shows the entry-point,
shifiing direction and the flnal result. For this example the entry-point is the #3
register. When all the bits are shitted in, the bit corresponding to the tirst time slot is
stored in the #3 register. The bit corresponding to the 2"'l time slot is in the left of the
entry-point, #4 register, and the bit corresponding to the l6'h tin-re slot is in the right

















Figure 11 Operation of the write controllert
39
We have concludecl that the time slot to which a memory unit comesponds is decided
by the pointer in the read controller. The memory unit ref-erenced by the pointer
always conesponds to the l6'h time slot and its left one corresponds to the first time
slot. Therefore, we can simply use the read controller to select the entry-point of the
schedule in the write controller. A diagram is shown on the top of fìgure 10. We note
that each output of the read control is connected to a tristate buff'er. The tristate
buflèr that is turned on will develop a path for the schedule to the write controller
and the register connected to this butl'er will become an entry-point. We note that the
#N bit of the read controller is connected to the #(N+l) bit of the write controller. In
other words, when the pointer of the reacl controller is pointing to the #N bit, the
#(N+l) bit in the write controller will be selected às the entry-point for the schedule.
The writing operâtion is very straighttbrward. The pointer in the read controller turns
on a tristate buffer. The schedule is shifted into the write controller iÏom the selected
entry-point. At the end of the time slot, all the schedules are ready in the write
controller and they are loaded into the memories. The cell trom the temporary #3
buffer is written into the selected memory units. The write process is accomplished.
2.4The Design of the Output Scheduler
As discusse{ in the fìrst chapter, the most arlvantageous features of this ATM switch


























16 x 15 ELEITENTARY SCHEDULER ARRAY
Figure 12 Structure of output scheduler
2.4.1The Structure of the Output Scheduler
A firnctional block diagran-r of the output scheduler is shown in fìgure 12. The
exanrple output scheduler shown in the diagram, has 32 input ports and 32 otltput
ports. The inputs include 16 output addresses and l6 priority information codes. All
of this input information comes fÌom the input port controller. There are l6 output
ports connected to the 16 input port controllers, which send the schedules to the
controllers. The other l6 output ports are connected to the switch matrix. Each of
these ports passes to the switch oLltputs, the input address to which the switch outptlt
is to be connected.
Basically, this output scheduler is composed of four n-rain firnctional blocks: the
elementary scheduler, the input Status register, the output status register and the
clock generator. As shown in the diagram, the scheduler consists of a l6xl6 array of
elementary schedulers. Each row of the elementary schedulers conesponds to an
input, ancl each column of the elementary schedulers conesponds to an outpllt. Each
elementàry scheduler conesponds to an input-otttptlt pair. The function of the
elementary scheduler is to conpare the input status array with the outpttt st¿ìtlls array.
To minimise the amount of infbrmation to be exchanged between the output
scheduler and outside, we can simply maintain the input statu.s array and outplìt
statlls array within the output scheiluler. The input status array and output status
arrays are storecl in the input status register and output status register respectively
(see fìgure l2). In addition to storing and updating the input status alray, the input
42
Moreover, a 4:16 decoder and a 16-bit address register are also integrated into the
input status register, although they have a separàte firnction. Another part of this
output scheduler is a clock generator. This generator receives the clock signal tiom
outside and it bulters and distrìbutes it every clocked circuit on the chip. It also
generàtes the assertion pulses which are needed both by the status register and the
elementary schedulers.
2.4.2 Operation of the Output Scheduler
A more detailed diagram of this output scheduler is given in figure 13, which shows
dataflow among each functional block. The aim of this section is to illustrate how
each fÏnctional block interfaces and coordinates with others. The detailed design
within each functional block will be discussed in the next section.
2.4.2.1Data flow of the Output Scheduler
In this section we will discuss how the output scheduler works. Before we step into
the detailed discussion of operation, a llow chart (see figure 14) can help us to
unclerstand the basic finction of scheduler'
The operation of àn outp¡t scheduler can be divicled into three steps and each step
takes one time slot. The flrst step is to import the request fiom the input status
register; the second Step iS to comp¿lre the status arràys; the last step is to send the































































olltp¡t addresses and priority infbrmation, comparing the status arrays and senrling
the input addresses and schedules out are all carried out in a pipelined manner'
Figurel4 Flow chart of operation of output scheduler
Scherluling results are sÊnt harlR to update the input
and rlutput statr.ls rEgislers.
input anr:l r:rutput status arrays and scheduling rÊque:;ts
arrt sÈnt to the rllÈrnrlntary srlhÈdulÊrs t¡ rnake
schEdules. r//hen tv/r:r rlr rnrlrrl inpuls requEsts rletnanrl
the same orltput, the scherluler at that output makEÊ
sevÉral sDheduling decisions.
Shilf the input rEquests c0nta.ining the destinatinn
rrutput adrlress into the nr-ttput scheduler
45
a Importing the Request
As mentioned above, the function of the output scheduler is to compare the input and
o¡tput status arrays so as to tìnd the first time slot available tbr both the requested
input and output. The status arrays are comparecl within the elementary scheduler
that corresponds to the requested input-outpllt pair, so we need some infomration to
fìnd the particular elementary scheduler. Moreover, any schedule is made on the
basis of the priority infbrn-ration, so some circuits must be provided in the outptlt
scheiluler to handle this priority information.
Recall that each row of the elementary schedulers corresponds to an input and each
elementary scheduler within this row corresponds to an outptlt. Therefore, we can
use an output address to select pàrticlìlar elementary schedulers and activate them to
compare the status affays. At the beginning of this chapter, we have stated that in
order to s¡pport multicast (in which an input can select a number of otltpLlts
simultaneously), a l6-bit anay is used to clescribe the output address' Ref'ening to
fìgure 13, in each row of the elementary scheclulers, there is a l6-bit shifi register
that receives the output address fiom the input port controller in series and sends it to
each elementary scheduler in this row in parallel. Each bit of this shift register is
connected to the "enable" input of an elementary scheduler. The asserted bit in the
address will activate the corresponding elementary scheduler to perform the
comparison of the status affay.
Let us study an example. If a cell requests to be switchecl to the outputs labelled #l
46
"1000_0000_0000_0001". Ir takes one time slot to shill it into the shift register. At
the beginning of the next time slot, the address is sent out in parallel. Each bit of this
array is sent to the corresponding elementary scheduler' In this example, the
elementary schedulers corresponding to the first and last olltput receive a logical 1,
so they are turned on and will compare the input and output status àffay' In contrast,
the other elementary schedulers that receive 0 are turned off. In other words' when
the status arrays flow through them they perform no operation and keep the status
arrays intact.
Reièrring to fìgure 13, we note that we have provided a decoder with each row of the
elementary schedulers. This decoder is utilised here to turn the 4-bit binary number
that describes the priority information into a l6-bit arrây. As defined at the beginning
of the chapter, we use a 4-bit ¿rrray to describe the priority information, which is
shifled into the output scheduier in series. This priority information describes the
threshold value of the time slot that a cell can use. As shown in a later section, the
elementary scheduler needs a 16-bit array to be used conveniently. Hence, we shottld
turn this 4-bit binary number into a 16-bit array. This is done by the 4:16 decoder.
As we have cliscussed, in this output scheduler we assunle tha[ a cell to be multicast
should possess the identical priority infbrmation. Therefore, we can siniply send the
decoded priority infbrmation for a multicast cell to all the 16 elementary schedulers
in a row (see figure l3).
O Comparing Status ArraYs
4'l
When both the output addresses and priority infornation is available, the output
scheduler is ready for the next step which is request processing. This step is the
essential step lbr the operation of the output scheduler. We discuss its basic
operation fìrst, then we will analyse its drawbacks and improve it.
At the beginning of the next time slot, both the output address and the decoded
priority information are sent to the elementàry schedulers. Simultaneously, the input
status registers load the status arrays into all the elementary schedulers in its
corresponciing row. The output status registers send the output st¿ltLls arrays to one of
elementary schedulers in its column (re1'er to fìgure l3). The outpllt status array will
go through all the elementary schecluler in this column one by one and the output
status array is comparecl with the input status array in the ¿ictivated schedulers.
Note that the operation of each column of the elementary schedulers is identical, so
let us study the operation of one column, which is suffìcient to mirror the operation
of the whole outpllt scheduler. For simplicity, we assume that the output status array
is loailed into the elementary scheduler in the first row. We will have a more general
discussion on this topic in later section'
For example, if the fìrst elernentary scheduler in a column is activated' the output
and input status is compared in it. If there is a schedule made, the elementary
scheduler wor¡ld update the output status array and send an updated copy of the
olttput status arr¿ìy to its subsequent elementary scheduler, the one in the second row'
:^ l,.L^ll^,1 Vt Qi-r"ltrnonrrclr¡ the
ln Ilgure I J LIlc Ltpuatgu uutljLlt òL4LLIò 4rr aJ rù r4t/wrrvu u r¡
48
elementary scheduler generates a schedule array (labelled e in the diagram) in which
1 represents a schedule and 0 means no schedule. This schedule array will coordinate
with other schedule arrays generated by the other elementary schedulers in thi's row
to produce a fìnal schedule array for the input status reglster.
The updated output status array xl is sent to the elementary scheduler in the second
row. If this elementary scheduler is also turned on' the output Status array will be
compared with the input status array that is stored there' Assume that in this
comparison no schedule is made, So the output Statlls array will be kept intact and go
to the next elementary scheduler. At the same time, the elementary scheduler outputs
a schedule array. Since no schedule is macle, it is just an ¿lrrày of zeros'
Subsequently, assume that no other elementary schedulers in this column are turned
on, then the outpnt Status array will flow through each of them and return to the
output status register with a value that is irJenticai to the value of x1' Since no
schedule can be made in a disabled elementary scheduler, the schedule outputs of
elementary schedulers are all arrays of zeros'
At the end of this time slot, the input status registers f-etch the value from the e-btts'
Indeed, the value on the e-btts is the logical oR of all the schedule results
corresponding to the same time slot and difterent output' We will explain late'r why a
logical OR should be used. The input status register uses the scheduling results to
modity the input status afrays stored in them. on the other hand, the output status
register stores the new output status array in it'
49
a Exporting the Scheduling Results
As we know, the input port controller needs the schedule array to place the cell into
the bufÏer and the switch matrix needs the input address to switch the cell. We have
discussed the generation of the schedule result. We will discuss the generation of the
input address in a later section. The third step is to send them out. At the end of the
3"ì time slot, the output scherJuler processes a reqllest'
2.4.2.2 Structure Analysis
In the last section, we discussed how the output scheduler works. In this section, til
gain insight into the operation of the switch, we will discuss why it is appropriate for
the schedule to Ðperate in this way..
One question may be why the input status array is sent to all the elementary
scheclulers in a row, while the output status affays are f'ed to one scheduler and it
ripples through the whole column.
As the name 'butput scheduler" implies, all the schedules made in this scheduler
correspond to the output ports. In other words, what is scheduled is the operation of
the output port. One output port can only handle one cell at one time slot, so it
demands the output scheduler to schedule only one cell into a time slot. In the output




immediately. The 1 in the output status array represents a time slot that has been
schecluled ollt. Therefore, this 1 will give its subsequent elementary scheduler no
chance to make further schedules into that time slot. Thus, this structure ensures that
only one cell is scheduled to a time slot, in other words, no conllict may occur on the
outplrt port.
On the other hand, consicier the input status affays. As we know, this switch should
provide the multicast function, so it is not surprising for one input port to send a cell
to a few output ports simultaneously. In the section discussing the operation of the
switch matrix, we have seen that in the crosspoint switch the inputs just broadcast
the cells to all the outpLìts, and the outplrt will select one of them according to lhe
input a<ldress that comes from the olltplrt scheciuler. This architectLlre implies that
trom the point of view of the input port of the switch matrix there is no diftÞretlce
bet'ween switching the cell to one output or to sixtcen outpLlts. Therefore, we just
senci the input status array to all the elementary schedulers in a row simultaneously.
Each elementary scheduler in this row will generate an affay of schedule results.
Therefore, we use a logical OR of all the corresponciing bits of the l6 arrays as the
tjnal schedule result and send it to the input status register. This explains why the
flnal schedule result should be the logical OR of all the individual results.
2.4.2.3 Defeating Unfairness
Now we have understoorl the necessity of rippling the output status array through the
column of elementary schedulers, another question related to it will appear: since
-h 
elemenf¿ry sched¡ler can onlyreceive the trpdaterJ output stâtlls array fronl its
51
upper one, the upper elementary scheduler has a greater chance to nìake schedr'rles as
it receives the output àrray earlier. In other words, there is unlãirness between each
elementary scheduler in a column. In order to solve this problem, we have to change
the entry-scheduler of the output Status array frequently and regularly so th¿ìt each
elenentary scherJuler in a column has an equal opportunity on average to achieve the
best service as well as the worst service.
Towards that end, the scheduler is designed to be able to load the status ârray lnto
any one of the clementary schedulers in one column and consequently, the updated
olttpltt Status array can be returned tiom any elementary scheduler (see figure l3)'
Clearly, in one time slot only one of them shoukJ be turned on to receive and return
the status array. Therefore, we need a signal to select it. We employ a l6-bit shift
register to act as â pointer in the signal generator' We have cliscussed the operation ol
a poirrter in the previous Section. Each bit of this register is connecter] to a fÙw of
elementary scheduler. When the pointer is pointing to À row' the row below the
pointed row woulcl be the entry scheduler of the output status àrrays for this time
slot. Thus, it receives the output status array fiom the output st¿rtus register directly'
The other ones will receive the Status array fÌom its upper scheduler' When the
output Status array goes through all the 16 elementary schedulers in a column' it is
returned to the output status register through the scheduler in the row pointed to' The
entry elenìentary scheduler is changed in each time slot, so tha[ atter 16 tinle slots,




In this chapter, we discussed the operation of the ATM switch at the highest level-
Four basic functional blocks are introduced and their coordination with each other is
demonstrated. Then, we cliscussed the structure and operation of the input port
controller. From the discussion of the input port controller, we understand how the
controller generates the request and manages the cell flow according to the schedule
results. Subsequently, we discussed the architectural design of the output scheduler,
lrom which we understand the basic scheduling process. With a good understanding








Investigating Basic Functional Blocks
of the Output Scheduler
Now that we have discussed the architecture of this output scheduler and studied
how each module coorclinates with others, we are ready to have a close 
look at the
operation of each functional block.
3.1 ElementarY Scheduler
3.1.1 Structure
First of all, let us study the interface of an elementary scheduler with the outside
world. The block diagram of an elementary scherluler is shown in fìgure l5' Since
the basic function of an elementary scheduler is to compare the input 
stàtus array and
output status àrfày,each elementary scheduler has two l6-bit inputs 
fbr the input and
output Status array respectively. As we have mentioned, the priority intormation 
also
takes part in the comparison, so a l6-bit input for priority information that is
decoded from a 4-bit binary number is neeclecl. The elementary scheduler 
also needs
an "enable" input which is controlled by the output address' In addition' in 
order to
defeat unfairness, each elementary scheduler is designed to be able to receive 
the






















































Figure 15 Diagram of the elementary scheduler
input are necessary. Each elementary scheduler has a schedule àrray output and two
output status array outputs one of which is connected to the subsequent elenientary
scheduler and the other is connected to the output status register. Moreover, each
elementary scheduler has a one-bit input and a one-bit output for the input address.
Secondly, let us consider the internal structure of an elementary scheduler. The
elementary scheduler consists of two main parts: the comparators and the schedule
registers. As shown in figure 15, there are 16 comparison units and each unit
corresponds to one time slot. For each unit, there are four inputs: input status, olltput
status, priority infbrmation and the interfacing signal. Note that associated with each
comparison unit there is a blocking circuit, which generates the interface signal to its
subsequent unit. As we know, the scheduler is only conconted with the tìrst tirne slot
in which both input and output port ¿Ìre availablc. The 'ulocking circuit is ernployed
to keep other schedules fiom being reached once the flrst one is found. For the first
unit, the interface signal comes fiom the output address, which determines whether
to enable this elementary scheduler. In addition, each scheduler outpttts a schedule
array and an output status affay. The schedule result has two branches of otttputs, as
shown in fìgure 15. One of them is sent back to the input status register. The other is
f-ed into a 16-bit schedule register, whose firnction wilÌ be discussed later. Moreover,
as shown in the diagram, there is a separàte one-bit register, which is used for the
generation of input addresses.
3.1.2 Operation of the Comparison Units
5ó
'Whether an elenentary schecluler will operâte or not is decided by the "enable"
signal, which is connected to one bit of the output address. If the olltput to which the
elementary scheduler corresponrls is not requested, a logical 0 will be sent to it. Then
this signal is rippled down to each comparison unit through the blocking circuit and
rlisables them. Alternatively, a logical 1 will activate this elementary scheduler. At
the beginning of each time slot, the inpllt status array and the priority infbrmation are
sent to each elementary scheduler, while the output status register will not be
available until it ripples to the elenentàry scheduler. When all these three sets of
inputs appear on an elementary scheduler, the scheduler is activated. What we are
interested in is the llrst tin-re slot that is available for both input and outpLlt, so the
comparison operation begins fiom the unit corresponding to the first time slot anrJ
passes along to the one coffesponcling to the last time slot. In other words, no
decision can be reached fbr a comparison unit r¡ntil it receives the interfacing signal
from its preceding unit.
Now let us study a simple scheduling example without consideration of priority
information. Assuming that an input status array, 0000-0000-0111-1101, and an
output status array ll11_0000_0000_llll appears on the input of an activated
elementary scheduler. Here we regard the least signitìcant bit as the first time slot.
The fìrst bit of each array is 1 which means that in this time slot both input and
output are busy. Then the first unit's blocking circuit sends a logical 0 to the
following unit to inform it that no schedule is yet fbund. With this signal, the second
comparison unit is activated and begins to compare the second bit of the arrays.
ver, SS no u
51
responsibility to its next unit. Thus, the comparison is carried out one by one until a
schedule is found. For these two arrays, in the 8'h time slot both alrays are 0 that
means both input and output are available. The 8'h coniparison unit generates a
logical 1 as the schedule result and updates the 8th bit of the output status array into
1. Simultaneously, its blocking circuit sends a 1 to its subsequent unit and this signal
wi¡ be passed through all its subsequent units. This signal prevents the subsequent
units fiom making any schedules, although both ports are available in the th to l2th
time slots.
3.1.3 Modifications for PrioritY
From the last example we are clear about the basic operation of the comparator. Now
let us look at how the priority information aff'ects scheduling. In the last example, tve
diil not consider the eflèct fÌom the priority mfornation. we can regard this situation
as that each cell has high priority, in which case Lhe cell can be scheduled into any
one of the i6 time slots. In this câse, the priority inf'ormation should be
,'1000_0000_0000_0000". As assumed above, the most signifìcant bit corresponds to
the 16th time slot, so this priority information me¿ìns the cell can not use the time
slots afterthe l6th. Thus all the l6 time slots are available forit.
A cell with a low priority can only use some of the time slots, which is decided by
the threshold value set by the priority information. Let us still consider the two arràys
in the last example to study how the priority information aflècts the scheduling
results. For convenience, we write it here again, the input Stàttls array:
58
consider the priority information "0000-1000-0000-0000"' This arrày means the
cell can use the lirst 12 time Slots. As we have cliscussed above, the cell would be
scheduled into the 8th time slot, so this cell obtains a schedule within the limit of
resoufces. The priority information would have no aft'ect on the schedule results' On
the other hand, if the elementary scheduler receives such an array of priority
infbrmation âs "0000-0000-0100-0000", the result will be diff'erent' This priority
information limits the time slots available to the cell to the lirst seven, while there is
no schedule for it until the 8'h time slot. Thus, even though there are plenty of
schedule resoLlrces are available from the 8'h to 12'h time slots, the cell is still going
to be discarded because its low priority prevents it from using them'
3.1.4 Operation of the Schedule Register
As mentioned above, the outpltt scheduler shoultl generate an input address f'or 
each
output of the switch matrix to set up a data path fbr the cell. In order to generate 
the
input adilress, we neecl a copy of schedules. Towards that end, one copy of the
schedule results are sent to the input status register, which contributes to update 
the
input Status arrays stored in both the input Stattls register and input port controller'
The other copy of the schedule array is directeil to a sixteen-bit schedule register'
Each bit of the register corresponds to a time slot. Since each bit of this register
obtains the schedule State fiom a comparison unit, the contents of this register unit
should correspond to the same time slot as the comparison.




elementary scheduler. As we have mentioned, each elementary scheduler
corresponds to one input-output pair. When a schedule is made in an elementary
scheduler, it means that a cell is scheduled to be switched between this input-outpttt
pair. Indeed, the states in this schedule register indicate in which time slot this inptlt-
output pair will switch a cell.
Clearly, the schedule array is the accumulation of the results of the schedule array
that come from the comparison units. We should use the schedule array to update the
stàtes in the schedule register. Towards that end, an OR gate is used. What is written
into the schedule register is the logical OR of schedule aray and the previous state in
the schedule register. Since the new state always relies on its previous state, all the
states in the schedule register should be set to 0 when the scheduler is powered up,
which means no schedule has been macle. When a new schedule, a logical l, appears
in the schedule array, it will niociify the state in the corresponding bit of the register.
If no schedule comes in, the States in the schedule register are kept intact.
Furthermore, as stated above, the shift register is adopted to indicate the scheduling
history of the following l6 time slots, so the binary array stored in the register should
be shifted one bit per time slot so that the state in the register can oorrespond to the
proper time slot. Hence, the shifi register is shified one bit per time slot. The last bit
which corresponds to the l6'h time slot is loaded with a 0, and the state stored in the.
first bit of the register is loaded into another shifl register, as shown in figure 15.
This register is also a parallel-in-series-out shifl register, whose function is to shifl
the input address out of the output scheduler.
60
3.1.5 Input Address Generation
As mentioned above, the purpose of keeping a copy of schedule array in the
scheduler register is to generate the input a<ldress to be passed to the switch at the
correct time when a cell is to be switched. In this section, we will discttss how to
generate the input address fiom the schedule arrays. To solve this problem, we
should put the elementary scheduler into a column of elementary schedulers, which
is sketched in 1ìgure 16. Each elementary scheduler in this column includes a shifi
register unit and they are connected together into a l6-bit shiTt register. This shifi
register achieves the schedule result fîom the first bit of the schedule register in the
elementary scheduler in parallel and shifts them out to the switch matrix. What is
loaded into this shift register is just the input address for the cell to be switched in
the current time slot. This address is scnt to the corresponding output port of the
switch matrix to select the inPut.
Recall that each column of elementary schedulers corresponds to an outpttt, and each
row of elementary schedulers corresponds to an input. Also, in the elementary
scheduler each comparison unit conesponds to a time slot. Now let us consider the
l6 comparison units fiom different rows, which comespond to the same time slot,
say rhe firsr rime slor. They are highlighted in fìgure 16 by a box. In this module
there are sixteen single-bit shifl registers, which indicate whether there is a schedule
made for its corresponding input port. Thus, if we associate this array of schedules
together, it just indicates which input port is scheduled to send the cell to this output














Figure 16 A column of elementary scheduler
62
schedule made for the 9'h input port, so consequently the input port controller
corresponding to the 9'h input port would send the cell to the switch matrix at the
beginning of next time slot. The array of schedules provide switch matrix
information to the output port, which then uses this information to select the
appropriate cell lrom the input port. The array of schedules eff'ectively provides an
input address to the output port of the switch matrix.
Some words are worthwhile to explain that a cell and its corresponding input address
always ¿irrive at the switch matrix in the same time slot. As mentioned above, the
input address is generated according to the schedule. When a schedule is generated it
is written into one bit of the scheclule register that corresponds to the scheduled time
slot. Recall that the other copy of schedule result is sent to the input port controller,
which helps to write the cell into a memory unit that corresponds to the scheduled
time slot. A1'ter r:ach time slot, the state in the schedule register is shifted by one bit
so that each bit conesponds to its preceding time slot. Similarly, the pointer in the
main bufÏèrs shifts one bit after each time slot and all the cells stored in main butf'ers
correspond to their preceding tinie slot. When the scheduled time slot becomes the
current time slot, the scherlule result stored in the schedule register is loaded into the
shift register, which acts as the input address. At the same time, the pointer in the
main buf}'ers moves to the memory unit that stores the corresponding cell of this
schedule ancl the cell written into a temporary buffer.
As it takes one tirne slot to shift the input address in series into the switch matrix, the
cell fiom the main buflers have to be stored in the temporary buff-ers #4 for one time
ó3
of the switch matrix and they âre loaded into the multiplexer. Simultaneously, the
cell is sent to the switch matrix. Therefore, a cell and its corresponding input address
always arrive at the switch matrix in the same time slot.
3.1.6 Circuit Design of the Comparison Unit
In order to understand the operation of the comparison unit, we have to step to a
lower level, namely, circuit level, to See how the Status arrays are compared.
The circuit of the comparison unit and its blocking circuit are shown in fìgure 17.
From the point of view of the inputs I (input status) and x (output status), there are
three paths and each path takes charge of a particular function. Specifìcally, path I
will compa¡e the input and output status and generates the schedule result; path 2
will compare the input and output status and update the output status alrây
conditionally; path 3 takes charge of generating the interlace signal to the subsequent
unrt
3.1.6.1 Generating the Schedule
The path labelled I is to generate the schedule. The operation can be readily













I (tþ¡î output ¡llh¡. raglrtd)










The purpose of the fìrst equation is to compare I and -x. Specifically, when both the
input Statlls I anil output status x àre 0, which n-reàns both input and output port are
available, the result e (schedule candidate) will be 1. That means this time slot is
suitable to make a schedule. However, it is only a candidate, whether it can become a
real schedule or not is dependent on whether this comparison unit is âctivated. As we
rliscussed above, for the first comparison unit, the enable signal comes fion1 the
olltput acldress and for the other units the enable signal comes from the interfâce
signal of the upper unit. We employ a logical AND between the ¿ and rtp (tnterÍrce
signal fiom upper unit) and their result is a real schetlule result. For example, when
the up is 1 that means this comparison unit is ciisabled , the e-tmp will defìnitely be 0'
no matter what the signal ¿ is. When the ttp is 0, then e-tmp would follow the value
of ¿. Therefore, if ttpis l, a schedule candidate (e=l) will becorne a schedule. Note
ttrat the signal e_ttnp has two branches: one is connected to the schedule register and
,i.ne other ilirects to an NMOS transistor. For the first câsc, we have described the
reason of why we should store this schedule result. Now let us study the other
branch.
In the previous section we have explained that what the input status register recelves
is the result of a logical OR of sixteen schedule results that correspond to the same
time slot but ditterent output ports. This transistor is used here to realise the wired
OR. Note that the "gate" of this N-transistor is connected to the schedule signal; the
"SOLlrCe" iS COnneCted tO the Vdd; the "drAin", E-out, iS COnneCted to a buS' ThiS buS
connects the input of the input status register with 16 schedule result olltpttts
corresponding to the same time slot ancl rlifferent outpttt. ports. 




the transistor and pull it up to 1; on the other hand, if the e-tmp is a 0, the transistor
will be turned off and output a high impendence that does not afTêct the state of the
bus. In order to realise a logical OR of these 16 schedule results, we should
discharge the bus to logical 0 at the beginning of each time slot. If one or more
schedule results connected to this bus is 1, they will pull the bus up and the input of
the input status register would be a -1. Altematively, when none of these l6 schedule
results rre I , all of them will output high impedance and the bus will keep the state
0. Thus the input status register will achieve a 0.
We can also use truth table to explain the operation:
Where Z'represents high impedance, and 1'represents "dont care".
Table 1 Truth table of schedule generation
3.1.6.2 Updating the Output Status



































output st¿ttus, if a schedule is macle. The operation of this path can be described with
the following equation
x-tmp - ( i' AND r'ry' ) OR x
Where l'represents the inverted i, ttp'means inverted up and x-tmp represents the
new output stâtus. This equation informs us that when the enable signal up is a
logical 1, in other words, this comparison unit is not requested to work, the result of
(i' AND r.¿p) woultl certainly be 0. Consequently, x-tmp will fbllow the previous
value of output status x. In contrast, if this elementary scheduler is activated, namely,
Ltp=0, then last equation can be rewritten like this:
x-tmp = i' oR x
The reason is that when ltp=Q, the result of (j, AND up) will always be decided by
the value of i '. When i is equal to 1, this means the input port has been scheduled by
another cell, so no schedule can be made and x-tmp will follow the previous x. If this
time slot for this input has not been scheduled out, the signal i would be 0. In thi.s
cùse, x_tmp is equal to 1, which means a schedule will be made and the outptlt st¿ltus
is updated to 1.





Where :'represent "dont c'àre".
Table 2 Truth table of output status updating'
Here we have achieved the updated output status, -r-tmp, next we discuss how this
signal is distributed. Recall that in order to def'eat untäit'ness, the output st¿lttls arrays
are loaded into the different elementary scheduler in e¿ich time slot a.nd consequently
the r.rpdated output status arrays are returned to the outpllt stattts register tionl
difl'erent elementary schedulers in each time slot. This distribution circuit is
employed here just for realising these functtons.
As shown in ligure ll, the distribution circuit is a combination of a 2:l
demultiplexer and a l:2 multiplexer. The signal x-r.mp has two branches of otttpttt:
one is Xottt andthe other is r. Clearly, we need a demultiplexer here to decide which
branch the signal should be directed. On the other hand, as vr'e see in the circuit































them we neecl a mnltiplexer here. Both the multiplexer and the deniultiplexer are
controlled by the signal labelledfairness', because all of these operations are to def'eat
unfairness.
When a row of elementary schedulers receives a logical 1 fiom signal fairness', (see
fìgure 17) that meàns the next row of elementary schedulers will act as the entry-
scheduler for the output status array. For this case, the output Xo¿¿l should be
connected to the output status from the output status register. The other source of the
Xottt Trom x_tmp should be latched. The output status goes through all the
elementary scheduler in a column will go back to the output status register fiom the
entry-point. Hence, the signal x_tmp is connected to the output th¿it directs to the
output statlts register. From the tìgure l7 we find that when the signal/ainte.ç'ç is a
logical 1, it will tum on the pass transistor betweeri the/and lhe Xout ancl the pass
transistor between x_tmp to t.




'Where Z'represents high impedance.







On the other hand, for the common case the elementary scheduler does not act as the
entry-scheduler, So it just receives the output status array fÌom its upper elementary
scheduler, and sends it to its next one. It is not aware of the existence of the olltput
status register. For this occasion, the signal fairnes's is a logical 0 and it develops a
path between the x_împ and Xottt (see tìgure 17). At the same time, this signal
latches the signal fTom the output status register and output high impedance to the
output status, which makes this elementary scheduler neither rnodify the states of the
output status array, nor be aff'ected by the status array. This completes the description
of how the output status array takes part in comparison and how it is ciistributed-
3.1.6.3 Interfacing with Subsequent Unit
Finally, let us analyse the blocking circuit that interfaces with the comparison unit
corresponding to the next time slot. The blocking circuit takes charge of infbrming
its subsequent unit whether they are requestecl to compare the status array. If the
interface signal down ts a 0, its subsequent unit is enabled to compare the status
array. In contrast, iÍ down is a 1, the subsequent units will be disabled. There are
three factors that may atf'ect the value of the intertace signal, that is ttp, p and the
comparison result of i anrj x. Theretbre, the blocking circuit needs these signals as
inputs. (see fìgure l7)
The operation of the blocking circuit can be described with the following equatlon
down= (iNORx)ORpORup
1T
v/here ¿rp'represents the inverted up
This equation shows the contribution fÌom these three factors clearly. If this
elenrentary scheduler is disabled, namely, up=O then signal up'wlll be 1' No matter
what is the comparison result of i and x, signal down will certainly be 1. Then the
subsequently unit will be disabled. If the signal up rs a 1, the interface signal down is
decided by the other fàctors. If the signal p is a 1, which means the priority
information prevents the cell from using the subsequent time slot, the signal down
will also be 1. Thus, the subsequent unit will be disabled. If both p and up' àÍe 0, the
signal down is decicled by the comparison result of i and x. When both of them are 0,
this time slot will be scheduled. Since a schedule has been found, it is not necessary
for the subsequent unit to do anything. Therefore, signal down passes a 1 to the
subsequent unit to disable it. On the otlier hand, if no schedule is achieved in this
conrparison unit, namely ( i NOR -r ) is equal to 0, then the signal down wlll be 0. It
informs its subsequent unit to go on comparing the input and output status arrays'
The fïnctionality of this part of circuit can also be described by the truth table.
12
I down
Where 1'represent 'dont care".
Table 4 Truth table of Generating interface signal.
3.2 Input Status Register
In the output scheduler, the input status arrays are maintained and updated in the
input status register. 'We have discussed the basic function of the input stâtus register
in the previous section. In this section the structure and operation of the input status
register will be discussed in more details.
3.2.1 The Structure of Input Status Register
Figure 18 shows the diagram of an input status register. The input status register has
16 inputs which receive the schedule anay fïom the elementary schedulers. There are






























































The input status register is comprised of l6 input status register units. Each unit
corresponds to a fixed time slot. Recall that in the main bufl-er, each memory unit
corresponds to a floating time slot. Here, each input status register corresponds to a
particular time slot. For example, the l" unit corresponds to the first time slot, the 3"t
unit corresponds to the third time slot and so forth. As shown in the diagram, each
unit receives a schedule bit and outputs an input status bit'
The diagram of an input status register unit is shown at the bottom of fìgure 18.
Basically, each input st¿ttus register unit contains two parts: one is the schedule status
shift register, the other is the input status register. The schedule status register in
each unit is connected in series into a l6-bit shift register. The inpllt status register is
comprised of an OR gate and a register. The OR gate is used to change the schedule
status into the input status, as explained below. The register is eniployecl to store and
shift the input status array.
3.2.2 Operation of the Input Status Register
First of all, let us look at the operàtion of the schedule status shifl register. This l6-
bit shifi register srores the l6-bit schedule anay in parallel at the beginning of a time
slot. This 16-bit wide array is shified out to the input port controller in series. This
schedule array is userl to place the cells into the buflèrs in the input port controller as
explained in a prevìous secLion.
Secondly, we will concentrate on the discnssion of input status register. Ref'er to the
75
From the diagram, we note that the register in each unit is connected one by one.
They comprise a 16-bit shift register. The shifi register forwards the state in it one bit
per time slot, namely the state in the l6th unit is loaded into the 15'h unit, the state in
the 3'd unit is loaded into 2''d unit and so on. The reason for this operation is
straightforward. As mentioned above, each unit of the input status register
corresponds to a fix time slot. After each time slot, each bit of the input status array
should correspond to a ne\ / time slot. Theretbre, in order to keep each bit of the
input status array always corresponds to the correct time slot, they should be moved
into the register units that correspond to the appropriate tin-re slot. In this input status
register, the top unit is designed to correspond to the first time slot, and the bottom
one conesponds to the l6'h time slot. Afier a time slot, each bit of the input status
array should correspond to its preceding time slot. Therefbre at the beginning of a
time slot, cach bit of the input status array is shifted into the upper unit. And the l6th
bit is f'ed a 0 that represents a new time slot to be scheduled (see fìgure 18)'
Note that besicles the state loaded into the register, there are another two branches for
the input status. One of them is sent out of the input status register. This copy of
input stàtus is sent to the elementary schedulers and will be compared with the
output status.
The other copy is sent to an OR gate. Recall that the input status register receives the
schedule array, but maintains and sends out the input status aray. Therefore, we
shoulcl change the scheclule status into the input status before it is stored in the
16
previous state of the input st¿ttus is the logical 0 and a schedule is made f'or this time
slot, namely alogical 1 is received, then the logical OR of these two states is 1. This
state is written into the register, which updates the previous state in the register. If
the previons status is the logical 0 and no schedule is made for this time slot, namely
a logical 0 is received, the input status will keep its previotts vàlue, logical 0. If the
previous value of the input status is logical 1, which means this time slot has been
scheduled out, it is impossible to make a lurther schedule. Theref'ore the schedule
should be 0 and the input status is stili 1.
3.3 Output Status Register
As the name implies, the out status register is used to maintain and drive the outpttt
status array. A diagran of the output status register with the detailed structure of rhe
register unir is show in figure 19. InrJeed, the output status register is tr simplified
version of the input stàtus register. The output status register needs not shift anything
out, so there is no shifier register for shiliing the schedule array out. Moreover, the
olltpLtt status register receives and stores the output status array and maintains the
olrtput status àrray, so the output status is written into the register directly. Except for
the above diff'erence, the operation of the output status register is identical to that of
input status register.
3.4 Clock Generator


































Figure 20 Diagram of clock generator
SHIFT SHIFT
79
Basically, it is comprised of two identical l6-bit shift registers. Both of them are set
to such a state as "0000_0000_0000_0001" when the chip is powered on. Moreover,
the tlrst bit and the last bit of either register are connected so that the states in it can
be rotated round and round. Indeed, both of them operate like a pointer used in the
read controller of input port controller.
The shift register #l is used to generate the assertion pulse, which is necessary in the
operation of the input status register and elementary scheduler. For example, the
input status register needs an assertion signal at the beginning of each time slot so
that the input status array is forwardecl one step. This signal can be realised using the
shitl register #1. This register is driven by the clock signal, which asserts l6 times
per tin-re slot (we will define tht: clock frequency ln the later section). At the
beginning of a tirne slot, the fìrst bit of the shift register #l is set to l; all other bits
are set to 0. The output of the lirst bit of the shilt register #1 will be thc asscrtion
signal that we need. Using this register, we can achieve any assertion pulse that we
need.
A tiniing diagram of the signal generator is shown in tigr,rre 21. -fhe shiii register #1
is driven by the clock signal, so the set bit shilis once every clock cycle. We use the
o¡tput of the shift register #l to control operation of the whole scheduler. In
ar1dition, the signal Al(1" bit of shift registerl) is used to drive the shift register #2,
which generates a signal for defeating unfairness. As shown in the diagram, Al is
asserted at the beginning of each time slot. When Al is asserted, it drives the shift







1" bit of shift registcrl. (signal A1)
2"d bit of shift registerl. (signal A2)
3'd bit of shift registerl. (signal A3)
l" bit of shitt register2. (signal B1)
2nd bit of shift register2. (Signal B2)
3"r bit of shitl register2. (Signal B3)
Figure2l Timing diagram of signal generator.
8l
Chapter Four
Physical Design of the Output Scheduler
In the previous chapters, we discussed the operation of the output scheduler in detail.
As we mentioned in the fìrst chapter, this output scheduler is designed to the circuit
level, so this chapter we will discuss some VLSI (Very Large Scale Integrated circuit
design) circuit design issues.
V/e will rliscnss three topics in this chapter. First of all, the design methodology is
illustrated, which shows the design flow of this output scheduler. Secondly, we will
discuss some techniques used in the VLSI design to achieve high speed,
comparatively low power dissipation and reasonably compact area- Then, we will
give the simulation results of this olltput scheduler.
4.1 The Design MethodologY
The design of this output scheduler lollows a top-down methodology, which is
widely used in modern ASIC (Application Specilìed Integrated Circuit) design [8]
[9]. Specifìcally, it includes the following steps:
High-level design specification and partition: According to the role of the
output scheduler to be played in an ATM switching system and the time





above, the output scheduler is dividect into 4 basic functional blocks, namely
elementary schecluler, input status register, output status register and clock
generator.
Develop and validate functional models: In this design, the models of the
output scheduler are developed with VHDL (Very high speed integrated circuit
Hardware Description Language). We use VHDL to describe the behaviour of
each basic firnctional block and combine them together according to the
architecture detìned in the last step. The models are simulated to make sure that
they can realise the functions that we expect. Basically, modelling has three
functions: fìrstly, it can verify the conectness of the design specifìcation at high-
level abstraction; secondly, the moclels can be used as the source for synthesis;
thirdly, the stimuli that will be used fbr functional verifìcation can be generated
lrom the models
Schematic design and simulation: In order to achieve the best performance, the
schematic of the scheduler is designecl manually, instead of being synthesised
trom the VHDL code. The schematic is simulated to the switch level to ensure
the correctness of its function. The schematic design can not provide such
infomration as the RC delay due to the contacts and wires, so we can not achieve
any àccurate simulation result. Therefore, switch level simulation is carried ottt,
which ignores the non-linear efl'ects of the transistor. It only shows the functional
validity. We can use the stimuli generated from VHDL models to simulate the
circuit àutomatically.
Layout design and simulation: The layout is also drawn manually to achieve
high-speed, low power and small area. The layout is designed on the basis of the
a
a





included all the delay infomration, we use Hspice to conduct full-analog
simulation, fiom which we can achieve accurate delay and power dissipation
estimates. As Hspice considers the non-linear eff'ects in the operation of the
transistor, it is very accurate but expensive in calculation time. Therefore, it is
not f-easible to simulate the whole chip or very large-scale circuit. Usually,
Hspice is used to analyse and simulate the critical path of the circuit.
Functional Verification: In order to guarantee that the circuit can realise the
functions we expect, we use the switch-level simulator again to verily the
function of the circuits. Similarly, we can use the stimuli generated from the
VHDL codes to verify the firnction.
Layout Verification: Layout verification includes two steps: LVS (Layout Vs
Schenatic) and DRC (Design Rule Check). Since the layout is designed on the
basis of the schematic, it makes sense that they have identical netlists. Theref'ore,
we should compare the layout and the schematic. As we know, the final
fabrication of the circuit on the wafèr relies on the layout of the circuit. For each
technology, there àre some specilic electronic rules for the circuit layout. If some
rules are violated, it may callse son-re malfïnctions of the circuits. Hence, a DRC
of the overall circuit is necessary.
Compared with the bottom-up design methodology, the most signifìcant advàntage
of top-down design is earlier error detection. As stàted above, the design begins with
the high-level abstract modelling liom which we càn detect errors before any
detailed design is begun. Afier we make sure that the models are correct, we step to




fabrication, the probability of error will be very low. Earlier error detection results in
the reduction of the development costs and increases the chance of first pass success'
In this section, we introduced the methodology of ASIC design. With à general
understanding of ASIC design flow, we will discuss some specific design topics in
the next section.
4.2 Techniques for High Performance Digital Design
In the second chapter, we discussed the operation of the output scheduler in the
domain of functional description. Relèning to the design llow introduced in the last
section, we have finished the schematic design. In the next f-ew sections, we will
discuss some issues on layout design, because the layout will decide the final
perfbrmance of the chip.
4.2.1 Design Specifïcation
Before v/e start to discuss the particular design issues, we should specily some
parameters
The scheduler is driven by a clock whose fiequency is 400 MHz. In other words, the
perioj of each clock cycle is 2.5 ns. This value is decided according to the design
objective. Recall that the output scheduler should be designed to support a
l0Gb/s/Channel switch matrix. For each ATM cell, there are 53 octets, namely 424
86
switch an ATM cell. In order to support such a high-speed switch matrix, the
scheduler has to llnish one scheduling process within 42.4 ns. We have de|ned the
time for switching one cell as one time slot. Therefore, for convenience we specity
each time slot to be 40 ns. As defined in the last chapter, the 16-bit output address,
schedule array and input address should be shitted into or out of the scheduler in
series within one time slot. Clearly, each time slot should include l6 clock cycles.
Therefbre, the period of each clock cycle should be 2.5 ns. In order to simplity the
design, the scheduler is driven by a single-phase clock signal.
The target process technology of the olrtput scheduler is 0.25uni TSMC CMOS
technology with 5 metal layers and single poly. The power supply is 2.5 V.
4.2.2 D esign Requirements
Basically, the quality of a VLSI chip is decided by three fãctors, nanely speed,
power dissipation and area. Just like many other engineering flelds, it is impossible
to achieve the best perfbrmance on every factor, because some factors always
conflict with others. Therefore, an optimal design is a compromise between those
factors.
For example, sttppose we are designing the CPU for a PC. For a PC, we need not be
too concerned about low power and we can install a fan to cool down the processor.
Moreover, we have sufllcient room fbr à processor. Therefbre, we cân trade the
power dissipation and area for the speed. On the other hand, if we are designing a
-PU 
fix a portãtrle eonpmeçwhosepower supply comes from the battery;we have
87
to minimise the power dissipation of the processor so as to increase the lil'etime of
the battery. In addition, portability requirenents dictate that the processor be
designed as compactly as possible. Hence, the power dissipation and area are the key
factors to be satisfied.
As mentioned above, the objective of this design is to support the switch matrix that
works at a speecl of l0Gb/s/Channel. Consequently, the scheduler should llnish a
scheduling process within 40 ns. In each scheduling process, the output status array
should ripple through all the 16 elementary schedulers in one column. It is really a
challenge! Therefore, the speed should be akey factorto beconsidered in the design
process. On the pren-rise of a satisfactory speed, we can try to minimise the power
6issipation anrJ area. Therefbre, all the design decisions are based on this basic
principle.
4.2.3 Design for High Speed
As stated above, the output scheduler should 1ìnish one scheduling process within
40ns and the requests and the scheduling results have to be shifted into or out of the
scheduler within one time slot. Atl of these require that the circuit must be designed
with much care. In this section, we will discuss some techniques used in the layout
design of the schecluler.
4.2.3.1Floorplanning
88
Floorplanning is the exercise of arranging blocks of layout within a chip so ¿ìs to
minin'rise the area ancl/or maximise the speed. Many detailed design decisions are
closely related to the high-level topology of the functional blocks, so u/e discuss the
floorplan of the scheduler first. A diagram of the scheduler floorplan is shown in
ligure 22. ln the centre of the scheduler is the clock generator. There is one input
status register in each row of elementary schedulers and the register lies in the
niidiile of this row. The shill register tbr outpltt addresses and the decoder lor
priority intorn'ration are integrated into the input status register. There is one output
statlls register in each column of elementary schedulers anil the registers are placed
in the middle of each column. As shown in the tiiagran, the input status registers and
outp¡t status registers comprise a cross that divides the l6xl6 elementary scheduler
array into tbur parts. Each part consists of an 8x8 array of elementary schedulers.
That the input status registers and the output status registers are placed in the middle
of the row or column where they lie makes the structure of the outputscheduler more
symmetric. Since the input statlls register and output status register should exchange
information with the elementary schedulers frequently, this synlmetric structure
helps to reduce the RC rlelay from the wiring. Also, it is advantageous for power
conservation (we will come back to this point later)'
The reason fbr putting the clock generator in the centre of the output scheduler is to
simplify the clock distribution. As stated above, the clock generator takes charge of
































Figure 22 Floorplan of elementary scheduler
90
lunctional block neecls some assertion pr.rlses or clock signals and ¿ill of these signals
come fiom the clock generator. Clearly, when we put the clock generator in the
centre of the chip, the assertion pulses and clock signals can reach each functional
block with the minimun distance and best symmetry. Also, it is helpful to minimise
the skew. We will discuss this topic in the next section.
4.2.3.2 Clock Distribution and Skew
In the output scheduler, both the input and outpltt status registers are sequential
circuits. They consist of a large number of registers and latches. All of them should
be driven by the clock signal or assertion pulse. This hugc fan out acts as a large
capacitive loacl on the clock signals and assertion pulses. The load is lìrther
increased by ttre capacitance from the wire itself, which is distributed fiom the clock
generator to e¿ch corner of the chip. Therefbre, we need sLrfficient butl'ers to amplify
the signals and drive them into functional blocks.
A closely related problem is skew. Some wires tor clock distribution may rezrch a
length of centimetres. Such long clock wires introduce a substantial series resistance,
even if we use the metal layer. A clock line thus behaves as a distributed RC line- As
the delay of an RC line is a function of the length, the flip-fìop connected to the same
clock signal may observe diflèrent transition times due to their different distance
lrom the driver. This etlèct is clock skew [9]. Skew can severely att'ect the
perfbrmance of the sequential circuit. In the actual design, it is very difflcult to make
the skew zero. The most important issue is to limit the skew to lhat which the circuit
91
can tolerant. As the output scheduler works at ¿ì very high speed, this makes it very
sensitive to the skew. Theretbre, we have to try to minirnise the skew.
A practical v/ay to solve the skew problem is to route the clock signal caretully and
use a hierarchical clock-bufl'ering scheme. Figure 23 shows the structure of a buffer
tree that is used in the output scheduler. Note that the clock signal is driven by the
buf'fers stage by stage fÌom the signal source to fìnal circuit. Specifically, the signal
is amplifìed to rlrive the bufl-ers in each column. The buffers in each column drive
the signtrl to the bufters in each elementary scheduler. The buf'fers in each
elenentary scheduler drive the signal to each flip-f1op. Clearly, this approach does
not result in a zero skew, but it decreases the skew substantially. The reàson is that
the intermediate buftèrs isolate the local clock nets fiom upstream load inipedànces
a.ncl an-rplify the clock signals degraded by the RC network. Therefore, lhe skew is
decreased and the signal slope is kept steep.
4.2.3.3 Critical Path Analysis and Optimisation
In a circuit, there are a large number of paths and each path has a characteristic
clelay. 'When we talk about the delay of a circuit, what we relèr to should be the
maximum delay trom all these paths. The path with the maximum delay is called the
critical path. The critical path will determine the speed of the whole circtrit.
Therefbre, we should analyse the circuit to find the critical path and optimise it to




Figure 23 Structure of buffer distribution
93
a Identifying the Critical Path
In the last chapter, we have noted that in the scheduling process the outpllt status
array ripples through 16 elementary scheduler in a column and is compared with the
16 input status arrays. The input status arrays and the priority information âre sent to
each elementary scheduler at the beginning of each time slot, while the output statlls
array goes through each elementary scheduler one by one. This implies that the input
status arrays and the priority information will not bring any delay, as they are always
waiting fbr the output status affay. Thus, we should only analyse the flow of the
output status array to look for the critical path. After the status arrays are compared
in an elementary scheduler, the elementary scheduler generates a schedule aray and
an updated output status array. The generated schedule anay is logical ORed with
other schedule arrays that are generated by the elementary schedulers in the same
row to prociuce a final schedule àrrày, then the operàtion in this parh is lìnished.
However, the updated output statlls anay will be compared with other input status
arrays in the subsequent elementary schedulers of this column. Clearly, the critical
path will be one of the paths in the flow of output status array.
A column of elementary schedulers is shown in fìgure 24. We assLìme that the
comparison begins fÌom the top elementary scheduler and fìnishes at the bottom
elementary scheduler. As we have discussed, in each elementary scheduler the input
and output status arrays are compared fiom the #l unit to the #16 unit. In the
column, the output status array is compared with the input status arrays lrom the top












that the status arrays appear on the #l unit of the top elementary scheduler to the
mo1¡1ent that the #16 unit in the bottom elementary scheduler outputs the output
status.
Now let us study the delay quantitatively. Each comparison unit receives the output
statLls fÌom the unit corresponding to the same time slot in the upper elementary
scheduler. The unit compares the received output status array with the waiting input
status array, then sends the updated output status àffay to the subsequent unit. We
define this delay as tl. Each unit receives the interface signal from the unit in the
same elementary scheduler corresponding to the preceding time slot. The unit
modifìes this interfâce signal and passes it to its subsequent unit. This delay is
deünerJ as t2. Refèrring to 1ìgure 24,we note that there are a large number of paths
tÌom the #l unit in the top elementary scheduler to the #16 unit in the bottom
elementàry scheduler. With a careful analysis, we tìnd th¿rt the cielay is identical for
all paths. The total delay is
Delay=lJ x tl + 15 xt2
For example, in the Path I the delay is 15 x tl + 15 xt2',ln the Path 2 the delay is 2
xt¡+15xrl+13xt2; in the Path3 the rlelay is15xr2+15x tl.Allof these
paths result in the same delay. Therefbre, v/e can conclude that any path between the
#l unit in the top elementary scheduler and the #16 unit in the bottom elenentary
scheduler can be the critical path. The final delay is decided by the delay from etich

























a Optimising the Critical Path
Basically, we have two Ìvays to optimise the critical path: one is to use lookahead
among the comparison units in the elementàry scheduler so as to minimise the delay;
the other is by careful layout design. In this section, we focus on the layout design
techniques, So we leave the fìrst way to be discussed in a later.section.
According to the analysis in the last section, the delay of the critical path is
determined by the delay fiom updating output st¿ttus (tl) and the delay fÏom the
interfãce circuit (t2). As the total delay is l5 times Ll+tL, any small decrement on tl
or t2 will have a signilìcant reduçtion on the total delay'
Design Priority
Recall that in each comparison unit, there àre three paths: interfãcing with the
subsequent unit, updating output statLls and generating the schedule result. The
former two paths are in the critical path, so we should give them a better priority in
the layout design to niinimise their delay and leave the path of generating the
schedule array to suffer more delay. This idea is mainly represented in the placement
of layout and the selection of the layers to be used lbr interconnection.
For example, we v/¿Ìnt to connect two gates that are in the critical path, but there are
some other gates physically between them and the layout of these gate is not "metal
transparent". As we know, the metal layer has a very small resistance, so it is the
most sllitable layer fbr connection. However, as in this case the metal can not go
9l
throu-qh the circuit between two gates, v/e can not use it. But the poly layer can do
that. Compared with the metal layer the poly layer has a very large resistance and
consequently it \ /ill callse a large RC deiay. On such occasions, we should move the
gates between the two gates to somewhere else and place these two gates in the
critical path as close as possible to each other so that they can be connected to the
metal layer. It may result in more delay forthose gates that were displaced.
Distinguishing the fast gate and slow gate
Usually, fbr a logical gate with a number of fan-ins, the signal will not appear on the
inputs simultaneously. Some signals appear earlier and some appear later. Vy'e can
take a two-input NAND gate as an example, which is shown in lìgr.rre 25. Note that
there are two n-transistors connected together in series. Assume that both inputs are
initialised to 0. If inpú in.0 receives a logical 1 Íirst, it will turn on the n-transistor
connected to it. At this moment, the in I is still 0, so no path to GND is developed
and ottt is still 1. At some moment, the inl receives a .1, then its connected n-
transistor is turned on. Since two n-transistors are all turned on, the out is discharged
to 0. For this case, the current should go through two n-transistors to discharge the
ottt when the in\ receives a 1. If the input inl receives the 1 flrst, it turns on its
connected n-transistor. Note that as soon as this transistor is turned on, it will
clischarge the point a to 0. Thus, when the input in} receives a / and turns on the
second n-transistor, the current only goes through one transistor to discharge the ot'tt
node. It is certainly fãster than the lìrst case. Therefbre, we define the in} as the fast














The signals in the critical path usually arrive later than the other signals in the same
logic gate. Therefore, we always connecl the signals in the critical path to the fast
gates. Although the improvement lion each gate is very small, it still contributes tcl
reduce delay.
4.2.3.4 General Techniques to decrease delay
Sone other techniques are wirlely used in the layout design of the olltptltscheduler
Carefully sizing the transistora
As statecl above, this output scheduler is designed manually. A signifìcant advantage
of custom design is that the size of the transistor can be selected flexibly according
to its fan-out. Thus, we can use large transistors to drive large loads, so as to achieve
high speed.
Stage Ratio
In some case, the buflèrs have to drive a very large capacitance, such as a bus or an
off--chip capacitive load. A small buffer will take more time to charge it, so a large
buff-er is needed. In order to achieve the best speed, we can use a chain of inverters
where each successive inverter is made larger than the previous one until the last
inverter in the chain drives the large load. The ratio by which each stâge is increased
in size is called the stage ratio. It has been shown that when the stage ratio is in the
100
PMOS transistor, its voltage rises fiom 0 to Vdd, and a certain amount ol energy is
drawn from the power supply. During the high-to-low transition, the capacitor is
discharged, and the stored energy is dissipated in the NMOS transistor. We can
compare it with dynamic logic. The power consumption in a dynamic network is
solely determined by the signal-value probabilities, not by the transition
probabilities. In other words, in the dynamic logic we should always precharge the
gates no matter if there is a transition or not. Consequently, the dynamic logic will
sink more power than the static CMOS[9].
4.2.4.2 Reducing the Effective Capacitance
The dynaniic consumption of the static CMOS can be expressed with the following
equ¿ìtlon
where c represents the load capacitance, v represents the voltage of the power supply
ancl / nìeâns the lrequency of a gate to be switched. For a particlllar technology, the
voltage of the power supply is 1ìxerJ and decreasing the voltage niay aflèct the noise
margin and af'Íect the circuit speed. With the advance of technology, smaller
propagation delays are becoming achievable. Consequently the switching fÌequency/
is increasing. In order to reach very high speed, we can not reduce the trequency f to
lower the power consumption. Therefore, the most efTective way to reduce power
dissipation is by reducing the capacitance.
2.p=cv J
t02
Carefully sizing the transistors
In the previous section we have ciiscussed how the transistor size alfects the
switching speed. Since each switching operation of the combinational static CMOS
is actually charging or rlischarging a capacitor, it is clear that the smaller the
capacitor charged, the less power is consumed. Toward that end, each gate in the
output scheduler is carefully sized to be as small as possible. When all the transistors
are designed to be the minimum size, the power dissipation is minimised. However,
that will aflèct the speed of the circuit. Therefore, when a gate is required to drive a
large capacitance, the transistor must be sized up.
Carefully sizing the wiresa
As we know, in CMOS technology, delay is basically caused by charging or
discharging capacitance. In addition to the capacitance conres fiom transistors, long
wire is another major source of capacitance. We should caretully size the width of
long wires or wires connected to a large load. V/e seek to make the wire to be as
narrow as possible. However, if the wire is connected to a large capacitor load or the
wire itself is very long. we need a large buft'er to drive it to achieve high speed. A
large butf'er implies a large current. The metal has a limitation on current density
(usually it is 0.4 mA./um to L0 mA/um). If the cunent density of a current-carrying
conductor exceeds the threshold value, the conductor atom will nove in the direction
of the cllrrent flow, and the conductor may eventually like a fuse. If we simply select
a very wl m AS wlre, 1t wl
103
ome a
power. Therefore, we should estimate the current needed fbr charging the capacitor
and decide the width of the metal accordingly.
Avoid the extensive sharing of the busa
Another approach to reducing the physical capacitance is to avoid the extensive
sharing of the bus. In order to illustrate this point, let us analyse an example. Recall
that in each row of elementary schedulers we use the discharged buses to realise the
logical OR of the sixteen schedule arrays from elementary schedulers; and the fìnal
result. is sent to the input status register. Clearly, this bus should be as long as the
width of the output scheciuler so as to connect all the sixteen elementary scheclulers.
No mattcr which elementùry scheduler makes a schedule, it has to charge the whole
bus so that the input status register can sense a,1. The diagram is shown on the top of
figure 26.To charge such a long bus it not only takes time, but also wastes po\Mer.
In order to avoid this drawback, we can divide the bus into two parts and each of
them is connected to an OR gate (refer to the diagram in the bottom of figure 26).
Each bus is still discharged to act as the logical OR of schedule arrays fÏom its
connected elementary scheduler. Indeed, we used this divided bus to realise a logical
OR of the result of two buses that is the logical OR of 8 schedule affays. Clearly, the
divided buses have identical fïnction to that of a signal bus, bttt each bufl'er should
only drive a load half the load of the previous bus case. Therefore, this approach











Figure 26 Avoid extensive bus sharing
105
4.2.5 Techniques for reducing the area
Usually, the area is the last factor to be considered for high perfbrmance circuit
design and we often trade area f'or high speed and low power dissipation, but
we can also reduce the area by carefïllayout. Clearly, optimally sizing the transistors
is an efTective way to achieve small area. In addition to that, some other techniques
are used
One approach is to carefully design the wires. In the modern CMOS technology
multiple metal layers are employed so that the wires can be routed above the circuits,
which helps to reduce the area. In very complex or very compact rlesign, sometimes
the available metal layers are not enough to route all the wires above the circuit; in
this case, we have to place the wires in the spare places, which will take more area.
In tlie layout design of the output scheduler, we need many wires trl connect the
subcells. For example, we need wires to connect the elementary schedulers in a
column lor the output statlls 
^rrày 
we need wires between the elementaryschedulers
and the input status register fbr priority inforrnation, input status array and scheduler
àrray. In order to minimise the area, we hope to place these wires above the circuit
layout. Towards that enrJ, the width of the wires are calculated according to their
current load, the space between wires is minimised and the placement and direction
of the wires are well organised. All of these techniques ensure that most of the wires
in the output scheduler are routed above the circuits.
Another approach is to divide the large buffer into a number of smaller parallel
ers
10ó
capacitive load. Ref'erring to fìgure 22,we note that all of the layout is organised into
many slices. The width of the slice may not be sufÏcient to place a very large buflèr,
so we divide the large buff-er into a number of smaller butl'ers whose size is suìtable
for the height of the slice. Thus, no màtter how large are the bufÏers, they can always
be placed into the required dimension. Clearly, dividing large buffers into smaller
ones can not reduce its absolute area, but it makes the subcell of the layout very
regular. The cells with regular din'rension ancl similar size are helpful to minimise the
area.
4.2.6UO System Design
In this subsection, we will discnss another factor that signitìcantly af'fects integrated
circuit performance, UO systenì
Pads are the interfaces between the chip and the outside world. In the output
scherJuler lbur types of pads are used, namely, input pads, output pads, power pads
and parJ ring pads. The input anil output pads are employed to exchange signals with
other chips; the power pads are used to provide the power supply for the circuits; the
pad ring pads are used to supply the power for the input and output pads.
Figure 27 shows the topology of the pads for the output scheduler. As we have
mentioned above, there are 32 inputs and 32 outputs. Therefore, we needs 32 input
pads and 32 output pads. On the top of the chip, there are 16 output pads for input















-T-lI I I t¡pur¿¡
- 



























Figure 27 Placement of Pads
108
evenly on the right and lefl side as well (see figure 27).In addition to those pads, an
input pad is used for the clock input, which is placed at the bottom of the chip. There
are eight power pads for VDD and eight power pads for GND. Moreover, we note
that at each corner and in the centre of the top of the chip there are fìve pad ring
pads. Each pad ring pad includes both the VDD pad and GND pad.
Note that between the circuits and the pads, there are two pad rings. These two rings
are two metal rings to provide the power for all the input and output pads. One of the
pacl rings is connected to VDD, and the other is connected to GND. All the input and
olltput pads are powerd fiom the pad rings. The pad rings are connected to the pad
ring pads. The reason for using à separate power supply for the pads is to reduce the
noise. As we know, the output pads are required to drive large capacitors, and
provide high current dive fbr short periods. It may cause power bounce. If we use the
same power supply as the circuits, these currents may flow through internal circuitry
causing power and ground bounce. Moreover, note that one pad ring pad is put in the
centre of the top of the chip. The output pads need a high current capability to drive
the off chip capacitance. Typically, every pad ring pad used in this outpllt scheduler
can only provide enough pov/er fbr eight output pads, so one pad ring pad is put
there. On the lefi and right sides, there are only eight outpttt pads and sixteen input
pails that need only drive a small capacitance and sink little power, so fhe pad ring
pads on the corner are enough to provide them powcr.
Note that we use eight pads lbr VDD and eight pads for GND. The number of the








































il I |l ir/lti




































































































































































































































































waveform-'¡ -- -ì- | - | -- r-- ' l----- I - l-- '---l-l




0 0n 20n 30n 40n 50n 60n 70n
Time (lin) (TIME)
column of elementary schedulers. Note that the maximum current is about l25mA.
The overall circuit includes seventeen such units, so the overalÌ maximum cuffent
clrawn from VDD is about 2.125A. The maximun.ì current that each power pad used
here can proviiie is about 0.54. Thus, we need only fìve power pads to provide
sufTìcient power for the chip.
However, we note that such a large current is drawn from the power supply within a
very short period. The change of voltage caused by the inductance of the bond wire
IS
DV= L di/dt
where di/dt is the rate of current change with respect to time. A rapid change of
current results in a large voltage change possibly enough to cause the state tlf the
circuits to be changed. In order to avoid that, we irave to provide enough power pads
to share the large cllrrent so that in each pad the di/dt can be reduceil and the voltage
change limited to a tolerable range.
The number of pad can be calculated according to the transient cuffent wave in
fìgure 28. The maximum cti/clt Tor the simulated part of the circuit is about 2.3 x 108
A/s and a typical value for inductance of a bond wire is about lnH. Thus, the voltage
change can be estimated fiom the above equation as 0.23v. That means if we use one
pad to provide the power to this part of the circuit, there is a maximum voltage
swing of 0.23v. Since we are using a 2.5v power supply and complementary logic
with a gate threshold around 1.25v, the permissible maximum ground an pov/er
bounce is 0.5v. Hence we can use a single pad to provide power fbr up to twice the
am o clrcul u 1n
l1l
examS
seventeen such parts, so we use eight powcr pads to provide the power to the total
circuit and so ensure that the ground power bounce is not more 0.5v. That is the
reason why we use eight VDD pads and eight GND pads.
4.2.7 P ow er distribution
Finally, we will discuss power distribution in the outpllt scheduler. In the design of
the power distribution system of the output scheduler, two main factors àre
considered: one is IR voltage drop and the other is noise.
As shown above, the output scheduler requires high instantàneous po\iler handling
capability at the beginning of each time slot, so IR voltage drops along the power
lines shoulcl be consiciered. The IR voltage drops degrade the circuitb noise margin
and make the circuit less reliable. This becornes v/orse when the supply voltage is
scaled down, because the magnitude of the voltage drops that can be tolerated are
even smaller. An eff'ective way to reduce the IR voltage drop is to reduce the
resistance of the power line. In the design of oLttput scheduler, we use the topnost
and thickest metal level (Metal 5) which has a smaller sheet resistance compared
with other metal layers to distribute the power. In addition, since this layer is used
solely for power distribution, we can make it wide enough to reduce the resistânce to
an acceptable level.
Another factor that may affèct the performance of the circuit is the noise on the
power supply. We have illustrated that in order to reduce the noise on power supply
we Ltse power p
112
power supply f'or the pads so that no power or ground bounce caused by the pads czrn
flow into the internal circuits.
As we have discussed, each input status register should shift the schedule out of the
chip with a liequency of 400MHz, which is power hungry. In addition, each input
status register should drive the output address, priority infbrmation and input status
array into all the elementary scheduler in a row, so there are a number of large
buflers in it. Therefore, the input status register will sink much more power than an
elementary scheduler. Moreover, most of the power is drawn within a very short
period, such as at the beginning of each time slot, so large power and ground bounce
may happen. In order to reduce that, we provide sufficient capacitance between the
power line and the substrate. When a power or ground bounce occllrs, the connected
bypass capacitor will be chargetl or tlischarged, which reduces the magnitude of
po\Mer anil ground bounce. The capacitors are placed close to the large bufl'ers so that
charging and discharging may be carrierl out effbctively. Thus, the noise on the
power supply is reduced signifìcantly.
In this section, we mentioned that in order to improve the perfbrmance, many design
parameters are selected on the basis of estimation. All the subcells of the circuits
were simulatecl with Hspice to verify the correctness of estimation. If the estimàted
parameter is not good enough, it will be modified according to the simulation results.
r 13
4.3 The Simulation Result of the Output Scheduler
In the last f'ew scctions, we have addressed the techniques used to improve the
performance of the chip. In this section, we give the simulation results for the circuit'
4.3.1 Detay and Power Dissipation
As we have discussed above, because the speed of a circuit is decided by its critical
path, we should only simulate the delay of the critical path. According to the
previous explanation, we note that the delay ol the critical path is between the
moment that the olttput status register sends the outpllt st¿ltus array to the moment
that the updated output status array appears on the input of the outpttt st¿lttls register.
4.3.1.1 Simul ation Environment
The ilelay of this critical path is simulateil with Hspice. As we have mentioned,
Hspice is suitable for accurate simulation of small circuits. We can not simulate this
critical path in the environment of the whole output scheduler. Therefore, only a
column of the elementary scheduler is simulated. Fortunately, it is enough to reflect
the delay of the outpllt scheduler. The entire power dissipation of the outpllt
scheduler can be estinated from the simulated part of the circuit.
Another point that should be mentioned is the selection of simulation parameters.
o Simulation model: As we know, the circuit will be fabricated on a waf'er. The
propertles o
rl4
v er may vary
a
a
we Llse a ditlèrent model to simulate it. Basically, this consists of using models
such astypical model, fast NMOS and fast PMOS, fast NMOS and slow PMOS,
slow NMOS and tast PMOS, and finally slow NMOS and slow PMOS. Since
we are simulating static digital circuit that is not very sensitive to the
environment, the simulation is simply carried out with the typical model.
Temperature: Usually, we have three choice for the temperature, namely 25"C,
75oC or 100'C, in which 25'C is the best case, 75oC is typical and l00oC is the
\r/orst cast. The scherJuler is simulated in the environment of 100'C'
Processing Technology: Thc technology for fäbricating this scheduler is TSMC
0.25¡tm. technology.
4.3.7.2 Selection of the Stimuli
Recall that fbr each elementary scheduler to work it needs such signals as the input
status arräy, priority information, an enable signal and the output stàttts àrray. Since
our purpose of this simulation is to fìnd out the maximum delay of the critical path,
the stimuli must be selected carefirlly so as to achieve the worst delay of the
scheduling.
The input Status affay and outpllt Status array ¿Ìre selected às
"0000 0000 0000 0000". This value ensures that the elementary schedulers have
sufÏcient scheduling resources. The priority information is set to
"0000 0000 0000 0000" and all the 16 elementary schedulers in the column are









The delay of the critical path can be deduced from tìgure29.The figure shows the
two voltage curves: enable signal (dotterl line) and l6th bit of the updated output
status array (real line) that is the critical path delay. As shown in the fìgure, fiom Ons
to 40ns, the enable signal is 1, that means in the first time slot all the elementary
schedulers are enabled. The real line shows that the l6'h bit of the updated outpllt
status array is turned to 1 afier 20.8ns. That means the outpLlt scheduler can tìnish
one time scheduling process within only half of the time needcd to support
lQGbisiChannel. This ensures that the scheduler can potentially support the switch
rnatrix with even higher speed.
As stated above, the critical path delay consists of two parts: one comes fÌom the
blocking circuits in the elementary scheduler, the other is the delay between each
elementàry scheduler. From figure 29, we can fìnd how much each delay is. At 40ns,
the enable signal tllms to 0, which will turn off the elementary scheduler. Since the
schecluler is turnerJ ofï, the last bit of the outpllt status array should be its previotts
value 0, instead of being updated. Therefore, the l6'h bit of the output status array
will turn to 0 afier the enable signal turns to 0. The enable signal will go through all
the blocking circuits in an elementary schecluler and finally reach the l6'h unit. The
f igure shows that the delay between response of the l6'h bit of the output status array



























































































































































































blocking circuit. The time that the output status array takes to ripple through all the
l6 elementary schedulers should then be 13.Ons.
The overall power dissipation of the output scheduler can be achieved by the
following equation:
P=I*V
Where I is the average ourrent of the output scheduler and the V is the voltage of the
power supply. The current is 3l4mA and V is equal to 2.5v. Therefbre, the estimated
power dissipation is 0.785w. The estimated power dissipation of the pads is about
0.lw. Clearly, this power dissipation is reasonably small, and it should not cause any
trouble in packaging.
4.3.2Size and Area
The whole outpnt scheduler consists of about 600,000 transistors. The dimension of
the circuit that does not include the pins is 2.8lmm x 1.75mm. The area of the circuit
is 4.9mm2. Given the size of the circuit, this is a very compact design. The area of
the chip that includes the pads can be calculated fiom figure26. According to cllrrent
technology, the minimum centre to centre distance between each pad is about l50p
m. Vertically, there are 27 pads and horizontally there are 20 pads' Thus, its




In this section, we will cliscuss three approaches that can further improve the
performnnce of this ATM switch.
5.1 Speed-up Two
At the architectural level, Sarkies and Main [6] showed that the switch performance
is improved if the switch has an internal speed-up of two. With a speed-up of two,
each time slot can be scheduled with two cells and two cells are switched out
simultaneously through the switch. This makes the switch work more like an ideal
output butl'ered switch with all its advantages.
The diagram of an ATM switch with a speed-up of two is shown in Figure 30. Note
that each input port controller (IPC) sends two cells to the switch matrix at the same
time. Therefore, a 16 x 16 switch matrix should be used fbr a 8 x 8 ATM switch.
Similar to the switch without speed-up two, each inpLtt port controller sends an
output address and priority information to the output scheduler. However, since two
cell can be scheduled in one time slot, the scheduler needs two scherlules to describe
the scheduling results for each input port controller (see fìgure 30). In addition, the
output schedule generate eight pairs of output addresses for the switch matrix.
120
tPc oPc
I output Ports8 input ports
prioritv
inïormafion










In order to support the speeil-up, the internal structure of the outptlt scheduler should
also be changed accordingly. Since speed-up two permits two cells to be scheduled
into one time slot, we need two input status arrays (A and B) and two output status
arrays (A and B) to describe the states of the input and output ports. Clearly, for each
time slot there are four possible combinations of scheduling status, namely AA, AB,
BA, BB. Therefore, we need fbur elementary schedulers for each input-output pair to
make a schedule for it. We can defìne it as an elementary scheduler group. A
diagram of an elementary scheduler group and its corresponding input and outpttt
status registers are shown in lìgure 3 l. The comparison unit and the schedule register
in each elementary scheduler are identical to that described in the previotts chapters.
Each elementary scheduler in a group still receives the input status array directly
trom the input status register. Each elementary scheduler also receives the output
status array liom its upper elementary scheduler. The input status register collects a
logical OR of the schedule arrays fiom each elementary scheduler.
The diflèrence is that each elementary scheduler employs a signal to interface with
other elementary scheduler in the group (see figure 31). This signal is used to prevent
that one cell being scheduled into the same time slot multiple times. For example, if
the input stàtus and output status are all 00,both elementary scheduler AB and AA
may make a schedule. If this happens, the cell is scheduled twice into the same time
slot. It wàstes the scheduling resource and may cause a problem in switching.
Therefbre, we need some interface signals to make sure that only one of the four
elementary schedulers is enabled in each time slot. In other words, the scheduling

































schedule is tound in an elementary scheduler, it will disable all its subsequent
elementary schedulers. If not, it will inform its subsequent elementary scheduler to
go on scheduling. Such a sequence is arbitrarily selected: AA, BA, AB and BB. The
enable signal of elementary scheduler AA is connected to the grollp enable. That
means, il this group is selectecl, the elementary scheduler AA will be enabled and the
status in input status register A and output status register A will be compared within
elementary scheduler AA. If no schedule is made in AA, BA is enabled, then AB is
turned on and so on. If a schedule is made in AA, all the subsequent elementary
schedulers will remain disabled.
Besides the addition of the interface signals, another dilference liorn the scheduler
without speed-up two is that only the fìrst row of elementary schedulers in each
group can receive the output status array directly from the output st¿lttìs register,
which is used to def'eat unfairness.
Moreover, due to speed-up two, an olltput scheduler with an array of 16x16
elementary schedulers, l6 input status registers and l6 output status registers
becomes an output scheduler that has only 8 x 8 elementary scheduler groups, with
eight input and output st¿ttus register pairs. Therefore, such an output scheduler can
only support an 8x8 ATM switch.
5.2 
^ 
Possible Way to Improve the Speed
Although the current circuit is fast enough to satisfy the design objective, we still
S can
124
in the elementary scheduler. From the discussion in chapter two, we note thàt each
comparison unit only interfaces with its immediately subsequent unit and the
interlace signal ripples through all the 16 units one by one. In other words, each r¡nit
only passes its scherJuling result to its immediately subsequent unit. Ref'erring to
figure lJ, we note that each blocking circuit results in two gate-delays. That nìeâns
the overall delay fiom the blocking circuit is thirty gate-delays.
If we Llse some lookaheacl among the comparison units, the delay caused by the
blocking circuits can be reduced. In this case, each comparison unit should interface
with all its subsequent units, instead of its immediately subsequent one. In other
words, each unit inlorms all its subsequent units of the comparison result. A diagran-t
that illustrates this approach is shown in the tigure 32. Note that the tìrst comparison
unit sends the interface signals to all of the subsequent l-5 units, tlnit2 sends the
interfãce signal to its subsequent 14 units and so on. Part of the circuits with
lookahead is shown at the bottom of the fìgure. The figure shows the circuits of four
comparison units. Note that the circuits of unitl are identical to that without
lookahead. The di1l'erence is that its interfäce signal (down.l) is sent to all its fìfieen
subsequent units. Similarly, the unit2 sends its interface signal (down2) to all its
fourteen units. Now let us str-rdy how each unit interfaces with its preceding units.
Since unit2 should only interface with one preceding unit, unitl, its interface signal,
ttp2, is connectecl to the signal downl directly. Unit3 should receive the intertace
signal from both unitl and unit2. Therefore, its interface signal up3 receives the
logical OR of downl and down.2. For the same reason, unitl6 should use a 15-input
















Figure32 Structure and c¡rcuit of elementary scheduler w¡th lookahead
s6
shown in figure 32, we can estimate the delay for this approach. The critical path
shouldbeginatthe enahle signalandend alxl6. Notethatthispathincludesseven
gates. According to our previous discussion, we know that there are about 30 gate-
delays introduced by the blocking circuits. Clearly, lookahead can reduce delay fiom
the blocking circuits signilicantly.
Nevertheless, what we pay for the high speed is greater area and design effort. V/ith
the lookahead, each unit should interfäce with a f'ew units, so each unit should add an
OR gate to collect the interface signals. This takes more room. Since the outpttt
scheduler includes 256 elen-rentary schedulers. any addition of each elementary
scheduler's area will all'ect the whole circuit very much. Moreover, we note that each
unit's circuit will be somewhat different, which implies that each unit should be
designed separately. In acldition, lookahead results in ¿r more. complicated layout
design for the comparison unit.
In the design of this output scheduler, we have comfortably satisfìed the
requirements of the speed, so we did not employ lookahead. However, if the speed
requirement is very strict, it is worthwhile to trade area, power dissipation and other
lactors fbr speed.
5.3 Challeng€s on packaging
Packaging is as important às, and ofien even more critical than, transistors in
determining the overall performance of a system. As we have mentioned above, the
o
127
fiequency. In order to support such high fiequency, packaging technology should be
selected carefirlly
In trarjitional packaging technologies, the circuit is fabricated on a silicon die and the
die is put into a chip carrier, then the chip carrier is placed on a print circuit board
and interf¿ices with other chips. t101. The large parasitic inductance of the bonding
wires and the transmission lines will cause noise on the signals. The reason can also
be illustrated from the following equation:
dV = L di/dt
The current used to charge and discharge bond wire and transmission line may
change with high fiequency, so di/dt would be substantially high. Due to the large
inductance, L, tÌom the bonding wires and transmission lines, the signal voltage
signal will be changed and this is a source of noise. On the other hand, the.
capacitance of the boncl wire and transmission line is quite large. The outpttt pads
take more time to charge or clischarge such a large capacitor to exchange inlormation
with other chips. Therefore, it increases the delay of the chip and lin-rits the highest
frequency that the pads can reach.
In order to reduce the noise on the signal and improve the speed, an efÏèctive way is
to minimise the inductance and capacitance from the bonding wire and transmission
lines. Multichip Module (MCM) technology is a solution. In MCM, a number of
chips are placed on one substrate, which provides smaller inductance and
capacitance electrical connections among the dice than that provided by traditional
single-chip carriers and PCB.
128
There are a number of alternatives of MCM. The one that is appropriate for
packaging this ATM switch is silicon-on-silicon hybrid tl0l. A silicon substrate is
used as an interconnection medium to hold multiple chips. Thin fìlm
interconnections are fabricated on a waf-er, and separately processed dice are
mounted on this silicon substrate. A signilìcant advantage is that chips fabricated in
dilferent technologies (CMOS, bipolar or GaAs) can be placed on the same hybrid
package. The silicon substrâte can also potentially contain active devices that serve
as chip-to-chip driver, and bus and VO n-rultiplexers.
The ATM switch discussed in this thesis may contain chips fabricated with ditl'erent
technologies. For example, the scheduler is designed with CMOS and the switch
matrix is most likely fãtrricated with GaAs. This is one reason that this packaging
technology is suitable for this ATM switch.
Here we only present a basic irJea on the selection of packaging technology, there are
many topics should be further researched to make each chip communicates with
other chips pert-ectly at high fiequency.
5.4 Summary
In this chapter we discnssed three topics: speed-up two, lookahead, and packaging.
We discussed the speed-up two and the corresponding modilication of the
scheduler's structure to support speed-up two. Then, we discussed a possible way to
uce yca oè
t29
packaging is a critical part that will afl'ect the performance of the chip. Since the
output scheduler works at a very high speed, a high quality package is necessary to
ensure the high performance of the chip. A possible Multi-chip Module technology









In this thesis, we discnssed the design of an input-buff-ered high-perlormance ATM
switching system, which employs a time scheduling algorithm developed bySarkies
and Main t6l. This research has achieved the following design objectives:
the architectural of an input-buttered ATM switch. The switch includes four
major parts, the input port controller, switch ntatrix, scheduler and output port
controllcr. The scheciuler is a key part of this project, so it is designed to the
circuit level. The input port controller and switch matrix that interface with the
scheduler is designed to the architecture level. We demonstrated that each part of




demonstrate that the input port controller can not only realise such basic
functions as generating the scheduling request and processing the scheduling
result, but also ot'fer some advanced f unctions such as variable priority threshold
and multicasting that is an enhancement to the algorithm.
According to the time scheduling algorithm and the design requirement of this
ATM switch, the architecture of the output scheduler is designed. Subsequently,
we design the VHDL models, schematics and layout of the scheduler.
The simulation results of the layout show that the maximum delay of a




objective, 40ns. The whole output scheduler includes 600,000 transistors. The
estimated power dissipation of the circuitry is about 0.785 watts and the power
dissipation for the pads is about 0.1w. Given such a big circuit that works at very
high speed, this power dissipation is reasonably low' The dimension of the
circuit that doesn't include the pads is 4.9mm2. The overall area that includes the
pads is about l}n'tmz.
Finally, we discussed the speed-up two that improves the performance of the
algorithm and the corresponding modifìcÀtion of the structure due to the speed-
up two is analysed; the circuit with lookahead is discussed to improve the speed
of the output schedule; a possible way to package the circuit is disoussed;
The dcsign satisfìes all the objective of the project and the. circtlit of the outptlt
scheduler can potentially support even an higher speed switch mâtrix.
t32
Reference:
l. F.Halsall, "Data comnunication, Computer Networks and Open systems"
ADDISON-V/ESLEY, 1992.
2. L.G.Cuthbert, "ATM: the broadbancl telecommunication solution": Institute of
Electrical Engineering, 1990
3. R.Handel, M.N. Huber, "Integrated Broadband networks": ADDISON-
WESLEY, 1991
4. K.S.Lowe, "A GaAs HBT 16x16 bit 10-Gb/s/sChannel Crosspoint Switch", In
IEEE Journal of Solid-State Circuits,VLO32, No.8, August 1997
5. R.Savara, A.Turudic, "A 2.5 Gb/s 16 x 16 Bit Crosspoint Switch with Fast
Programming", IEEE GaAs IC Symposium, 1995, pp.47-48
6. J.Main and K.Sarkies, "Cell Scheduling Using Status Arrays in Input Buflèred
ATM Switches", IEEE BSS'95, Poznan, April l9-21 , 1995, pp.333-339.
7. N.McKeown,M.lzzarfl "The Tiny Tera: A packet Switch Core" IEEE Micro pp
26-33 January 1996
t33
8. N.H.E. Weste, K.Eshraghian, "Principles of CMOS VLSI Design: A systems
Perspective", ADDISON WESLEY, 1993
g. J. M. Rabaey, "Digital Integrated Circuits: A Design Perspective", PRENTICE
HALL, 1996.
10. H.B.Bakoglu, "Circuits,Interconnections, and Packaging for VLSI", ADDISON-
WESLEY, 1990.
t34
