The Design of a single chip 8x8 ATM switch in 0.5 micrometers CMOS VLSI by Rughoonundon, Rudi
Rochester Institute of Technology 
RIT Scholar Works 
Theses 
11-1-1996 
The Design of a single chip 8x8 ATM switch in 0.5 micrometers 
CMOS VLSI 
Rudi Rughoonundon 
Follow this and additional works at: https://scholarworks.rit.edu/theses 
Recommended Citation 
Rughoonundon, Rudi, "The Design of a single chip 8x8 ATM switch in 0.5 micrometers CMOS VLSI" 
(1996). Thesis. Rochester Institute of Technology. Accessed from 
This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in 
Theses by an authorized administrator of RIT Scholar Works. For more information, please contact 
ritscholarworks@rit.edu. 





Partial Fulfillment of the





Committee Chainnan - Dr. Roy Czernikowski. Department Head
Committee Member - Dr. Tony Chang
Committee Member - Professor George Brown
DEPARTMENT OF COMPUTER ENGINEERING
COLLEGE OF ENGINEERJNG
ROCHESTER INSTITUTE OF TECHNOLOGY
ROCHESTER. NEW YORK
NOVEMBER 1996
THESIS RELEASE PERl\iISSION FORM
ROCHESTER INSTITl'TE OF TECHNOLOGY
COLLEGE OF ENGINEERING
Title of Thesis: The Design of A Single Chip 8x8 ATM Switch In 0.5 ~m CMOS VLSI
[, Rudi Rughoollundon. hereby grant permission to the Wallace Memorial Library ofRJT




This thesis illustrates the design of a single chip Asynchronous Transfer Mode (ATM)
protocol switch using Very Large Scale Integration (VLSl). The ATM protocol is the data
communications protocol used in the implementation of the Broadband Integrated
Services Digital Network (B-ISDN), A number of switch archite(,;Lur~s are fmit studied
and a new architecture is developed based on optimizing performance and practicality of
implementation in VLSI. A fully interconnected switch architecture is implemented by
permanently cormecting every input port to all the output ports. An output buffering
scheme is used to handle cells that cannot be routed right away. This new architecture is
caned rhe High Performance (HiPer) Switch Arc:hitec(ure. The pertormance of the
architecture i~ simulated using a C++ model. Simulation results for a randomly distributed
traffic pattern with a 90% prohability of cells arriving in a time slot produces a Cell Loss
Ratio of 1,Ox I O~l( with output buffers that can hold 64 cells. The device is then modeled in
'v'HDL to verify its functionality. Finally the layout of an RxR ."witch is produced using a
o5 ~m CMOS VlST process and simulations of that circuit show that a peak throughput
of 200 J\1bps per output port can be achieved
Table ofContent
Table ofContent '
List of Figures vii
List of Tables ix
Glossary x
Chapter 1 => The ATM Protocol 1
1.1 -Introduction 1
1.2 - User Network Interface and Network Node Interface 3
1.3 The ATM Cell 4
1.4 -The ATM Reference Model 9
1.4.1 The ATMAdaptation Layer 9
1.4.2 -TheATMLayer 10
1.4.3 - The Physical Layer. 10
1.4.4 - Traffic Classes andATMAdaptation Layer Types 12
1.4.5 - CellDelineation 13
1.5 -Traffic Management 14
1.5.1 -The Traffic Classes 14
1.5.2 - The Quality ofService 15
1.5.3 Types ofTrafficManagement 16
1.6 - Network Architectures 18
1.7 - Local Area Network Emulation (LANE) 19
Chapter 2 - Architecture ofATM Switches 20
2.1 - Switches and Routers 20
2.2 - The ATM Switch Fabric 21
2.3 - Performance Evaluation of an ATM SwitchArchitecture 22
2.4 - Classification ofATM Switches 25
2.5 - Shared Resource Switches 26
2.5.1 - Shared BufferArchitecture 26
2.5.2 - SharedMedium Architecture 27
2.6 - Space Division Switches 29
2.6.1 - Crossbar Switches 29
2.6.2 - Multiple-Stage Interconnection Networks 31
2.6.3 - Multiple-Plane Switches 34
2.6.4 - Fully Interconnected Topology 35
2.7 - Buffering Schemes 36
2.7.1 - InputBuffering 37
2.7.2 - OutputBuffering 38
2.7.3 - SharedBuffering 39
2.8 The Design of an ATM Switch 41
2.9- Architecture oftheHiPer ATM Switch 42
Chapter 3 - Performance Analysis Using C++ 46
3.1 The Object Oriented Paradigm 46
3.1.1 What is an Object ? 46
3.1.2 -Polymorphism 47
3.1.3 -Abstract Classes and Inheritance 47
3.2 - The Object Oriented Model of the HiPer ATM Switch 48
3. 2. 1 Uniform Distribution Random Number Generator Class 49
3.2.2 - NetworkManager Class 49
3.2.3 -Switch Class 49
3.2.4 -ATMCell Class 50
3.2.5 - First In First Out Queue Class 50
3.2.6 - HistoricalData Collection Class 50
3.2.7 -Integer Class 51
3.3 - Implementation of the Object Oriented Model of the HiPer ATM Switch 51
3.4 - The Significance ofData Collected During Simulations 55
3.5 -Analysis ofC++ Simulation Results 57
3.5.1 The Cell Loss Ratio 57
3.5.2- OutputBuffer Utilization 59
3.5.3 - Frequency distribution ofOutput Buffer lengths 60
3.5.4 - CellDelay 62
Chapter 4 - Functional Verification Using a VHDL Model 65
4.1 -Hardware Description Languages 65
4.1.1 The VHSICHardware Description Language 65
4.1.2 - BehavioralModeling 67
4.1.3 - StructuralModeling 67
4. 1 .4 - DataflowModeling 68
4.2 -The Functional Description of the HiPer ATM Switch 68
4.2.1 The Write Synchronization Signals 69
4.2.2 - The Input Stage 71
4.2.3 - The OutputBuffers 76
4.2.4 - The Cell Buffers 79
4.2.5 - The Reset State 81
4.2.6 -TheATM Cells 82
4.2.7 - Hardware Complexity 83
4.3 - Description of the VHDL Model 84
4.3.1 The Input BufferAnd The Sync Buffer 84
4.3.2 - The Switching Table 85
4.3.3 - The Switching Tree 85
4.3.4 - The Write Select Register, Read Select Register AndBit SelectRegister 85
4.3.5 - The Fully InterconnectedNetwork 86
4.3.6 -The OutputBuffer 86
4.3.7 -The Cell Buffer 87
4.3.8 -The Counters 88
4.3.9 -The ATMSwitch 88
4.4 - Simulation of the VHDL model 89
4.4.1 TheTestbench 90
4.4.2 - Cell Switching Validation 91
4.4.3 - CellMulticasting Validation 91
4.4,4 -High Traffic and Congestion Handling Validation 92
4.5 VHDL Simulation Results 93
ill
Chapter 5 - Circuit Level Simulations 102
5.1 - Circuit Level Issues 103
5.1.1 - Sources ofparasitic components 103
5.1.2 - Clock andData Skew 105
5.2 -The Chip Floorplan 105
5.3 -Standard Cells 106
5.3.1 The Dynamic Flip-Flop 107
5.3.2 -The Static Flip Flop Ill
5.3.3 - The Dynamic Random AccessMemory Cell 112
5.3.4 -The FullAdder & the Single Bit Counter 114
5.3.5 - The DynamicDecoder 115
5.3.6 - The StaticMultiplexer 116
5.3.7 - Generic Logic Gates 117
5.4 - System Components 118
5.4.1 - The Input Cell Buffer and the Input Sync Buffer 118
5.4.2 - The Switching Table 118
5.4.3 - The Switching Tree 119
5.4.4 - The Write Select Registers andRead Select Registers 120
5.4.5 - The Bit SelectRegister 120
5.4.6 -The Write StateMachine 120
5.4.7 - The Read StateMachine 123
5.5 -The VLSI Layout 124
5.5.1 The Input Stage 124
5.5.2 - The Interconnection Network. 125
5.5.3 - The Read/Write Logic 126
5.5.4 -The Output Buffer 127
5.5.5 -The Chip Layout 129
5.6 - Results from the circuit level simulations 130
Chapter 6- Conclusions 137
6.1 - Summary 137
6.2 - Improvements And Future Directions 138
IV
6.2.1 The Implementation ofCell Loss Priority 138
6.2.2 -Reducing the size of the Interconnection Network
139
6.2.3 - Implementing large switch fabrics with multiple chips 139




A.4 - NetworkManager.h 148
A.5 -NetworkManager.ee 149
A.6-Switch.h 151




A. 11 - Queue.cc 161
Appendix B - VHDL Model Source 164
B.l - Switch_Pack.vhdl 164
B.2 - Testbench.vhdl 168
B.3 - Input_Port.vhdl 172
B.4 - Output_Port.vhdl 175
B.5 - Input_Cell_Buffer.vhdl 178
B.6 - Input_Sync_Buffer.vhdl 179
B.7 - Switching_Table.vhdl 180
B.8 - SwitchingJTree.vhdl 182








B.17- Input 0 test file 202
B.18- Input 6 test file 202
B.19- Input 7 test file 202




Figure 1.1: ATM vs. STM transmission modes 1
Figure 1 .2: Difference between the Network Node Interface (NNI) and User Network Interface
(UNI) 3
Figure 1.3: ATM cell at the Network Node Interface (NNI) 4
Figure 1.4: ATM cell at the User network Interface (UNI) 6
Figure 1.5: Train crossing a highway analogy to demonstrate advantage of small cell sizes 7
Figure 1.6: Relationship between Virtual Channels (VC) and Virtual Paths (VP) 8
Figure 1.7: The layers of the ATM Reference Model 9
Figure 1.8: Framing ofdata through the layers of the ATM Reference Model 11
Figure 1.9: State machine used for cell delineation 13
Figure 1.10: Common configurations of contemporary networks 18
Figure 2.1: Structure of an ATM switch 21
Figure 2.2: Shared Medium/Time Division Multiplexed switch architecture 27
Figure 2.3: Topology of the Crossbar switch 29
Figure 2.4: Architecture of a Multiple Stage Interconnection Network 31
Figure 2.5: Simplified structure of the Starlite switch architecture 33
Figure 2.6: Architecture ofMultiple-Plane Switches 34
Figure 2.7: Illustration of a Fully Interconnected architecture 35
Figure 2.8: The Input Buffering Scheme 37
Figure 2.9: The Output Buffering Scheme 38
Figure 2.10: The Shared Buffer Scheme 40
Figure 2.11: Architecture of the HiPer ATM Switch 42
Figure 3.1: Object Diagram of the C++ Model of the HiPer ATM Switch 53
Figure 3.2: Plot ofCell Loss Ratio against Output Buffer size 58
Figure 3.3: Output Buffer utilization for different arrival rates 60
Figure 3.4: Frequency distribution ofOutput Buffer lengths 61
Figure 3.5: Cell delay encountered by 90% of cells in the system 63
Figure 4.1: Mechanism used in the HiPer ATM Switch to carry cell information in a single bit
asynchronous signal (Write Sync Signals) 71
Figure 4.2: Handshakingmechanism between Input Modules and Switch Fabric 72
vn
Figure 4.3: Illustration of the relationship between the Addressing information in the Cell Buffer
and the Lookup Sync signal produced by the Sync Buffer 73
Figure 4.4: Relationship between signals in the Input Stage of the HiPer ATM Switch 74
Figure 4.5: Staggering of the Sync signals to produce theWrite Sync signals in the Sync Buffer. 75
Figure 4.6: High level architecture ofOutput Buffers 76
Figure 4.7: Structure of a Circular Serial Shift Register with a reset signal 77
Figure 4.8: State Diagram of theWrite Select State Machines in the Cell Buffer 79
Figure 4.9: State Diagram of the Read Select State Machine in the Cell Buffer 81
Figure 4.10: Timing diagram of cells into and out of switch 94
Figure 4.11: Functions ofwrite sync signals in Output Buffers and Cell Buffers 96
Figure 4.12: Congestion simulation and FIFO characteristics 99
Figure 5.1: Planned floorplan of chip 106
Figure 5.2: Transistor level schematic of a Positive Transition Triggered Dynamic Flip-Flop .... 108
Figure 5.3: Transistor level schematic ofNegative Transition Triggered Dynamic Flip-Flop 109
Figure 5.4: Negative Transition Level Triggered Dynamic Flip-Flop with Synchronous Reset and
Set signals 110
Figure 5.5: The single phase clock Negative Transition Triggered Static Flip-Flop Ill
Figure 5.6: The single phase clock Negative Transition Triggered Static Flip-Flop 112
Figure 5.7: Dynamic Memory Cell 113
Figure 5.8: Standard cell for Full Adder 114
Figure 5.9: 2 Bit Dynamic Decoder 115
Figure 5.10: Transistor level schematic of 4 to 1 StaticMultiplexer 117
Figure 5.11: Switching Table implemented as a Read Only Memory 119
Figure 5.12: Gate level design ofWrite State Machine 1 andWrite State Machine 2 122
Figure 5.13: Gate level design of the Read State Machine 123
Figure 5.14: Floorplan ofthe input stage 124
Figure 5. 15: The Read/Write Logic for the Cell Buffer 126
Figure 5.16: Floorplan of the Output Buffer 127
Figure 5. 17: The floorplan of the HiPer ATM Switch chip 129
Figure 5.18: Switching of the sync signals in the input stage 131
Figure 5.19: Read Select Register and memory read select lines 133
Figure 5.20: The Write Select Register and the memory write select lines 135
vm
List of Tables
Table 3.1: Cell Loss Ratio for different Output Buffer sizes and different arrival probabilities 57
Table 3.2: Utilization ofOutput Buffers for different arrival probabilities 59
Table 3.3: Cell delay encountered by 90% of cells in the system 62
Table 5.1: State transition table forWrite State Machine 1 121
Table 5.2: State transition table forWrite State Machine 2 122
Table 5.3: State transition table for the Read State Machine 123
IX
Glossary
AAL - Asynchronous Transfer Mode Adaptation Layer
Layer of the Asynchronous Transfer Mode Reference Model responsible for interfacing
other high level protocols to the ATM protocol.
ABR - Available Bit Rate
Data traffic which uses whatever bandwidth is available for transmission but specifies a
minimum traffic requirement.
ATD - Asynchronous Time Division Multiplexing
Statistical multiplexing scheme for transmission of data. In the ATM protocol this also
stands for real Asynchronous transmission as opposed to the use of an intermediary
protocol such as SONET.
ATM - Asynchronous TransferMode Protocol
A high speed cell relay protocol selected to implement the Broadband Integrated Services
Digital Network.
B-ISDN - Broadband Integrated Services Digital Network
Protocol created in the late 80's to overcome some of the shortcomings of the Integrated
Services Digital Network. B-ISDN has the advantages of more scalability, of a more
simple protocol, of higher bandwidth and of simple implementation. Asynchronous
TransferMode is the scheme used to transport data over such a network.
CBR - Constant Bit Rate
Data traffic characterized by a constant rate of flow ofbits
CLP - Cell Loss Priority
Field of an Asynchronous Transfer Mode cell used to identify the priority of that cell
CRC - Cyclic Redundancy Check
Bit pattern generated using a fixed polynomial which can be used to detect and
correct
errors in a stream of bits.
CS - Convergence Sublayer
Sublayer of the Asynchronous Transfer Mode Adaptation Layer responsible for adapting
the flow of data from a higher level protocol to the Segmentation and Reassembly
Sublayer.
CSMA/CD - Carrier SenseMultiple Access with Collision Detect
This is a networking protocol used in Ethernet type networks. Any node on the network
can transmit if it sees that the medium carrying data is available. Any node transmitting
data has to monitor the medium to ensure that another node did not start transmitting
simultaneously with it. If a collision is detected, all nodes will usually back up and try
transmitting again after a random amount of time.
CS-PDU - Convergence Sublayer Protocol Data Unit
Protocol Data Unit produced by the Convergence Sublayer of the Asynchronous Transfer
Mode Adaptation Layer.
DQDB - Distributed Queue Dual Bus
Protocol dedicated to the implementation ofMetropolitan Area Networks. The protocol
uses two unidirectional buses working in opposite directions across which a number of
queues are distributed to move data from one bus to the other.
GFC - Generic Flow Control
Field of an Asynchronous Transfer Mode cell where flow control information is stored.
This field may be used to implement negative feedback between a user's Data Terminal
Equipment and the User Network Interface.
HEC - Header Error Control
Field in the header of an Asynchronous Transfer Mode cell where the Cyclic Redundancy
Check code of the header is located.
XI
ISDN - Integrated Services Digital Network
Protocol created in the late 80's to overcome some of the shortcomings of the Integrated
Services Digital Network. B-ISDN has the advantages of more scalability, of a more
simple protocol, of higher bandwidth and of simple implementation. Asynchronous
TransferMode is the scheme used to transport data over such a network.
ITU - International Telecommunication Union
International organization in charge of allocating worldwide data transmission channels
and responsible for specifying and standardizing data communication protocols. Formerly
known as CCITT.
LAN - Local Area Networks
Computer networks spanning within very small regions. Usually small is considered to be
within the confines of a building or even a single floor in a building.
MPEG - Motion Picture Experts Group
Organization in charge of standardizing protocols for real time transmission of video data
and for creating video data compression and decompression schemes.
NNI - Network Node Interface
Interface that implements a protocol to connect two ATM nodes (usually ATM switches)
inside of an ATM network.
OO - Object Oriented
Software engineering paradigm whereby entities make extensive use of encapsulation and
modeling to produce systems that more closely
resemble what they represent.
OSI - Open Systems Interconnect
Data communication protocol using a variable size frame relay scheme and a fairly
complex multi layer protocol stack to implement data transmission and reception.
xn
PDU - Protocol Data Unit
Data unit or entity carrying a piece of information or a stream of data which is based on a
given data communication protocol. For example an ATM cell or SONET frame.
PMD - Physical Medium Dependent
Sublayer of the Physical Layer of the Asynchronous Transfer Mode Reference Model
which specifies the physical aspect of the data transmission.
PT - Payload Type
Field of an Asynchronous Transfer Mode cell used to identify the type of data carried by
that cell.
QoS - Quality of Service
Level of service expected from a communication channel set up in the ATM protocol.
SAR - Segmentation and Reassembly
Sublayer of the Asynchronous TransferMode Adaptation Layer responsible for breaking a
data stream into smaller sized fragments to be converted into Asynchronous Transfer
Mode cells.
SAR-PDU - Segmentation and Reassembly Protocol Data Unit
Protocol Data Unit produced by the Segmentation and Reassembly Sublayer of the
Asynchronous TransferMode Adaptation Layer.
SONET - Synchronous Optical Network
Frame relay type protocol currently used in the transmission of data over fiber optic
networks. SONET uses Synchronous Transfer Mode and Time Division Multiplexing
schemes to transmit data.
SS7: Common Channel Signaling System No. 7
Data communication control and management protocol which is implemented on a
separate channel from the one on which the data is being transported. This is also called
Sideband Signaling.
xni
STM - Synchronous Transfer Mode
The conventional means of transmitting data using Time Division Multiplexing and
allocating fixed time slots to data channels.
TC - Transmission Convergence
Sublayer of the Physical Layer of the Asynchronous Transfer Mode Reference Model
responsible for fitting a stream ofAsynchronous Transfer Mode cells into the data stream
that will actually carry the cells across a network.
TCP/IP - Transmission Control Protocol/Internet Protocol
TCP defines a data communication protocol at the software layer inside a host whereas IP
applies to the hardware and network wide level.
TDM - Time Division Multiplexing
Multiplexing data from different channels onto a common link by assigning fixed time
dependent positions for data from each channel.
UBR Unspecified Bit Rate
Data traffic which uses whatever bandwidth is available for transmission and which does
not have to specify its requirements.
UNI - User Network Interface
Interface that implements a protocol to connect a host system to a node inside of the ATM
network.
U-PDU - User Protocol Data Unit
Protocol Data Unit produced by the user.
VBR-RT - Real Time Variable Bit Rate
Data traffic characterized by a variable rate of production of bits but which requires
transmissionwithin specified periods of time.
xiv
VBR-NRT - Non-Real Time Variable Bit Rate
Data traffic characterized by a variable bit rate of production of bits and which does not
require transmissionwithin specified time intervals.
VCI - Virtual Channel Identifier
A Virtual Channel is a point to point link carrying Asynchronous Transfer Mode cells. The
Virtual Channel Identifier is the address of a Virtual Channel in the header of an
Asynchronous TransferMode cell.
VHDL - Very High Speed Integrated Circuits Hardware Description Language
Language used to model complex digital systems such as Integrated Circuits.
VPI - Virtual Path Identifier
A Virtual Path represents a group of Virtual Channels in the Asynchronous Transfer
Mode. The Virtual Path Identifier is the address used to select a Virtual Path.
WAN - Wide Area Networks
Data transmission networks spanning over large areas. Such an area could be a city or an
entire country.
xv
Chapter 1 - The ATM Protocol
1.1 - Introduction
The Asynchronous Transfer Mode (ATM) protocol was adopted by the International
Telecommunication Union (ITU) to implement the Broadband-Integrated Services Digital
Network (B-ISDN) [12]. The ATM protocol is a high speed packet based connection
oriented protocol that can support connectionless data traffic. The protocol is
asynchronous in nature based on the mode of insertion of data packets (called cells) in the
data stream.
C
m mm im b a
T8 T7 T6 T5 T4 T3 T2 Tl TO




C \ A| B! B j Aj
T4 T3 T2 fl TO
(b) - The ATM Protocol
| | Empty data unit
|_ |Valid data unit
Figure 1.1: ATM vs. STM transmission modes
Figure 1.1(a) demonstrates how the conventional Synchronous Transfer Mode (STM)
using Time Division Multiplexing (TDM) functions [2, 21]. Data packets from different
streams each occupy a fixed time dependent slot in the data stream. A packet from a given
channel can only be inserted in the slot allotted to that channel. Ifno data is available on a
channel the time slot allocated to that channel goes unused leading to wasted bandwidth.
The ATM protocol remedies this situation by inserting cells from a channel into the data
stream when the channel has data to transmit as shown in Figure 1. 1(b). In practice ATM
cells are carried across a network by using another protocol such as the Synchronous
1
Optical Network (SONET) protocol. This solution was adopted to facilitate the transition
from conventional networks to eventual ATM networks using Asynchronous Time
DivisionMultiplexing (ATD)
The ATM protocol is also designed with fiber optic technology in mind. One of the
important features of fiber optic data transmission is the very low bit error rates (BER) of
the order of less than
10"9
[7, 18]. The protocol is therefore specified with very few error
recovery mechanisms. This leads to the additional benefit that the protocol can be
implemented at very high speeds without being limited by high hardware and/or software
processing overhead.
The aim ofB-ISDN is to provide a network that can accommodate multiple different types
of digital data traffic within the same framework. The emergence of new markets in
multimedia applications has also been a major factor pushing for the implementation of
such a network and of the ATM protocol. Multimedia applications entail different types of
traffic from different types of sources running at different speeds on the same network.
For example a video conferencing system would have to accommodate live video and
audio data accompanied possibly by images and data files. Also pay-per-view services in
the consumer home will require video on demand with the customer requesting
prograrnming through the TV receiver. This type of service will require multicasting of
data, live video feed and data flowing back through the network to the video supplier.
ATM will provide a unified networking scheme that spans both Wide Area Networks
(WAN) and Local Area Networks (LAN). The situation today is one of Local Area
Networks using one type of protocol such as TCP/IP or Token Ring and Wide Area
Networks using another protocol such as SONET or X.25. In this framework, data has to
hop over several different networks to reach a destination. The need to convert the data
from one protocol to another leads to high transmission delays. Similarly, the error rates in
such a scheme are limited by the worst error rate that will be encountered in the network.
The ATM protocol will offer the advantages of a unified data format and a unified
transmission scheme making it possible to achieve very high speeds with very low
overheads.
1.2 - User Network Interface and Network Node Interface
The User Network Interface (UNI) and the Network Node Interface (NNI) must first be
defined before continuing this discussion. Two types of interfaces that connect nodes in an

















NNI - Network Node Interface
UNI User Network Interface
||||| Data Terminal Equipment
[ vj ATM Switches
Figure 1.2: Difference between the Network Node Interface (NNI) and User Network
Interface (UNI)
The UNI is the interface that connects the Data Terminal Equipment (DTE) of a user to
the core of the network which is most likely made up ofATM switches as shown in Figure
1.2. The Private-UNI is the interface used in private networks and the Pubkc-UNI is used
in public networks. Note that the interface between a private ATM network and a public
ATM network is also a Public-UNI. The UNI has to link two entities of different nature.
For example a file server to an ATM switch. The file server might be using a different
protocol and its data therefore has to be converted by the UNI to ATM cells.
Furthermore, at the UNI, the function of Generic Flow Control has to be implemented to
provide a feedback mechanism by which the amount of cells being injected into the
network can be controlled. The UNI also works the other way as it allows cells to be
passed from the network back to the host equipment.
The NNI is the interface that connects ATM nodes/switches inside the network. An NNI
protocol is used to implement this interface. Private-NNI are used in private networks
whereas the Public-NNI is used in public networks. This is once again shown in Figure
1.2. Note that the NNI only need to interface to other ATM protocol type nodes. The
NNI being inside the network does not have to provide facilities for traffic management if
the UNIs feeding data into the network are well behaved. Unfortunately this might not be
the case in practice and traffic management has to be performed at this interface too.
1.3 - The ATM Cell
The ATM cell is the basic unit of information transmitted in an ATM network [5, 7]. The
cell is a 53 bytes long data packet with 5 bytes of header information and 48 bytes of
payload. Two different ATM cells are defined at the User Node Interface (UNI) and
Network Node Interface (NNI). The UNI is defined to be the first node connecting a user
















Figure 1.3: ATM cell at the Network Node Interface (NNI)
The ATM cell at the NNI is shown in Figure 1 .3 and it comprises 6 fields as follows [7,
18].
4
The VPI field is 12 bits wide and stands for the Virtual Path Identifier. The Virtual
path is considered to be a semi-permanent link carrying a number of virtual channels.
A Virtual Path is assumed to be set up by the network administrators and user when
the they enter into a contractual agreement. The Virtual Path will usually remain active
in time periods measured in days rather than fractions of time.
The VCI field is 1 6 bits wide and stands for the Virtual Channel Identifier. A Virtual
Channel is an individual data stream within a specific Virtual Path. Channels are
considered to be more dynamic in nature as they can be created and destroyed as the
need arises for them. Each channel can carry a fixed bandwidth of data and additional
channels can be created within the same Virtual Path if more bandwidth is required
from that Path.
The PT field is 3 bits wide and stands for Payload Type. It indicates the type of data
being carried by the cell. This field is mainly used to discriminate between cells
carrying data or control information.
The CLP field is 1 bit long and it indicates the Cell Loss Priority of the cell. Setting
this bit to 1 gives it a lower priority and that cell will be dropped if it is contending for
buffer space or bandwidth with a cell that has its CLP set to 0. An example of the
application of this bit is in the transmission of live uncompressed video data. The video
signal comprises sync signals that cannot be lost and therefore cells carrying sync
information have the CLP bit set to 0. On the other hand, small amounts of the actual
image data (the TV lines) can be lost without too much consequences as the data can
be recovered by extrapolation from whatever data is available. These cells therefore
have their CLP bit set to 1 and as long as the required Quality of Service (QoS) is met
the transmission will work fine.
The HEC field is 8 bits wide and it is the Cyclic Redundancy Check code for the
header of the ATM cell. The polynomial used for the CRC code generation is
x8+x2+x+l. This scheme allows a simple mechanism to detect and correct single bit
errors (which are the most likely to happen when fiber optic links are used) and to
detect multiple bit errors. The CRC code is only generated for the header as the
protocol is only concerned about misrouting data. This is mainly due to the fact that
the cost of misdelivering data can be very high in terms of wasted bandwidth and
monetary cost for the users who are transmitting and receiving the data. The
correctness of the payload on the other hand has to be handled by higher layers in the
protocol.

















Figure 1.4: ATM cell at the User network Interface (UNI)
At the UNI the VPI field is reduced to 8 bits to create an additional field as shown in
Figure 1.4. The new field in the cell header is 4 bits wide and is used to carry Generic
Flow Control (GFC) information. This information is expected to be used to control the
flow of data between the user's data terminal and the first node leading into the network
[18]. Although the exact functionality of that field has not been defined, it is expected that
the field will be used to prevent the user's equipment from overwhelming the UNI during
periods ofhigh traffic activity.
The size of the ATM cell was dictated by several constraints that the protocol had to fulfill
and by the requirements of the end users of the protocol (telephone companies, Internet
access providers and so on).
The need to minimize transmission delay in telephone networks required very small
cell sizes. That is due to the fact that voice data sampling and transmission occurs at
low rates of 64 Kbps. To avoid unnecessary transmission delay, the packet has to be
made small so that it can be filled, transmitted and emptied quickly. A 32 byte packet
size was expected to be acceptable for this type of application.
The need to maximize efficiency in the transmission of data requires fairly large
packets in which a lot of data can be packed and sent across a network. The delay
incurred in the transmission process is less likely to be a problem in most data
application and therefore delay is not an issue. Ideally data transmission applications
could use at least 64 bytes ofpayload.
The size of 48 bytes for the payload was selected to please all parties concerned. The
small size of the cell also minimizes delay and delay variance. A good analogy to
demonstrate that fact is the example of the railroad tracks crossing a highway, as
described by Goralski [7] and which is illustrated in the following.
(a) Large packets (b)
- Small packets
Figure 1.5: Train crossing a highway analogy to demonstrate advantage of small cell
sizes
7
Figure 1.5(a) shows a train crossing a highway. The train tracks represents a data channel
carrying large data packets. The trains are contending for a common link with another
data channel carrying smaller packets, which is represented by the highway with cars on it.
Notice that all the cars are stalled and have to be buffered until the entire train has crossed
the highway. This leads to a high delay for the cars, which arrive at their destination later
than expected. If the delay is large enough in a data transmission system, the receiver
might think that the data was lost and discontinue the connection.
Figure 1.5(b) on the other hand shows the same situation but now the trains on the track
are a lot smaller. Notice that some cars still have to stop for the train but now the cars
wait for a shorter amount of time and fewer cars have to wait for a train to go by.
Translated into a communication channel, this means that large data packets that used to
be transmitted without any delay will now have a small negligible delay (as in the case of
the trains). On the other hand small packets that used to have large delays and large delay
variations will now have much smaller delays and delay variations (as in the case of the
cars). This is a significant improvement from the original situation and it is more than





Figure 1.6: Relationship between Virtual Channels (VC) and Virtual Paths (VP)
The relationship between Virtual Channels (VC) and Virtual Paths (VP) is further
illustrated in Figure 1.6. Each Virtual Path is made up of a set ofVirtual Channels [5, 18].
This allows switching of a group of virtual channels simultaneously with a minimum of
hardware overhead. Also once a Virtual Path is allocated channels can be created and
destroyed within it to accommodate bandwidth requirements without having to re
negotiate network connections. Once again the need for high speed, high throughput and
low latency dictated the selection of that scheme.
1.4 - TheATM ReferenceModel
The ATM protocol is defined as a set of layers similar to the Open Systems Interconnect
(OSI) protocol Reference Model [5, 7, 18]. However, it is not possible to correlate the
OSI reference model directly to the ATM model. The layers of the ATM Reference Model
are explained next.
Control Plane User Plane
<3











Figure 1.7: The layers of theATM Reference Model
1.4.1 - The ATM Adaptation Layer
The ATM Adaptation Layer (AAL) is responsible for interfacing higher layer protocols to
the lower layers of the model as shown in Figure 1.7 [5, 7, 18]. Examples of higher layer
protocols could be a digital phone line or a video server. The AAL is further split into two
sub-layers called the Convergence sub-layer (CS) and the Segmentation And Re-assembly
(SAR) sub-layer. A block of data coming from a higher level protocol is first passed to the
Convergence sub-layer. The CS is then responsible for framing the data to record where it
came from and to make it fit into the ATM data stream (by padding the stream with extra
bits). The Segmentation and Re-assembly sub-layer receives data from the CS and it is
responsible for splitting the data stream into equal size segments to be placed inside an
ATM cell. In addition to transmission of user data from the user plane the AAL is also
responsible for transmitting control and management information from the control plane
into the data stream.
1.4.2 - The ATM Layer
The ATM layer receives data from the AAL during transmission as illustrated in Figure
1.7 [18]. The ATM layer adds the cell header to the segments that it receives from the
AAL before sending it to the physical layer. The ATM layer is also responsible for
management of the data stream. It ensures that the transmission meets the requirements
promised during the call setup phase. The ATM layer inside the network also provides the
services ofVirtual Path and Virtual Channel translation for switching purposes. At the end
nodes of the network this layer is also responsible for multiplexing and de-multiplexing the
cell stream.
1.4.3 - The Physical Layer
The Physical layer is subdivided into the Transmission Convergence (TC) Sublayer and the
Physical Medium Dependent (PMD) Sublayer as shown in Figure 1.7 [15]. The TC
Sublayer is responsible for adapting the data rate being transmitted to the data rate at
which the ATM channels are running. When data is not available for transmission the TC
Sublayer produces idle cells to fill in the empty slots in the data stream. The PMD
Sublayer handles issues such as bit timing, synchronization and encoding on the physical
medium. The PMD also changes depending on the type of physical medium carrying the
data. For example, the specifications differ depending on whether fiber optical links,
twisted pair copper or coaxial cable is used. The PMD will differ depending on the
10




User Protocol Data Unit









ATM CeU in SONET Frame
Figure 1.8: Framing of data through the layers of the ATM Reference Model
The function of the different ATM Reference Model layers is further illustrated in Figure
1.8. This example describes the processing of a variable bit rate type of traffic (see later
for classes of traffic) corresponding to an AAL4 level of support [5]. The User Protocol
Data Unit (U-PDU) is produced by a higher level protocol such as MPEG-2 for video
transmission. The Convergence Sublayer frames the U-PDU within a header and trailer to
create a Convergence Sublayer Protocol Data Unit (CS-PDU). The CS-PDU is then
passed on to the Segmentation and Reassembly where it is broken into smaller size
packets. The packets are then framed with a header and trailer to form the Segmentation
and Reassembly Protocol Data Unit (SAR-PDU). Each SAR-PDU is then passed on to the
ATM layer which adds the ATM header to it to form an ATM cell. The ATM cells are
then sent to the Physical layer where they are framed appropriately to be transmitted by
the SONET protocol. The Transmission Convergence Sublayer is responsible for
producing unassigned cells to fill up a SONET frame when enough cells are not available
for filling an entire frame. Note that different classes of traffic use different framing
11
structures in the AAL (the CS-PDU and SAR-PDU will not be the same) and different
physical transport protocols will be implemented differently in the Physical layer.
1.4.4 - Traffic Classes and ATM Adaptation Layer Types
The CS handles different types of data traffic which are formally specified in the protocol
[17]. Four classes of traffic are specified in the protocol to handle most transmission
formats. Each class of traffic requires a different AAL to handle it and that is defined by
the AAL type. Note that AAL type and Traffic Classes are used interchangeably in
literature.
Class A traffic defines a constant bit rate data flow on a connection-oriented service.
Voice and uncompressed video data are examples this category. The ATM Forum
AAL1 type traffic is similar to this class.
Class B defines a variable bit rate traffic on a connection oriented service. The
difference between Class A and B is that Class B type traffic has to be delivered within
a given time period. That is a maximum delay requirement has to be met. Compressed
video is an example of this type of traffic. The ATM Forum AAL2 traffic fits into this
class.
Class C traffic defines a connection oriented service with variable bit rate and no
maximum delay requirements. File transmissions fit into this class of traffic. The
AAL3/AAL4 types traffic from the ATM Forum and the AAL5 (which is a
combination of 3 and 4) fit into this class.
Class D traffic defines a connectionless service. This is the basis ofLAN emulation and
datagram type data transmission. Once again AAL3/4 or AAL5 support this type of
traffic.
The traffic classes and AAL types described above have flexible applications and they do
not have to be used in the fixed framework described above [17]
12
1.4.5 - Cell Delineation
An important function of the PMD is cell delineation which allows the network interface
to determine the boundaries separating cells. Two techniques can be used for that purpose
[5, 7, 14]. When the cells are transmitted within a frame such as SONET the frame
provides pointers within the data stream that point to the beginning of individual cells. The
PMD can then locate these pointers and use them to delineate cells in the data stream. The
second technique makes use of the HEC byte in the cell header to find cell boundaries.
This procedure can be used to separate individual cells both in SONET frames and in







Figure 1.9: State machine used for cell delineation
Figure 1.9 shows a simple state machine that is used to find the boundary of cells in a data
stream. The state machine starts off in the Hunt state where it is looking for a cell
boundary. Each new bit that arrives is used to produce a CRC code and that is compared
to the byte following that bit. If a match is found it means that a cell header has been
found and the system goes into the Pre-Sync state. In the Pre-Sync state the system uses
the same technique as in the Hunt State to locate cells. If 8 cells are found at the expected
positions while the system is in the Pre-Sync state the system goes into the Sync state. On
the other hand if a cells with mismatched CRC code are found the system goes back to
the Hunt state. In the Sync state cells are delineated by counting every 53 bytes. If [3 cells
13
are found in the Sync state with the wrong CRC code in the header the system goes back
to the Hunt state to try to find the right boundary again. Small values of a, {3 and 5 have
been demonstrated to yield excellent cell delineation characteristics.
1.5 - Traffic Management
ATM is a connection based protocol which means that a virtual circuit has to be set up
before data can be transmitted [7]. A node transmitting data must first establish a
connection with each switch in the network that leads to its destination. The node
accomplishes this task by sending out signaling cells that carry the type of traffic that it
needs and the Quality of Service (QoS) it expects. A switch can then accept or reject the
connection based on whether it can provide this QoS or not. If the connection is rejected
the node could try for a lower QoS or a different type of traffic. This exchange of
information and the negotiation that it entails is called a traffic contract [17]. Once the
connection is setup the transmission ofdata can proceed.
1.5.1 - The Traffic Classes
The traffic contract specifies the class of traffic that will be produced. Note that ABR and
UBR are best efforts type of traffic contract where no guarantees are involved for the
network or the user. This type of service is expected to be sufficient for most
contemporary applications. Different classes are defined to support different types of
traffic [17].
A Constant Bit Rate (CBR) service means that data will be produced and must be
delivered at a constant rate. Cells that can be lost in this type of traffic have the CLP
set to 1 .
A Real Time Variable Bit Rate (VBR-RT) service means that the amount of data
transmitted will vary in time but the data has to be transmitted within a bounded delay
and a fixed Cell Delay Variation is expected from the network. In this type of
transmission the network reserves the right to lose cells in the event of congestion.
14
A Non-Real Time Variable Bit Rate (VBR-NRT) service means that the amount of
data transmitted will vary in time and the user specifies the Cell Transfer Delay
expected from the network. No Cell Delay Variation is specified or expected in this
category.
Available Bit Rate (ABR) service implies that data is transmitted at a minimum cell
rate but cells will be dropped to make place for higher priority services. Also the user
is expected to control the amount of traffic it produces in case the network cannot
handle it.
Unspecified Bit Rate (UBR) service is used by applications that want to use the
remaining bandwidth on a link. Once again cells in this service are lost during
congestion but the application is not required to reduce its rate of transmission.
1.5.2 - The Quality of Service
When a QoS is specified the bandwidth that is used to implement it has to be reserved and
cannot be used for any other purposes. This defeats the purpose of the asynchronous
nature of the protocol. Therefore a traffic contract can be set up without a specified QoS.
This prevents bandwidth from being wasted when a source is not making use of its
allocated resources. When it is specified, the Quality of Service (QoS) will include at least
some of the following attributes [17].
The maximum rate at which users transmit data is called the Peak Cell Rate.
The Sustained Cell Rate is the average amount of traffic which the user will transmit
over a period of time.
The Cell Loss ratio is the number of cells that are dropped in the network due to
congestion or error.
15
The Cell Transfer Delay is the delay that a cell encounters during transmission.
Components of the delay are the propagation delay through the network and delays in
processing by switches in the network.
The Cell Delay Variation is the variance of the Cell Transfer Delay in the network.
The maximum size of a data burst from the user that the network can sustain without
affecting performance is called the Burst Tolerance.
Minimum Cell rate is the minimum transmission rate required by the user.
During connection setup a metasignalling sequence is used during which the user tells the
network what type of service it will need. The network then responds positively if it
determines that it can handle the traffic and refuses the connection otherwise. The user can
then proceed with transmission, try to request a connection of lower QoS or wait until
bandwidth becomes available for its transmission.
1.5.3 - Types ofTraffic Management
The performance of ATM networks will determine the success of the protocol and
therefore the subject of traffic management issues have been studied in depth. Traffic
management can take place at different layers of the protocol and even at different
locations in the network as discussed below [17].
Admission control is the simplest form of traffic management. In this scenario a new
channel is allocated if and only if the network determines that it will be able to handle
the additional traffic. However implementation of such a technique requires the precise
knowledge of the behavior of the source and that data is most of the time not available
beforehand. Also the computations involved in determining bandwidth availability are
complex and depend on the behavior of all the other channels in the network (which
changes continuously). These issues make admission control impractical.
16
Once a channel is allocated for a certain type of traffic and a traffic contract has been
exchanged the source might break the rules and violate the agreement made during the
call setup. This would usually come in the form of producing more cells than
agreed
upon or producing longer bursts. The network therefore has to monitor and be able to
take actions against these rogue sources. The action taken could depend on the state
of the network at the time of the violation. A channel could be dropped if traffic is
very high and it is significantly affecting performance of the network or in the case of
light traffic the excess cells could just be let through or dropped.
The flow of cells into the network can also be controlled by letting cells into the
network at a constant rate no matter at what rate the cells are being produced by the
source. This is also called the Generic Cell Rate or Leaky Bucket Algorithm. The
principle is that a fixed size buffer is allocated for a given source. Cells are removed
from that buffer at a constant rate and when the buffer overflows the cells are lost.
This has the effect of somewhat controlling an irregular flow of cells.
A priority based approach can also be used to maintain the required Quality of Service
(QoS). This requires maintaining separate buffers for each QoS provided. The buffers
are then emptied so as to maintain the predetermined QoS and cells transmitted under
a lower QoS traffic contract run the risk ofbeing lost during congestion.
17






Figure 1.10: Common configurations of contemporary networks
Figure 1.10(a) shows the configuration ofEthernet type networks. All the data terminals
share a common line and control of the network is distributed. In this type of network all
the transmitters contend for the common line and the transmitters try to drive the line
whenever they think it is free. The transmitters have to continuously monitor their own
transmission to ensure that another node is not trying to use the line at the same time. A
protocol must be established to determine which transmitter gets the line during conflict.
An example of such a protocol is IEEE 802.3 CSMA/CD. The advantage of this type of
network is that failure of a node on the network does not affect the rest of the terminals.
On the other hand, part of the network will go down if one section of the link fails. The
complicated process involved in transmitting data in such a network prevents it from being
very fast and prevents high efficiency of data transfer.
Figure 1.10(b) shows the Token Ring type of network where the data terminals are
connected through point to point links. The terminals are connected to each other in a
ring. Once again, control of the network is distributed in this scheme. This protocol works
by passing a token around the network. The node that has the token is allowed to transmit
data while the rest of the nodes have to wait for their turn. The disadvantage of this type
of architecture is that it is not very fault tolerant. If one of the links fails, or one of the
18
nodes fail to transmit the token, the entire network is brought down. Furthermore the time
taken to pass the token around the network can add up to significant transmission delays.
The IEEE 802.5 Token Ring and FDDI are examples of such protocols.
The ATM network uses the configuration shown in Figure 1.10(c). In this network each
node is connected to a central device called a switch through its own point to point link.
The advantage of this architecture is that failure of a node or a link still allows the majority
of the network to function properly. The centralized control scheme of the architecture is
its biggest weakness as failure of the switching structure will bring down the entire
network. Also, the performance of the network is mainly determined by the effectiveness
of the switch. The main subject of this thesis is ATM switches and the next chapter will
take an in depth look at these devices.
1.7 - Local Area Network Emulation (LANE)
LAN emulation [17] encompasses the subject of implementing conventional
connectionless LANs such as Ethernet and Token Ring over ATM backbones. A large
amount of literature has been written on that subject but it is beyond the scope of this
project. It is left to the reader to investigate the subject in more details if they are
interested to do so.
19
Chapter 2 - Architecture ofATM Switches
2.1 - Switches and Routers
The ATM protocol has been designed with universal service in mind Universal in this
context does not only mean integrated services but also universal at the LAN, MAN and
WAN levels. However the new protocol is expected to be implemented at theWAN level
followed by the MAN and finally the LAN level. In this respect the following discussion of
ATM switches will have to be placed in the perspective ofWide Area Networks.
The ATM switch is the basic piece ofhardware at the heart ofATM networks. ATM is a
connection oriented protocol which is different from the connectionless environment in
wide use in data networks nowadays. A discussion ofATM switches must therefore start
by contrasting switches to routers [7], Connectionless packets in this context will be
referred to as datagrams.
In a connectionless network the router is responsible for directing datagrams from a
source to its destination. Routers have to be aware of the status of the entire network to
be able to make the routing decision. Routers also maintain routing tables which determine
what channel a datagram must be routed through to be able to make it to its final
destination. The overhead in mamtaining such a table in a large and complex network is a
major drawback of connectionless data communication. The advantage of this scheme
however is that it provides multiple paths for data and therefore can maintain performance
levels even with failure in individual links or nodes in the network.
ATM switches are routers in the connection oriented domain. A switch only needs to
maintain data on the output channel to which an incoming cell must be sent to. This is
usually determined by looking at the Virtual Path Identifier and the Virtual Channel
Identifier fields of the cell. This is a direct result of the fact that a connection oriented
protocol assumes the existence of a predetennined path through which datawill be routed.
The ATM switch does not have to be aware of the status of the entire network to make
20
routing decisions leading to low implementation overheads simple routing algorithms and
high speed routing being achieved.
2.2 - The ATM Switch Fabric
The distinction must be made between the ATM switch and the ATM switch fabric. The
ATM switch is a device that implements the physical layer and ATM layer of the ATM
Reference Model. On the other hand the switch fabric only implements the ATM layer of
the ReferenceModel [5].
















CMU: Management & Control Unit
Figure 2.1: Structure of an ATM switch
The ATM switch is made up of different components that interact together to function as
a switching device. This is illustrated in Figure 2. 1 [5].
The Input Modules are responsible for implementing the Physical layer of the ATM
Reference Model on the side of the switch where cells are incoming. Typically the
Input Module is responsible for synchronizing to the bit stream, delineating bits in the
input stream, checking for errors and implementing error recovery, and feeding
individual cells to the switch fabric or the Management and Control Unit. The
implementation of this module is very dependent on the Physical medium protocol
used on the input side of the switch.
21
The Switch Fabric is the device that implements the ATM layer of the Reference
Model. It is responsible for switching cells from an input port to the appropriate
output port. It is also responsible for implementing features such as cell multicasting,
virtual path and virtual channel translation, and prioritizing the flow of traffic from its
inputs to its outputs. The switch fabric feeds its outputs to the Output Modules. Note
that the switch fabric produces valid cells to be fed to the Output Modules and
therefore the switch fabric has to implement CRC generation for the new cell headers.
This function however is not required if the cell header is not modified (i.e. no Virtual
Channel and/or Virtual Path identifier translation is performed).
The Output Modules implement the Physical layer on the outgoing side of an ATM
switch. They receive cells from the switch fabric or the management and Control Unit
and encode them into a bit stream which in turn is framed for transport over a physical
link. The specific functions of this module is very much dependent on the Physical
Medium Protocol being used on the output side.
The Management & Control Unit as its name implies implements theManagement and
Control Planes of the ATM Reference Model. Cells can be sent to this unit either
through dedicated connections or by using some of the bandwidth available on the
regular input channels. A typical function of the Management & Control Unit would
be traffic/congestion management and call setup and disconnect features. Cells out of
this unit can flow through dedicated links or can share bandwidth on the regular data
links Note that ISDN uses SS7 which would imply the use of dedicated channels for
control and management cell flow as shown in Figure 2.1.
2.3 - Performance Evaluation of an ATM Switch Architecture
The performance of an ATM switch architecture is evaluated according to the following
criteria.
22
The number of cells that are lost by the switch during periods of congestion determine
the Cell Loss Ratio (CLR) of the switch (as discussed in Chapter 1). The lower the
CLR and the better the switch architecture is considered to be. On the other hand,
different types of traffic will produce different CLRs in different architectures.
Therefore, the CLR of different architectures can only be compared for a given traffic
pattern. In the following discussion a random traffic pattern will be assumed to allow
for uniform comparison between architectures.
The complexity of the hardware and/or software required to implement a switch also
determines the performance of a switch. An overly complex architecture can result in a
switch that is impractical or that cannot be implemented to operate at the high speeds
required by the protocol. Also high complexity might result in a design that cannot be
implemented in VLSI and therefore resulting in an impractical switch. The ideal ATM
switch will therefore have a low complexity. Unfortunately no standard metric exists
for measuring complexity. Therefore the O(n) metric will be used taking into account
either the amount of interconnection required or the amount of hardware required to
implement an architecture.
The next item of interest, when considering a new architecture, is the amount ofbuffer
space required to produce an acceptable level of performance (for example a given
Cell Loss Ratio). Most switch architectures can be made to produce incredible results
using large amounts of buffer space. Unfortunately infinite buffer space is not a
practical solution for producing better performance. A switch is considered to be
better as the amount of buffer space required to produce a given performance level is
decreased.
The number of incoming cells that a switch can handle at a time determines the Cell
Loss Ratio of that switch. Some architectures have to drop some of the cells that show
up at their input ports so as to
implement the switching properly. This type of feature
usually leads to lower CLR levels. Looking at how the architecture deals with
23
congestion of this nature is a good indication of what level of CLR a switch will
produce.
The number of cells that a switch can place on its output ports determines the level of
throughput of that switch. The overall performance of the switch is improved if the
switch can produce a higher throughput on its output ports.
The ATM protocol provides for the ability to multicast cells in the network. A
practical ATM switch must therefore be able to implement such a feature. The ease
withwhich such a function is implemented in the switch determines its practicality.
The method of routing cells in the ATM switch is also an issue as it often detennines
the complexity of the hardware required to implement the switch. Some switch
architectures are self routing in that the cells are routed as they flow inside the switch.
This scheme is the more simple solution to implement. The other category of routing
technique is to use a lookup table to correlate the VCI and VPI in a cell to an output
channel. In addition to lookup tables, this scheme requires that a cell be stored, while
the channel to which it must be routed is looked up. This results in additional hardware
complexity.
The delay experienced by a cell when routed through a switch is also an indication of
the performance of that switch. This is specially of concern in the ATM protocol
which specifies fixed delay requirements for some types of traffic. The delay
introduced by an individual switch increases as the amount of congestion on a channel
goes up. Also, as the amount of buffering space goes up, the delay experienced by
cells is longer as cells might be stored for a longer period before being routed. Taking
this into account, the minimum delay experienced by a cell is the metric used to
evaluate this facet ofperformance. The minimum delay is the delay which an incoming
cell will encounter if no congestion is happening in the switch, and if the cell is routed
as fast as possible to its output channel. The lower that minimum delay, the better the
performance of the switch.
24
Internal Blocking is yet another issue that affects switch performance. Cell blocking
happens whenmore than one cell contends for a link inside the switch fabric leading to
one cell making it through and the other being either buffered or discarded.
Output Blocking is the last feature that affects performance and it is due to more than
one cell contending for an output port. Only one of the cells can make it through,
whereas the other cells may be buffered or discarded depending on the switch fabric
implementation. Obviously, the performance takes a hit if the architecture has to
discard cells during output blocking.
A predefined metric does not exist to measure overall performance of a switch. The
performance issue is further complicated by the fact that different types of traffic result in
different performances for the same switch. Each one of the above factors is therefore
presented in the following discussion and the reader is left to decide which architecture is
better suited for a given application.
2.4 - Classification ofATM Switches
ATM switches are classified according to their architecture and based on their routing
properties [12, 16]. The architecture of a switch usually determines its performance and
complexity and therefore each category of switch must be looked at both in terms of its





An additional factor that determines performance is the buffering scheme used in a switch






Practical switches usually use a combination of architecture and buffering scheme to
produce switches with different performances. In the course of the following discussion it
will become very clear that different applications require different types of performance
and therefore different switch architectures are used to implement them [12].
2.5 - Shared Resource Switches
In Shared Resource Switches [12], the switching fabric is made up of a single resource
which all cells use to be switched to an output port. Use of that shared resource can be
done simultaneously by all incoming cells or each input can get a turn at using the resource
in sequence. Shared Resource Switches come in two flavors.
Shared Buffer Switches
SharedMedium Switches/Time Division Switches
2.5.1 - Shared Buffer Architecture
The Shared Buffer Switch is based on a buffering scheme rather than a switching scheme
and it is therefore treated in the section on buffering later on in this chapter.
26














Figure 2.2: Shared Medium/Time Division Multiplexed switch architecture
Shared Medium Switches address the fact that all cells that show up at the inputs of a
switch cannot be routed simultaneously [2, 12]. The shared medium in this type of
architecture is usually a high speed bus running at a higher clock speed than the input and
output ports. The internal circuitry of the switch fabric therefore runs faster than the
external ports of the switch. This allows the internal hardware to be able to process each
cell at the inputs serially within the time than it takes for the next wave of cells to show up
at the inputs. This concept is shown in Figure 2.2. This architecture is also called Time
Division because the process described is the conventional Time Division Multiplexing
approach where each input port is given a time slot on the bus.
The advantage of such a scheme is that it is simple to implement for switches whose ports
run at low speeds and for switches that have a small number of input ports. The hardware
complexity of this type of switch is 0(NV). The speed at which the internal hardware has
to run is directly proportional to the number of input ports (more input ports mean that the
switch has to run faster - N) and to the speed of the input ports (higher external speed
translate into higher internal speed to handle the same amount of data within the same
amount of time - V). The disadvantage of this scheme is obviously that it is not easy to
scale up.
27
Internal blocking does not occur in such a switch. Every input gets a chance to make it to
an output port. However, output blocking does take place as more than one cell cannot be
squeezed out in one cell cycle. The architecture can produce a cell on every output within
each cell cycle, if it has to, and therefore has very high throughput.
A low Cell Loss Ratio for such a switch can be maintained with a large buffer, but
utilization of buffer space can be fairly low depending on the buffering scheme (more on
that later). Finally, the delay incurred by a cell will be of the order of at least 1 cell cycle
for the cell switched in the last time slot.
Cell multicasting is a simple feature to implement in this architecture. Every output has
access to every cell on the bus. Therefore, an output port just has to be setup to grab all
cells being multicasted to implement the function.
The ATOM switch by Suzuki et al [3, 19] is an example of a Shared Medium architecture.
The shared medium is a bus that route cells from every input port to every output port.
Each input port has access to the bus sequentially to place a cell on it. On the other hand
every output port monitors the bus at all times and decodes the address of the cells as they
go by to see if they are destined to them. Cells that are destined to a given output port are
then grabbed and sent out through that port.
As shown in Figure 2.2 the serial bit stream incoming the ATOM switch is first converted
to a parallel stream to allow higher transfer rates at a given clock speed. This takes care of
one of the disadvantages of this type of architecture. The internal switch fabric speed does
not have to be increased up as much anymore. However, the number of input ports of the
ATOM switch still have to be maintained low within individual switches.
The creators of the switch therefore came up with an alternative scheme to make their
architecture scaleable. They connect several small ATOM switches into a parallel network
to produce larger switches [3]. Unfortunately printed circuit boards that can handle very
high speed data are expensive and are very often prey to a whole array of electromagnetic
28
noise which makes this scheme somewhat impractical unless it is implemented inside a
single IC package.
2.6 - Space Division Switches
In Space Division Switches the resources available to switch cells are divided so as to
allow cells to be switched simultaneously through separate paths [12]. This scheme
employs a divide and conquer solution to solve the problem of concurrently switching
cells. Space Division type switches come in several different forms [2, 12].
Crossbar Switches
Multiple Stage Interconnection Networks
Fully Interconnected Switches
Multiple-Plane Switches









T f T T
OutO Outl Out2 OutN
# ConnectedNode (Switch On)
O DisconnectedNode (Switch Off)
Figure 2.3: Topology of the Crossbar switch
29
The crossbar switch is the simplest space division switch and probably one of the oldest
switch types around. As can be seen in Figure 2.3 the crossbar switch connects every input
to every output via switches that can be turned on and off. To switch an mcoming cell to
the appropriate output port, the right switch is turned on and the data can flow straight
from the input to the output.
The hardware required to implement such a switch can run at the same speed as the input
and output ports. Unfortunately this architecture is not very scaleable because it has a
hardware complexity of 0(N2) (where N is the number of ports in the switch) as the
number of internal switching nodes required is given by the product of the number of input
ports and the number of output ports.
The regular nature of crossbar switches and the simplicity of its individual internal node
(the internal switches) make it an excellent candidate for VLSI design [2, 12, 15].
Furthermore, advances in technology is making it more practical to design fairly large such
devices. .
A centralized control mechanism must also be provided to determine which input will get
connected to which output in the switch. The crossbar switch is therefore not inherently
self routing. At best a separate control mechanism can be provided for each individual
output port which distributes the taskmore evenly and helps practical implementations.
The delay in switching a cell through this type of circuit can be as low as a fraction of a
cell cycle. Note that the delay is not insignificant. A certain amount of time is still required
to determine which switch to turn on, and to actually turn on that switch. On the other
hand, some form of buffering will be required when output blocking occurs. No internal
blocking occurs when using this architecture.
Multicasting is fairly simple to implement as multiple output ports can switch to the same
input port to produce the desired result. As will be seen in the next section, the simplest
crossbar switch (2x2) is the basic building block ofBanyan Networks.
30
The Bus Matrix Switch by Nojima [14] is an example of a crossbar switch. This design
handles the problem of output contention by inserting a buffer at every intersection of the
switching matrix. A filter placed at the input of each buffer determines whether a cell is
destined to that output port or not. A round robin arbitration mechanism then determines
which buffer gets to place a cell on an output port. This scheme eliminates the need for a
centralized control mechanism.
The 0(N ) complexity of the architecture still limits the size of a practical Bus Matrix
Switch. A Multiple Stages Interconnection architecture with a number of such switches is
therefore proposed as a solution for larger switch matrices.














Figure 2.4: Architecture of a Multiple Stage Interconnection Network
Multiple-Stage Interconnection Networks were developed to produce an efficient
architecture for Central Office switching applications that need to handle a large number
ofports [17]. A multistage network is a regular arrangement of small switch fabrics which
are interconnected according to a predefined scheme to produce a larger switching entity.
Figure 2.4 shows a multistage interconnection network.
Note that each stage in the network is composed of a 2x2 crossbar switch. Every time a
cell needs to be routed through the switch each 2x2 node is set up to either pass data
31
straight through or swap the links causing the cells to be eventually delivered to the
right
output.
The arrangement shown in Figure 2.4 is called a Banyan network and its most appealing
feature is that it has a hardware complexity of 0(Nlog2(N)) [2]. Another advantage of this
architecture is that it can be designed to be self-routing leading to even less hardware
overhead.
The modular and regular nature of this switch type makes it very suitable for VLSI
implementation. On the other hand, the amount of interconnections crisscrossing each
other in large networks can become an implementation problem. This is in fact one of the
main reason why the size of actual switches have been limited to only a small number of
ports.
Interconnection networks are prey to both internal and external blocking. Internal blocking
can be remedied in this case by providing multiple redundant paths from the input to the
output. Internal blocking can also be prevented by providing buffering inside each 2x2
element to hold cells that are contending for a common link. Another solution to internal
blocking is to provide a sorting network in front of the Banyan network to sort the cells in
such a way that they do not have to contend for the same link inside the switch [9]. This
solution is however not 100% effective. Unfortunately, all these solutions also lead to
additional hardware complexity.
Output blocking can be remedied by the use of buffering at the outputs [17]. Another
solution used to remedy both internal and output blocking is to re-circulate cells within the
switch [2, 9]. The idea is that if a cell cannot be routed through the switch the first time
around it can be sent through the switch to another port and then fed back into the switch
during the next cell cycle. This eliminates the need for internal buffers which can be costly
in terms ofhardware. This solution is not very effective during periods of high congestion
and could actually degrade
performance of the switch significantly.
32
Internal blocking leads to the fact that interconnection networks may not be able to
provide the maximum throughput possible. This is due to the fact that some cells destined
to an output port may be stuck somewhere inside the network leading to under-utilization
of that port.
The delay through such a network can also be fairly large as cells have to travel through
each stage of the network before getting to the output port. Note that as the number of
inputs increases the number of stages in the network has to go up too and therefore the
delay incurred by the cells get larger.
Multicasting can be implemented fairly simply in such an architecture. This can be done by
allowing the 2x2 nodes to implement a single input to both output mode which would
effectively be multicasting inside of the individual nodes. However, multicasting in such a
device can lead to an increase in internal blocking, which once again leads to degradation












Figure 2.5: Simplified structure of the Starlite switch architecture
The Starlite [9] switch is a good example of a Multiple Stage Interconnection Network
design. A conceptual view of the switch fabric is shown in Figure 2.5. Starlite uses a
Batcher sorting network followed by a Banyan network to eliminate internal blocking and
provide self routing characteristics. The problem of output blocking is handled by
providing a trap network in between the Batcher and Banyan stages of the switch.
33
The function of the trap network is to detect cells that have the same output address. One
of the cells is then allowed to go through the Banyan network whereas the other cells that
would have caused contention at the output are re-circulated through the sorting network.
Buffering in the feedback network allows the switch to handle cells that cannot be re
circulated right away. An aging mechanism with the help of the sorting network also
allows for cells to be switched in order of arrival even when they are re-circulated.
2.6.3 - Multiple-Plane Switches
InN >
OufN
Figure 2.6: Architecture ofMultiple-Plane Switches
Multiple plane switches [17] try to remedy the problem of internal blocking by providing
multiple paths for a cell to make it to an output. This architecture usually uses one of the
other Space Division type configurations and then uses independent parallel planes of
switches to route cells. If two cells are detected that will lead to internal blocking in the
switch fabric each cell is routed through a different plane. This type of switch is shown in
Figure 2.6.
The hardware complexity of such a scheme is fairly high. Coverage of all internal blocking
can only be achieved when the number of planes match the number of inputs. However,
the overall hardware complexity can be lower than other switch types if the individual
planes have very low complexity such as a Banyan scheme.
34
A tradeoff can be achieved whereas the number of planes is reduced leading to less
coverage but still providing adequate performance characteristics in practical situations.




' ' ' ' i ' 1 ' ' V v v v
Mux Mux Mux
OutO Outl OufN
Figure 2.7: Illustration of a Fully Interconnected architecture
The fully interconnected switch architecture [2] represents a class of switches where every
input is permanently connected to every output in the switch as shown in Figure 2.7.
Another term used to describe this type of switch is the disjoint path topology meaning
separate paths for data. The principal advantage of this architecture is the elimination of
internal blocking inside the switch fabric. The issue of output blocking is still present in
such an architecture and it is remedied by the usual solution of output buffering.
The throughput of such an architecture is theoretically very high as every input can be
routed to an output. In practice however, the hardware complexity of such a switch is of
the order of 0(N2) [2] as every additional input requires N times more output lines. The
hardware complexity has limited the size of switches of this nature.
The delay of a cell through such a network is minimal as in the ideal case a cell can enter
the switch and leave it instantaneously without requiring any switching. This architecture
therefore provides better delay properties than a crossbar switch.
Multicasting is another non-issue in this architecture as every input gets to every output.
The disadvantage of this switching scheme is the fact that the performance of the switch is
35
mainly determined by the buffering scheme in use. As a matter of fact the buffering scheme
potentially has to be able to handle cells incoming on every input line simultaneously
giving the buffering scheme a hardware complexity of 0(NV). This could lead to cell loss
if the buffer is not equipped to handle all the traffic.
The Knockout Switch by Yeh [23] is a fully interconnected architecture which was
implemented in practice. The problems mentioned previously are handled in that
implementation by providing a multiplexing scheme that gradually reduces all the data
destined to an output to a smaller number of cell streams. The creators of the switch
demonstrated that performance degradation in a situationwith uniform traffic is not as bad
as expected as long as an intelligent reduction of the number of incoming cells is
implemented [2, 23].
For example reducing a large number of input streams to only 8 simultaneous cell streams
will lead to a cell loss ratio of only
10"6
for uniform traffic [2]. Non-uniform traffic will
require the reduction to be less significant to achieve similar performance. The output
buffer in this case still needs to be able to handle 8 cell streams simultaneously.
2.7 - Buffering Schemes
Practical switches cannot be implemented without some form of buffering. The issue of
buffering determines both cell loss ratio performance and implementation complexity [12,
17]. Furthermore, the utilization of a given buffering scheme is fundamental in determining
the effectiveness of an architecture. Utilization has an added importance in practical
switches as the amount ofmemory available for buffering is finite and very often determine
the cost of a switch. High utilization and low complexity are therefore what has to be
traded offwhen implementing a practical switch.
36









w I ?! ?
0ut2
*Out3
Figure 2.8: The Input Buffering Scheme
Input buffering is the simplest of all buffering schemes. It consists of buffering each cell
one at a time as they are obtained from the data channels as shown in Figure 2.8. A first in
first out scheme is then used to feed data to the switching fabric. When using such a
buffering scheme, the cells are held in the buffer until they can be routed through the
switch. This is an effective way of preventing internal or output blocking, as a simple
algorithm can be used to determine which cells will lead to problems and then hold them
back in the buffers.
Furthermore, as only one cell needs to be written into the buffer at a time the speed of the
hardware is the same as that of the input channels. This buffering scheme obviously has
very low hardware complexity and therefore can be implemented easily and cheaply.
The principal disadvantage of the Input Buffering scheme is the Head OfLine problem [2,
17, 11]. This problem is a direct result of the fact that this scheme remedies internal
blocking in the switch and it relies on a First In First Out Scheme. A cell that is being held
at the head of the buffer (because it might cause blocking inside the switch) will prevent a
cell behind it, which could be routed without problems, to be held back too. This
effectively limits the performance of
such a scheme to around 60%.
The solution to this problem is to use a first in and multiple out scheme to let cells out of
the buffer as demonstrated by Karol et al. [11]. This means that each cell enters the buffer
in order but cells can come out of the buffer from multiple locations. The Head Of Line
37
problem can thus be somewhat remedied. Improvements in performance of this scheme
can be increased to 90% by providing output from up to three locations at the head of the
buffer. Note that this scheme leads to somewhat of a higher implementation complexity
but it is still fairly simple to implement such a buffer. On the other hand, the Head OfLine
problem can never be eliminated as the implementation of such a scheme would be
impractical.
The utilization of buffer space in this scheme is low as congestion on one input line can
lead to overflow of one set of buffers whereas all the other buffers have space waiting to
be filled. The Head OfLine problem further accentuates the low utilization factor as a cell
stuck at the head of the line causes the buffer to overflow much sooner than it should have
leading to fast degradation in Cell Loss Ratio characteristics.
2.7.2 - Output Buffering
Output buffering is the reverse of input buffering in that cells are buffered at the output














Figure 2.9: The Output Buffering Scheme
Output buffering is illustrated in Figure 2.9. Although the output buffering scheme seems
to be similar to input buffering some very subtle differences exist between the two. The
38
main difference is the absence of the Head ofLine Problem in output buffering. Cells that
are buffered at the output and cannot be transmitted right away are not holding back other
cells that could have been switched during that time.
In the absence of the HOL effect, the throughput of an output buffered scheme is very
high reaching a maximum of 1 cell per cycle [2, 11]. Furthermore, additional hardware is
not required to solve throughput problems.
The main drawback of output buffering is the fact that its buffer utilization is fairly low.
An output port which has high congestion might see its buffer overflow whereas all the
other ports in the switch have plenty of free space. This problem however is not an issue if
enough buffering is provided to take care of any type of traffic. The following section on
Shared Buffering schemes presents what is considered to be a subset of output buffering
and that solves the problem of low utilization.
A second drawback of output buffering is that the buffer has to be able to handle multiple
inputs as at any one time more than one cell maybe be destined to it. This requires the
design ofmore complex hardware either to handle more cells at a time or to prevent more
than one cell from going to an output at the same time. The latter solution is less attractive
as most techniques used to implement it can very quickly degrade to produce very high
Cell Loss Ratios. Multi-ported memory schemes are fairly common in other applications
and they can be adapted to fit the requirements of this buffering scheme.
2.7.3 - Shared Buffering
The Shared Buffering scheme as its name indicates uses a centralized memory to store all
incoming cells [2, 6, 8, 17]. The cells destined to each port are then selected from the
central memory and sent to the appropriate output. This scheme produces the most






Figure 2.10: The Shared Buffer Scheme
The Shared Buffering concept is illustrated in Figure 2.10. This architecture maximizes
utilization as all the input channels share the same buffer space. The amount of buffer
space taken up by each channel can grow and shrink as congestion on the channel
changes. This is possible because in a practical situation all channels do not become
congested at once, and therefore the space that was used for one channel can be
dynamically reallocated to another channel as the need arises.
The advantage represented by dynamic allocation of buffer space also leads to the
disadvantage of shared buffering schemes. When an output is facing a heavy burst of
traffic, it might take up more output buffers than anticipated by the designers of the
switch. This leads to starvation of buffer space for the other output ports. This can, in
turn, lead to worse Cell Loss Ratio parameters for the switch as a whole while providing
better performance for a few output ports.
The other disadvantage of this buffering scheme is that the hardware complexity of such a
switch is very high. The buffer has to be able to accommodate simultaneous data incoming
from every input port and outgoing to every output port. Hardware complexity can be as
high as 0(NV) and the buffer has to be multi-ported [2, 17].
Furthermore each cell in the buffer must be kept track of individually by using pointers and
separate queues must be maintained for each port to keep track of the order of arrival and
40
departure of cells. This extra overhead makes the implementation of such a switch more
complicated than most other schemes that we have looked at so far.
2.8 - The Design of an ATM Switch
The first step in designing an ATM chip is to target the market that the chip will be used
in. As can be seen from the earlier discussion, each switch fabric architecture produces
different performance at different complexity points. On the other hand, an attempt to
maximize a single given performance parameter in an ATM switch only leads to producing
very poor characteristics in other areas.
The type of traffic that the switch will encounter, when used in practice, is a good starting
point for deterrnining the type of architecture that will be required. For example, if bursty
traffic with long burst and long delays between bursts is required (such as file transfer), a
centralized memory scheme would yield the best CLR value. However, this might come at
a high price in terms of hardware complexity. If the designer is dealing with low
congestion, short bursts and high number of ports, then an input buffering scheme with a
crossbar fabric might be practical.
Ultimately the performance of an ATM switch comes down to two factors. The CLR
performance parameter for a given type of traffic and the hardware complexity (which
ultimately leads to the cost) to implement the switch.
A switch can be designed to accommodate all traffic types well while showing some
degradation in performance during periods of stress. This might be acceptable as
economies of scale in producing such a device might make it cost effective. This in turn
will increase its appeal to a much larger market. This is usually the course that has been
taken in current development projects as specific applications where ATM is the
technology of choice have not been clearly identified yet. A general purpose switch must
therefore attempt to optimize as many performance parameters as possible to produce the
best price/performance tradeoff.
41
2.9 - Architecture of the HiPer ATM Switch
The HiPer ATM switch is a general purpose switch designed to provide optimal
performance for all traffic types while maintaining low hardware complexity. The name
HiPer comes from High Performance describing the goal that was attempted to be
achieved in this design. HiPer is based on a fully interconnected architectures such as the
Knockout Switch without the disadvantage of requiring cells to be discarded at random.
Output buffering is used to store cells that cannot be switched immediately as output
buffering is simple and it is a fairly effective buffering scheme.
The principal aim of this thesis is to develop an architecture and implement it in VLSI.
Therefore, a low hardware complexity is used as a starting point for the design. A close
second design priority is good performance under generic traffic conditions. This is due to
the fact that a new architecture has no value if it does not perform well. As will be seen











Figure 2.11: Architecture of the HiPer ATM Switch
42
The HiPer ATM Switch is a fully interconnected output buffered architecture as illustrated
in Figure 2.11. The switch fabric uses a permanent link to connect every input port to
every output buffer as demonstrated by the Interconnection network in Figure 2.11. Each
cell arrives at the switch and is held in the Input Cell Buffers until the complete destination
address (the Virtual Path and Virtual Channel Identifiers) is available. The address
obtained from the incoming cell is looked up in a Switching Table to produce a bit pattern
that drives a Switching Tree. Once the address has been obtained the cell is allowed to
leave the Input Cell Buffer through the Switching Tree onto the Interconnection Network
and then into the Output Cell Buffer.
Note that each output of every Switching Tree in the switch is connected to the
appropriate Output Cell Buffer by a permanent link. Looking at Figure 2. 1 1, the Switching
Tree connected to input port 0 (10) has a connection to every Output Cell Buffer in the
switch (follow the arrows). Also, notice that the Output Cell Buffer receive a separate
input from each Switching Tree in the system.
The Output Cell Buffers are designed to handle multiple incoming cells simultaneously and
therefore no output contention occurs in the switch fabric. Cell loss in this architecture can
only happen when the output buffers fill up and overflow. The output buffers are also
designed to pass cells out as they come in if no other cells are waiting in the buffer. Note
that the architecture is not self routing and therefore a small delay of a fraction of a cell
cycle is encountered to account for the time that an address lookup and the cell switching
takes to be completed.
The function of cell multicasting is provided by the Switching Trees which are
implemented with modified de-multiplexers. As the details of the implementation of cell
multicasting were not available when this
work was started, a scheme was devised to
implement it in the switch. This scheme determines whether a cell is to be multicasted and
where it is to be multicasted to from the Virtual Channel Identifier (VCI) and the Virtual
Path Identifier (VPI) fields of a cell.
43
When a VPI/VCI combination is matched in the Switching Table it produces a bit pattern
that can turn on any combination of outputs in the Switching Trees. Therefore an
incoming cell with the right VPI/VCI combination can be sent to multiple output ports if
the routing table has the right bit pattern in it. This provides an elegant way of
implementing cell multicasting as only the entry in the routing tables must be changed to
enable multicast of cells and then turn that feature off.
Finally priority control is not implemented in the HiPer ATM switch as it would add to the
hardware complexity of the VLSI implementation without adding any more substance to
this project. However, a fairly simple scheme is described to implement priority control in
Chapter 6.
The first thing that comes to mind on the subject of fully interconnected architectures is
the high hardware complexity of such a scheme. As mentioned earlier the complexity of
such a scheme is of the order of 0(N2) for the interconnection network that is required in
the implementation. This complexity estimation is a very valid issue and an excellent
argument against such an architecture in a multiple-chip implementation rarining on a
printed circuit board. The amount of electromagnetic noise generated by such a scheme
and the number of printed circuit layers required to implement the crisscrossing
interconnections make it very difficult to implement at the Printed Circuit Board level.
However at the VLSI level within a single chip, the 0(N2) complexity can be reduced. The
only issue in the VLSI implementation is the capacitance and crosstalk between metal
layers. This problem can be reduced significantly by following some strict design rules.
Furthermore using large buffers to drive the capacitance on the lines only adds additional
delays to the transmission of the signals without significantly increasing hardware
complexity. The availabihty of multiple layers of metal in VLSI also make for a very
effective way of further reducing complexity. As will be shown in the next few chapters
(particularly Chapter 5), the interconnecting links and the associated driving buffers are a
small component of the total area of such a chip. As a matter of fact, the size of the chip
will be shown to be mostly determined by the size of the output buffers.
44
The next issue in such an architecture are the output buffers themselves. The scheme
described requires a multi-ported memory system. The buffer must support multiple writes
and a read operation concurrently. The key to making this design work was to develop
such a memory system using a fairly simple algorithm. This algorithm, which makes use of
a three state state machine, is described in Chapters 4 and 5.
The design of the switch is kept to a small scale because designing a larger switch does
not prove anything as far as the thesis is concerned. Therefore an 8x8 port switch
architecture with 8 cell output buffers, which can each hold 8 cells, is designed and
simulated. The hardware is designed to be simple to scale to much larger sizes without
taking a hit in complexity.
The principal theme of this thesis is architecture. The whole point of this work is to
demonstrate the process by which an ATM switch architecture is developed and then
implemented. Before setting out to design the new switch, the performance of the selected
architecture must be studied. Obviously, ifperformance is low amarket might not exist for
the device, and there is no point of even setting out to design it. Chapter 3 will look at an
effective way of carrying out performance analysis using the
C++ object oriented
programming language.
The high complexity ofASIC chips makes it impossible to design hardware directly using
digital gates. The task would be tremendous and maybe even impossible to accomplish.
An intermediate but crucial step towards producing a new chip is to design a VHDL
model that describes the inner workings of the hardware. The VHDL model of the HiPer
switch will be the subject ofChapter 4. The circuit level design and the actual VLSI layout
are looked at in Chapter5. Chapter 6 will recap the work performed in this project and
suggest paths to the future.
45
Chapter 3 - Performance Analysis Using C++
3.1 - The Object Oriented Paradigm
The Object Oriented Paradigm is one of the major categories of software engineering
today. The concept of Object Oriented Programming (OOP) is a major breakthrough in
the way that software is perceived and implemented. OOP was initially identified to be
very suitable for simulation type applications and this chapter will demonstrate that
nothing is more true.
3.1.1 - What is an Object ?
OOP uses the concept of objects to represent entities in a program. Objects are
encapsulated data structures that have associated with them a set of methods. Objects
communicate with each other using messages which are passed to each other using the
methods that the object understands. An object and its associated methods are declared
and implemented in a class. A class is a data type whereas an object is an instantiation of a
particular class.
To illustrate this concept more clearly an object can be defined to represent a car. The car
object will encapsulate other objects such as the engine, the exhaust system, the electrical
system and so on. A user of the Car object communicates with it by passing it messages
that the Car object understands.
For example some of the methods that the Car object would understand would be Start
Engine, Turn Left, Turn Right and Increase Speed to name a few. Note that the Car object
is made up of other objects (Engine, Exhaust, Electrical and so on) but an object
interacting with Car does not see these objects as they are encapsulated within the Car and
they are hidden by it. Objects exchanging messages with Car may be aware of the
existence of these additional objects but they cannot access them unless Car gives them a
method to do so.
46
3.1.2 - Polymorphism
Also it is very likely that the Car object will not implement the method Increase Altitude
(cars do not fly!!). On the other hand, an Airplane object would respond to that method.
Both the Airplane object and the Car object will however understand the methods Increase
Speed and Turn Left. This is a simple illustration of the concept of polymorphism where
the same method is understood by multiple different objects. Also, the same method might
be implemented differently in each class.
Another example of polymorphism would be a group ofTransportation objects comprised
of Car objects, Train objects, Airplane objects and Boat objects. All these objects would
respond to the same methods such as Turn Left and Slow Down. The actual algorithm
used to implement these methods in each class would probably differ, but the methods
would still be the same. All the classes could have been derived from a common ancestor.
In this case the common ancestor would be the Transport class which is an Abstract Class.
3.1.3 - Abstract Classes and Inheritance
An Abstract Class is a class that cannot be instantiated on its own. Rather, the Abstract
Class is used to create new objects that inherit methods and characteristics of their
ancestor. This technique can come in handy when a number of classes share the same
common characteristics but these characteristics are not enough to make up an object on
their own as in the previous example.
A Display Speed method could be implemented in the Transportation class. The Car class,
Train class, Airplane class and so on could then be derived classes of Transportation
which would lead them to inherit the method Display Speed from the abstract class. Each
class could also overwrite the implementation ofDisplay Speed if the method did not meet
its particular requirements.
The advantage of designing software using this technique is that every time a new object is
created all its methods do not have to be created from scratch. Rather a class could inherit
47
most of its methods from another class and then implement only the methods that are
specific to it locally.
The preceding analogies also demonstrate how OOP is well suited for simulations. Objects
can be used to represent real entities surrounding us as simple as a queue buffer to as
complex as spacecrafts.
3.2 - The Object Oriented Model of the HiPerATM Switch
The objective of producing an Object Oriented model of the HiPer ATM Switch is to
measure the performance of the design in terms of Cell Loss Ratio, Output Buffer
utilization and Cell delay.
The Object Oriented model of the HiPer ATM switch is itself made up of a number of







Historical Data Collection Class
The classes to which each object belongs is described in the next sub-sections. The next
sub-sections also demonstrate the power of the Object Oriented paradigm. Some of the
classes used in this project were originally developed by GNU. These classes were used as
is, without modifying them, and without concern as to how they were implemented. This
48
is the concept of encapsulation and reuse. The C++ code for the classes that were
developed for this project is included in Appendix A of this report. A Makefile is also
included to demonstrate the file dependencies. The source of the classes that are part of
the GNU C++ distribution are not included as they are beyond the scope of interest of this
project.
3.2.1 - Uniform Distribution Random Number Generator Class
The Uniform Distribution RNG class is included as part of the GNU distribution of C++.
It uses the bit stream output from a Linear Congruential Generator to produce a stream of
random bits from an initial seed. The Linear Congruential Generator is based on the
algorithm described by Knuth in "Art of Computer Programming^ Volume
DT'
The
Uniform Distribution Object then uses that bit stream to produce a uniformly distributed
set of random numbers between two predefined limits Note that the distribution
approaches uniformity for larger sets of random numbers. The algorithm also ensures that
the period of the distribution is very large by
'intelligently'
randomizing bits in the seed to
produce better random behavior patterns.
3.2.2 - NetworkManager Class
The Network Manager class is one of the top level objects in the model. As its name
implies, it is responsible for modeling an ATM network by producing cells to be switched
and receiving cells that have been switched. The Network Manager is also responsible for
keeping track of statistics of cell production, cell loss and cell delay. This class makes use
of the Uniform Random Number generation class to produce a uniform stream of ATM
cells at each input port. It also uses the Historical Data Collection class to compile
statistics of cell delay.
3.2.3 - Switch Class
The Switch class is the other top level object in this model and it represents the core of the
ATM switch in the model. The Switch object receives a data structure, which holds cells
49
for each of its input ports, from the Network Manager. It switches the cells to the right
output buffer and at the same time it also removes one cell from each output buffer and
places them in a data structure destined for theNetwork Manager object. The Switch class
uses the FIFO Queue class to model the output buffers. It also uses the Historical Data
Collection class to keep track of the length of each output buffer after every cell cycle.
3.2.4 - ATM Cell Class
The ATM Cell class defines an ATM cell in the model. It holds the destination to which a
cell needs to be switched and the time at which the cell was produced. Each cell has a field
that indicates whether it is a valid cell or an invalid one. An invalid cell can be used to
represent unassigned cells in an ATM bit stream or to represent absence of a real cell at an
input or output port. The cell class also includes a field which holds an identifier that is
unique to every cell produced during a simulation. This particular feature is not of much
use in this model but it can be used for debugging purposes. The unique identifier is left in
the class for possible future applications of the model.
3.2.5 - First In First Out Queue Class
This class implements a traditional queue ofCell objects, where the first cell that enters the
queue is the first cell to leave it. This data structure is implemented as an array of cell
objects. The object maintains two pointers that point at the head and the tail of the queue.
An additional variable is maintained to keep track of the size of the queue. The class also
provides methods to obtain the state of the queue such as whether it is empty or full and
the size of the queue. This class can only handle Cell class objects, and it is therefore
somewhat limited in its applications.
3.2.6 - Historical Data Collection Class
The Historical Data Collection class is once again provided as part of the standard GNU
C++ distribution. This class takes a set of floating point data and produces statistical
information about the sample. For example this class has methods to produce the mean,
50
variance, standard deviation and confidence interval of the data set. It also allows the data
set to be displayed in the form of a bin split for bar graph displays and in the form of a
cumulative distribution.
3.2.7 - Integer Class
The C++ programming language provides the standard integer and long integer data types
to represent 16 and 32 bits numbers. Unfortunately these standard data types do not have
a large enough range to cover the span of time that is needed to be simulated in this
project. The GNU distribution of C++ provides an Integer class which implements very
large integers. This class is used extensively wherever the range of the natural data types is
insufficient. This class supports all standard methods that a regular integer data type
would support such as addition, division and exponentiation.
3.3 - Implementation of the Object Oriented Model of the HiPer ATM Switch
The switching fabric of the device is non blocking and it will not affect the Cell Loss Ratio
of the design. However, the buffers used to hold cells that cannot be routed right away can
become full, leading to cells being lost. The Object Oriented Model therefore concentrates
on modeling these buffers.
The Cell Delay is affected by the switching mechanism but that delay is always fixed and it
does not need to be simulated. A fixed delay can be added to the delays produced in
simulation to yield more precise values. As the delay to switch the cells using this design is
of the order of a fraction of a cell cycle, it is considered to be negligible and it is therefore
ignored in this particular model.
The switching mechanism has to be implemented in the model to ensure that the cell
arrival at the output buffers is simulated properly. The cell arrival is considered to follow a
Bernoulli process at the input of the switch. The output to which the cells are addressed is
distributed uniformly with equal probability to be routed to any output. Once these cells
are switched, they arrive at the output buffers with a Binomial/Poisson distribution.
51
Two approaches are possible in implementing the simulation of the Object Oriented
Model. A model that only simulates the buffers and provides a Binomial/Poisson
distribution of cells to be fed to the buffers requires that the arrival probability and mean
arrival rate be specified for the model. The other alternative is to provide a uniform stream
of cells arriving with a given probability and then switch the cells to the output buffers
inside the model. This solution was chosen as it only requires the specification of one
variable (the arrival probability) and the implications of that variable are easier to
understand.
Another issue is the representation of the granularity of time. A simulation can be based on
real time so as to capture the finest granularity of events or it can be based on time periods
during which a number of events happen to provide a coarser granularity. In this case the
latter approach is used as at this level ofmodeling we are only interested in what happens
to entire cells rather than the stream ofbits.
Simulation time therefore progresses in terms of cell cycles time. A unit cell cycle time is
measured as being the amount of time between one cell and the next cell following it. This

























Figure 3.1: Object Diagram of the C++ Model of the HiPer ATM Switch
The Object Diagram of the ATM switch model is shown in Figure 3.1. The Network
Manager object is the object in charge ofproducing cells for the switch and receiving cells
from the switch. A Time object which belongs to the Integer class is used to keep track of
simulation time and is used to initialize and manipulate Cell objects. The Network
Manager first produces a random stream of cells by using two sets of Uniform Random
Number Generators.
A set of 8 Arrival RNG objects corresponding to each input port are used to determine
whether a cell has arrived at the input of a switch. The number produced by each Arrival
RNG object is set to vary between 1 and 100. The number produced by the Arrival RNG
is compared to the specified arrival probability. If that number is lower than the arrival
probability, a cell object arrives at the
inputs of the switch. If the number is higher than the
specified arrival probability, an invalid Cell object is placed at that particular input port.
53
The next step consists of producing the destination address of the Cell objects that have
arrived. This is determined by the Network manager using another set of 8 independent
Uniform Random Number Generator objects shown as Destination RNG objects. These
objects are set to produce values in the range between 1 and 8. The output port to which a
cell object is addressed is the number produced by Destination RNG object.
The Cell objects are then placed at the input of the Switch object by the Network Manager
object. The Switch object then switches all the valid cells and each cell is placed in the
right output buffer represented by the FIFO Cell Buffer objects. Each output port has a
separate FIFO Cell Buffer object associated with it. During each cycle one Cell object is
removed from each FIFO Cell Buffer and every Cell object destined to that buffer is
placed in it if space is available. The Cell objects that come out of the Cell Buffers are
passed back to the Network Manager. The Switch also keeps track of cell objects that
cannot fit into the Cell Buffers to produce Cell Loss Ratio statistics.
Two Historical Data Collection objects are present in the system. The first object collects
data about the size of the buffer queues at each iteration of the cycle time and it is shown
as the Buffer Length Historical object. This object receives all its data from the Switch
object. This is due to the fact that the Switch object encapsulates the FIFO Cell Buffers
and therefore the Switch is the only object that can communicate with the Cell Buffers to
get their lengths.
The second Historical Data Collection object collects the delay which was encountered by
cells that successfully made it to the outputs and it is represented by the Cell Delay
Historical object. Data is placed in this object by the Network Manager object which gets
this information from cells that are coming out of the switch. Note that the Network
Manager does not calculate the cell delay itself but rather requests the switching delay
from each Cell object before discarding them.
A number of redundant mechanisms are in place to ensure that simulations proceed
properly. The number of Cell objects in the system are kept track of by counting the
54
number produced and comparing that to the number that was successfully switched and
the number that was lost. Also cumulative statistics from the Historical Data Collection
objects are compared to statistics collected outside of the objects by the main system to
ensure redundancy.
3.4 - The Significance ofData Collected During Simulations
The objective of the simulation is to demonstrate the performance of the switch when
faced with a stream of randomly distributed cells. As mentioned earlier one of the factors
that determine the performance of the HiPer switch is the size of the output buffers in the
switch. The size of the buffers affects the cell loss ratio, burst size handling capability of
the switch, the hardware complexity of the implementation and the delay experienced by
cells while they are switched.
The size of the output buffers deterrnines how many cells the switch can store when it
cannot route the cells right away, as would happen during output blocking. Obviously the
larger the output buffers and the smaller the number of cells that are lost and the better the
performance of the switch. The size of the buffers also determines the length of burst that
the switch can handle. A larger instantaneous burst of cell traffic can be handled by a
deeper output buffer in the switch.
Unfortunately the size of the buffers cannot be infinite due to practical reasons. Larger
buffers lead to more space being occupied by memory on the chip and they require more
complex logic to be implemented. Also, as the buffers become longer the delay
experienced by a cell to be switched increases. This in turn leads to an increase in
transmission delay. Many applications such as real-time video and voice transmission
require delays to be kept minimal and therefore limit the size of the buffers.
The size of the buffer therefore determines opposing performance parameters and a middle
ground tradeoff point has to be found for the practical switch. The next parameter that
determines switch performance is the traffic patterns at the input ports. The larger the
55
probability that a cell will arrive at an input and the better the chances that the buffers will
overflow, therefore, leading to higher Cell Loss Ratio. In the same essence longer or more
frequent bursts of traffic will lead to the buffers overflowing faster.
An additional parameter of interest in designing a switch is buffer utilization. A switch
cannot be designed to be inefficient for economical reason. An inefficient switch would be
one which has much more buffer space than it would really need in a practical situation. A
higher utilization factor produces a more cost effectiveness switch. The lower the
utilization and the more wasteful the switch is, in terms of resources. Note that some
advocates will argue that too much buffer space is better than not enough. However too
much buffer space could lead to a switch that is so expensive that nobody would buy it.
Also a switch architecture must not be designed so as to be so complex to implement that
it will never become reality. Once again the optimum level ofutilization has to be reached
to achieve the right price/performance point.
The objective of the simulations is to look at performance in terms ofvarying the depth of
the output buffers and varying the arrival probability of cells. The depth of the output
buffers is simulated for sizes of 8, 16, 32, 64 and 128 cells. The computational resources
to simulate deeper buffer sizes are very large both in terms of simulation length (of the
order of weeks) and memory requirements. Larger buffer sizes are therefore not
simulated. Furthermore, simulations for buffer depths larger than 64 (e.g. 128 at 93%
arrival rate) were performed without producing any cell loss. This suggested that any
more simulations would not yield any useful data.
The arrival probability of cells, within a given time slot, is varied among 87%, 90%, 93%,
95% and 98%. Note that this is a fairly small range of simulation but once again
simulations of lower arrival rates are very time consuming and highly computationally
intensive without adding any value to the project.
The data collected represents a very fine granularity of the simulation. That is, the
frequency distribution of every possible output buffer length or cell delay is collected. Also
56
note that six data sets are produced for each combination of arrival probability and buffer
size. Each data set uses a different prime number seed for the random number generators.
This is done to provide a more accurate picture of the behavior of the model. The results
used here are an average of the different data sets collected.
3.5 - Analysis ofC++ Simulation Results
The following sub-sections present and analyze the results obtained during this stage of
the project. The following three data items are collected during the simulations.
Number of output buffers in use after every cell cycle
Delay in switching a cell from an input port to its destination port
Cell loss ratio for a given buffer size and arrival probability
3.5.1 - The Cell Loss Ratio
Arrival Probability
Buffer Size 98% 95% 93% 90% 87%
8 -3.15 -3.45 -3.68 -4.04 -4.44
16 -4.05 -4.75 -5.28 -6.16 -7.11
32 -5.21 -6.85 -8.11 -10.18 -12.39
64 -6.96 -10.81 -13.69 trMI'M "2l,i$6
128 -10.11 -18.07 '-23;.37 .-3L33 -3938
256 -15.59 -33 61 M '"f"J ^ J^s? .&)5<> -?6 58
Table 3.1: Cell Loss Ratio for different Output Buffer sizes and different arrival
probabilities
Table 3.1 shows the natural log of the Cell Loss Ratio in terms ofvarying buffer size and
arrival probability. As mentioned earlier The data could not be collected for large buffer
sizes and low arrival probabilities. The data for these cases is extrapolated using linear
interpolation, and it is shown as the darker regions of the table. Note that the CLR
57
performance index gets better as the Ln(CLR) value becomes more negative. As expected,

















[ 98% 95% 93% 90% 87% [
Figure 3.2: Plot of Cell Loss Ratio against Output Buffer size
Figure 3.2 shows a plot of the CLR data. This chart displays several pieces of information
that come in handy for the switch designer. The best size of the Output Buffers can be
determined from the chart based on the expected arrival probability of cells. For example,
the plot indicates that the CLR for buffer sizes above 128 cells is so small that it can be
considered to be negligible. Output Buffers of 128 cells per output could therefore be used
in the device if cell loss needed to be eliminated. Similarly, Output Buffer sizes of 32 and
64 would be very adequate if the switch needs to be small in size and CLR is less of an
issue.
58
3.5.2 - Output Buffer Utilization
Buffer utilization is usually studied using the mean length of the buffers. The mean
utilization of the system is convenient to use as it can be studied using both mathematical
theory and actual simulations. This project however uses simulations exclusively to
determine utilization levels. The worst case utilization of the buffers is therefore analyzed
instead. In this instance, the worst case is defined to be the amount ofbuffer space needed
to buffer 90% of the cells during a simulation. The mean behavior of a system can often
differ greatly from its typical behavior. This method is therefore expected to produce
results that model the real situation better.
Arrival Probability
Buffer Size 98% 95% 93% 90% 87%
8 87.50% 87.50% 87.50% 75.00% 75.00%
16 87.50% 75.00% 68.75% 56.25% 43.75%
32 81.25% 56.25% 40.63% 31.25% 21 .88%
64 64.06% 29.69% 20.31% 15 63% 10,94%
128 36.72% 14.84% 10.16% 7 81% 1II1IIII
Table 3.2: Utilization ofOutput Buffers for different arrival probabilities
Table 3.2 displays the ratio of the number of buffers used to store 90% of the cells to the
total number of Cell Buffers in the Output Buffer. This information is displayed for both
varying Output Buffer sizes and varying arrival probabilities.
59




| 98% 95% -93% 90% 87% |
Figure 3.3: Output Buffer utilization for different arrival rates
Figure 3.3 is a plot of the data in Table 3.2. As the size of the Output Buffers increases,
the utilization degrades very quickly. The tradeoff between CLR and utilization is
illustrated by contrasting this curve to the CLR curve in Figure 3.3.
Increasing the number of buffers leads to a decrease in the Cell Loss Ratio. However,
increasing the number of buffers also leads to a decrease in the utilization of the buffers.
The designer must therefore consider these two opposing features when putting together a
switch.
3.5.3 - Frequency distribution ofOutput Buffer lengths
The utilization of the Output Buffers degrades quickly as the size of the Output Buffers is
increased. This can be explained by looking at the frequency distribution of the Output
Buffer lengths.
60
FrequencyDistribution ofOutputBuffer Lengths for a 32 Cell Output Buffer
0.00%
<N n cs cs
OutputBufferLength
-98% 95% 93% 90% 87%
Figure 3.4: Frequency distribution ofOutput Buffer lengths
Figure 3.4 is a plot of the frequency distribution of the Output Buffer lengths for the
different simulations of a 32 Cells deep Output Buffer. This Output Buffer size produces
results that are consistent with the behavior seen for other Output Buffer sizes.
The 98% arrival probability rate produces an almost flat distribution. This means that all
the buffers are used with somewhat equal frequency during the simulations. This in turn
explains why the utilization of the Output Buffers is consistently higher for the 98% arrival
rate. Also, the high arrival rate of cells causes the buffers to be filled up more consistently
during the simulations.
At the other extreme, the arrival rate of 87% leads to a large peak at low Output Buffer
lengths. This is explained by the fact that cells do not arrive as consistently as in the
previous case and therefore the buffers tend to be more empty more often. This also
61
explainswhy utilization degrades a lot faster for the 87% arrival rate. The larger the buffer
size and the more often the buffers will be empty under this traffic condition.
3.5.4 - Cell Delay
The following table and figure are based on the cell delay encountered by 90% of the cells
in the simulations. This approach is used to produce results that are closer to reality as
compared to using a mean cell delay analysis.
Arrival Probability
Buffer Size 98% 95% 93% 90% 87%
8 7 6 6 6 5
16 13 12 11 9 7
32 25 18 13 9 7
64 41 19 13 K t\
128 47 19 13 9 7:
Table 3.3: Cell delay encountered by 90% of cells in the system
62



















-98% 95% -93% 90% 87%
Figure 3.5: Cell delay encountered by 90% of cells in the system.
Table 3.3 and Figure 3.5 displays the maximum cell delay encountered by 90% of the cells
during the simulations. The plot in Figure 3.5 shows that the cell delay approaches a
steady state value as the number of buffers get larger. This is a well known result for
stochastic processes of this nature and it can also be deduced intuitively.
No cell loss occurs when large cell buffers are present. Under these circumstances, the cell
delay is determined by the number of cells arriving into the system rather than the size of
the cell buffers. The factor that determines the cell delay is therefore the arrival probability
and the number of input ports. As the number of input ports is constant in our simulations
the cell delay is dependent solely on the arrival probability for large buffer sizes.
For small buffer sizes and large arrival probabilities the buffers tend to be full all the time.
Every cell that gets buffered will be placed at the last position of the queue. The cell then
63
waits for every cell ahead of it to be routed before its turn arrives. Therefore the majority
of cells will be delayed by the size of the buffer.
This type of reasoning can be extended to explain any other trends seen in Figure 3.5.
64
Chapter 4 - Functional Verification Using a VHDL Model
4.1 - Hardware Description Languages
The object oriented modeling technique described in the last chapter allows for
straightforward and fast simulation of the performance of an ATM switch. The software
model does not give an insight in the details of the implementation of the device. The C++
model demonstrates how the device performs, but it does not show that the device can be
implemented in practice. The C++ programming language can be used to construct a
model that more closely resembles the digital system that needs to be implemented. This
exercise would, however, be extremely tedious and uselessly time consuming.
4.1.1 - The VHSIC Hardware Description Language
The alternative to modeling a digital system with C++ is to use a Hardware Description
Language such as VHDL. VHDL was created by the federal government to describe the
design of high speed and high density integrated circuits. The language is based on the
programming language ADA which was also created by the federal government for the
development of software applications.
VHDL allows a digital system to be represented using conventional programming
language constructs and digital logic elements in a flexible environment. Furthermore, a
VHDL model can be directly synthesized into a silicon chip layout or it can be used as a
blueprint for the layout of the device.
The VHDL language allows bit level simulation of a system rather than object level
simulation as offered by C++ At the bit level, the designer deals with the smallest
representation of a data element in a digital system. At the bit level the designer also deals
with smaller segments of time as compared to the larger cell cycle time in the C++ model.
An event in a VHDL model happens every time a new bit comes into the device rather
than every time a cell comes into the device. This chapter will therefore deal with bit
timing and refer to cell cycle time when necessary.
65
VHDL however offers more than just simple modeling of a system. The language allows
fairly complex timing requirements between components of a system to be specified
leading to more accurate models. The data obtained from a previous silicon chip using the
same manufacturing process can be used to implement this concept. Also once a chip is
manufactured its actual timing parameters can be back annotated to the hardware
descriptionmodel leading to amuch more accurate model.
The VHDL language also allows for the creation of signals that can be used to drive the
inputs of a device just as in a real circuit. This additional feature allows for the creation of
testbenches within which a model can be simulated. As a matter of fact an entire test suite
to stress every feature of a design can be developed using VHDL. An entire chip may be
designed and tested before being fabricated. Another advantage provided by this feature is
the fact that problems with the device can be identified, reproduced and corrected faster in
the model than in the actual chip.
The use of a hardware description of a design allows for very fast turnaround times as far
as chip design goes. The amount of time needed to fabricate a design in silicon and test it
is of the order ofmonths. On the other hand, the VHDL model of a chip can be modified
and tested completely in a matter of days. This difference in turnaround time explains why
hardware modeling is such a crucial step in the design ofmodern digital systems.
Furthermore, modeling the hardware using a language is even faster than schematic
capture of a design and often a lot more intuitive to the human eye for large designs.
Simulation of a VHDL model is also much faster than the simulation of a netlist obtained
from a schematic although the results of the VHDL simulation are often not as accurate as
that obtained by using a netlist. The lack of accuracy of the VHDL model is mainly due to
the fact that many circuit level parameters cannot be estimated accurately before the actual
layout is produced. The amount of time saved in implementing a design, however, makes
up for the rough
approximation obtained using hardware description languages.
66
The language allows modeling of a digital system using different levels of
abstraction.
Furthermore the language allows different levels of abstraction to be mixed in the
description of a device allowing for models to more closely match what the
designer




4.1.2 - Behavioral Modeling
Behavioral modeling attempts to model the behavior of a device without using the logic
blocks that actually make up the device. This is the highest level of abstraction available in
the language as it makes extensive use of programming language constructs which are a
lot easier to visualize for the average designer.
For example the standard IF statement can be used to produce the output of an inverter
from a given input signal. The behavioral model would read as 'IF input is true THEN
output is false ELSE output is
true'
4.1.3 - StructuralModeling
Structural modeling is closer to the hardware level than Behavioral. At this level of
abstraction, a device is described by an entity that itselfmodels the device. In this case the
actual implementation of the device is encapsulated inside of the model and all the
designer sees are the inputs and the outputs of the model.
Using an inverter as an example once again the structural description of the device could
be 'output = INVERTER
(input)'
Note that the actual INVERTER could be a behavioral
model but the designer does not care about that. This is an extension of the Object
Oriented concept of encapsulation into the VHDL language.
67
4.1.4 - DataflowModeling
Dataflow modeling uses the concept of signal flow to model hardware. Each signal in a
Dataflow model can be combined with other signals or manipulated to produce other
signals. These signals flow through different operations to produce a final result signal.
This level of abstraction is the closest to the way that hardware is implemented. The use of
digital gate models (as in Structural modeling) is eliminated at this level, and replaced by
Boolean, arithmetic and other datamanipulation operators instead.
The inverter used in the two previous examples would be represented by 'output
= NOT
input'




changes without the need for an explicit digital gate or programming statement needing to
be called to perform the operation.
4.2 - The Functional Description of the HiPer ATM Switch
The architecture presented in this paper depends heavily on its memory system to perform
well. The memory system has to be able to process up to a maximum of 8 parallel cell
streams within the time it takes for a single cell to arrive. The main challenge of this
project therefore was to develop such a memory system and design it in such a way that
the architecture would be scaleable to larger sizes.
As pointed out earlier, one of the ways of implementing such a device would be to speed
up the internal clock of the system so as to be able to process all 8 input streams
simultaneously in a time division multiplexed fashion. Unfortunately the main drawback of
this solution is that it limits the size of the switch and the clock rate at which the switch
can run externally. This is due to the fact that the hardware complexity increases in direct
proportion to both clock rate and number of input ports. Another solution must therefore
be developed to meet the criterion of flexible scaling.
The first step in implementing the device is to identify the bottleneck in the system which
limits concurrency. Starting at the time that cells enter the switch the address lookup and
68
switching of individual cells can be performed in parallel. At the output of the switch only
one cell leaves the Output Buffers at a time with each cell coming from a separate buffer
and going to a separate output port and therefore this operation can be performed
concurrently too. The bottleneck therefore only exists at the input of the Output Buffers
and therefore the rest of the system can indeed run in parallel without taking a hit in
hardware complexity.
The solution that is chosen for this project is an innovative approach in this type of design.
A single Output Buffer in the system will potentially have to handle cells coming from
every input port of the switch. However before all the cells can be written into the Output
Buffer the read and write select lines of the physical memory have to be set up properly
and every cell has to be routed through a dedicated path to the right Cell Buffer.
Furthermore each cell has to be placed in consecutive empty locations in the buffer as a
FIFO type of Output Buffer is required, therefore adding more complexity to the
hardware.
4.2.1 - TheWrite Synchronization Signals
The problem that faces the designer is that enough clock cycles are not available to
process 8 cells in the time that it takes one cell to arrive. Fortunately the cells themselves
are not needed to setup the Output Buffers. The buffers only need to know how many
cells are coming their way, and they need to know the input ports from which the cells are
arriving so that the right paths can be setup.
This is the solution adopted in the design of the HiPer ATM switch. Each valid cell that
arrives at the switch is buffered while its addressing information is extracted and looked
up in the Switching Tables. A synchronization signal called the write sync signal is sent
ahead of the cell to the Output Buffers after the ports to which the cell is destined has
been determined. The actual cells are held back in the input buffers to give enough time to
all the write sync signals to arrive at the output buffers and to be processed completely.
69
The write sync signals also have to carry the information about the input port from which
they are coming. This is achieved by rearranging the write sync signals so as to follow
each other in consecutive clock cycles based on the input port from which they have been
produced. For example, a valid cell at input 0 produces a write sync signal at time t
followed by a write synch signal at time t+2 for a valid cell at input port 2 and so on. The
write sync signals now carry the information that cells are arriving and the position of the
signals in time determines which input they came from.
The next problem is that all the write sync signals still cannot be processed in parallel.
However the hardware now has plenty of clock cycles to manipulate the write sync signals
one at a time and set up the memory. This is exploited by processing the write sync signals
serially. All the write sync signals are combined into one by feeding them into an OR logic
gate. This signal is then combined through an AND logic gate with the system clock to
produce a shorter pulse.
This extra step is only required in this system to separate consecutive write sync signals
but it does not have to be performed if the write sync signals are spaced longer apart in
time. As a matter of fact, each write sync signal can be spaced in time by as long as
needed to process one of them completely. In this case, the write sync signals lasts a
single clock cycle but that can be easily extended ifmore time is required in a larger sized
switch or if the AND logic gate needs to be eliminated.
The write sync signals represent the principal mechanism by which cell switching is
effectively carried out in the HiPer ATM switch and this process is illustrated in more
details next.
70

























Figure 4.1: Mechanism used in the HiPer ATM Switch to carry cell information in a
single bit asynchronous signal (Write Sync Signals).
As shown in Figure 4. 1 valid cells arrive at input ports 0, 2, 6 and 7. Once the addressing
information is available thewrite sync signals are produced at times t, t+2, t+6 and t+7 as
shown in the next column. This is followed by the combination of the separate write sync
signals into 1 signal and the conversion of the serial signal into a series ofpulses as shown
in the next two signals. The system clock is also shown as a rough reference guide.
4.2.2 - The Input Stage
The input stage of the system can now be described and its role in producing the
appropriate write sync signal can be clarified. The Input Modules of the ATM Switch are
responsible for performing the functions of phase alignment of the cells, bit phase
alignment with the system clock, cell delineation and error checking and correction. Note
that this project attempts to design a switch fabric and that therefore none of these












Figure 4.2: Handshaking mechanism between Input Modules and Switch Fabric
The interface between the Input Module and the Switch fabric is defined as a simple
asynchronous system with each new cell arrival being indicated by a pulse on a separate
synchronization line. This is shown in Figure 4.2. This pulse is expected to last one clock
cycle and is expected to be produced with the first bit of every cell being sent to the
switch. This pulse is referred to as the sync signal and it effectively delineates the
boundary between successive cells in a continuous stream ofbits.
A single sync signal indicating the first bit of all the cell streams may be split into separate
signals for each input port or a separate signal can be produced by the Input Modules for
each separate input port. The latter solution is used once again to simplify the design.
Note that the ATM protocol requires that an unassigned or invalid cell be produced if no
cells are present at an input port. Therefore the sync signal must always be present even
when a valid cell is not being transmitted. Furthermore the format of an unassigned cell is
required to have a VPI and VCI combination of 0 meaning that no addressing information
is available. This is exploited later on in the system during address lookup to discard a
write sync signal belonging to an invalid cell. The sync signal is both used to perform
addressing information lookup in the Switching Tables and to generate the write sync
signals as explained next.
Both the cell and the sync signal accompanying it are first placed in a serial register. The
serial register holding the cell is called the Input Buffer and the serial register holding the
pulse is called the Sync Buffer. A section of the Input Buffer serial register is fed to the
Switching Table to obtain the required bit pattern to route the cells properly. The bit
72
pattern thus produced is latched into a permanent register when the correct address shows
up. This is done by clocking the latches which drive the Switching Table with the content
of the Sync Buffer. The latching clock signal is called the lookup sync signal and it is
produced from one of the latches in the serial register.
Cell






































Figure 4.3: Illustration of the relationship between theAddressing information in
the Cell Buffer and the Lookup Sync signal produced by the Sync Buffer
This scheme is illustrated in Figure 4.3. For example suppose the first two bits of the Input
Buffer are fed to the Switching Tables. Also suppose that only the last two bits of the VCI
field in the header are used for addressing the cells. These four bits will occupy bit
positions 27 and 28 in the bit stream assuming that the first bit of the cell stream is bit 1 .
Therefore the full addressing information will be present when bit 28 shows up which is
when latch 28 in the Sync Buffer is 1 . Therefore tapping latch 28 of the Sync Buffer

























Figure 4.4: Relationship between signals in the Input Stage of the HiPer ATM
Switch
The destination address of each cell is looked up in a Switching Table to produce a bit
pattern that will drive the Switching Trees. This is implemented using a fast decoder which
drives a Read Only Memory in which bit patterns corresponding to each address
combination have been hardcoded. Note that this is performed in parallel for each input
port so as to minimize hardware complexity. This implies that a separate Switching Table
is present for each input port. This process is also illustrated in Figure 4.4.
The write sync signals for each input port are produced after the bit patterns have been
produced to drive the Switching Trees. This is done by once again tapping the content of
the Sync Buffer serial register. To implement the time staggering of the write sync signal
























































Figure 4.5: Staggering of the Sync signals to produce theWrite Sync signals in the
Sync Buffer.
The process of tapping consecutive latches in the sync register is demonstrated in Figure
4.5. For example if the tap for input port 0 is on bit 30 of Sync Buffer 0 then the next Sync
Buffer has bit 31 tapped and so on up to Sync Buffer 7 which has a tap on bit 37. Note
that if a longer period is required between write sync signals the taps could be placed
further apart as for example on bits 30, 33, 36 and so on if a delay of 3 clock cycles is
required.
The write sync signals are then switched through theWrite Sync Switching Trees on their
way to their assigned Output Buffer. Note that an invalid or unassigned cell will produce a
bit pattern such that all the branches of the Write Sync Switching Tree will be turned off.
On the other hand valid cells will enable one or more branches in the Switching Trees.
75

















Figure 4.6: High level architecture ofOutput Buffers.
Each output port has a dedicated Output Buffer associated with it. Each Output Buffer is
made up of a number of elements which facilitate the implementation of the multiple input
single output FIFO memory structure as shown in Figure 4.6. The main elements of the
Output Buffer are as follows.
The Cell Buffers are 424 bits wide memory banks which are used to hold single ATM
cells.
The Write Select Register points to the next empty Cell Buffer.
The Read Select Register points to the next Cell Buffer which will be read from.
The Write State Machine and Read State Machine control access to the Cell Buffer
memory.
76
Miscellaneous elements such as latches, multiplexers and counters are used to
implement the multiple concurrent write and single read operations.
A global counter called the Write Sync Counter is used to time the arrival of each write
sync signal. The Write Sync Counter is initialized at reset in such a way that the counter
has a value of 0 when a write sync signal from input port 0 reaches the Output Buffer, a
value of 1 for the next write sync signal and so on. When the Cell Buffer receives a write
sync signal, it reads the value of the counter to determine which input port the signal came
from.
A local counter is also maintained in each Output Buffer. The counter keeps track of the
number of cells in the Output Buffer and it is referred to as the Output Cell Counter. The
output of this counter is decoded to produce the signals of output buffer empty (when the
counter is at 0) and output bufferfull (when the counter is at 8). Every time a Cell Buffer
is assigned to be written to the Output Cell Counter counts up. Conversely every time a
cell is read out of the Output Buffer the Output Cell Counter is decreased. This counter is
4 bits wide to count up to 8 cells.





















Figure 4.7: Structure of a Circular Serial Shift Register with a reset signal
The three registers described next are all circular serial shift registers with parallel output
ports. The registers have no input port and the latches in the register are forced into a
known state by the reset signal. For example the register shown in Figure 4.7 has Q0
forced to 1 and the rest of the latches forced to zero during reset. The single bit is then
shifted from one latch to the next on rising edges of the clock until it reaches the last latch
77
in the register at which point it is fed back into the first latch. The bit therefore loops
around the register in a circular fashion.
The first register is called the Write Select Register. Each latch of the register points to a
state machine that determines the write state of a Cell Buffer. The pointer in the Write
Select Register moves to the next state machine after the current state machine has been
assigned to bewritten to. When the Output Buffer is full, theWrite Select Register pointer
remains at the same state machine until the buffer is not full anymore. This register is 8 bits
long to match the number ofCell Buffers.
The second register is called the Read Select Register, and it performs the opposite
function of the Write Select Register. The latches in this register each point to the Read
State Machine that determines the read state of a Cell Buffer. The bit in this register
moves in the same direction as theWrite Select Register thus implementing a First In First
Out scheme. This serial register is clocked every time a new cell cycle starts. The register
however only shifts when the Output Buffer is not empty. This register is 8 bits long to
point to each Cell Buffer.
When the Output Buffer is empty, the Read Select Register should point to the same
location as that pointed to by theWrite Select Register and both pointers must stay there
until new cells start to arrive. When the Output Buffer is empty, an unassigned cell is
produced and fed to the output ports rather than the content of the Output Buffer.
The Bit Select Register is associated with all the Output Buffers in the system. This
register is 424 bits long and each bit in the register acts as a pointer to individual rows of
bits in the Cell Buffers. The combination of the Bit Select Register and the Write Select
Register or the Read Select Register in a Cell Buffer allows any individual bit in the
Output Buffer to be accessed. Once again, a single bit travels around this serial register.
The first bit in this register is also used to indicate the end of a cell cycle and the start of
the next cycle. This register is of the size of a single ATM cell which is 53 bytes (424 bits).
78
4.2.4 - The Cell Buffers
Each Cell Buffer is made up of a Write State Machine, a Read State Machine, an eight
input Multiplexer and 424 Dynamic Random Access Memory elements (the size of an
ATM cell is 424 bits). The memory bits within a Cell Buffer share the same write select
line, read select line, data in line and data out line. Each individual bit in the Cell Buffer is
accessed by enabling a bit select line. Each bit select line is selected by the Bit Select








I Clk t Bit Select (0)
Figure 4.8: State Diagram of theWrite Select State Machines in the Cell Buffer
The write select line is driven by the Write Select State Machine. This state machine is
actually made up of two separate components as shown in Figure 4.8. The purpose of this
system is to control the write select line of the Cell Buffer. Both states machines are
initially in the Idle state in which nothing is being written to the Cell Buffer. The first state
machine is clocked using the system clock. When thewrite select signal is asserted and the
Cell Buffer is not full and a write sync signal is received that state machine goes into the
Ready state. At that point the write select signal pointing to this Cell Buffer is de-asserted
as the Write Select Register is shifted to point to the next empty Cell Buffer. This has the
effect of enabling the Idle to Write arc in the second state machine which is clocked by the
least significant bit of the Bit Select Register.
79
The least significant bit of the Bit Select Register is asserted when a new cell cycle is
starting. That causes the second state machine to go into theWrite state which allows data
to be written to that Cell Buffer. The first state machine then sees that the Cell Buffer is
now being written to, and it goes back into the idle state during the next transition of the
system clock. Finally, the cycle ends when the least significant bit of the Bit Select
Register is asserted a second time, which now indicates that the cell cycle has ended, and
the second state machine goes back to the Idle state.
This technique effectively implements a latching mechanism where the state machine
remembers that it must write a complete cell to the Cell Buffer, although theWrite Select
register is not pointing to that Cell Buffer anymore.
Each Cell Buffer has a multiplexer associated with it. The function of this device is to
select one of the 8 input lines from which cells could be coming in. When the write select
line of the Cell Buffer is enabled, the content of the global Write Sync Counter is latched
into a register. The Write Sync Counter keeps track of the timing of the write sync
signals. The content of the register then drives the multiplexer to select the right signal to
route into the data in line of the Cell Buffer. For example, if a cell is incoming from input
port 5, theWrite Sync Counter will be at the value of 5 and this will get latched when the
write select line is asserted. The latched value will then drive the multiplexer to select line





Figure 4.9: State Diagram of the Read Select State Machine in the Cell Buffer
The Read Select State machine is a two state system. The state machine is initialized to be
in the Idle state. The clock for this state machine is the least significant bit of the Bit Select
Register. The system goes into the read state when the Read Select Register is pointing to
it and the Output Buffer is not empty. The system then goes back to the Idle state the next
time that the first bit of the Bit Select register is enabled. This effectively allows the Read
Select Register to point to that Cell Buffer through the Read Select State machine
signifying to the Cell Buffer that it is going to be read from. Once the state machine goes
into the Read state, the Read Select Register shifts to point to the next Cell Buffer in the
Output Buffer. When the Output Buffer is empty the Read Select Register remains at the
same state machine until the Output Buffer fills up again.
As explained earlier the first bit of the Bit Select Register indicates the start of a new cell
cycle. The need for external counters to keep track of the start and end of a cell cycle is,
therefore, ekminated by using the bit select register to provide the clock for the state
machines.
4.2.5 - The Reset State
The HiPer ATM switch has a reset input signal whose function is to initialize the system to
a known state. During a reset operation all major buffers are cleared of data, all pointers,
registers, and counters are initialized and all state machines are forced into known states.
81
The reset operation must happen synchronously with the system clock and must last at
least two clock cycles to allow proper cycling of state machines. Furthermore the first cell
must arrive during the first clock cycle following the deassertion of the reset signal to
allow theWrite Sync Counters to work properly.
During reset, both theWrite Select Register and Read Select Register are set to point to
the same Cell Buffer. The Bit Select Register is set up in such a way that the first bit will
be selected when the first bit of the first cell arriving into the switch shows up at the Cell
Buffer. For example, if it will take 40 clock cycles for the first bit to show up at the Cell
Buffer then bit 385 (424-40+1) is set in the Bit Select register at reset. Thus, after 40
clock cycles bit 1 of the Cell Buffer becomes selected on schedule for the first bit of the
first cell.
All the state machines in the system are also forced to the idle state and the counters are
forced to zero during assertion of the reset signal. The latches driving all the Switching
Trees and are forced to zero to prevent false write sync signals from being passed to the
Cell Buffers. Also, the latches driving the multiplexers connected to the data in lines of
individual Cell Buffers are also forced to zero as a precautionarymeasure.
Finally, all the Write State Machines and Read State Machines are forced into the idle
state to prevent any write or read select lines from being asserted before the first write
sync signals reach the Cell Buffers.
4.2.6 - TheATM Cells
During the time that the write sync signals are being dispersed around the system the
actual ATM cells are still sitting in the Input Buffers waiting to be switched. As the
architecture is fully interconnected, the cells themselves are never switched. Rather, every
cell is sent to every Output Buffer. If all the Cell Buffers have been initialized properly,
they will pick the correct cells from the 8 input ports that they all have access to. The cells
leave the Input Buffers in time to arrive at the Cell Buffers when the first bit of the
82
memory system is selected by the Bit Select Register. As explained earlier, this
synchronization mechanism is implemented by adjusting the size of the Input Buffer and
the position of the select bit in the Bit Select register during the reset phase.
The Output Cell Counter increments every time a new sync pulse is received by the system
and as long as the Cell buffer is not full. The counter must also be decreased once every
cell cycle to account for the cell that is being read out. The counter is not decremented if
the buffer is empty. The clock for this counter is produced by using the write sync signals
and the lest significant bit of the Bit Select Register. The write sync signal is used to
increment the counter whereas the bit select (0) signal will decrement it.
4.2.7 - Hardware Complexity
In effect the process of cell switching has been reduced to that of switching the write sync
signals leading to a more flexible system. Looking at the devices used in the system,
scaling the size of the switch up to a larger number of inputs only requires more
interconnection lines, larger counters and larger multiplexers, which all come at a fairly
low cost as far as hardware complexity goes.
The system is also implemented without requiring the clock period to be shortened and
therefore it is practical to scale the system clock as far up as the VLSI process will allow.
This design is in fact limited by the VLSI technology used rather than the clock being used
or the hardware complexity of the implementation. This will be studied further in Chapter
5.
Note that the system as a whole uses both the concepts of synchronous design, as
exemplified by the use of state machines, and the concepts of asynchronous design, as
exemplified by the write sync signals. This leads to a superior solution as compared to a
system that is designed exclusively based on the synchronous or asynchronous paradigm.
83
4.3 - Description of the VHDLModel
The VHDL model of the HiPer ATM switch is modeled very closely to the actual
hardware being designed. This is done so as to facilitate the implementation of the VLSI
layout of the system. The VHDL model will serve as a blueprint for the final layout of the
system.
The listing of the VHDL code used to produce the model is given in Appendix B. The
next subsections will briefly describe what each entity does in the system and how the
model is organized.
4.3.1 - The Input Buffer And The Sync Buffer
The Input Buffer and Sync Buffer entities are implemented in VHDL in the form of a bit-
vector of fixed length. At every rising edge of the clock the content of the bit-vector is
shifted by one bit. A bit can also be input into the vector and the value of the last bit of the
bit-vector can be read out at every rising clock edge. This effectively produces the
behavioral model of a shift register.
The Input Buffers have fixed input and output ports. The output ports from the Input
Buffer entity gives access to the cell coming out of the serial register and access to the
address field of the cell which is a sub-range of the bit vector.
The Sync Buffers have a fixed input port and a fixed output port for the lookup sync
signal. However, the output for the write sync signal changes depending on the Sync
Buffer that is generating that signal. This is implemented by using a generic value which is
passed to the entity when it is created. The generic holds the port number with which the
Sync Buffer is associated and therefore allows the write sync signal to be produced
dynamically from the appropriate position in the bit vector.
84
4.3.2 - The Switching Table
Four bits are required to implement all the switching possibilities offered by an 8 port
switch. The 16 combinations available are enough to produce all 8 possible outputs
separately and the remaining combinations are used for multicasting. The VPI/VCI
combination 0 is reserved by the ATM protocol for invalid cells and therefore only 15
addresses are actually available.
The switching table is implemented as a behavioral model based on a case statement. The
CASE statement produces a bit pattern which is 8 bits wide to independently drive the 8
branches of the Sync Switching Trees. The case statement switches based on the 4 bits of
input that it receives from the input Buffer.
This simple model implements the ROM used as a Switching Table with its associated
addressing logic which is implemented as a decoder.
4.3.3 - The Switching Tree
The Switching Tree is implemented as a Dataflow model using a simple conditional
assignment statement. The entity receives an input signal with a bit pattern which is 8 bits
wide that determines which one its 8 branches is active.
Each output branch of the Switching Tree is implemented as a signal which is assigned the
input signal if the bit corresponding to it in the bit pattern is 1 . Otherwise a 0 is assigned to
that branch.
4.3.4 - TheWrite Select Register, Read Select Register And Bit Select Register
These entities are all circular serial shift registers with reset input signals. They are all
implemented using behavioral modeling and using bit-vectors of the appropriate length.
Once again the data in the registers are shifted on the rising edge of the system clock. In
this case however the output from the last bit of the register is fed back into its input to
implement the circular nature of the device.
85
All three registers have no input port per se. Rather, they are forced to a predefined value
during the reset phase of the system. The Read and Write Select Registers both have their
first bit set at reset, whereas, the Bit Select Register has a bit set depending on the length
of the Input Buffer as explained earlier.
All the bits in these registers are output as signals to be used as pointers.
4.3.5 - The Fully Interconnected Network
This entity is another Dataflow model responsible for implementing the fully
interconnected network. Note that links have to be implemented to fully interconnect both
the cells and the write sync signals to the Output Buffers.
The inputs and outputs of this entity are in the form of bit vectors. The model reorganizes
the bits by shuffling the bits in the input bit vector around and placing the result in the
output bit vector.
For example with 64 bits incoming for the sync signal and 64 bits outgoing (8 input ports
to 8 output ports) bits 1, 9, 17 and 25 in the input bit vector are placed in positions 1, 2, 3,
and 4 and so on in the output bit vector. This can be modeled easily in VHDL by using the
GENERATE statement as shown in the code in Appendix B.
4.3.6 - The Output Buffer
The Output Buffer is implemented as a Structural model. Instances of the separate Cell
Buffers are generated in this entity. Also the Write Select Register and the Read Select
Register are instantiated. The Output Cell Counter is also instantiated based on the generic
Counter entity.
This model also produces several signals used by the individual Cell Buffers such as the
output buffer empty signal and output buffer full signal.
86
The output buffer receives 8 cells, 8 write sync signals, the system clock, the output of the
bit select register, the output of the global counter and the reset signal as its inputs. This
entity produces 8 cells and 8 sync signals as output.
4.3.7 - The Cell Buffer
The Cell Buffer is the heart of the entire system. This model is a mix of behavioral and
Dataflow modeling. The reason for mixing the modeling paradigms is to simplify the
design without affecting the effectiveness of the model.
The Write State Machine and Read State Machines are implemented using the behavioral
description for state machines which is based on the use of the CASE statement. These
two state machines in turn drive the write select line and the read select line.
The multiplexer is implemented using a Dataflow model much in the same way as the
Switching Tree is implemented. In this case, however, the inverse operation occurs as
multiple inputs are being selected for a single output. This time a conditional signal
assignment statement is used which assigns the data in signal with one of the cell streams
based on the content of the value latched during the setup phase.
The Random Access Memory is implemented as a bit-vector variable. The different
control signals such as the write select, read select and bit select lines are tested using IF
statements to determine what operation to perform and in which location to operate on.
When the write select line is enabled, a bit in the data in signal is stored in the memory.
Also, when the read select line is asserted, one bit of data is read out of the bit-vector
representing the memory, and placed on the data out signal. The memory location in
which the bit is actually written to or read from is determined by looking for the bit which
is set during that clock cycle in the bit select register.
Note that this closely reflects the implementation of such a system in VLSI. For example,
using dynamic memory cells usually means that, data is first written into a cell during the
87
first halfof the clock cycle, and then data can be read out of the cell during the second half
of that same clock cycle. This is the same principle used here as data is first written in and
then read out from the bit-vector memory.
The Cell Buffer has 8 cell streams, 8 write sync signals, the reset signal, the output of the
Bit Select Register, the pointers corresponding to it from the Read Select and Write Select
Registers, the output bufferfull signal, the output buffer empty signal and the clock signal
as inputs. This entity produces a serial cell stream and awrite sync signal as outputs.
4.3.8 - The Counters
Two counters are required in the system to implement the Write Sync Counter and the
Output Cell Counter. A single counter entity is designed in this model. This single entity
can then be instantiated into different forms to implement the different counters.
The counters are implemented as behavioral models using an integer to keep track of the
count. The integer is then converted to a bit vector before being passed out of the entity.
The size of the counter can be changed using a generic value that determines the number
ofbits in the counter. The counter can also be reset to a value which can also be specified
at instantiation.
The direction of count of the counter can be changed dynamically during the simulation as
required for the Output Cell Counter. This feature is implemented using a single bit signal
which determines whether to count up or down during the next clock cycle. The Write
Sync Counter always counts in the same direction and therefore this feature can be
disabled by never changing the direction of count for that instance of the counter entity.
4.3.9 - The ATM Switch
The ATM switch entity is the overall model that encloses all the other entities described so
far. The model is strictly structural in nature. It instantiates all the other elements of the
88
system and connects them together using signals to implement the switch fabric. The
model is made of the following instances.
8 Input Buffers corresponding to the 8 cell input streams
8 Sync Buffers corresponding to the 8 sync signals
8 Switching Tables to allow for parallel address lookup for each incoming cell
16 Switching Trees to accommodate the switching of 8 write sync signals and 8 cell
signals. Note that the 8 Switching Trees used for the cell signals are only used to
provide cell duplication onto the fully interconnected network. The bit patterns fed
into these Switching Trees are all I's which means that all the branches are enabled.
8 Output Buffers corresponding to one for each output port.
The Bit Select Register which is used by all Cell Buffers as all cells are required to
arrive together and be processed concurrently in the system.
The Fully Interconnected Network which links all the Switching Trees to all the
Output Buffers.
The ATM Switch entity has 8 inputs for cells, 8 inputs for the sync signal, the system reset
input and the system clock input. The entity has 8 outputs for cells and 8 outputs for the
sync signals.
4.4 - Simulation of the VHDL model
The main purpose ofproducing a VHDL model of the HiPer ATM switch before its actual
implementation is to show that the device is feasible and that the design is functionally
correct. The previous sections described the details of the implementation so as to show
that it is feasible. The following sections will simulate the model to show that the system
actually works in practice.
89
The first requirement in simulating a system is to determine what needs to be simulated
and how to go about producing the simulation. The functional simulation will have to
exercise the different mechanisms in the systems to make sure that they are working
properly. Furthermore, the results of the simulation will have to be consistent with what is
expected from theoretical analysis. Ideally, the device will have to be simulated through
every possible combination of input so as to exercise as much of the model as possible and
so as to increase the confidence in its functionality.
The model of the ATM switch is first simulated one entity at a time. This is part of the
development process of the model and it is not relevant to what is trying to be
accomplished here. This topic will therefore not be elaborated on any further. The next
step is to start putting these components together and simulating them while ensuring that
they interact with each other properly. Finally the whole system is put together and the
ultimate test is to see ifATM cells can actually be routed through the switch as expected.
4.4.1 - The Testbench
A testbench is therefore created to simulate a network in which the ATM switch model
can be added to be simulated. The testbench comprises components that simulate Input
Modules and Output Modules.
The Input Modules read data from ASCII files and convert that data into ATM cells.
These cells are then fed into the switch according to the interfacing protocol required by
the switch design. The address to which cells are routed and the source file of data can be
specified when the InputModules are instantiated to provide more flexibility.
On the output side of the switch, a set of outputModules are attached to the output ports.
Each Output Module is responsible for monitoring the outgoing flow of cells and
retrieving the data contained within the cells and writing them out to ASCII files. The
name of the output files can also be specified at instantiation to make the testbench more
flexible.
90
The choice ofASCII files for the input and output format of data enables easy viewing of
the data produced by the switch and fed to it. A more advanced set up could have been
designed to be self checking. Although that would have been more effective, the amount
of time required to implement such a setup was not available.
As the design is very symmetrical in nature only small sections of the chip need to be
tested to determine proper functionality. This feature is exploited by producing one set of
tests that will be run simultaneously through different input and output ports of the switch.
4.4.2 - Cell Switching Validation
The main activity performed by the device is cell switching, and therefore, the main goal of
the tests is to determine if the cells are switched properly. This feature is tested by feeding
in a file through one input port destined to a single other output port. The output port is
expected to only receive data from one input port therefore preventing any contention at
the output. The input and output file were visually inspected for differences and the UNIX
diff commandwas run on the input and output files resulting in no differences.
This simulation verifies that the core machinery driving the system is working properly.
For example the Output Buffering mechanism and the cell switching algorithms are both
tested under this scenario. Although this test is not very stressful as far as pushing the
system to its limits (for example by causing buffers to overflow), it however demonstrates
that the basic mechanism is working properly.
4.4.3 - Cell Multicasting Validation
The next feature that is tested is that of cell multicasting. As mentioned earlier the ability
to multicast is one of the attractive features of the ATM protocol. The basis of cell
multicasting is that one cell can be transmitted to several different output ports
simultaneously. The HiPer ATM switch claims to be able to implement that function and
this is tested by feeding a file into an input with an address that would cause it to be
91
multicasted to two output ports. The two files produced by the switch are also
successfully compared to the original files.
Exercising this feature also helps to check the proper functionality of the Switching Trees,
the Switching Tables and the Interconnection Network. In addition, the buffering
mechanism is now stressed a little bit more as more than one Output Buffer is active
simultaneously. As a matter of fact, each Output Buffer is designed to be independent of
the other and this test ascertains that this is indeed the case.
4.4.4 - High Traffic and Congestion Handling Validation
The next test consists of stressing the switch to its limits to ensure that the control
mechanism does not break down on rare events or events that only happen under specific
conditions. This is carried out by switching three different cell input streams to the same
output port. The three input ports selected for this test are ports 0, 6 and 7. Input ports 0
and 7 are selected to ensure that the system will work properly at the limit of the
synchronization mechanisms. Port 0 tests that the synchronization works properly for the
first port and port 7 for the last port. It is assumed that if the two extremes work properly,
then the ports in between should function as well.
Input port 6 is used to demonstrate that the switch will work properly when two write
sync signals follow each other right away. This test actually led to the failure of the switch
initially leading to the discovery of a bug in the design. This has been fixed since then and
the switch now succeeds on this test too.
This test also pushes the buffers to then limits as they are forced to overflow under the
pressure of all the incoming traffic. The output file from that test is looked at next. The
three input files and the output file are found at the end ofAppendix B.
The three input ports all produce separate different data streams ensuring that the data
from each cell can be identified in the output file. The output file starts out being a mixture
of the three input streams. The first 48 bytes in the file arrive from input port 0 followed
92
by 48 bytes from input port 6 and then 48 bytes from input port 7. This indicates that the
data is being queued up properly in the Output Buffers and that they are indeed being
switched properly.
This pattern goes on for a few more packets after which all the packets start to arrive from
output port 0 only. This is the point of high congestion where the Output Buffers have
overflowed and only one cell can enter the buffer at a time as one cell is being removed.
As the write sync signal for input port 0 arrives at the Output Buffer first, it gets the first
pick at a Cell Buffer. After that, the Output Buffer becomes full again and the next two
write sync signals are ignored leading to the cells being lost.
This final test provides enough confidence in the proper functionality of the switch and
therefore concludes the study of the VHDL model of the HiPer ATM switch.
4.5 - VHDL Simulation Results
The VHDL model is simulated using the QuickHDL simulator by Mentor Graphics
Corporation. This section uses simulation outputs to demonstrate some of the major
concepts that have been discussed in this chapter. The simulation displayed here involve a
combination of all the validation plans listed earlier. Cell switching is demonstrated by
cells arriving into port 1 and leaving through port 1 . Cell multicasting is demonstrated by
cells arriving into port 5 and leaving through ports 2 and 5. Finally, heavy traffic is
demonstrated by cells entering ports 0, 6 and 7 and destined to port 0. The rest of the
ports carry cells that are not addressed as being unassigned (address 0). The address fields
of the cells are located between the first long pulse in the cell stream (this pulse is 4 clock
cycles wide) and the first data pulse. For example, notice the address 0101 in the
serial_in(5) signal in Figure 4.10. Also looking at serial_in(3), the address field is 0 as this










tH* &* S* K" K" S" S* " ^-" K*
ST





5 -jj, g ^ g ^ g g s gr s g g g g g g
"1 ? S ?l ? SI S 21 ?
s 1 s ^ s ? s 1 s 1 s 1 s ?
y is a -5 a
s j i i
i i
5 .B
C C C _c
" "5 2 "5 B
.e> a -s?
^ >" -
Figure 4.10: Timing diagram of cells into and out of switch
94
The relationship between the mcorning cells, outgoing cells, and sync signals are shown in
Figure 4.10. The signals labeled serial_in are the cells entering the switch. The signal
below each serial_in signal is the sync_in signal. This corresponds to the sync signal of
that ATM cell as produced by the Input Modules and as mentioned earlier this signal is
fundamental for the correct operation of the switch. Notice that cell transmission starts
right after the reset signal has been de-asserted as required by the architectural
specifications (Section 4.2.5).
The ATM cells are seen coming out in the form of the serial_out signals accompanied by
then sync signals which are labeled sync_out. The sync signals are generated by the HiPer
switch this time. Also, the delay due to transmission of the cell through the switch fabric
can be seen and, it measures approximately 40.5 clock cycles. Finally, notice that a sync
signal is produced even when no cells are being transmitted by a port. This is a
requirement of the ATM protocol as idle cells must be inserted in the flow when no data is
to be transmitted. The idle cells in this case are unassigned cells with the VPI/VCI value 0.
95
r r




s, . i -, f
'
I i , 5 g
u o u o o ti o s a = u 9 - J -= J J i J I S t { 1 S V 5 J l
*-
. i e a S -s =
-' 5 "i 1 S -1
i i s s 4 .-s i i o 3 f f g 1 1 s s l % | f 1. a 4 4 1 v *
I I II II II ai f I * 1 3! * ,P,*ll | w! I 5, Jiniii||!i S <3i iS"iJ* **. J-
Figure 4.11: Functions ofwrite sync signals in Output Buffers and Cell Buffers
96
The processing of write sync signals is demonstrated next in Figure 4.11, which uses a
snapshot of events in Output Buffer 0. The timing diagram shows how the write sync
signals are combined together and modified to produce other signals in the Output Buffers
and Cell Buffers. The separate sync signals are first staggered in time in the Input Sync
Buffers to produce the write sync signals as explained earlier (Section 4.2.1). These
signals are denoted as write sync in in Figure 4.11. As cells are coming into port 0 from
ports 0, 6 and 7, the combined writesync signal is first high for one clock cycle then low
for 5 clock cycles and then finally high for two more clock cycles as shown. In the Output
Buffer, this signal is combined with the clock to produce the write clk signal and
part of the output counter clk signal. The originalwrite sync signal is also combined with
the write select signal inside a cell buffer to produce the inputjportjatch clk signal.
The inputjportlatchclk signal clocks the latch that holds the port number from which
the current cell is coming in. The value latched by this register is the value present in the
Input Counter as shown. As this Cell Buffer is receiving cells from port 0, the Input Port
Latch latches a value of 0.
The write select clk signal has the function of clocking the Write Select Register when
space is available in the Output Buffer to store cells. As shown in the timing diagram, the
register shifts by 1 bit every time it encounters a rising edge on the write select clk. This
register, therefore, keeps track of the next available Cell Buffer in the Output Buffer. The
bits in theWrite Select Register drive the write lines of each Cell Buffer. As shown
in the diagram, the write_select line is de-asserted in Cell Buffer 0 as soon as it is assigned
an input port to buffer.
The output counter clk signal clocks the Output Counter up when the countdirection
signal is high. This signal is part of the Output Buffer as each buffer an individual counter.
Notice that the value of the counter increases every time it receives a rising edge. Also
note that during the assertion of the least significant bit of the Bit Select Register (the start
of a new cell cycle) t4he count direction goes low and a pulse on the output_counter_clk
97
signal causes the Output Counter to count down. This effectively implements a counter
that counts the total number ofCell Buffers that are full for the next cell cycle.
The least significant bit of the Bit Select Register is used as the clock for the Read Select
Register in the Output Buffer. This is shown in Figure 4.11. When bit_select(0) is asserted
and the Cell Buffer is not empty the Read Select Register shifts by 1 bit to point to the
next Cell Buffer. This register thus points to the next Cell Buffer from which data will be
read. Notice that the local readselect signal in Cell Buffer 0 is de-asserted on the rising
edge of bitselect(O) .
The write state 1, write state 2 and read state signals are the states of the three state
machines present in this system. The write_state_l signal first changes state from Idle to
Ready when the first write_sync signal hits. The write_state_2 state machine then goes
into the Writing state when the bitselect(O) signal is asserted followed by which
write_state_2 goes back to Idle. Similarly the Read Select State machine is represented by
the read_state state machine and it goes from Idle to Reading when bitselect(O) is
asserted.
The signal cellin(O) represents the actual cell arriving into the Cell Buffer after all the
sync signals preceding it have been processed completely. Looking at Figure 4. 1 1, the cell
coming out of the Cell Buffer shows up half a clock cycle later as would be the case in the
VLSI implementation where dynamic Random Access Memory is read during the second



































































































































































tH-l^r' 1 1 1 1 1 1
&M 3 i 4 fi *j 4 i 1 2 i 4 * '< I
~,1
.. -a_ S.M w f. -a_ I.M




I 8, 3 f t 3 1 * s. 1 -I a 1 -i a
' 1 -I ajli a a' 1 . a 1 1 a ' 1
|a a s J-| s | J| a | -'I S | -'I s | -'1 i | ;'|a|-'l2| -
'
| S
V I S|l * 3 nl *. J^ll * J Bll * J*l 1 *. J I 3 * J*|5 *. J "i 1 .J
Figure 4.12: Congestion simulation and FIFO characteristics
99
Figure 4.12 shows how the Cell Buffers are written to and read from to implement a
multi-ported First In First Out data structure. The write_sync signal at the top of the
timing diagram indicates whether cells are being received or not during that cell cycle. The
boundary of cell cycles is indicated by the bit_select(0) signal which is also located at the
top of the timing diagram. The output counter register keeps track of the length of the
queue. Notice that it initially increases by 2 every cell cycle as 3 cells are arriving and 1
cell is leaving simultaneously. Once the buffer becomes full, the output counter becomes
steady at 7 meaning that some cells are now lost due to overflow of the buffer. The value
of the counter is always 7 and never 8 as 1 cell is always leaving the Output Buffer when
the buffer is not empty. Finally the value of the Output Counter starts to decrease by 1 at a
constant rate until reaching 0 as the traffic has died off and all remaining cells in the
Output Buffer are being transmitted.
The Read State Machine and Write State Machines of the Cell Buffers are also displayed
in Figure 4.12. The first 3 Cell Buffers are written to during the first cell cycle as seen in
the timing diagram. The value latched in each inputjport latch register indicates that Cell
Buffer 0 is holding a cell from port 0, Cell Buffer 1 is holding a cell from port 6(110 Hex)
and Cell Buffer 2 is holding a cell from port 7. At the same time, the Read State Machine
is in the Read state for Cell Buffer 0 meaning that the first cell is being read out from
there. The other Read State Machines are still idle as only one cell can be read out from an
Output Buffer at a time.
During the next cell cycle, Cell Buffers 3, 4 and 5 now latch input ports 0, 6 and 7 as
expected whereas the Read State machine for Cell Buffer 1 is now active. This process
repeats itself in the third cell cycle.
At the end of the third cell cycle, the buffer is now full and cells must now be turned
down. Therefore in the fourth cell cycle theWrite State Machines for Cell Buffers 1 and 2
are the only ones which are active and the
cell from Port 7 is lost. In the next cycle, things
get worse as now only 1 cell from port 0 can be buffered in Cell Buffer 3, whereas the
cells from port 6 and 7 are dropped. This state of congestion then continues until finally
100
cells stop arriving on the input ports and the cells
are gradually read out of the buffer,
until
it is empty in the last displayed cycle.
101
Chapter 5 - Circuit Level Simulations
The last two chapters looked at two different levels of abstraction of the HiPer ATM
Switch. Chapter 3 dealt with cell level simulations whereas Chapter 4 covered bit level
simulations. This chapter will describe circuit level simulations of the circuit.
The objective of this chapter is to present the circuit used to implement the HiPer ATM
Switch. Note that this chapter will combine both gate level and transistor level simulations
as digital gates are intrinsically made up of transistors.
The success of this new architecture is based primarily on the use of high speed
components in its design and implementation. This goal is achieved by extensively
researching high speed circuit design techniques to implement the individual components
that make up the switch, while maintaining practical limits on component sizes and
transistor count.
The circuit technology selected for the project is the Metal Oxide Semiconductor Field
Effect Transistor (MOSFET), which offers the possibility for very small feature sizes. The
Complementary MOSFET (CMOS) logic gate design technique is used in the
implementation to take advantage of its high speed characteristics. The main feature of
CMOS design is that it uses two different (or complementary) types of MOSFET
transistors to implement logic gates. A process with minimum feature size of 0.5 pm is
selected for this implementation.
Both dynamic and static logic gate design techniques are used to implement the system.
The tradeoff between speed and transistor count is the main issue faced by circuit level
designers. Increasing the number of transistors can often lead to faster circuits. If the gate
needs to be replicated several thousand times, the high speed might come to the cost of a
larger die. The size of a silicon chip must also be minimized to ensure good yield during
fabrication and to minimize the amount ofpower consumed by the device.
102
5.1 - Circuit Level Issues
VHDL modeling offers a sheltered environment within which to simulate a circuit as many
assumptions are made in the digital domain which might not hold true in practice. The
transistor level circuit introduces a number of variables which must be taken into account
to produce a successful design.
The typical behavioral VHDL model does not take into account the delay in the
transmission of digital signals from one gate to the next. This is a direct result of the fact
that devices in a circuit have parasitic components that are of a complex nature, and that
cannot be predicted before implementing a circuit. These parasitic components often result
in a signal taking a finite amount of time in traveling from the output of a gate to the input
of the next gate. When modeling a system, such a delay is often discounted as being
insignificant and it is therefore ignored. Unfortunately, in practice, instances where the
parasitic components are of a significant nature tend to be the rule rather than the
exception. For example, long connections between two gates or a large number of
transistor gates connected to the same output all lead to parasitic components which
produce significant delays in the actual circuit.
5.1.1 - Sources of parasitic components
The source ofparasitic components in a circuit are as follows.
The internal capacitance of an interconnection line is produced by the capacitance
produced between the interconnection line and the substrate on which the silicon chip
is being built. An increase in the area of an interconnection line leads to an increase in
its internal capacitance and consequently leads to an increase in the amount of time
taken to route a signal through that link. On the other hand as a layer is found further
from the substrate its internal capacitance decreases.
The cross coupling capacitance is due to capacitance between different layers of the
circuit that are either next to each other or above each other in the silicon chip. The
103
amount of capacitance produced by this effect is directly proportional to the area of
each layer that is exposed to each other, and inversely proportional to the distance
between the layers.
Internal resistance is another component of circuit parasitics. The internal resistance of
a layer is directly proportional to the number of geometrical squares that make up that
layer. The actual size of the squares does not matter in this case as any square
structure of a given material has the same fixed resistance in a silicon chip. As the
number of squares arranged in series increases, the resistance of the layer increases
proportionally, leading to increased delays in transmission time through that layer.
Contact cuts between layers at different levels of a silicon chip also have an intrinsic
resistance. Increasing the number of contact cuts, however, leads to more resistors of
the same value in parallel and therefore results in lower contact resistance and higher
transmission speeds.
Digital gates are made up of silicon transistors. Transistors by then very nature have
internal capacitances on then nodes. The gate of a transistor which usually makes up
the input of a digital gate has a capacitance due to the substrate and the field under the
gate. The drain of transistors which usually make up the outputs of digital gates have
intrinsic capacitances due to the field at the junction of the node and the substrate.
These nodes also have internal resistances associated with them which cause the RC
constant of the node to go up leading to loss in performance..
The environment in which a device operates is also a major source of parasitic
fluctuations in the behavior of a circuit. For example, an increase in the temperature of
a transistor leads to a decrease in the current flowing through the channel of that
transistor. This, however, is not taken into consideration as the device to be designed
is expected to work under ideal conditions.
104
5.1.2 - Clock and Data Skew
Clock skewing occurs when different clocks in a system go out of phase and start to
trigger events out of synchronization with each other. This is a direct result of the fact that
the delay encountered by each clock is different, and they might reach different functional
units at different times. Clock skew is a major issue in high speed VLSI devices. Circuits
that require clocks withmultiple phases are the most affected by this issue. [22, 24]
The simplest solution to the clock skew problem is to eUminate it altogether by using a
single clock signal with the same phase over the entire circuit. This is the solution adopted
in this project and the implementation of such a system requires the use of single clock
phase devices such as TSPC-1 [24],
The clock and the data can also go out of phase with each other resulting in the wrong
data being latched in a system. This problem is more difficult to resolve, but observing
some strict timing requirements such as setup and hold times can get rid of the problem.
These requirements, however, become more difficult to meet when the speed of the clock
increases as less time is available to ensure that the data and the clock signal are in sync.
This problem is prominent in long registers where the clock might overtake the data and
trigger the latches before the data gets to the latch. This problem is solved by feeding the
clock signal in the opposite direction fromwhere the data is being fed in. [22]
5.2 - The Chip Floorplan
The floorplan of a chip is the first step in laying out a circuit in silicon. This process
consists of roughly estimating the size of major components of the system and arranging
them as they would be laid out in the final chip. This process is very speculative in nature
as it is fairly difficult to estimate the exact size of the different components. The initial
floorplan, however, provides an excellent starting point from which the chip can be
fragmented into smaller modules which can be designed separately.
105
Producing a floorplan also allows the designer to plan in advance how signals will be
routed around the chip. More importantly this process gives the designer a rough idea of
the type and amount of buffering that will be needed in the chip, and it also allows the































Figure 5.1: Planned floorplan of chip
The floorplan of the HiPer ATM Switch is shown in Figure 5.1. The floorplan indicates
that the chip will be rectangular in shape which is not an ideal configuration. This setup
does help to make routing simpler and minimizes the amount of buffering and gating
required. It also allows all the input ports and all the output ports of the chip to be placed
on opposite sides of the device. Therefore, routing signals from the I/O pads to the
internal system can be done more efficiently.
The next step in the design process consists of producing standard cells that will be
assembled together to produce the larger components of the system.
5.3 - Standard Cells
Standard cells are layouts of basic devices that can be assembled together to produce
larger components. The use of Standard Cells allows the designer to take the level of
abstraction of IC design one level above that of layout by hand. For example, a library of
106
Standard Cells can be used to synthesize a design from its Hardware Description
Language model. In situations where high speed or low power are required, it is very
often more effective to layout a circuit by hand. However Standard Cells can still come in
handy so that the designer only has to worry about interconnecting components efficiently
rather than producing components that work.
Standard cells have fixed physical and functional characteristics. For example the
capacitance of every input and output port of the device is known exactly. The rise and fall
time of the outputs as well as the capacitance on the output ports of the cell are known.
Precisely. The physical position of the input ports and output ports are also known to the
designer. Finally the functionality of a Standard Cell is guaranteed within its specified
operating parameters. These characteristics make Standard Cells a valuable addition to the
tools used by the Integrated Circuits designer.
The need for very high speed devices and low transistor count is the driving force behind
many of the Standard Cells in this project. This explains why dynamic devices are used in
many instances where speed is critical. A Standard Cell Library for the process used in this
project was not available, and therefore a set of Standard Cells was designed and
simulated by the author and they are presented next.
5.3.1 - The Dynamic Flip-Flop
The Flip-Flop is a storage element which can store one bit of information while the device
is powered up. Dynamic Flip-Flops (DFF) are used when very high speeds and small sizes
are required [24]. The Flip-Flop is also the basic component used to implement state
machines and to put together different types of registers.
The need to minimize skew prompted the use of a single phase clock in the system. The
DFFs are therefore designed so that they can run with a single clock signal. This type of
devices are called True Single Phase Clock Dynamic Flip-Flops (TSPC DFF). The design
107















Figure 5.2: Transistor level schematic of a Positive Transition Triggered Dynamic
Flip-Flop
The transistor level schematic of a Positive Level Triggered DFF is shown in Figure 5.2.
When the clock cj> is low, transistors Ql, Q2 and Q3 form an inverter which pre-charges
the gate of transistor Q5 to the inverse level of the input D. Transistor Q4 is also turned
on leading to the gates of transistors Q7 and Q8 to be pre-charged high. Transistor Q8 is
turned off at that time and the output of the Flip-Flop is therefore isolated from its input.
When the clock tj> goes high, transistors Q7, Q8 and Q9 now form an inverter whereas
transistors Ql, Q2 and Q3 now isolate the input D of the Flip-Flop from the pre-charge
stage. Transistor Q6 is now turned on and if the original input D was low it allows
transistor Q6 to now pull the gates of transistors Q7 and Q9 low. This is then inverted to
produce the output /Q and an additional static inverter produces the output Q. If the
original input D was high, the gates of Q7 and Q9 would remain high as the gate of Q5
would be low, leading to a low output at /Q and a high output at Q.
108
D-










Figure 5.3: Transistor level schematic ofNegative Transition Triggered Dynamic
Flip-Flop
The transistor level schematic of a Negative Transition Triggered DFF is shown in Figure
5.3. This device latches a signal when the clock goes from high to low as explained next.
When the clock $ is high, transistors Ql, Q2 and Q3 invert the input D whereas transistors
Q7, Q8 and Q9 isolate the outputs Q and /Q from the input stage. This time, however, the
pre-charge stage formed by Q4, Q5 and Q6 pre-charge the gates ofQ7 and Q9 low which
turns off transistor Q9. When the clock cj> goes low, transistors Ql, Q2 and Q3 now isolate
the input D from the pre-charge stage, whereas transistors Q7, Q8 and Q9 now invert the
signal level at the gates of transistors Q7 and Q9. At this point, if the original input D was
high, transistors Q4 and Q5 are turned on. This pulls the gates of transistors Q7 and Q9
high leading to a low output on /Q and a high output Q. If the original input D was low,
transistor Q5 would be turned off and the gates of transistors Q7 and Q9 would remain
low leading to a high output on /Q and a low output on Q.
The last standard cell described in this section illustrates how the features of synchronous
set and reset signals are implemented in the Dynamic Flip-Flop. The Negative Transition
Triggered DFF is used as an example to illustrate this concept.
109
D-
Q3 4 Q6 R QR Q9
X
-/Q
Figure 5.4: Negative Transition Level Triggered Dynamic Flip-Flop with
Synchronous Reset and Set signals
The Negative Transition Triggered Dynamic Flip-Flop with Synchronous Set and Reset
signals are shown in Figure 5.4. The device operates in exactly the same way as a regular
Negative Transition Triggered Dynamic Flip-Flop under normal operation. However,
when either the R or the S signals are asserted a predefined value can be clocked into the
Flip-Flop. When S is asserted during the high phase of the clock (j) transistor QS is turned
on, and that forces the gate of transistor Q5 to be pre-charged low no matter what the
input D is. This low signal is then latched into the device when the clock goes low leading
to a high on the output. On the other hand when signal R is asserted transistor QR is
turned on. This forces the gates of transistors Q7 and Q9 to be pulled low during the low
phase of the clock no matter what was latched in during the high phase of the clock. This
then forces the output of the Flip-Flop to be low at the end of the cycle.
Note that the S signal must be asserted at least during the first phase of the clock whereas
the R signal must be asserted at least during the low phase of the clock. Also ifboth the R
and the S signals are asserted it is the R signal that will dominate the process leading the
Flip-Flop output to be forced low. The opposite process is implemented in the Positive
Transition Triggered DFF.
A final note on the TSPC DFF created by Yuan and Svensson is that the original device
has a digital glitch when the input D and the output /Q have the same value [10]. The
glitch can be minimized by sizing the transistors in the latch properly.
110
5.3.2 - The Static Flip Flop
Flip-Flops can also be designed using static elements. In this case, however, the speed of
operation of the devices is lower and most static flip-flop designs require a two phase
clock to operate properly. On the other hand, this type of device can hold data reliably for
much longer periods of time without charge sharing problems arising as is the case for
dynamic devices [10]. A variation of the conventional static flip-flop design is used in this
system. The modifications are made to the conventional design to eliminate the need for a







Figure 5.5: The single phase clock Negative Transition Triggered Static Flip-Flop.
The conventional static flip-flop uses a transmission gate to implement the transparent and
hold modes required in a memory element. The transmission gate requires two opposite
signals to work properly and therefore the need for a two phase clock with opposite
phases. This conventional design is modified as shown in Figure 5.5 to use passthrough
transistors instead of transmission gates. The functionality of the device remains exactly
the same as the conventional static flip-flop except for the fact that signals inside of the
flip-flop do not rise and fall completely to the ramps. This is a minor issue as the signal
levels are still good enough to represent the required digital values.
Figure 5.5 shows aNegative Transition Triggered Static Flip-Flop. When the clock is high
the NMOS transistors are enabled, whereas the PMOS transistors are disabled. In that
state the master stage of the flip-flop is sampling the input whereas the slave stage is
111
holding on to a previous value. When the clock goes low, the NMOS transistors are
turned off and the PMOS transistors turn on. In that state, the master is now holding on to
the last data it sampled and it is passing that data to the slave stage ahead of it. When the
clock goes back high, the master samples its input again whereas the slave latches on to






Figure 5.6: The single phase clock Negative Transition Triggered Static Flip-Flop.
A Positive Transition Triggered Static Flip-Flop is shown in Figure 5.6 The operation of
this device is very similar to its Negative Transition Triggered counterpart. In this case a
low level of the clock causes the master stage to sample the input and a high clock level
isolates the master from the input and causes the slave to sample the output of the master.
As a final note, the Static Flip-Flop uses asynchronous reset and set signals. This is
implemented by using a pull up transistor to pull the input of the slave high when the
device needs to be cleared. Conversely, a pulldown transistor is used to pull the input of
the slave stage low so as to get a high on the output.
5.3.3 - The Dynamic Random AccessMemory Cell
The Dynamic Random Access Memory (DRAM) cell is used to implement the output
buffers in the HiPer ATM switch. Dynamic memory has the disadvantage of being slower
than static memory cells but this cell is a lot smaller than its static counterpart in terms of
transistor count. As the output buffers will dominate the transistor count of the layout a
112
small memory cell is required to minimize transistor count and overall size. Therefore the
DRAM cell becomes the obvious choice for this application.
T














Figure 5.7: Dynamic Memory Cell
Figure 5.7 shows the transistor level schematic of a single memory cell. This design is
based on the 3T Dynamic Random Access Memory Cell [1]. The capacitance on the gate
of transistor Q6 is used to store one bit of data, and the complement of that value is stored
on transistor Q10.. Data is written to the cell by turning on the S and W signals allowing
the Din signal to either charge or discharge the gate of transistor Q6 (vice versa for Q10).
Note that unlike conventional DRAM cells this cell has a separate bit select line S. This
allows a single individual memory cell to be accessed for a read or write operation in the
buffer. This is different from conventional memory where an entire column is usually
selected at a time and a separate bit select line is not required.
The D line which is the data out line of the cell is pre-charged to a high level in the first
half of the clock cycle through transistor Q7. During the second half of the clock cycle
transistor Q3 is turned off and data can be read out of the cell by asserting the R and S
signals. If transistor Q6 has charge stored on its gate it will pull the D line to the ground
causing it to discharge. On the other hand if a low signal was stored in the cell transistor
Q6 is turned off and the D line maintains its charge. The opposite process happens on
transistor Q10
113
The outputs D and ID are fed to a sense amplifier which compares the two input signals.
The sense amplifier will then produce a high or a low output depending on which of the D
or ID line is higher.
5.3.4 - The Full Adder & the Single Bit Counter
A full adder standard cell is produced to implement the counters. This technique is used as
the Output Counter needs to be able to count up and down. The input counter, on the
other hand, only counts up but a separate cell is not designed for it. The additional time to
design a cell just for the input counter is not deemed worthwhile as the full adder provides
enough performance for this application. The standard cell for the full adder is obtained
fromWeste and Eshraghian [22].
Figure 5.8: Standard cell for Full Adder
Figure 5.8 shows the circuit used for the full adder. This circuit is static in nature, and it
relies on the fact that the sum of a full adder involves the value of the carry out. The logic
to produce the Carry Out signal is therefore reused to produce the sum term.
114
This standard cell feeds its S output (the sum term) to a Flip-Flop to implement a 1 bit
counter. The C+ output (the carry out) is fed to the next significant bit in the counter. As
theWrite Sync Counter operates at a high frequency, dynamic flip-flops are used to design
it. The Output Counter runs at a lower frequency and it is implemented using Static
Flip-
Flop Standard Cells.
5.3.5 The Dynamic Decoder
The need to minimize the size of the device while providing high speed of operation
prompted the use of dynamic logic in the design of the decoder. The main purpose of the
decoders is to enable selection of individual rows in the Switching Tables based on the
VCI field ofATM cells.





Figure 5.9: 2 Bit Dynamic Decoder
The transistor level schematic of a 2 Bit Dynamic Decoder is shown in Figure 5.9. This
design differs from regular static CMOS decoders in that only NMOS type transistors are
used to implement the branches of the decoder. Furthermore the regular arrangements of
the transistors allows this design to be extremely compact and small. The output of each
branch of the decoder is pulled up high by a weak PMOS transistor. Inverters are used to
115
clean up the output of the decoder and produce positive logic outputs (asserted signal is
high).
When one of the branches of the decoder is turned on it pulls the input of the inverter
connected to it low. The weak PMOS transistor at the output of that branch is too weak
to keep the signal high and this causes the output of the inverter on that branch to go high.
All other branches in the decoder are disabled at that time. The weak PMOS pull-up
transistors can maintain these branches high, and consequently, the output of the inverters
connected to these branches are low.
This setup produces responses that are slower than that of a static device but the savings
in transistor count makes up for the speed disadvantage. Furthermore as the device will
not be used in a situation where very high speed is required this disadvantage can be
ignored. The inverters at the end of each branch also serve as buffers to drive the row
select lines of the Switching Table. The inverters therefore serve two purposes as buffers
and for logic inversion.
5.3.6 - The Static Multiplexer
The StaticMultiplexer is used in theWrite logic where it is used to select one of the input
ports coming into a Cell Buffer. The regular arrangement of transistors in this device
allows for a very compact circuit layout. The small size of this device is important as the
number ofmultiplexers required is equal to the number ofCell Buffers in the system.
116
SI /SI so /so so /so si /si
Out
Figure 5.10: Transistor level schematic of 4 to 1 StaticMultiplexer
The transistor level schematic of a 4 to 1 Static Multiplexer is shown in Figure 5.10. The
transistors are arranged to form a series of two transmission gates per input. Each
transmission gate has a unique combination of select signals connected to it. Therefore,
only one transmission gate can be enabled at a time. This feature allows all the outputs of
the transmission gates to be tied together to produce a wired-or configuration. Only one
of the inputs will be allowed to pass through to the output of this device based on which
branch is selected.
5.3.7 - Generic Logic Gates
Generic gates comprise such devices as AND gates and OR gates. These devices are
implemented as Static CMOS gates. Static CMOS logic requires a set of PMOS and
NMOS transistors arranged in a complementary configuration to produce the logic
function. All the generic gates in this system are designed with minimum sized transistors.
Additional buffering using buffer chains is added to the gate when more driving power is
required. Most of these gates are small in terms of transistor count. Therefore, the hit in
transistor count is fairly small as compared to dynamic logic.
117
5.4 - System Components
Once Standard Cells have been designed and characterized they can be assembled together
to produce larger components as described next.
5.4.1 - The Input Cell Buffer and the Input Sync Buffer
The Input Sync Buffer and the Input Cell Buffer are both serial registers which can be
assembled from a series ofDynamic Flip-Flops chained together. Each DFF in the register
offers a load of four transistor gates to the clock signal and therefore skewing problems
can be come significant.
Fortunately clock skewing is not an issue in this design as the DFFs are single phase
devices. On the other hand the skewing between the clock and data being latched in the
register can be the source of significant problems. This is remedied by feeding the clock
signal from the opposite side where the data is being input into the register. This is a well
known technique to eliminate the problem of data and clock skewing.
5.4.2 - The Switching Table
The Switching Table is made up of a Dynamic Address Decoder and a Read Only
Memory. The ROM holds bit patterns corresponding to the address in the VCI field of
ATM cells. The bit patterns in the ROM are used to drive the branches of the Switching
Trees which are in turn responsible for switching the sync signals in the device.
The VCI field of an ATM cell which is being buffered in the Input Cell Buffer is used to
drive the inputs of the Dynamic Address Decoder. This turns one of the branches of the
decoder on which in turn enables a single row of the ROM. The output of the ROM is

















Figure 5.11: Switching Table implemented as a Read Only Memory
The Read Only Memory is designed in a regular structure using only NMOS transistors as
shown in Figure 5.11. The bit value produced in a row of bits can be changed easily by
moving the contact cut connecting the transistors to the power lines from Ground to Vcc
or vice versa. This allows for a compact and flexible Switching Table layout.
5.4.3 - The Switching Tree
The Switching Tree is responsible for Switching the write sync signals from the input
ports to the appropriate output ports. The property of cell multicasting requires that a
Switching Tree be able to switch one cell to several different output ports. A simple de
multiplexer circuit cannot therefore be used to implement this device.
The Switching Table produces a bit pattern that indicates the individual output ports to
which a cell is addressed. This bit pattern is latched into a temporary register until the
write sync signals have been generated and switched properly. Each bit in the pattern
obtained from the Switching Table turns a CMOS transmission gate (TG) on or off. The
input of the transmission gates are driven by the write sync signals from the Input Sync
Buffers whereas the output of the TGs are fed to buffer chains that drive the
interconnection network. This technique allows individual bits from the Switching Table
119
to control separate TGs. A single write sync signal can therefore be switched to multiple
output ports, which implements multicasting.
5.4.4 - The Write Select Registers and Read Select Registers
The Write Select Register and Read Select Registers are very similar in nature and they
are therefore treated together. Both registers are implemented using Static Flip-Flops as
they only get clocked a few times during a cell cycle. One of the flip-flop set lines has to
be tied to the reset signal to allow the registers to be initialized properly.
5.4.5 - The Bit Select Register
The Bit Select Register is a high speed device operating at the same speed as the system
clock. It is therefore implemented with the Positive Triggered Dynamic Flip-Flop device.
The least significant bit of this register is used to synchronize other events in the system.
This bit is fed to a buffer chain to be boosted before being distributed to the rest of the
system.
5.4.6 - The Write StateMachine
The Write State Machine is implemented using two separate smaller state machines as
described in Section 4.2.4. The first state machine uses the system clock as a clock and the
second state machine uses the least significant bit of the Bit Select Register as a clock.
The state transition table for the first state machine is as follows.
120
State_1 Write Buffer_full Write_Selec1Write_Sync State 1 +
0 0 0 0 0
0 0 0 0 1
0
0
0 0 0 10
0 0 0 11
0
1
0 0 10 0




0 0 1 1
0
0
0 10 0 0















10 0 0 0























Table 5.1: State transition table forWrite StateMachine 1
The function of the Write State Machine 1 is to remember that a cell buffer has been
selected to be written to in the next cell cycle. This state machine goes back to its original
state when it sees that the cell is being written to by monitoring the state of State Machine
2.
121









Table 5.2: State transition table forWrite StateMachine 2
The second Write State Machine is the one which actually drives the Write line of the
memory cell. Write State Machine 2 asserts the Write line of the memory only afterWrite
StateMachine 1 has changed states to 1. This state machine then goes back to its original












Figure 5.12: Gate level design ofWrite StateMachine 1 andWrite State Machine 2
The two state machines are implemented using a Negative Transition Triggered Static
Flip-Flop for Write State Machine 1 and a Positive Transition Triggered Static Flip-Flop
forWrite StateMachine 2. The gate level schematic of the two state machines is shown in
Figure 5.12. Static flip-flops are used instead of dynamic devices because of the low
frequency at which these state machines are operated.
Simulations ofminimum size dynamic latches indicate that they can only hold data for a
maximum of 2 us. A dynamic device is not appropriate in this instance as an ATM cell
cycle lasts 2. 12 \is at 200 Mbps.
122
5.4.7 - The Read StateMachine
The read state machine is implemented according to the following state transition table.

















Table 5.3: State transition table for the Read StateMachine
The clock for this state machine is the least significant bit of the Bit Select Register, which
signifies the start of a new cell cycle. The rising edge of the bit_select (0) signal is used to
clock events. The state machine is in State 0 until the read select signal for that cell is
asserted. The state machine then goes into State 1 if the Output Buffer is not empty and
goes back to idle on the next clock transition. The combination of State 1 and read_select
signal 1 and buffer empty signal 0 should never happen if the logic is working properly.
This state is however forced to State 0 as a precautionary measure.
/Buffer_Empty^> 1




Figure 5.13: Gate level design of the Read State Machine
This state machine is implemented using a single Static Flip-Flop and a static three input
AND gate as the device operates at a fairly low frequency. This is shown in Figure 5.13.
123
5.5 - The VLSI Layout
The system is separated into components according to the main floorplan presented earlier
(Section 5.2). Eachmajor component will now be described using floorplans of the layout.






















Figure 5.14: Floorplan of the input stage
Figure 5.14 shows the floorplan of the input stage of the system. Cells and sync signals
flow into the two corresponding buffers. The Input Cell Buffer and Input Sync Buffer are
made up of series of Positive Transition Dynamic Flip-Flops, as described previously.
These two registers are clocked with the system clock. The first four bits in the Input Cell
Buffer are placed on a short Address Bus which takes the data to the Address Decoder.
This device decodes the address and enables one of the rows of the Switching Table. As
all the devices are dynamic in nature the process is very fast and it takes a fraction of a
clock cycle to decode the address, enable the row and produce a bit pattern from the
switching table.
When the lookup sync pulse is asserted the bit pattern from the switching table is frozen in
the Latch. As the latch operates at very low frequencies and it has to hold data for a long
time a Static Latch cell is used to implement it. A Negative Transition Triggered latch is
used for this purpose so that the high part of the lookup sync pulse is used to setup the
124
device. Once the bit pattern is latched, the next write sync signal that is generated by this
input stage will be switched to the right output ports.
Large driver stages are also present in the input stage to provide enough power to drive
the large load presented by the Interconnection Network. Both the write sync signal and
the serial cell signal are fed to the interconnection bus through the large drivers. These
drivers are the conventional buffer chain type of device made up of inverters of
progressively larger size that are connected to each other in series. The ratio for inverter
sizes is chosen to be 3 times in this implementation.
5.5.2 - The Interconnection Network
The interconnection network implements the bus that carries the serial cell signals and the
write sync signals from the input stage to the output stages. It is a fairly simple layout
using three layers of metal. The bus is made up ofMetal 2 lines that run parallel to the
input stages. Metal 2 is used for its fairly low capacitance and so as to allow direct
connections to the inputs and outputs of the bus using a single Via. The spacing between
lines of the bus is larger than the minimum required so as to reduce coupling capacitance
between the lines.
Input lines into the bus are implemented using Metal 3 layers. These lines all come from
the Bus Drivers in the Input Stages. The outputs from the bus to the Output Buffers are
implemented using Metal 1 lines. This configuration allows for the use of a single Via to
connect any input to the bus and any output to the bus. Metal 3 layers can be connected to
Metal 2 using a single Via and Metal 2 can be connected to Metal 1 using a single Via as
well. The overall effect is that a three dimensional stacked structure is implemented where
input and output lines into the bus overlap each other without any problems.
The actual bus is generated using a Perl script which produces a file in the CIF (CalTech
Interchange Format) format. This process therefore automates the layout of the
InterconnectionNetwork and saves a large amount of time as compared to hand layout of
125
the bus. The only requirement for this process to work properly is that the inputs and
outputs from the bus must be located at regular intervals of space so that the program
knows where to generate the necessary lines.





















-? RSM Logic -? RSM |
-? WSM2 Logic WSM2
i
WSM1 Logic WSM1
Figure 5.15: The Read/Write Logic for the Cell Buffer
Figure 5.15 shows the floorplan of the Read/Write logic block associated with each Cell
Buffer. This block includes the Read State machine, the twoWrite State machines and the
multiplexer and latch used to select the input port. The cells coming in from the input
ports are fed directly into the multiplexer block represented byMux.
The state machines and latches require a number of external signals such as the system
clock, the global counter value and the bit select 0 signal to function properly. These
signals are fed to each R/W logic block through a bus which is shown as the Input block in
the floorplan.
The WSM1 block corresponds to the Write State Machine 1 in the system. All the logic
required to implement the state machine is implemented as one standard cell shown as
WSM1 Logic. The output from that cell is then fed to the WSM1 block which is a
Negative Transition Triggered Static Flip-Flop standard cell. The output from the WSM1
Flip-Flop is then fed to the WSM2 Logic block which is a 2 input AND gate. The output
126
of that standard cell is fed to the WSM2 block, which is a Positive Transition Triggered
Static Flip-Flop. Static flip-flops are used to implement these state machines as they
operate at very low frequencies.
The RSM Logic Block is a 3 input AND gate which feeds the RSM block. This block is
another Positive Transition Triggered Static Flip-Flop. The three latches that drive the
multiplexer select lines are shown as the Latch 0, Latch 1 and Latch 2 blocks. Each latch
is aNegative Transition Triggered Static Flip-Flop.
The output of each block is fed into a set of bus drivers. The signals from the bus drivers
are then tapped and connected to the memory array. This is shown as the Output Buffers
block.








Logic 4 WSR WSC
















Figure 5.16: Floorplan of the Output Buffer
Figure 5.16 shows the floorplan of the Output Buffers. Each output buffer is made up of 8
R/W Logic blocks, a Write Select Register, a Read Select Register, an Output Counter
and a Write Sync Counter. The R/W Logic blocks correspond to the standard cell
described in the previous section. Each Cell Buffer must have a R/W logic block
associated with it, and therefore this Output Buffer has 8 R/WLogic blocks.
127
TheWSR and RSR blocks are theWrite Select Register and Read Select Register blocks.
These two registers are made up of Positive Transition Triggered Static Flip-Flops. The
output of both registers are placed on the Local Bus block, which is responsible for
distributing data around the Output Buffer.
The OC block represents the output counter of this Output Buffer. The OC block is a self
contained standard cell which receives the sync signals from the system and generates its
own clock. This cell also includes the logic to produce the buffer full and buffer empty
signals.
The WSC block is the Write Sync Counter standard cell. The architecture of the system
specifies that the entire switch only has one Write Sync Counter. However, implementing
the switch in that way would mean that the Output of theWrite Sync Counter would have
to be routed to every Output Buffer in the system. This would require a large amount of
buffering and the delay introduced by this process could become significant. The actual
implementation of the chip, therefore, uses a local Write Sync Counter in each Output
Buffer.
Notice that the Sync and Cell inputs from the Interconnection Network enter the standard
cell from the side where the OC block is located. This is done on purpose so that the
Output Counter is the first block in the system to see the sync signals. The reason behind
this implementation detail is that the OC block also holds the logic to implement the
ORing of the sync signals to form the combined write sync signal. Therefore, placing the
OR gate as close as possible to the where the write sync signals enter the Output Buffer
allows the combined write sync signals to be produced faster. Also the signal can then be
sent to the rest of the logic blocks with a smaller delay than if the OC block was at the
other end of the Output Buffer.
128







0 0 0 0 0 0 0 0
B B B B B B B B
0 1 2 3 4 5 6 7
Interconnection Network
ISO IS1 IS2 IS3 IS4 IS5 IS6 IS7
Figure 5.17: The floorplan of the HiPer ATM Switch chip
Figure 5.17 shows the floorplan of the final chip. The first row of standard cells are the
input stages connected to each input port. These blocks are labeled as IS. The write sync
signals and the ATM cells are fed from the input stages into the Interconnection Network
block. Each Output Buffer, represented here as OB standard cells, taps the content of the
interconnection network to grab the cells that they require. The select lines (read and
write) and the data lines from the Output Buffers are
then connected to the Random
Access Memory (RAM) through Metal 3 lines. Metal 3 is not used in the implementation
of the Output Buffer Cells. This allows direct connections to be made between each
Read/Write logic element and the RAM usingMetal 3 lines.
The Random Access Memory is split into two sections that hold 64x212 bits each. The Bit
select Register is placed in the middle of the two sections. The bits are selected starting at
129
the bottom of the left section of the RAM. The bit select pointer then moves up the left
section until it reaches the last bit in that section. The bit is then sent over to the right to
enable the next section. The bit select pointer then travels down the right side of the buffer
until it reaches the bottom at which point a cell cycle is complete. The bit coming out of
the register is then fed back into the left side and it starts to go up again. This effectively
implements a circular register where all the connections between flip-flops are of equal
length.
This design also allows the bit select (0) signal to be produced close to the Output
Buffers. This, in turn, reduces the amount of buffering required to get the bit select (0)
signal to all the state machines. The capacitance of the data lines of the RAM is now also
smaller. The RAM, therefore, can operate faster as discharging and charging the data lines
now takes less time.
The total number of devices in the chip is of the order of 300,000 transistors. Out of that
number, 270,000 transistors are used to implement the Output Buffers. The dimensions of
the layout is approximately 4000x3000
urn2
5.6 - Results from the circuit level simulations
The input stage is simulated first. A cell is fed to that stage with an address that will lead
to multicasting. This test verifies that the sync signals are switched to the right output
buffers.
130
1 1 1 1 1 1 III 1 1 1 l
o - -
o_' o "'i - 'o
i rr
. cj =s J = (J
-_'







































































Figure 5.18: Switching of the sync signals in the input stage
A bit pattern of
'1011'
is used as the destination address of the cell. This address causes a
cell to be multicasted to 4 output ports. This can be verified by looking at the Lookup
Table in the VHDL model (see Appendix B Section B.7). Figure 5.18 shows the results of
this test.
131
The first signal at the bottom of Figure 5.18 is the cell corning into the input port of the
switch. The system clock signal is seen above it, followed by the complement of the
system reset signal and the input sync signal. Notice that the first bit of the cell arrives
right after the reset signal is de-asserted. Furthermore, the sync signal is only asserted
during the first bit of the cell. This is expected from the handshaking protocol specified
between the InputModules and the Switch Fabric (Chapter 4 Section 4.2.2).
The system reset signal is displayed above the sync signal and it is followed by the 8 write
sync signals, which are produced by the Switching Trees (these are labeled sync_out_0 to
sync_out_7). As expected, write sync signals 1, 4, 5 and 7 are asserted.
The actual cell comes out of the Input Cell Buffer shortly after the sync signal has been
switched. This is also shown in Figure 5.18 as the top 8 signals in the timing diagram
(labeled cell_out_0 to cell_out_7). The delay between the cells showing up on the
interconnection network and the production of the switched write sync signals is the
amount of time required by the Output Buffers to process all the write sync signals.
132


















Figure 5.19: Read Select Register and memory read select lines
Figure 5.19 displays the read mechanism in the HiPer ATM switch. This simulation is run
on a small scale Output Buffer that only holds 8 bits. The first signal at the bottom of the
timing diagram is the write sync signal. The next signal above it is the system clock after it
has been delayed to arrive at the same time as the other signals in the Output Buffer. The
133
third signal from the bottom of the timing diagram is the least significant bit of the Bit
Select Register. Every time a pulse is seen on this signal, a new cell cycle starts.
The Read Select Register is shown in Figure 5. 19 as the 8 signals above the bit select (0)
signal. Each signal is one bit of the Read Select Register. Notice that the bit shifts from
one position to the next at the start of every cell cycle.
Also shown in this timing diagram are the read select lines into the memory. These lines
are driven directly by the Read State machine. Notice that the first line is asserted at the
rising edge of the bit select (0) signal. Shortly afterwards, the Read Select Register shifts
to point to the next cell buffer. The system is set up so that the Read Select Register will
not shift at the same time as the Read State Machine is switching states. This ensures that
problemswithmetastabihty do not happen in the system.
134
1 i 1 1 _ 1 _l .
CS ts. era
r.1" ii i =
?3 V





















































Figure 5.20: TheWrite Select Register and the memorywrite select lines
135
Figure 5.20 shows the write mechanism in the Output Buffers. The signals in this timing
diagram are organized in the same way as in the previous figure. The first signal at the
bottom of the diagram is the write sync signal followed by the delayed system clock, the
bit select (0) signal, theWrite Select Register and finally at the top of the diagram are the
write select lines into memory.
Looking at Figure 5.20, it can be seen that theWrite Select Register shifts every time that
the write sync signal is asserted. However, the write select lines into memory do not get
asserted until the start of a new cell cycle. This demonstrates the memory effect of the two
Write State Machines, which remember that they have been selected to be written to,
although theWrite Select Register is not pointing at them anymore.
136
Chapter 6 - Conclusions
6.1 - Summary
A new ATM switch architecture is designed and implemented in the course of this project.
A top-down approach is used to define and implement the device. The new architecture is
a fully interconnected switch fabric with output buffers to handle overflow of cells. This
architecture improves on similar switches such as the Knockout architecture where cells
are dropped at random so as not to overwhelm the buffers. The output buffers are
implemented using a multi-ported FIFO memory which allows multiple cells to be written
into the memory at the same time as a cell is being read out. The logic to implement such a
memory scheme is described and implemented in the project. The complexity of the
memory logic is small enough that it will land itselfwell to scaling.
An Object Oriented Model of the architecture is first created and simulated to measure the
performance of the device. The results obtained during the simulation match what would
be expected from such an architecture.
This is then followed by a Hardware Description Model of the device using the VHDL
language. The VHDL model is used to verify that the logic can be implemented as
described. The model is also used as a blueprint for the circuit of the system and its layout.
A transistor level model of the system is implemented to confirm that the logic will operate
at the expected speed of 200 Mbps. Finally the layout of the system is produced using a
0.5 urn CMOS VLSI process.
The implementation of the system indicates that the 0(N2) complexity of a fully
interconnected architecture is not an issue at the VLSI level. The cell buffer takes up the
majority of the space
of the chip layout. This, therefore, suggests that at the VLSI level
the hardware complexity of the memory system dominates
over the complexity of the
interconnection network.
137
6.2 - Improvements And Future Directions
This implementation does not cover all the requirements for an ATM switch. The main
objective was to implement a switching fabric and that was indeed accomplished. On the
other hand, extra features such as cell priority level and VPI/VCI translation are not
available in the architecture.
The following sections will look at how these features could be implemented and how the
switch fabric could be improved.
6.2.1 - The Implementation ofCell Loss Priority
One bit in the ATM cell header is reserved to carry priority level information about the
cell. The bit is called the CLP bit for Cell Loss Priority (see Section 2). The ATM protocol
specification suggests that this bit could be used to assign a priority level to the cell but it
does not require that a priority scheme be implemented. Indeed, the present architecture of
the HiPer ATM switch does not implement this feature.
The design of the HiPer ATM switch has an inherent priority scheme implemented in its
architecture. This is due to the nature of the handshaking inside the switch fabric as
implemented with the write sync signals. The port which produces the first write sync
signal is the one that will have the highest chance of finding an empty buffer. Therefore,
that port has the highest priority of all the inputs. This process can be extended to show
that the port whose sync signal follows the first one has the second highest priority level
and so on until the port that produces the last sync signal has the lowest priority of all.
This mechanism can be put to use by assigning the channels that have the highest priority
to the first input port and the channel with the lowest priority to the last input port. This,
however, does not help the situation when a channel carries both high priority and low
priority cells in the same stream of data. This type of
behavior can only be solved by
implementing the CLP bit feature.
138
The CLP bit can be combined with the address field so that when the CLP bit is asserted
the bit pattern returned by the Switching Table will cause the cell to be dropped rather
than routed to a buffer.
This method is not very efficient as during periods of low congestion cells will still be lost
due to the fact that they have the CLP bit set. The solution to this problem is to enable the
use of the CLP bit when the size of the buffer reaches a certain threshold set by the
designer. This could be done by decoding the value of the Output Counter and sending a
signal back when that value exceeds a preset level. The signal could then be ANDed with
the CLP bit to turn the priority feature on and off.
6.2.2 - Reducing the size of the Interconnection Network
The size of the interconnection network can be reduced to an 0(N) complexity level. The
Output Buffers only make use of the aggregate write sync signal and they do not require
the individual signals on separate signal lines to function properly. Therefore, the write
sync signals could be combined together in the input stage itself and then sent over to the
Output Buffers. In this case onlyN write sync lines would be required forN output ports.
The same principle can be applied to the Cell lines, as the same cell is propagated to every
Output Port. The cell signal does not need to be split into 8 separate identical signals for
each Output Buffer. This would reduce the complexity of the Cell signal lines to 0(N) too.
The disadvantage of this method is that the amount of capacitance that has to be driven on
each signal line will be very large requiring large drivers. However, the size of the drivers
might become less of an issue as the number of ports is increased, and the interconnection
network area starts to become significant.
6.2.3 - Implementing large switch fabrics with multiple chips
The size of an individual single chip HiPer ATM switch can be increased to produce fairly
large switch fabrics. However, a threshold is always reached where scaling the single chip
is not feasible anymore. The switch fabric is, therefore, designed so that scaling of the
139
switch fabric beyond the realm of a single chip can be implemented easily. This type of
switch fabric is exemplified by the
Multiple-Stage Interconnection Networks such as
Batcher Banyan switches (Chapter 2 Section 2.6.2).
The input and output ports of the switch are symmetrical in nature. On the input side cells
come in accompanied by sync signals, and on the output side of the switch, the cells come
out accompanied by sync signals. This allows the output port of a HiPer ATM switch to
be connected directly to the input port of another similar switch to form large switch
fabrics. The implementation of Multiple-Stage Interconnection Networks is therefore
simplified, as a designer does not need to figure out how to interface the switches with
each other using glue logic.
140
Appendix A - The C++ Model
A.1 - Makefile
SWITCH=-03
main: main.cc main.h Switch. o Queue. o NetworkManager . o Cell.o
g++ $ (SWITCH) Queue. o NetworkManager . o Switch. o Cell.o main.cc -o
main -lg++ -lm
Switch. o: Switch. h Switch. cc Queue. o Cell.o
g++ $ (SWITCH) -c Switch. cc
NetworkManager . o: NetworkManager .h NetworkManager.ee Cell.o
g++ $ (SWITCH) -c NetworkManager.ee
Queue . o : Queue . h Queue . cc Cell . o
g++ $ (SWITCH) -c Queue.cc
Cell.o: Cell.h Cell.cc









// AUTHOR: Rudi Rughoonundon
// PURPOSE: Top level of Object Oriented Model of Hiper ATM Switch
//
//////////////////////////////////////////////////////////////////////
// prevent multiple inclusion of this file
#ifndef
#define









// define a fixed number of ports
#define NUM_OF_PORTS 8
141
// define the position of each argument in the command line to allow








iiiiiiiiiiiiiiiiiiiiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiii ill innmiinn
n
II NAME: main
// AUTHOR: Rudi Rughoonundon












void create_new_data_file (char **) ;
void update_data_file (char **, Integer
SampleHistogram *, SampleHistogram *) ;
Integer *, Integer
iiiiiiiiiiiiiiiiiiii 1 1n i in innin1 1 iiiiiiiiiiiiiiiinilv ///iii/iiii
1/ Main function
iiimninmumi iiiiiiiiiiiiiiimmtuuiiiiiiiiiiiiiiiiiiiiiiii





II Declare and initialize all top level objects
iii/iiiiiiiiiiiiiiiimiiiimmmimiiiiimm'iiimiiiiiiiiimi
Integer max_time (0) ;
Integer flush_time (0)






char name [50] ;
ostrstream file_name (name, 50) ;
Cell input_cells [NUM_OF_PORTS] ;
// maximum simulation time
//
//
additional time to flush buffers
current simulation time
// number of cells sent
// number of cells received
// number of cells lost
// simulation output file
// temporary output file
// file name
// file name stream
// input port to switch
142
Cell output_cells [NUM_OF_PORTS] ; // output port from switch
SampleHistogram buf fer_length (1, atol (argv [FIF0_SIZE] ) +1, 1) ;
SampleHistogram cell_delay (1, atol (argv [FIFO_SIZE] ) +1, 1) ;
NetworkManager network_manager (atoi (argv [ARRIVAL_PROBABILITY] ) ,
atol (argv [RNG_SEED] ) , input_cells, output_cells, &cells_received,
&cells_sent, &time, &cell_delay) ;
Switch atm_switch (atoi (argv [FIFO_SIZE] ) , input_cells, output_cells ,
&cells_lost, &buffer_length) ;
ll/lllllll III II 1 11/11 ///Ill/Ill/ //III llll 11/111/ /Ill/Ill// /lllllllll//
1 1 check format of command line
// exit if command line arguments are not given
lllllllllllllllllllllllllllllllllllllllllll/llllllllllllllllllllllllll
if (argc < 6)
{
cout "Main: Not enough command line arguments\nUsage: main





lllllllll I llll II llllllllllll lllllll lllllllll III llllllllllllllllllll/ll
II create file name of progress file









lllllllll/lllllllll/lllllllllllllllllllllllllllllll llll I llll III lllllll
II reset histogram objects to be empty
//////////////////////////////////////////////////////////////////////
buffer_length. reset () ;
cell_delay. reset () ;
//////////////////////////////////////////////////////////////////////
// run for specified time
// send cells to switch, switch cells and receive cells from switch
lllllllllll/llllllllllllllllllllllllllllllllimilllllllllllllllllllll
max_time = atol (argv [SIMULATION_TIME] ) ;
for (time = 0 ; time < max_time ; time++)
{
if ( (time % 500000)
== 0)
{
output file. open (name, ios : : truncl ios : : out) ;




















output_file. close () ;
}
network_manager. send_cells () ;
atm_switch. switch_cells () ;
143
network_manager . receive_cells () ;
}
lllllllinilll/llllllllllllllllllllllllllllllllllll lllllllllll/lllllll
II switch remaining cells and process them accordingly
III l/l/llllllllllll/llllllllll/lllllllllllll/lllllll l/ll/lllllllllllll
flush_time = max_time + atol (argv [FIFO_SIZE] ) ;
for (; time < flush_time ; time++)
{
atm_switch. switch_cells () ;
network_manager . receive_cells () ;
}
//////////////////////////////////////////////////////////////////////
// check whether a data file exists for the output
// if no such file exists create a new one and set it up
// to receive new output
//////////////////////////////////////////////////////////////////////
input_file.open (argv [OUTPUT_FILE_NAME] , ios::in) ;




input_file. close () ;
llll/lllllllllllllllllllllllllllllllllllllllllllllllllllll/lllllllll II
II update the output file with data collected from current simulation
iiiiiiin iiiiiiiiiiimmmmmmmiiiiiiimi imninniiiiiii
update data file (argv, &cells_sent, &cells_received, &cells_lost,
&buffer_length, &cell_delay) ;
lllll/lllll I l/llllllllllllllllllllllllllllllllllllllllllll llllllllllll
II destroy all objects that were created for simulation
IIIII//II/I///IIIIIII/IIIIIIIIIIII/III/IIIIIIII/IIIIIIIII llllllllllll/
delete [] input_cells ;
delete [ ] output_cells ;
}
/lllllllll llllll/l Ill/llllll/lll/l/ll////////////////// ///////////////
II
II NAME: update_data_file
// PURPOSE: dump out all the data from the simulation into an output













int length = atoi (argv [FIF0_SIZE] ) ;
Integer cumul_length (0) ;
Integer cumul_delay (0) ;
ifstream input_file ;
ofstream output_file ;
char line [256] ;
char command [50] ;
ostrstream command_line (command, 50) ;
output_file. open ("main. tmp", ios : : truncl ios : : out) ;
input_file.open (argv [OUTPUT_FILE_NAME] , ios::in) ;
if ( !output_file)
{










































input_file.getline (line, 255, '\n') ;
output file line
' '
atoi (argv [ARRIVAL_PROBABILITY] )
'\n'
;
input_file.getline (line, 255, '\n') ;
















for (index = 0 ; index
<= length ; index++)
{
input_file.getline (line, 255, '\n') ;
cumul length























for (index = 0 ; index <= length ; index++)
{
input_file. getline (line, 255, '\n') ;
cumul_delay

























for (index = 0 ; index <= length ; index++)
{
input_file. getline (line, 255, '\n') ;

















for (index = 0 ; index
<= length ; index++)
{
input_file. getline (line, 255, '\n') ;
cumul_delay









input_file. close () ;
output_file. close () ;






delete [] line ;





// PURPOSE: create a new output file with the initial rows and columns








int length = atoi (argv [FIF0_SIZE] ) ;
new_data_file. open (argv [OUTPUT_FILE_NAME] , ios : : out | ios : : trunc) ;
if ( !new_data_file)
{














































































for (index = 0 ; index












// AUTHOR: Rudi Rughoonundon
// PURPOSE: generates cells according to the given probability of
arrival
// and probability distribution and feeds them to the switch.
// also receives the cells as they come out of the switch.
//









































// PURPOSE: destroys the objects used for
random number generation
'/l 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
148




// PURPOSE: produces cells to be fed to the input ports of of the ATM
// switch, cells are produced based on a uniform arrival
// distribution and based on the specified arrival
// probability, the destination of each cells is also
// produced at random based on a uniform distribution
// this method also keeps independent track of the number of
// cells sent and returns the value to the main system.
//
//////////////////////////////////////////////////////////////////////




// PURPOSE: receive all the cells from the output ports, get the
// get the delay encountered by the cell in the switch and
// and update the cell delay histogram, also keep track
// of cells received to confirm data in histogram
//
//////////////////////////////////////////////////////////////////////







// AUTHOR: Rudi Rughoonundon
// PURPOSE: generates cells according to the given probability of
arrival
// and probability distribution and feeds them to the switch.





llll/lllllll/lllllllllllllllllll llllllllllllll III llll llllllllllll III II
II
II NAME: networkmanager
// PURPOSE: create and initialize the objects used in this simulation.
//
//////////////////////////////////////////////////////////////////////











global_id. Integer (0) ;
gen = new ACG (seed, 98) ;
arrival_rnd = new Uniform [NUM_0F_P0RTS] (0, 100, gen) ;
destination_rnd = new Uniform [NUM_OF_PORTS] (0, NUM_OF_PORTS, gen)
input_cells = input ;
output_cells = output ;
cells_received = received ;




time = sim time ;
)
lllllll/lllllllllll II11/11 III lllllll III III lllllll llllll ///////////// 1 1
II
II NAME: -networkmanager
// PURPOSE: destroys the objects used for random number generation
//
llllllllllll/llllllllllll llllllllllllllllllllllllllllll/llllllllllllll
NetworkManager: : -NetworkManager ()
{
delete [] arrival_rnd ;
delete [] destination_rnd ;
delete gen ;
}
III 1 1 II I III III lllllllllllllllllllllllllllllllllllllllllll llllll lllllll
II
II NAME: send_cells
// PURPOSE: produces cells to be fed to the input ports of of the ATM
// switch, cells are produced based on a uniform arrival
// distribution and based on the specified arrival
// probability, the destination of each cells is also
// produced at random based on a uniform distribution
// this method also keeps independent track of the number of
// cells sent and returns the value to the main system.
//
//////////////////////////////////////////////////////////////////////







// produce a cell for each input port independently
150
// produce a random number and compare it to the specified
// arrival probability to determine cell arrival
// if cell arrived create it and place it on an input else
// place a blank cell on the input
for (port = 0 ; port < NUM_0F_P0RTS ; port++)
{
arrival = arrival_rnd [port] () ;
if (arrival < arrival_probability)
{
destination = destination_rnd [port] () ;
dest = static_cast<int> (destination) ;











// PURPOSE: receive all the cells from the output ports, get the
// get the delay encountered by the cell in the switch and
// and update the cell delay histogram, also keep track
// of cells received to confirm data in histogram
//
//////////////////////////////////////////////////////////////////////
void NetworkManager: : receive_cells ()
{
int port ;
Integer delay (0) ;
// process each output port separately
for (port = 0 ; port < NUM_0F_P0RTS ; port++)
{
if (output_cells [port] .is_valid ())
{
output_cells [port] .delay (time, &delay) ;
(*cell_delay) += delay. as_double () ;






// AUTHOR: Rudi Rughoonundon
// PURPOSE: implements the routing and queing functions of the ATM
switch
//

























// number of cells lost
// buffer length histogram
iiiiiiii 1 1 1 1 1 / 1 1 1 111mmminiiii 111 111 iiiiiim/mimiimiiim
n
II NAME: switch
// PURPOSE: constructor initializes all internal objects and creates the
queues
//
/ iiii / / // / / / / // / iiiiii / iiii / 111 iiiiii iiiiii 111 111 // / iiii iiii iiii iiiin
Switch (int length, Cell *input_cells, Cell *output_cells, Integer
*cell_lost, SampleHistogram *buf_length) ;
imil 1 11 11 ifiii/iiiii i iiiii iiiiiiiiiininnininmiinurniiii 111
n
II NAME: -switch







// PURPOSE: switches the arriving cells and places them in the right
queue
// removes one cell from each queue that is not empty and
places them
// on the output
//
mmmimmiimmmmmimmmmmmmimmmiiiii




IIII//I/I/I/IIII/III/II//III/III III/ II IIIIII/III/lllllllll///lllllllll
II
II NAME: Switch
// AUTHOR: Rudi Rughoonundon
// PURPOSE: implements the routing and queing functions of
// the ATM switch
//






// PURPOSE: constructor initializes all internal objects and
// creates the queues
//
/llll II 1 1 llllllllllll/ 1 llll llll III llll l/lll I l/ll/llll I II I lllllllllllll
Switch: : Switch (int length, Cell *input, Cell *output, Integer *lost,
SampleHistogram *buf_length)
{
output_buffers = new Queue [NUM_0F_P0RTS] (length) ;
input_cells = input ;
output_cells = output ;
cells lost = lost ;





// PURPOSE: destructor destroys the queue objects
//
mmimmmmmmmmmimmmmimmimmmiiiiii
Switch: : -Switch ()
{





// PURPOSE: switches the arriving cells and places them in
// the right queue
// removes one cell from each queue that is not empty
// and places the cells on the output
Illlllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
153




// repeat for each port
for (port = 0 ; port < NUM OF PORTS ; port++)
{
III 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 IIIIII 1 1III 1 1 1 1 1 III 1 1 III I / / / 1 / / 1 1 1 / / 1 1 / / / 1 / 1 1 1
II if a cell arrived on this input port get its destination address
// and place it in the right output queue if the queue is not full
//
///////1iiiiiiimiiiiiiiiiimmimiimm/iim/iii/i/m/mmi
if (input_cells [port] . is_valid ())
{
dest = input_cells [port] . get_destination () ;






output_buffers [dest].enq (input_cells [port]) ;
}
input_cells [port] . invalid () ;
}
//////////////////////////////////////////////////////////////////////
// get the size of the buffer before removing cells from it
//
//////////////////////////////////////////////////////////////////////
*buffer_length += static_cast<double> (output_buffers [port] . length
0) ;
//////////////////////////////////////////////////////////////////////
// if the output queue is not empty remove one cell and send it to
// the output port
//
//////////////////////////////////////////////////////////////////////
if (output_buffers [port]. empty ())
output_cells [port] .invalid () ;
else






// AUTHOR: Rudi Rughoonundon
// PURPOSE: Cell defines the class of a single ATM cell. The object
// holds the time at which it was created, a unique id
// to identify it, its destination and whether it is an










Integer arrival_time ; // cell creation time
Integer identity ; // unique ID of cell
int destination ; // destination address of cell
bool valid ; // determines whether cell is valid
public:
llll IllllII I///////III//I////I/I/I/II//III/II/III/I//II/I III //Ill/Ill/
II
II NAME: Cell




IIIllllll/llll IIIII/I//I/III/IIII/I/II//IIIIIIII III ininniiiiiii llll
II
1/ NAME: -Cell
// PURPOSE: destructor returns all memory allocated to object
//





// PURPOSE: Set the fields of a Cell object with the given values
//
mmimimmmimmiimmmiimmimimmimiiiiiim






// PURPOSE: displays the content of a single cell
//
II II lllllll I llllllllllll lllllllllllll llll llll III III l/lllllllllllllllll




// PURPOSE: returns the delay encountered by the cell in the switch.
// this method requires the current time to be passed to it.
//
iiiiii 1 1 1niiiiiii in i iimiimim iiiiimiiiiiii i iiiiiii immim




// PURPOSE: This is an overloaded = operator which will effectively
// copy one cell object to another.
//
/IIIIIIIIIIIIIIII lllllll llll III I II I III II III II I II III I IIIIII/ 1 III lllllll




// PURPOSE: returns whether the cell is valid or not
//
IIIIIIIIIIIIIIIIIIIIIIII lllllll/HIlllllll/111llllINI!Illllllllllllll




// PURPOSE: Forces a cell to be invalid or unassigned
//
mmmmimmmimmmmiiiiimimmiiiiiiiiiiiiiiiiiiii




// PURPOSE: returns the destination address of the cell
ii/iii/i/i/ii/ii/iiiiiii/iiiiiiiiiiiii/ 1/iiiiiiiiiiiiiiiiiiiimmni




miminiinn ///////II//////I/II iiiiii i inni/n/II////////I///mi
n
II NAME: cell
// AUTHOR: Rudi Rughoonundon
// PURPOSE: Cell defines the class of a single ATM cell. The object
holds the time
// at which it was created, a unique id to identify it, its
destination












arrival_time. Integer (0) ;
identity . Integer (0) ;
destination = 0 ;
valid = false ;
)
Ill/Ill lllllllllllllll/lllllllllllllllllllll/llll III1III///II//I lllll I
II
II NAME: -Cell






////// / 1 1IIIII I IIIIIIIIIIIIIIIIIIIIII lllllll llll llllllllll llllllllllll
II
II NAME: set
// PURPOSE: Set the fields of a Cell object with the given values
//
llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll






destination = dest ;






// PURPOSE: returns the delay encountered by the cell in the switch.
// this method requires the current time to be passed to it.
//
iiiiii iiiiii /////I///I//II/II//I////IImiiiiiiiimiinniiiiii//i/n









// PURPOSE: This is an overloaded = operator which will effectively
// copy one cell object to another.
//
llllllllll/lllllllllllllllllllll llllllllllllllllllllllllllllllllll/lll
Cell Cell: : operator = (const Cell &cell)
{
arrival_time = cell . arrival_time ;
identity = cell . identity ;
destination = cell . destination ;
valid = cell. valid ;
II lllllllll llll l/llllllllllllllllllllllllll llllll lllllllllll lllllll III
II
II NAME: is_valid
// PURPOSE: returns whether the cell is valid or not
//
lllllllllllllllllllllll llll I III III llllll II II I III llllll III lllllll I llll I




II/I//I///I/II//IIII/IIII llll I IIIlllllll//////////////////////// //////
II
II NAME: invalid
// PURPOSE: Forces a cell to be invalid or unassigned
//
mmiimmmiiiimmmmmmmmmmmmimiimiiii
void Cell: : invalid ()
{






// PURPOSE: returns the destination address of the cell
//
lllllllllllllllllllllllllllllllllllllllllllllllllll/lll lllllllllllllll







// PURPOSE: displays the content of a single cell
//
iiiiiiiiiiiiiiimiimiinnurn1 111 1 iiiiiii iiiiiimii/mimmil
























iiiiiiiiiiiiiiiiiiii iimiiiiimi in inmm/I iiiiiiimimimmi
n
II NAME: Queue
// AUTHOR: Rudi Rughoonundon
// PURPOSE: implements a first in first out queue data structure
//










queue ; // pointer to an array of cells
int head ; // head of the queue
int tail ; // tail of the queue
int cur size ; // current size of the queue





// PURPOSE: constructor initializes all internal objects








// PURPOSE: destructor removes all the sapce allocated to the queue
//////////////////////////////////////////////////////////////////////




// PURPOSE: adds a new cell to the queue
//
//llllllllllllllll lllllllll/lllllllllllllll lllllll III lllllllllllllllll




// PURPOSE: removes a cell from the queue and returns it
//
//////////////////////////////////////////////////////////////////////




// PURPOSE: returns the length of the queue
//
l/lllll/l/llllll/l I llll III llllll II II III llll lllllllll II llllll III lllllll




// PURPOSE: returns whether the queue is full or not
//
IIIIII/III III llllllllllllll IIllllllllllllllllllllllllllllllll11lllllll




// PURPOSE: returns whether the queue is empty or not
//
iimmiimmmmimiimmiiiimmimiimmim iiiiiii in
bool empty { ) ;




// PURPOSE: displays all the cells in the queue
//
//////////////////////////////////////////////////////////////////////







// AUTHOR: Rudi Rughoonundon








// PURPOSE: constructor initializes all internal objects
// and creates the queues
//
//////////////////////////////////////////////////////////////////////
Queue : :Queue (int size)
{
int index ;
queue = new Cell [size] ;
head = 0 ;
tail = 0 ;
max_size = size ;
cur_size = 0 ;
// invalidate all the cells by initializing them
for (index = 0 ; index < max_size ; index++)
{






// PURPOSE: destructor removes all the sapce allocated to the queue
//
mmiimmiimiiiimmmmmmmmmmmimmimm
Queue: : -Queue ( )
161
{





// PURPOSE: adds a new cell to the queue
//
//iimim i iimiimiiimi iiiii iiiiiii iimmmiiimmmm/ii
void Queue: : enq (const Cell &cell)
{
if (cur_size == max_size)
{





queue [head] = cell ;
cur_size++ ;
head = (head +1) % max_size ;
}
////// 1 11mi11 ii inimmimminiiii imimimm/iimiii/mi
n
II NAME: deq
// PURPOSE: removes a cell from the queue and returns it
//
II II I III llllllll 11/1/ l/l/l III I II IIHI I II II II I III I III III III II III/ II III/
void Queue: : deq (Cell &cell)
{
if (cur_size == 0)
{





cell = queue [tail] ;
cur_size ;
tail = (tail + 1) % max_size ;
}
lllllll II llllll //////Ill/Illllllllllllllllllllllllllllllllllllllllllll
II
II NAME: length
// PURPOSE: returns the length of the queue
//
/////// // 1 /iiiiii ii iiiiimi immiimiimimmn//mini///////








// PURPOSE: returns whether the queue is full or not
//
llllllll I lllllll/llllllllllllllllll lllllllllll llllll IIIlllllllllllllll
bool Queue:: full ()
{





// PURPOSE: returns whether the queue is empty or not
//
IIIIIIIIIIIIIIIIIIII Illllllllllll lllllllll lllll lllllllllllllllllllllll
bool Queue:: empty ()
{





// PURPOSE: displays all the cells in the queue
//
llllllllllll llllIIIIIIIIIIIIIIII llllllllllllllllllllllllllllllllllllll
void Queue: : display ()
{
int index ;
















for (index = 0 ; index < max_size ; index++)
{




Appendix B - VHDL Model Source
B.l - Switch Packvhdl
NAME: Switch Package
AUTHOR: Rudi Rughoonundon
DATE: November 1 1996
PURPOSE: Defines data types, constants and functions
that are common to all the entities in the
ATM switch model.
Package definitions
PACKAGE switch pack IS
Constants
CONSTANT delay : TIME
:= 100 ps ;
CONSTANT input_cell_buffer_length : INTEGER := 40
CONSTANT input_sync_buffer_length : INTEGER := 4 0
CONSTANT output_buffer_size : INTEGER := 8 ;
CONSTANT number of outputs : INTEGER := 8 ;
Subtypes
SUBTYPE nibble_type IS BIT_VECTOR (0 TO 3) ;
SUBTYPE byte_type IS BIT_VECTOR (0 TO 7) ;
SUBTYPE bit_map_type IS BIT_VECTOR (0 TO number_of_outputs-l) ;
SUBTYPE cell_type IS BIT_VECTOR (0 TO 423) ;
SUBTYPE cell_body_type IS BIT_VECTOR (0 TO 383) ;
SUBTYPE cell_header_type IS BIT_VECTOR (0 TO 39) ;
SUBTYPE rw select_type IS BIT_VECTOR (0 TO output_buffer_size-l) ;
SUBTYPE input_counter_type IS BIT_VECTOR (0 TO 2) ;
SUBTYPE output_counter_type IS BIT_VECTOR (0 TO 3) ;
SUBTYPE cell buffer_type IS BIT_VECTOR (0 TO input_cell_buff
er_length-
D ;
SUBTYPE sync_buffer_type IS BIT_VECTOR (0 TO input_sync_buff
er_length-
1) ;
SUBTYPE delayed_cell_type IS byte_cype ;
SUBTYPE bit_select_type IS cell_type ;
SUBTYPE read_select_type IS rw_select_type ;
SUBTYPE write_select_type IS rw_selecc_type ;
SUBTYPE ram type IS cell type ;
164
Types
TYPE destination_type IS ARRAY ( 0 TO 7 ) OF nibble_type ;
TYPE routing_bit_map_type IS ARRAY ( 0 TO 7 ) OF bit_map type
TYPE routed_cell_type IS ARRAY ( 0 TO 7 ) OF byte_type
TYPE routed_sync_type IS ARRAY ( 0 TO 7 ) OF byte_type ;
TYPE ascii_file IS FILE OF CHARACTER ;
TYPE write_state_l_type IS (Idle, Ready) ;
TYPE write_state_2_type IS (Idle, Writing) ;
TYPE read_state_type IS (Idle, Reading) ;
CONSTANT unassigned_cell : cell_type := (OTHERS => '0') ;
Function declaration
NAME: chartobyte
PURPOSE: converts a character to a byte
FUNCTION chartobyte (data : CHARACTER) RETURN BIT VECTOR ;
NAME: bytetochar
PURPOSE: converts a byte to a character
FUNCTION bytetochar (byte : byte_type) RETURN CHARACTER ;
NAME: inttobitvector
PURPOSE: converts an integer to a bit vector
FUNCTION inttobitvector (data : INTEGER ;
size : INTEGER) RETURN BIT VECTOR ;
NAME: wired_and
PURPOSE: implements a wired and resolution function
FUNCTION wired_and (drivers : BIT_VECTOR) RETURN BIT ;
FUNCTION wired_or (drivers : BIT_VECTOR) RETURN BIT ;
END switch_pack ;
Package body
PACKAGE BODY switch_pack IS
165
- NAME: bytetochar
- PURPOSE: converts a byte to a character
FUNCTION bytetochar (byte : byte_type) RETURN CHARACTER IS
VARIABLE value : INTEGER := 0 ; value of character to be returned
BEGIN
FOR index IN byte 'RANGE LOOP
IF byte (index) =
'!'
THEN
readjust for ordering of bits in the byte
and sum up the values of the bits that are
set
value := value + (2 ** (abs (index - 7)))
END IF ;
END LOOP ;
return character corresponding to that integer
RETURN CHARACTER 'VAL (value) ;
END bytetochar ;
NAME: chartobyte
PURPOSE: converts a character to a byte
FUNCTION chartobyte (data : CHARACTER) RETURN BIT_VECT0R IS
VARIABLE byte : BIT_VECT0R ( 0 TO 7 ) ; byte to be returned
VARIABLE ascii_data : INTEGER :
= 0 ; ascii value of character
VARIABLE value : INTEGER
:= 0 ; dummy variable
BEGIN
get the integer value of the character
ascii data :=
CHARACTER'
P0S (data) ; convert character to
integer
generate the vector representing that integer
by repeatedly dividing
the integer by 2 and using the
166
remainder to determine the value of the bit
FOR index IN byte 'RANGE LOOP
value := ascii_data rem 2 ;
ascii_data := ascii data / 2 ;













PURPOSE: converts an integer to a bit vector
FUNCTION inttobitvector (data : INTEGER ;
size : INTEGER) RETURN BIT_VECTOR IS
VARIABLE vector : BIT_VECTOR (0 TO size-1) ; byte to be returned
VARIABLE value : INTEGER := 0 ; dummy variable
VARIABLE new_data : INTEGER := 0 ; copy of data
BEGIN
copy original data
new data := data ;
generate the vector representing that integer
by repeatedly dividing the integer by 2 and using the
remainder to determine the value of the bit
FOR index IN vector 'RANGE LOOP
value :
= new_data rem 2 ;
new data := new_data / 2 ;
















PURPOSE: implements a wired and resolution function
FUNCTION wired_and (drivers : BIT_VECT0R) RETURN BIT IS
BEGIN
check each signal driver and if one of them is 0 return
a 0
FOR index IN drivers
'
RANGE LOOP













- PURPOSE: implements a wired or resolution function
FUNCTION wired_or (drivers : BIT_VECTOR) RETURN BIT IS
BEGIN
check each signal driver and if one of them is 1 return
a 1
FOR index IN drivers
'
RANGE LOOP

















DATE: November 1 1996
PURPOSE: This is the environment within which the switch
model is simulated
LIBRARY switch_pack ;

















































PORT (serial_in : IN byte_type ;
sync_in : IN byte_type ;
serial_out : OUT byte_type ;
sync_out : OUT byte_type ;
clk : IN BIT ;
reset : IN BIT) ;
END COMPONENT ;
BEGIN
Produce input signals using input port entities
Three inputs to one output . . .
Port 0, 6 and port 7 to port 0
ipO : input_port GENERIC MAP (file_name => "inO", destination_address
=> "0001") PORT MAP (serial_in (0), sync_in (0), clk, reset) ;
ip6 : input_port GENERIC MAP (file_name => "in6", destination_address
=> "0001") PORT MAP (serial_in (6), sync_in (6), clk, reset) ;
ip7 : input_port GENERIC MAP (file_name => "in7", destination_address
=> "0001") PORT MAP (serial_in (7), sync in (7), clk, reset) ;
One input to one output . .
Port 1 to port 1
ipl : input_port GENERIC MAP (file_name => "inl", destination_address
=> "0010") PORT MAP (serial_in (1), sync in (1), clk, reset) ;
One input to two outputs
Port 5 to ports 2 and 5
ip5 : input_port GENERIC MAP (file_name => "in5", destination_address
=> "1010") PORT MAP (serial in (5), sync in (5), clk, reset) ;
One input to two outputs . . .
Port 5 to ports 2 and 5
ip2 : input_port GENERIC MAP (file_name => "empty",
destination_address => "0000") PORT MAP (serial_in (2), sync_in (2),
clk, reset) ;
ip3 : input_port GENERIC MAP (file_name => "empty",
destination_address => "0000") PORT MAP (serial_in (3), sync_in (3),
clk, reset) ;
170
ip4 : input_port GENERIC MAP (file_name => "empty",
destination_address => "0000") PORT MAP (serial_in (4), sync_in (4),
clk, reset) ;
capture output signals using output port entities
--
Output from port 0
mixture of input ports 0, 6 and 7
opO : output_port GENERIC MAP (file_name => "outO") PORT MAP
(serial_out (0), sync out (0), clk, reset) ;
Output from port 1
input port 1
opl : output_port GENERIC MAP (file_name => "outl") PORT MAP
(serial_out (1), sync_out (1), clk, reset) ;
Output from port 2
input port 5
op2 : output_port GENERIC MAP (file_name => "out2") PORT MAP
(serial out (2), sync_out (2), clk, reset) ;
Output from port 5
input port 5
op5 : output_port GENERIC MAP (file_name => "out5") PORT MAP
(serial out (5), sync_out (5), clk, reset) ;
the switch
sw : switch PORT MAP (serial_in, sync_in, serial_out, sync_out, clk,
reset) ;
produce a clock signal with period 5 ns and freq 200 MHz
clk <= NOT clk AFTER 2500 ps ;








NAME: Input Port Driver
AUTHOR: Rudi Rughoonundon
DATE: November 1 1996
PURPOSE: Implements a device that reads data from an
ascii file and drives the input ports of the





USE switch_pack. switch_pack.ALL ;
Entity declaration
ENTITY input_port IS
GENERIC (file_name : STRING ; name of input file
-- destination of cells from this input port













END input port ;
Architecture declaration
ARCHITECTURE behavior OF input_port IS
Header of cell generated after the destination of the
cells is known
















VARIABLE line_buffer : LINE ;
VARIABLE next_line : LINE ;
VARIABLE data_count : INTEGER : = 0 ;
VARIABLE cell : cell_type ;
VARIABLE data : CHARACTER ;
VARIABLE true : BOOLEAN := TRUE ;
FILE input_file : text IS IN file_name ;
BEGIN
Wait until reset is deasserted
WAIT UNTIL reset =
'0'
;
Append cell header to actual cell
cell
(cell_header'
RANGE) := cell_header ( cell_header
'
RANGE) ;
data count := 5 ;
WHILE (NOT ENDFILE (input_file) ) LOOP
Read one line at a time and append an End of Line to the
input
READLINE (input_file, line_buffer) ;
WRITE (line buffer, LF) ;
fill up cells with content of line until line is empty
or cell is full
WHILE
(line_buffer'
LENGTH > 0) LOOP
WHILE ( (data_count < 53) AND (line_buffer
'
LENGTH > 0)) LOOP
If line is empty but cell is not full get another line
IF
(linejouffer'
LENGTH > 0) THEN
READ (line_buffer, data) ;
cell ( (data_count*8) TO ( (data_count*8) +7 ) ) := chartobyte
(data) ;




Finish sending out the last cell and pad it with Os
WHILE (data_count < 53) AND (ENDFILE (input_file) ) LOOP
cell ( (data_count*8) TO ( (data_count*8 ) +7 ) ) :=
"00000000"
;
data_count : = data_count + 1 ;
END LOOP ;
Send out the current cell one bit at a time and also
produce the sync signal
IF (data count = 53) THEN
Reset the cell to empty so that it is filled up again
next time around the loop
data_count := 5 ;
FOR bit_count IN cell 'RANGE LOOP
Sync signal is asserted for first bit only













send out one bit from cell after every clock cycle
serial_out
<= cell (bit_count) ;
WAIT UNTIL
clk'








Send out idle/unassigned cells as all data from file
has been transmitted
cell (cell 'RANGE) := (OTHERS => '0') ;
WHILE (true) LOOP
FOR bit_count IN cell
'
RANGE LOOP
Sync signal is asserted for first bit only













send out one bit from cell after every clock cycle
serial out <= cell (bit_count) ;








NAME: Output Port Driver
AUTHOR: Rudi Rughoonundon
DATE: November 1 1996
PURPOSE: This device reads cells coming out of the output
ports of the ATM switch and converts the data
to ascii text and stores it in a file
LIBRARY std ;
USE std. textio.ALL ;
LIBRARY switch_pack ;









END output port ;
IN BIT ; cell in
IN BIT ; sync in
IN BIT ; system clock
IN BIT) ; system reset
Architecture declaration
ARCHITECTURE behavior OF output_port IS
SIGNAL receive : BOOLEAN := False ; start receiving flag
BEGIN







_ 1 1 i1') ) ;
receive <= True ;
WAIT ;
END PROCESS ;
once the first sync has been detected this process does










FILE output_file : TEXT IS OUT file_name ;
BEGIN
start filling up a cell when the sync signal is asserted
fill one bit from the port on every clock cycle
176
FOR index IN cell 'RANGE LOOP
WAIT UNTIL (receive AND clk 'EVENT AND (clk = '1')) ;
cell (index) : = serial_in ;
END LOOP ;
when an entire cell has been received translate the data
bytes into characters and write them out to a file
skip over header of cell as it does not hold any data
Ll : FOR index IN 5 TO 52 LOOP
convert one byte at a time
byte := cell ((index * 8) TO ((index
* 8 ) + 7 ]
data := bytetochar (byte) ;
if byte is 0 then this is cell padding and transmission
must be over
IF (byte = "00000000") THEN
EXIT Ll ;
if data is line feed then write out all the data
received since the last line feed to a file
ELSIF ((data = LF) AND (line_buffer
'
LENGTH > 0)) THEN
WRITELINE (output_file, line_buffer) ;
-
else keep filling up a line data structure with the bytes
ELSE






B.5 - Input_Cell Buffer.vhdl
NAME: Input Cell Buffer
AUTHOR: Rudi Rughoonundon
-- DATE: November 1 1996
PURPOSE: Serial register to hold input cell
LIBRARY switch_pack ;
USE switch_pack. switch pack.ALL ;
Entity declaration
ENTITY input_cell_buffer IS






END input cell_buffer ;
IN BIT ; system clock
IN BIT ; cell in
OUT BIT ; cell out
OUT nibble_type ; destination of cell
IN BIT) ; system reset
Architecture declaration
ARCHITECTURE behavior OF input_cell_buffer IS
BEGIN
PROCESS (clk, reset)
VARIABLE rgstr : cell_buffer_type ; register
BEGIN





:= (OTHERS => '0') ;
On new synchronous event shift register left by 1 bit and
input one new bit into least significant bit of register
178
ELSIF (clk'EVENT AND (clk = trigger)) THEN
rgstr (rgstr
'
RANGE) := cell_in & rgstr (0 TO rgstr 'HIGH-1) ;
END IF ;
drive new output from register
cell_out <= rgstr (rgstr 'HIGH) AFTER delay ;
drive new destination field from register







NAME: Input Sync Buffer
AUTHOR: Rudi Rughoonundon
DATE: November 1 1996
PURPOSE: Serial register to hold input sync signal
LIBRARY switch_pack ;
USE switch pack. switch_pack.ALL ;
Entity declaration
ENTITY input_sync_buffer IS
GENERIC (port_number : INTEGER ;

























VARIABLE rgstr : sync_buffer_type ; register
BEGIN




rgstr := (OTHERS => '0') ;
On new synchronous event shift register left by 1 bit and
input one new bit into least significant bit of register
ELSIF clk 'EVENT AND (clk = trigger) THEN
rgstr (rgstr
'
RANGE) := sync_in & rgstr (0 TO rgstr 'HIGH-1) ;
END IF ;
produce lookup sync signal
lookup_sync_out <= rgstr (27) AFTER delay ;
produce write sync signal and implement write sync
staggering based on input port






DATE: November 1 1996
PURPOSE: The switching table is used to look up a
bit pattern corresponding to an address
in a cell, that bit pattern is latched






PORT (clk : IN BIT ; system clock
lookup : IN BIT ; lookup sync
destination : IN nibble_type ; cell address
routing_bit_map : OUT byte_type ; bit pattern
reset : IN BIT) ; system reset
END routing_table ;
Architecture Declaration
ARCHITECTURE behavior OF routing_table IS
SIGNAL bit_map : bit_map_type ;




- Shorten the length of the lookup sync pulse
local_clk <= clk AND lookup ;












on clock signal latch the switching bit pattern
ELSIF
(local_clk'
EVENT AND (local_clk= '1')) THEN
routing bit_map



























































<= "11111111" AFTER delay r
WHEN OTHERS => bit map







DATE: November 1 1996
PURPOSE: Switches the incoming signal to a number

















ARCHITECTURE behavior OF routing_tree IS
BEGIN
for each branch if corresponding bit is 1 switch
the signal through branch else pull branch low
tree_branches : FOR index IN signal_out
'
RANGE GENERATE









DATE: November 1 1996
PURPOSE: This is a fancy name to represent the
interconnection network from the switching tree









IN routed_cell_type ; cells in
IN routed_sync_type ; write syncs in
OUT routed_cell_type ; cells out
OUT routed sync type) ; write syncs out
END cell_aligner ;
Architecture declaration
ARCHITECTURE behavior OF cell_aligner IS
BEGIN
rearrange the incoming bits into a different order to
--
produce the outgoing bits
PROCESS (cell_in, write_sync_in)
BEGIN
FOR port_index IN cell_in
'
RANGE LOOP
FOR cell_index IN cell_in
'
RANGE LOOP
cell_out (port_index) (cell_index) <= cell_in
(cell_index) (port_index) ;







NAME: Bit select register
AUTHOR: Rudi Rughoonundon
DATE: November 1 1996
PURPOSE: Circular shift register to select one bit at
a time in output buffers.
LIBRARY switch_pack ;
USE switch pack. switch pack.ALL ;
Entity declaration
ENTITY bit_select_register IS
GENERIC (trigger : BIT :=
'0'
; trigger level




IN BIT ; system clock
OUT bit_select_type ; register
IN BIT) ; system reset
END bit_select_register ;
-- Architecture declaration




VARIABLE rgstr : bit_select_type ; register
BEGIN
asynchronous reset of register
Note that this reset sets one bit in the register and
clears all the other bits. This is done to maintain




rgstr (rgstr 'RANGE) := (OTHERS => '0') ;
rgstr (reset bit) :=
'!'
;
On new synchronous event shift register left by 1 bit and
input most significant bit back into least significant
bit
ELSIF (clk'EVENT AND (clk = trigger)) THEN
rgstr (rgstr 'RANGE) := rgstr (rgstr'HIGH) & rgstr (0 TO
rgstr 'HIGH-1) ;
END IF ;
drive actual register with content of local register




NAME: Write Select Register
AUTHOR: Rudi Rughoonundon
DATE: November 1 1996
PURPOSE: Circular shift register to select next cell
which will be written to.
LIBRARY switch_pack ;









IN BIT ; system clock
OUT write_select_type ; register
IN BIT) ; system reset
END write_select register ;
Architecture declaration
ARCHITECTURE behavior OF write_select_register IS
BEGIN
PROCESS (clk, reset)
VARIABLE rgstr : write_select_type ;
BEGIN
asynchronous reset forces one bit to 1 and other bits to 0
IF (reset = '1') THEN
rgstr (rgstr
'
RANGE) := (0 =>*!', OTHERS => '0') ;
On new synchronous event shift register left by 1 bit and
input most significant bit back into least significant
bit
ELSIF (clk'EVENT AND (clk = trigger)) THEN
rgstr (rgstr 'RANGE) := rgstr (rgstr'HIGH) & rgstr (0 TO
rgstr'HIGH-1) ;
END IF ;
drive actual register with value of internal register
write_select





NAME: Read Select Register
AUTHOR: Rudi Rughoonundon
DATE: November 1 1996
PURPOSE: Circular serial register to select cell









IN BIT ; system clock
OUT read_select_type ; register
IN BIT) ; system reset
END read_select_register ;
Architecture Declaration
ARCHITECTURE behavior OF read_select_register IS
BEGIN
PROCESS (clk, reset)
VARIABLE rgstr : read_select_type ; register
BEGIN




rgstr (rgstr 'RANGE) := (0 =>'!', OTHERS => '0') ;
On new synchronous event shift register left by 1 bit and
input most significant bit back into least significant
bit
ELSIF (clk'EVENT AND (clk = trigger)) THEN
187
rgstr (rgstr 'RANGE) := rgstr (rgstr'HIGH) & rgstr (0 TO
rgstr 'HIGH-1) ;
END IF ;
drive actual register with value of internal register






DATE: November 1 1996
PURPOSE: Generic counter entity to implement any size
of counter
LIBRARY switch_pack ;
USE switch_pack. switch_pack.ALL ;
Entity declaration
ENTITY counter IS
GENERIC (size : INTEGER ; size of counter in bits






IN BIT ; system clock
IN BIT ; system reset
OUT BIT_VECTOR ; counter value
IN BIT) ; count direction
Architecture declaration
ARCHITECTURE behavior OF counter IS
BEGIN
PROCESS (reset, clk)
VARIABLE value : INTEGER := 0 ; counter value
BEGIN
188




value := 0 ;
Count up if up signal is asserted
ELSIF (clk 'EVENT AND (clk = trigger) AND (up
= '1')) THEN
value := (value + 1) mod (2
**
size) ;
Count down if up signal is deasserted
ELSIF (clk 'EVENT AND (clk = trigger) AND (up
= '0')) THEN




drive the actual counter from the internal counter









DATE: November 1 1996
PURPOSE: Holds one cell in output buffer
LIBRARY switch_pack ;
USE switch pack. switch_pack.ALL ;
Entity declaration
ENTITY cell IS





















OUT wired_or BIT ;































produce the clock for the write state machine
and the write state machine
:it_select_low <= bit_select (bit_select
'
LOW) AFTER delay ;
produce a clock signal to latch a value into the
input port latch








input_port_latch <= (OTHERS => '0') ;
190
on clock the register latches the value of the input
counter
ELSIF (input_port_latch_clkT EVENT AND (input_port_latch_clk = '1')
THEN
input_port_latch <= counter ;
END IF ;
END PROCESS ;
This process is the state machine that sets the read latch
This state machine uses bit select (0) as a clock
PROCESS (bit_select_low, reset)
BEGIN




read_state <= Idle ;
ELSIF
(bit_select_low'
EVENT AND (bit_select_low = '1')) THEN
CASE read state IS
When system is idle goto Reading when buffer is not empty
and this cell is selected to be read from
WHEN Idle =>
IF ( (read_select





When reading go back
to idle state when a complete cell









This process is the first state machine that determines
whether the cell will be written to
PROCESS (clk, reset)
BEGIN




write_state_l <= Idle ;
ELSIF (clk 'EVENT AND (clk = '1')) THEN
CASE write state 1 IS
prepare the second write state machine to go into the
writing state when the buffer is not full and this cell
is selected to be written to and the write sync is
asserted
WHEN Idle =>
IF ( (buffer_full = '0') AND (write_select
= '1') AND
(write_sync = '1')) THEN
write_state_l <= Ready ;
END IF ;











This process is the state machine that allows data write








write_state_2 <= Idle ;
ELSIF
(bit_select_low'
EVENT AND (bit_select_low = '1')) THEN
CASE write state 2 IS
when the first state machine is ready this state
machine starts writing when the first clock hits
WHEN Idle =>
IF (write_state_l = Ready) THEN
write_state_2 <= Writing ;
END IF ;
-







this models a multiplexer which drives the data in line
into the cell buffer based on the value latched in the
input_port_latch
data in <= cell_in (0) WHEN (input_port_latch
= "000") ELSE
~~
cell_in (1) WHEN (input_port_latch = "001") ELSE
193
cell_in (2) WHEN (input_port_latch = "010") ELSE
cell_in (3) WHEN (input_port_latch = "Oil") ELSE
cell_in (4) WHEN (input_port_latch = "100") ELSE
cell_in (5) WHEN (input_port_latch = "101") ELSE
cell_in (6) WHEN (input_port_latch = "110") ELSE
cell in (7) ;
-
this is the actual cell buffer where the data is stored
PROCESS
VARIABLE ram : ram_type ; ram memory
BEGIN
fancy way of waiting foir the next bit select to be
asserted before reading from or writing to that




WAIT UNTIL (bit select (index) = '1') AND clk 'EVENT AND (clk
'0') ;
produce the sync out signal when bit_select (0) is
asserted
sync out <= bit_select (bit_select
'
LOW)
write data if write latch is asserted
IF (write_state_2 = Writing) THEN
ram (index) := data_in ;
END IF ;
read data if the read_latch is asserted
















NAME: Output Cell Buffer
AUTHOR: Rudi Rughoonundon
DATE: November 1 1996
PURPOSE: Implements the output buffer for one output port
as a fifo data structure
LIBRARY switch_pack ;
USE switch_pack. switch_pack.ALL ;
Entity declaration









END output buffer ;
IN byte_type ;
IN byte_type ;








































































OUT wired_or BIT ;





































assert buffer empty signal if counter is all 0s
196
buffer_empty <= NOT wired_or (output_counter ) AFTER delay ;
assert buffer_full signal if counter is all Is
buffer_full <= output_counter (output_counter
'
LOW) WHEN (clk 'EVENT
AND (clk = '1') ) ELSE
buffer_full AFTER delay ;
produce the write_select_clk signal to clock the
write_select register
write_sync <= wired_or (write_sync_in) AFTER delay ;
write_select_clk <= wired_or (write sync_in) AND (NOT buffer_full)
AND (NOT clk) AFTER delay ;
~~ ~
produce the read_select_clk signal to clock the
read select register





read_select_clk <= bit_select_low AND (NOT buffer_empty) AND (NOT
clk) AFTER delay ;
END PROCESS ;
The write select register
wsr : write_select_register GENERIC MAP (trigger => '0') PORT MAP
(write select clk, write select, reset) ;
The read select register
rsr : read_select_register GENERIC MAP (trigger => '0') PORT MAP
(read select clk, read_select, reset) ;
- Generate each cell buffer




cb : cell PORT MAP (cell_in, input_counter, buffer_full,
buffer_empty, read_select (index) , write_select (index) , write_sync,
bit_select, cell_out, sync_out, clk, reset) ;
END GENERATE buf ;
generate a clock for the output counter
output_counter_clk <= (write_select_clk OR read_select_clk) AFTER
delay ;
change count direction so that count up during write















oc : counter GENERIC MAP (size => output_counter
'
LENGTH, trigger =>






DATE: November 1 1996
PURPOSE: Top level of Switch model
LIBRARY switch_pack ;









IN byte_type ; cells in
IN byte_type ; cell syncs in
OUT byte_type ; cells out
OUT byte_type ; cell syncs out
IN BIT ; system clock




































GENERIC (size : INTEGER ;











GENERIC (trigger : BIT :=
'0'
;




















reset : IN BIT)
END COMPONENT ;


























































OUT wired_or BIT ;






OUT routed sync type)
200
Instantiate 8 of each component to implement 8 data
paths
sw : FOR index IN 0 TO 7 GENERATE
Input Cell Buffer
icb : input_cell_buffer GENERIC MAP (trigger => '1') PORT MAP (clk,
serial_in (index) , delayed cell (index) , destination (index) , reset) ;
Input Sync Buffer
isb : input_sync_buffer GENERIC MAP (trigger => '1', port_number =>
index) PORT MAP (clk, sync_in (index) , lookup_sync (index) , write_sync
(index) , reset) ;
Routing Table
rt : routing_table PORT MAP (clk, lookup_sync (index) , destination
(index) , routing bit_map (index) , reset) ;
Cell Routing Tree
crt : routing_tree PORT MAP (delayed_cell (index) , routed_cell
(index) , routing_bit_map (index) ) ;
crt : routing_tree PORT MAP (delayed_cell (index) , routed_cell
(index) , cell bit_map) ;
Write Sync Routing Tree
wsrt : routing_tree PORT MAP (write_sync (index) , routed_write_sync
(index), routing bit_map (index)) ;
Output buffer
ob : output_buffer PORT MAP (aligned_cell (index) ,
aligned write sync (index) , bit_select, input_counter, serial_out
(index)7 out (index), delayed_clk, reset) ;
END GENERATE sw ;
201
Instantiate 1 of each of these components
Fully interconnected network
ca : cell_aligner PORT MAP (routed_cell, routed_write_sync,
aligned_cell, aligned_write_sync) ;





ic : counter GENERIC MAP (size => input_counter
'
LENGTH, trigger =>
'!') PORT MAP (clk, reset, input_counter, count_up) ;
Bit select register
bsr : bit_select_register GENERIC MAP (trigger => '1', reset_bit =>
383) PORT MAP (clk, bit select, reset) ;
Delay the clock to prevent data and clock skew after
cells go through the interconnection network
delayed_clk <= clk AFTER delay ;
END behavior ;






aAbBcCdDeEfFgGhHi 01 j JkKllJtrMnNoOpPqQrRs StTuUvVwWxXyYzZ































[10] A Huang and R. RogeNMOSer, "Speed Optimization of Edge Triggered CMOS
Circuits For Gigahertz Single Phase
Clocks,"
in IEEE Journal of Solid-State
Circuits, Vol. 31, No. 3, March 1996
[11] M. J. Karol, M. G. Kuchyj and S. P. Morgan, "Input Versus Output Queueing On
A Space Division Packet
Switch,"
in IEEE Transactions on Communications, Vol.
COM-35, No. 12, pp. 1347-1356, December 1987
[12] T. Koinuma and N. Miyaho, "ATM in B-ISDN Communication Systems and VLSI
Realization,"
in IEEE Journal ofSolid-State Circuits, VOL 30, No. 4, April 1995
[13] W S. Marcus, "A CMOS Batcher and Banyan Chip Set for B-ISDN Packet
Switching,"
in IEEE Journal of Solid-State Circuits, Vol. 25, No. 6, December
1990
[14] S. Nojima et al., "Integrated services packet network using bus matrix
switch,"
in
IEEE Journal of Selected Areas in Communication, Vol. SAC-5, pp. 1284-1292,
October 1987
[15] S. K. Rao and M. Hatamian, "The ATM Physical
Layer,"
in Computer
Communication Review, pp. 73-81,
[16] H. J. Shin, D. A. Hodges, "A 250 Mbps CMOS Crosspoint
Switch,"
in IEEE
Journal ofSolid-State Circuits, Vol. 24, No. 2, pp. 478-486, April 1989
[17] R. J. Simcoe and T. Pei, "Perspectives On ATM Switch Architecture And The
Influence of Traffic Pattern Assumptions on Switch
Design,"
in ACM SIGCOMM
Computer Communication Review, pp. 93-105, April 1995
[18] K. Siu and Raj Jain, A Brief Overview ofATM: Protocol Layers, "LAN Emulation
and Traffic
Management,"
in ACM SIGCOMM Computer Communication Review,
pp. 6-20, April 1995
205
[19] A. Stevens, Teach YourselfC++. MIS Press, New York, New York, 1995
[20] H. Suzuki, H. Nagano, T. Suzuki, T. Takeuchi and S. Iwasaki, "Output-Buffer
Switch Architecture For Asynchronous Transfer
Mode,"
in IEEE ICC '89, 4.1,
pp.99-103, 1989
[21] R Vickers and M. Wernik, "The Role Of SDH and ATM In Broadband Access
Networks,"
in Globecom'91, 8A.1, pp. 212-216, 1991
[22] N. H. E. Weste and K. Eshraghian, Principles Of CMOS VLSI Design.
Addison-
Wesley Pubhshing Company, ReadingMassachusetts, 1 993
[23] Y. S. Yen, M. G. Hluchyj and A. S. Acampora, "The Knockout Switch: A Simple
Modular Architecture For High Performance Packet
Switching,"
in IEEE Journal of
SelectedAreas in Communication, Vol Sac-5, pp. 1274-1283, October 1987
[24] J. Yuan and C. Svensson, "High Speed CMOS Circuit
Technique,"
in IEEE Journal
ofSolid-State Circuits, Vol. 24, No. 1, pp. 62-70, February 1989
206
