Modeling, analysis and design of the input controller for ATM switches by Wu, Dongmei




Modeling, Analysis and Design of the
Input ControUer for ATM Switches
By
© Dongmei Wu, B.Sc.
A thesis submitted to the
School of Graduate Studies
in panial fulfillment of the
requirements for the degree of
Master of Engineering
Faculty of Engineering and Applied Science
Memorial University of Newfoundland
August 2001
St. John"s Newfoundland Canada
Abstract
In broadband communication networks. commonly used traffic rates :m: of the
Older of gigabits per second. or even terabits per second... lk nodes of the networks. also
known as switeh6 or routers. art: among the primary technology barriers mar. hinder the
deployment of fast speed networks. while the modem optical fibre technology allows the
transmission media (0 meet the application requi~ments.Within the switch itself. rouling
table lookup is the WOI"5I bottleneck. Among the proposed Multistage Interconnection
Network (MIN) an::hiteclures for ATM (Asynchronous Transfer Mode) switch fabric. the
Balanced Gamma (BG) net.....ork has been shown 10 be reliable. fault-tolerant. efficient.
scalable and superior in perfonnance when compared wilh other MlNs wilh similar
hardware complexity. In this thesis. we provide the modtling. analysis and design of thc
Input controller (IC) for ATM switches using BG network!..
The Ie temporarily stores the incoming cells in input buffers. perfonns routing
table loolo:up. and forwards them to the switch fabric lhat dt:liven cells to outgoing lines.
A cache-based IC ~hilec1ure improves the efficiency by locally slOring the fn:quenl1~
used forwarding information. We realize this purpose by high.speed cache altempl$
follo.....ed by slower routing lable lookups. ifnecc~.
We h.we developed a simulalor to evaluate different schemes to construel the Ie.
The simulator has the capability of generating traffic following unifonn random traffic
(URT) and bursty traffic models. Simulation results show that the IC system works well
and the system perfonnance C3l'i. be improved as cache hitS occur moSt of the time.
Encouraged by the good perfonnance shown. we have developed the hardware
implementation for the proposed Ie system using Very High Speed Hardware
Description Language {VHDLl. This is simulated and synthesized using design lools
supplied by Model Technology and Synopsys.
Acknowledgement
First of all. I am thankful 10 my (amil)', especia.ll)' my parents. my husband and my
d3ughl~r for their continuous support and great S3C1ifices thai were the major factors in
making this work a reaJiIY.
I would especiJ1l1y like to c:tpress my sincere thanks to my supervisor Dr.
Venkatesan for his academic guidance and financial support. Without his supervision I
could not have finished this work. I also do nOI forget the support and help of Dr. Howard
Heys and Dr. Paul Gillard.
I would like 10 thank the School of GradUilte Studies allhc: Memorial University of
Newfoundland (Of the financial support it provided during my Master's Program.
Also. Mr. Nolan White. systems administrator in the Depanment of Computtr
ScIence:. deserves speci.:ll thanks for his true coopc:r.uion in fixing the problem related 10
the Synopsys CAD lOOIs.
Thanks also to Ms. Moya Crocker for her help during my study ::u the Faculty of
Engineering. Memorial Univmily of Newfoundland.
Finally. I would like: 10 thank all my friends and colle:lgues who since~ly helped
during my Master's study, I specially thank Y. E. Sayed. P. Mehrot.ra. Cheng li. Lu
Xiao. J, Deepakurnara.Qiyao Yu. Wei Song. Ji Xie. and K. Kannan.
iii
Table or Contents
Abstr.lct ... . i
Acknowledgemenl .
TableofConlents .
liSl of Figures
lislofTables ....
Lisl of Abbreviations ..•..
............... iii
...........•.... iv
.....•..... vii
••••.•.•.•..••.• ilt
. lti
· .......•. 16
· :!~
. 12
. 12
. 21
. 25
· 25
........25
. 26
. 28
. 31
. 1
...................... 1
. 1
................3
..............5
.. 6
.. 8
. 8
............8
. 8
.2.1 Introduclion ...
3.1 Infroduclion ....
3..2 Topology ...
J.3 Routing Algorithm ...
J..I Properties ...
Chapter! .
INTRODUCTION ....
1.1 Background on Broadband Communication Networks ...
I.~ ATM Swilching
1.3 MOlivaiion for This Thesis...
1..4 Thesis Organization", .
Chapter.:! ...
INPlIT CONTROLLER FOR ATM SWITCHES
BALANCED GAMMA NETWORK·BASED ATM SWITCHES...
_._ Input ContrOller Functions .
2.3 Input Controller Archilcctures .
23.1 Input BurreL._ __ .
.2.3.2 Routing Table Memory
1.4 Input Conuoller in Commercial Switching Products.....
~.5 Sumnury .....
Chapter3 ...
3.4.1 Hardware complexity ... . 31
3.4.2 Fault Tolerance Propc:nies 32
3.4.3 Reliability Analysis 35
3.5 Summary... .. 36
Chapter4..... . 37
INPlJf CONTROu.ER ARCmrECTURE DESIGN 37
FOR BO ATM NETWORKS... 37
4.1 Introduction .... .....37
4.2 System [kscription... .. . 37
4.] Input Buffer Module .. 39
4.3.1 Input Buffer Operations... .. 40
4.3.2 Input Buffer Structure .41
4,4 Cache Memory Module .. .. .42
4.4.1 CacheStnJcture..... .. 42
4.4.2 Cache Operation ,45
4.5 Routing Table Module.... .....48
4.5.1 Routing Table Structure ... .... ,48
4.5.2 Routing T3ble Operation .50
4.6 Arbiter Logic Module ..... .....51
4.7 Summary... .....52
Chapter5 ...
PERFORMANCE ANALYSIS OF THE INPlJf CONTROLLER ...
.... 54
...54
5.1 IntrOduction ...
5.2 TrafficModeling ...
.. 54
...54
5.2.1 Traffic Patterns in Broadband Communic:uion Networks 54
5.3 Simulator Software ..
5.4 Simulation Results and Analysis ..
5.4.1 Input Buffer Number .
5.4.2 Cache Performance ..
.. 58
. 68
........................................68
. 74
5.5 Summary..... . 76
Chapter6... .. 78
HARDWARE IMPLEMENTATION OF THE IC... . 78
6.1 Introduction... .. 78
6.2 Hardware Implementation Methodology ...
6,) Input Controller Architecture ...
.. 78
. 81
6.4 System Components... .. 86
6.4.1 Input Buffer.... . 86
6..1.2 ICCenter..... .. 87
6.5 Simulation and Symhesis 100
6.6 Summary... . 105
Ch:lpter7... . 106
CONCLUSION... .. 106
7.1 Contributions in this Thesis... .. 106
7.2 Recommendations for Future Work... .. 108
7.3 Summary ....
REFERE.:-':CES ..
APPE.:WICES ..
. 110
.. 111
.. 114
;\ Module Siructureofthe IC System 114
Behavioral Simul:ltion Result of the Ie System... . 114
List of Figures
FIGCRE 1.1 BASIC mEA OFBROADBANO COM.\mSICATlON NETWORKS 1
FIGURE 1.1 GE..."ER1C MOOEL OF SWITCH ..
FIGURE [.] THE Rol.£ OF ATM SWITCHESI:': A... [Z\TER.,''ETWORK ...
FIGURE 2.1 CEll. STRL:CTURE AT THE UNlINNI....
FIGL"RE 2.2 CEll. HEADER STIl.UCT1JRE AT UNI ..
. 4
. 5
. 9
. 9
FIGL"RE 2.3 ATM VPNC SW1TCH['\G .. . 10
FIGl"RE 2.4 VARIOUS BL'FFERL'IG STRATEGIES ...........................•.......................•...... 13
FIGL'RE 2.5 CACHE ORGA...1ZAnONS ....
FIGl;RE 2.6 THE THREE PoRTIONS OF ,l,... ADDRESS I:'J A CACHE...
FIGL"RE3.1 8x8BGNElWORK....
FIGURE 3.2 SE OLI'fPUT ~K NOTATION ..
FIGURE 3.3 ROLm:G l" A.'18 x 8 SO NEn\·ORK.....
FIGl"R.E 3.4 CROSSPOINT COMPLEXITY IN A SE ..
............. 19
.. 20
................ 27
.. 28
. 29
. 31
FIGl'RE3.5 DY~A.\UcREROl.JlDiG~ A...-8 x 8 BG NETWORK IN CASE OF UsK FAULTS 33
FIGl~E 3.6 DYI'A.\uC REROl.rn"-G 1'I AN 8 x 8 BG NETWORK 1'I CASE OF SE FALLTS .. 35
FIGl~E-4.1 1:>.'PUTCO:>.TROllERARCHlTEcnJRE ..
FIGl'RE 5.1 THE ON/OFF SOl;ll.CE MODEL •..
.... 38
. 62
FIGURE 5.2 Sl~EROF b,-PUT BllfFERS U:>''VER URT TRAme FOR DIJ'FER.E''T SIZES OF
SWITCHES ..... . 71
FIGl"RE 5.3 Nl"~mER OF L'TlT BLJ"FFERS monER BURSTY TRAme wm-J lI.tEA,... BURST
u:-GTIl= 15.. .. 73
FIGl~E5.4 CACHE HrrRATES U1'mER VAlUOl.'ST'RAFFlC1"YPEs A..''DN .. 74
FlGl:RE 5.5 CACHE HIT RATES L~'VER VARlOUS VPNC NUMBER A.''D N .....
FIGl"RE6.1 DESIG:-I FLow REcO~''DEDBY CMC ...
FIGl'RE6.2 UART A.''DllIEL'"PUTCON'TR.OLLER(IC) ...
FIGl"RE 6.3 L'1'n CO:>.TROUER (IC) SYS1E\1 BLOCK ..
FIGt:RE 6,4 IC CEXTRE ARCHITEcnU ...
vii
....... 75
.......... 79
. 83
.. 84
.. 85
..................•.•..•.• 86FIGUR.E6.5 L";TERFACEOFTHEm.
FIG\.J"R£.6.6 look"UPCAOU::AR.~ _.87
FIGl"R£. 6.7 1~'TERFACEOFT1IEC_REGFn..E_O•.
FIGl"R£.6.8 14-BrrCOMPARATOR .
FIGllJl.E 6.9 COMPAR1SO:"l" PoRno~ IN THE Look",1' CACHE
. ..............••.•..•.• 88
. 89
..................... 90
FlGl1lE6.l0 l'l,IERFACEOfllfEC_REGRLE_I 91
FIGl"R.E6.1 J l'Io"'TERFACf OFTHECACHE_ME."'1.. . 91
FlGl"RE6.12 TR.HTATE SVMBOLA.'I,'DT!l.l.:"TH TABLE .••
F1Gl"RE 6.13 TRI·STATE BliS....
. 93
. 93
F1GL"RE6.14 ROlTL"'GTABLE(RT)AROlITEcn.:'RE 94
FIGL"RE6.15 I~"TERFACEOFTHERT_REGRLE 95
FIGLll.E6.16 COMPARISO=" PoRno~ e-.:TIiERT ARCHITECnJRE 96
FIGLiRE6.17 L'I,"TERFACEOFTHE RT_MEM... . 97
F1GL"RE6.18 L'I,TERFACEOFniEARBlTER ... . 98
FIGL"RE6.19 PsELiOORA~DOMNu!>mER.GE.'Io"ERATOR(PRO) 5TltUCTURE 99
FIGL"R.E6.!O A PARTOFntESl\f\...u:no.. . R£poRT 102
FIGL"RE 1.1 biPROVED PsEUOO RA.'l)()~ NliMBER OE.'lo"ERATOR (PRO) 109
List of Tables
TABLE~.I L'1"lJ,BUFFERORGANIZATIOS............................•.•...•..•.....__ .•.............•.... 41
TABLE4.2 CACHEORG......'1ZAno~... . . ...........•••.................... 45
TABL..E43 CACHE BLOCK ACCESS NUMBERS UPOAID;G PROCESS... . ... 47
TABLE4,4 ROL"'TD;G TABLE(Rn ORGA.'1ZATlO~.. • • ••••••••••••••••••.•..................... so
TABLE 5.1 Sl,;mEROfl'1'UT81JFfERS ""nOl'"TPUT BliFFERS m.'OEIlDlFfER£.'T URI
TRAme LoADs ... . ...............................•.. 70
TABLE 5.1 Nn-mER OF NpUT BUFFERS ""'"D 01.rrPL,. BLl'l'ERS ID."DER DIFFERE.'" URI
TRAme loADS ..... . 70
TABLE 5.3 Nl'!<otBER OF L'l'UT BUFFEIlS .......'0 OUTPUT BUFFERS m.'DER OIFFER.D.'T URI
TRAFFlC loADS ... . .. 70
TABLE 5.4 Nl:MBER OF L'1'UT BUFFERS ASO OL"TI'UT BUFFERS lflIo'DER DIFFERE.''T URT
TRAFFIc loADS ... . 71
TABLES.S NL'MBEROF L'l'UT Bli'FfERS mODER URI FOR DIFfER.£." SIZES OfSwrrcHES
........................................................................... 11
TABLE 5.6 NllWlER OF L'l'UT Bl.'FFERS A...'O OllTI'VT BlJFFD.S l.'!'o'OER BURSTY TRAffJC
WTnf MEA... BlJltST 1..£SGnl: = 15 IDo"DER DlFFIJl.E'"TTlv.fFlC LoAD. fOR SWITCHES
"mlN=8 ... ......................................... 12
TABLE 5.1 NL'MBEROF bl'UT BUFFERS A...,oOUn'VTB'-''FFERS "'1'o'DER BURSTYTRAffJC
WTnf MEA... Bl.ltST l.e."CiTH =15 m..1>ER DlFFIJl.E'"TTRAFFK: LoAD. R>R SWITCHES
"'mlN= 16 ... . 12
TABLE 5.8 !'i'D-mER OF L'1'loi BL"R'ERS ""''0 Dum..rT BlJl'FElS lr.Io'OER BURSTY TRAmc
wtnf MEA... BURST1..£SGTH = 15 m..'OER DlFfERE.'ITTRAFflC LoAD. fUR SWITCHES
"'mlN=32 .. . 72
TABLE 5.9 NL'~EROFL"I1'llTBL'FFERS AA'OOUTPUT BL"FFERS U1''DER BURSTY TRAfFIc
Wffil MEA.'Il BlJ'RST LE.'IlGTH = 15 l.I:''OER D!FFER.E.""TTRAFRC LoAD. FOR SWITCHES
\limN=32 ... ................................... 72
TABLE 5.10 NUMBER OF b'PLT BlJFFERS UNDER. BURSTY TRAFFIc FOR DIFfERE!";T SIZES
OF SWITOIES .•. . 73
TABLE6.1 PATTERN -GENERATOR OUTPUTS . 99
TABLE6.~ St.."MMARY OF AREAA.'ll~G REPoRT FOR lHE CO~~'E'<TMIDTHE
SYSTEM••.•• ... 103
List of Abbreviations
ABR: Available Bit Rate
ALFSR: AUlonomous Linear Feedback Shift Regisler
ASIC: Application-Specific Integrated Circuit
ASSP: Application Specific Standard Products
ATM: Asynchronous Transfer Mode
BG: Balanced Gamma
B·lSON: Broadband Integrated Service Digital Network
BR: Broadcast Reliability
CAM: Content Addressable Memory
CBR: Constant Bit Rate
CLP: Cell Loss Priority
CMC: Canadian Microelectronic Corporation
AFO: Firsl -In - First - QUI
FPGA: Field Programmable Gate AJr.lY
FT: Fault Tolcr.loce
GFC: Generic Row Control
HC: Hardware Complexity
HEC: Header Error Control
HOL Head Of Line
[8: Input Buffer
IC: Input Controller
IP: Internet Prohxol
lPv~: IP Standard version 4
IPv6: lPSl.andard version 6
ITU·T: International Telecommunic:llions Unit-Telecommunications
St:mdardization Sector
LAN: Local Area Network
LANE,
LFSR:
LIFO:
LRU:
MIN:
MSB:
MTIE
NR:
DC
051:
PRG:
PTo
Q05;
RA.'-.1.:
RT:
OTL'
SE:
SF:
SIN:
SONET:
TGD:
TP,
TR:
UART:
UBR:
li:"H:
URI:
VBR:
VLSI:
LAN Emulation
uncar Feedback Shift Register
Last In Finl Qut
Least-Recently Used
Multistage Interconnection Network
Most Significant Bil
Mean Time To Failure
Network ReliabililY
QutpulController
Open Systems Interconnection
Pseudo Random-number Generator
Payload Type
QuaJityofService
Random Access Memory
Routing Table
Register Transfer Level
Switch Element
Switch Fabric
Single-stage Interconnection Network
Synchronous Optical NETwork
Truncated Geometric Distribution
Terminal Path
Terminal Reliability
Universal Asynchronous Receiver Transmitter
Undefined Bit Rate
User Network Interface
Uniform Random Traffic
Variable Bil Rate
Very Large Scale Integralion
vPI: Virtual Path Identifier
VCI: Virtual Channel Identifier
VlIDL Very.high·speed Hardware Description Language
WA.1'Il: Wide Area Nelwork
xiii
Chapter I
INTRODUCTION
1.1 Background on Broadband Communication Networks
Modem telecommunic~l.lion networks are evolving at a fast pace. In today's
broadband communication networks. many newly emerging applications have diverse
service features and require huge bandwidths. These include con\'entional telephony
services. computer data transfer services. applic:llions for audio- and video-
communication. :ludio and video broadcast. games and inleractivc multimedia
applications. etc. The word "broadband" here rc:fc~ to the wide bandwidths these
servlcesrequlre.
rlpn 1.1 Basic: IdnofB..-dbaDd COIlIhlnicaticHl Nerworb
Figure 1.I ~piClS the basic i~a of broadband communication networks. For
deca~s. great effons have been made to develop broadband communication networks
that are flexible enough to accommodate continual changes in service mixes. easy to
install and maintain. and efficient on resource utilization. while providing user-friendly
access. ATM (Asynchronous Transf~r Mode) has been identified by the rnJ-T
(International Telecommunications Uni! - Telecommunications Standardization Sector)
as the most suitable protocol to integrute all services over a unified network. ATM is a
standard for cell relay. where data. such as voice. data or video, arc: all convened into
small cells of fixed size. ATM technology combines the benefits of circuit switching and
packet SWitching. Combining these two technologies brings the advantage of guaranteed
capacity and constant tnmsmission deJay (circuit switching) with the efficiency and
flexibility for increasing data traffic (packet switching). One of the main benefits of ATM
is that it offers assured quality of transmission with service-level agreements. ATM
provides QoS (Quality of Service) to eveI')' user. congestion controL ~livery classes. and
dynamic band..... idth allocation capabilities. Currently. the most popular protocol for
broadband communication networks is IP (Internet Protocol). IP was invented to be the
network layer (layer 3 in the OSI modell protocol running on the lOp of Ethernet and
Token Ring LAt'\'s (Local All:a Networks). IP is an unreliable protocol. offering servic~s
on a bes[-effon basis. and discarding messages when necessary without attempting
retr.lnsmission and without warning [he transmitting progmm {I],
1.2 ATM Switching
In this thesis. among many aspects of ATM and [p prolocols. we will only discuss
aspects regarding A1M switching. This is based on the following observations.
bRspective of the choice cfltle protocol. ATM or IP. the nodes of the networits. or
switches. ha\-c essentially the same functional and performance requirements. Besides.
ATM ""ill play an imporunl role in the future IPeare I1ClWods. In the WAN (Wide Area
Network), ATM is expected to carry IP traffic. IP gives us specific attributes. including
efficient mulliple;l;ing. inlemclworking and multicasting. BUI. as mentioned above. ATM
supports QoS while IP does not. ATM is providing the transitional solution for IP's
limitations in quality and legacy service adaptation. IP core platforms are utilizing ATM
backplanes and [P QoS schemes are tTrimicking the capabilities of ATM as IP continues
to evolve as a future core network protocol.
Rese;1l'Ch on A1M switching has been developed wortdwide for several }'e3rS and
nuny ATM switching archjtect~ have~ proposed.. Throughoullhis Ihcsis. we will
divide the swilching architecture. or the switch. into three function blocks: inpul
conunller (Ie). switch fabric. and output conunller «>C). We discuss only cell
forw;miing oper:ttions. namely. oper.nions related 10 the transfer of cells from the inputs
10 the OUlputs of the switch. Thus. Olher functionalities relevant to the set-up and tear-
down of the vinual connections are nOi discussed here. Figure 1.2 illuslr.ltes this generic
swilCh model. As shown in Figure 1.3. these switches are geographically distributed over
a .:ommunication nelwork. A user·to-u$er connection lypically goes Ihrough several
nodes (switches). and thus each cell belonging 10 such a connection experiences many
hOps.
switch
Fabric
IC; input controller
DC: output controller
F"lIunl.l Geaeric Model of Switch
The switch fabric (SF) fonns the core of a switch. The SF is primarily responsible
for transferring cells between the other functional blocks. routing, signaling and
managing cells
The output controller (DC) prepares the ATM cell sueams for physical transmission
by remo\·ing and processing the internal tags, managing output cell buffers. etc.
The input controller (Ie) perfonns some imponant functions roreach ATM cen:
temporarily buffers the incoming cell streams
perfonns table lookups to translate the cell vpuyel (virtual path
identifier/vinual channel identifier) values
detennines the destination output port
rewrites lhc ATM cell headers
generates m~ int~ routing lag for use only within lhe switch
1k following chapters arc: devOl:ed 10 eJ:ploring !he input controller in
Con51der.1b1~ delail.
1.3 Motivation for This Thesis
Th~ standards for ATM are relativ~ly malUre. but the actual hardware designs and
impl~mentation are still in progress. Tremendous re~art:h has been conducled to dev~lop
more effici~nt and bett~r-perfonningATM switch archit~ctUre5.
As we know. in very high-speed networks.lhe technology deployed in the nodes of
the network. or switches. innuences the speed of the networI.s gre:uly. and the routing
table lookup is lhe worst bonlencck in a switch. There have been many 3nempts to
oveft:ome the problem of speeding up the routing lable lookup tcchniques. Our work here
isan effon in this direction.
We propose: a hardware implemenlalion of a rouling table lookup scheme. Al very
high line rates. this process muSt be executed as rapidly as possible. which is why we
consider the design of an ASIC (applicalion-spc:cific integrated Clft:uit) dedicated 10 this
process. Hanfware can greatly speed up the system process compared wilh SOfl\lo'are
implementalion. Besides. a cache-based architeclure of input controller (lC) is motivated
here. This IS realized by high-speed cache anemprs followed by slower table memory
lookups. We eltpect to gain better system pcrfonnance through caching the frequently
used fOl'\l.arding infonnatlOn.
Our input controller solution in this thesis would be applicable to any switch or
router Ih:Jl handles fixed-sited packets_ not just to A1M switches. For those routers that
have 10 deal With long. \·ariable.length packers. the packets an: chopped into small equal·
sited chunks. switched using an AIM switch-like router. 3nd rhen reassembled to get
back origmal p;w;:kers. Such router designs are conunonplace now.
1.4 Thesis Organization
The remainder of the thesis is organized into seven chapters.
In Chapter 2. we emphasize the input conrroller functions in ATM switches and
investigate the various categories of input controller architectures in terms of the input
buffering strategies. routing table hiel1l1'Chy design. etc. To clarify the different schemes.
we also show some ellamples of input controlle~ in commercial SWitching products. We
end this chapter with a detailed commercial clIample.
In Chapter 3. we depict a clear picture of the Balanced Gamma (SG) network-based
ATM SWitch. This is necessary before we start discussing our design because our IC is
designed for BO ATM networks and we WIll take some advantage of the characteristics
of the SO network. In this chapter, we include the topology. the routing: algorithm
employed and various properties of this network.
In Chapter 4 and 5. we discuss our IC design in detail. We divide the whole system
into several modules and analyze the functions and structures of the individual modules.
To verify our model. we build up a software: simulator. The traffic patterns used in the
simulation arc uniform random ttaffic (URT) and bursty traffic. After examining the
simulation results. we make sure that the system modeling is correcl and thus we can
prepare for the hardware design process.
In Chapter 6. we present the IC hardware design. The design is carried out using
Synopsys CAD tool supported by the Canadian Microelectronic Corporation (CMC). The
design is described 3t the Register Transfer Level (RTL) using Very-high-speed
Hardware Description Language (VHDL). We adopt the 0.18 J.lm CMOS technology.
also supported by CMC. 10 achieve a high-speed design.
Chapter 7 concludes our work.
Cbapter 2
INPUT CONTROLLER FOR ATM SWITCHES
2.1 Introduction
In this chapter. we prescnt a survey of severnl input controller schemes for ATM
switches. Firstly, we e.'plain the input controller main functions that are of interest in our
work. Then. we discuss various input controller architeclUres commonly used for
switches in broadband communication networks. We mainly focus on two issues. One is
the input buffering strategies. The other is the memory hierarchy for the rouling table.
Finally. we show some input controller cll.amples in commercial SWitching products.
2.2 Input Controller Functions
Before we talk about input controller functions. it is useful to uamine the ATM
cell and cell headersttuclure. as well as ATM VPUVCI switching.
ATM is a cell·based switching technology. The cell consists of :I 5-octet (i.e. 5-
b~1el header and a ~8-octet information field as shown in Figure 2.1 {2]. Some
conventions used are [2]:
bilS within an octet are sent in decreasing order. staning with bit 7:
octets are sent in increasing order, starting with octet (:
for aU fields. the first bit sent is the most significant bit (MSB).
Bits:16543210
He:tdcr(5octets}
Information fidd
(48 octets)
'- --' 52
53 octets cell
f''I'lrt 1.1 Cell SU'uctVrt.t ltw UNlINNI
The SlruCIU~ of the header is shown in Figu~ 2.2 121. The fields contained in the
header and their encoding is described below.
7 i 6 I , I • 3 I , I I I 0 B~1e
GFC VPI 0
VPI VCI I
VCI ,
VCI PT ICLP 3
HEC •
Fiprt U Cd! Hackr SU'UCNrt.t USI
CLP Cell Loss Priority
GFC Generic Row ContrOl
PT Payload Type
HEC Header Error control
VPI VinuaJ Path ldenlifier
VCI Vinual Channel Identifier (2)
The 2+.bit routing field consists of 8 bits for vinuaJ path identifier (VPI) and 16
bits for vinual channel idenlifier (VCI). ATM is conneclion-orienled and Ihe header
values. VPVVCls. are assigned 10 each section of a connection for complete duration of
the conneclion: VPI and vel are unique for cells belonging 10 the same vinual
conneclion on a shared transmission medium.
There are basically t.....o lypeS of ATM s..... itching used in an ATM nelwork (see
Figure 2.3) (3]:
VCII
\'Cl: ;==I;==:"-J::===~~(
~~ ~l~
\·CI.a VPl3 VCl.a
"cure 2.3 ATM VPIVC Swiltbilll
10
I) the virtual path switch
This swilCh roules cells based only on the VPI value within lhc routing field of the
cell header. This means a low switching overhead and efficiency. This type of
switch is known as an ATM cross<onnect and is frequently used as a network
trunk-switching device (31.
2) the virtual channel switch
This switch roules cells based on the value of the whole of the routing field within
the cell header. that is. using the VPUVCI combination. This type of switch has a
higher overhead than a cross<onnect type switch [3J.
The input controller is located in front of the switch fabric. Figure 1.2 in Chapter I
depicts the input controller location in the ATM generic swilCh. Input buffers are the
interface of a switch with the incoming signals. When reaching a switch. the incoming
signal is first terminated and the ATM cell streams ~ extraCted. These involve signal
conversion and overhead process. and cell delineation and rate decoupling [4 J. Afler th:ll.
for each ATM eel!. several important processes are done. These important processes are
the main focus of our work here and are explored in great depth in this thesis. They
mclude:
temporarily buffering the incoming cell streams
routing table lookup to tr.lJlslate the cell VPUYCI values
determining the destination output port
rewriting the ATM cell headers
II
• g~nerating the internal routing tag for use only within the switch
2.3 Input Controller Architectures
Because th~ switch design is not pan of the ATM standards and research in this
field is still continuing. vendors utilize a wide variety of techniques to build their
switches. A lot of research has been carried out to ~xplore the different switch design
alternatives. In this section. we revi~w two issues related to performing the input
controller functions in the IC: the input buff~ring strat~gy and th~ memory hi~rarchy of
theroutingt:lble(RT).
2.3.1 Inpul Buffer
2.3.1.1 Blodung in MulliSlage Inlerronnedion Networks
In multistage interconnection networks (MINs),lhe packet inputs are unpredictable.
When paths of two cells addressed to two different output lines might conflict before the
last stage, the internal blocking occurs. Only one of th~ two cells contending for a link
can be passed to th~ next stage. A fabric is said to be int~mally blocking if a set of N cells
addressed to N diff~rent outputs can cause conflicts within the fabric. When cells wait for
a del:lyed c~1I at the head of the queue to go through. even if their own destination output
ports are free. head-of·line (HOL) blocking occurs. This also reduces the overall
throughput.
2.3.1.2 Buffering Slrale'gies
To reduce the above cell loss and improve the throughput. packet switches need
buffers to hold packets as they wait for :lccess to the switch fabric or the output trunk.
12
There are severn basic approaches to the placement of buffers. These basic
approaches are illustrated in Figure 2.4 [4]15!:
Input Queuing
[nlemaiQueuing
]I:~ ,®-+, s ' ®-+
Output Queuing
Switch Fabric
'"P",~OutP",
19J
Shared Memory
Input Queuing
In a purely input.queued switch. packets are: buffered at the input and released
when they win access 10 both the switch fabric and the output trunk.. This approach
suffers from head-of-the-line (HOl) blocking. When two cells arrive at the same time
13
and are destined to the same OUtpUl. one of them must wait in the input burfen.
preventing the cells behind it from being admitted. Thus capacity is wasted (4].
Several methods have been proposed to taekle the HOL blocking problem. but they
all exhibit complex design. Increasing the internal speed of the space division fabric. or
changing the first-in-first-out (FIFO) discipline are two examples of such methods (4].
Output QueuioC
A pure output queued switch buffers mila only at its outputs. This approach is
optlmal in tcrms of throughput and delays because thc OUtput queues do nOl suffer from
HOL blocking. However. in the worst case. cells at all N input pons may be destined to
the same output pon. If the switch has no input buffers and we want to avoid ceJlloss. the
s..... itch fabric must deliver N cells 10 :I single OUlput. and the Output queue mUSl store N
cells in the time it takes for one cell to arrive at an input. This makes the s.....itch fabric
and queues more expensIve than ....i!h input-queued switching [5). Besides. in both Cl1Se$.
the throughput 3nd scalability are limited.
IntunaJ Queuing
One kind of switch fabric is c:tlled space division swikh rllbric. in which there is
more: than one path between each Input 3nd output pon. These paths wed: in parallel so
that many cells C3n be sent 3t the same time {:!8]. Buffers can be placed within the
SWitching elements in a space di\'ision fabric. Again. HOL blocking might occur within
the s..... itching elements. and this significantly reduces throughput. especially in the case
of small buffers or larger networks. Intem:tl buffers :tIso introduce random delays within
14
m~ swilCh fabric. causing undesirabl~ cell delay variation [4). Gelling out of sequence of
cells is another problem in this kind of multi-path s....;lches.
Shared Memory
In a sham:l.memory switch. input and OUtput pons sh~ a common memory. Cells
OR stored in th~ common memory as lhey arrive. and the cell he.lder is ~Xtr.lCled and
roUted 10 the outpul pon. When the output pen schedul~r schedules a c~1l for
transmission on the trunk. it removes the cell from th~ shared m~mory. Since th~ swilch
fabric only switches h~aders. it is easier 10 build. Howev~r. an N x N swilch mUSI read
and write N cells in one cell arrival time. Because memory bandwidth is usually a highly
constrnined resource. chis resuicts the sm: ofthc switch [51.
Studies show th.u variOllS combinations of th~ abov~. no! one of th~ pure queuIng
slr.Uegtcs. would lead to betler compromises [61. W~ also need to nolice mat buffering
cannot tOla1ly eliminale cell loss. bUI it can reduce it to acceptable lev~ls. Adding buffers
IS not a penalty-free: solution to....·ards minimizing cell loss fOl" twO reasons. One is that
buffenng requires adding exlr.l. hardw~ to the system in the fonn of memory ~I~ments
and control circuils. 1ne other is thai cells thai are stored withm the buffers suffer from
delay. in fact. Ihe larger the buffer sizes. the bigger is the delay. which imposes another
limIt on the buffer size used. For inslance. when w~ try to improve the level of cell loss.
which is nonnally achieved by increasing buffer sizes. we end up experiencing poor
le\"~ls of cell delay. and vice versa [71.
15
2..3.2 Routing Tabl~ Manory
n.e routing table (Rn memory is a very imponant memory system in an ATM
switch. In practice. the existing memory system can be classified into two categories. the
non-cxhe scheme :md the cxhe-based scheme. In this section. we focus on the
description of each choice.
2..J~ I Non-cacbe Schane
This category can be subdivided into centl'3lized RT memory and distributed RT
memory. In the centralized RT memory system. a single RT receives lookup requests
from all input ports of the switch. An arbitl'3tion mechanism must be introduced to
coordinate the requests from different input ports. A cenll'3lized RT can be a performance
bottleneck if it is overlooded by processing demands. Distributed RT mcmory can solve
the bottleneck.. bUL since every input pan has a RT. huge amounts of memory may be
needed. The hardware complexity may increase since coordination may be required.
This category usually has two main levels of merTlOf)' hiel'3fChy. small cache
memory and large RT memory. These twO levds of memory may be further subdivided
into multiple le\'els.
D.2.2.1 MotivaUon (or Using Cache
One of the most imponant memory properties that can be explOited is IocIlily of
reference: information that has been used recently tends to be reused. This says that every
item of the data or code is not accessed with the same probability. Two different types of
locality have been observed. Temporal loc:aIity staleS that recently accessed items are
16
likely to be accessed in the near future. SpatiailotaJity says thai items whose addresses
are near one another tend to be referenced close together in time [8).
Cache is used in the RT memory system bec:lUSC of the presence of a strong
temporal locality, not spalial locality. displayed by the nature of dau forwarding in a
communication network. This will be explained in Section 4.4.1. The input controller
perfonns the lookup at line speeds, or wire speeds, Ihat is. at the rale of incoming cells on
the input lines. To keep up the speed. given a large routing table. it is obviously useful to
cache recently used rouling entries.
2.3.2.2.2 Cache Attributes
Before we consider Ihe application of a cache structure in the ATM switch design.
it isnecessarytoexploitcache:lltributes briefly.
Cache is the name generally given to the first level of the memory hierarchy
encountered once the address leaves the CPU (8). Since the principle of localily applies al
m:lny levels. and taking advantage of locality 10 improve performance is so popular. the
term 'cache' is now applied whenever buffering is employed 10 reuse commonly
occumng lIems.
A block is the minimum unit of information thai can be present in the c:lche (hit in
the cache). Miss rate is the f!":lClion of accesses that are nOI in the cache. Cache
organizalion can be classified into three calegories based on where a cache block is
placed [8]:
Direct mapptd cache
17
If each block has only one place it can appear in the cache. the cache is said to be
direct mapped. The mapping is usually
(Block address) MOD (Number of blocks in cache)
Fully assotiativt '*ht
If a block can be placed anywhere in the cache. the cache is said 10 be fully
Set associative cache
If a block can be placed in a restricted set of places in the cache. the cache is said to
be set associative. A set is a group of blocks in the cache. A block is first mapped onto a
set. and Ihen the block can be placed anywhere within that set. The set is usually chosen
by bit seleetion.lhatis.
(Block address) MOD (Number of sets in cache)
If there are n blocks in a set. the cache placement is called n-way set associative.
Direct mapped is simply one-way set associative and a fUlly associative cache with
m blocks could be called m-way set associalive: alternatively, direct mapped can be
thought of as having m sets and fully associative as having one set.
Figure 2.5 (81 depicts the different cache organizations. Caches can be further sub-
classified into multi-level caches. lhat is. main cache. secondary cache. even tertiary
cache. Seeondary c3che is widely used nowadays in computing systems.
18
Fully
usoc:ialive:
block 12
,an go
an~where
Direcl
mapped: block
!2cangoonly
into block 4
(12 mod 8)
Setusoc:iath'e:
block 12 ,an go
anywhere in
setO(l2mod
4)
01234567 Block no. 01234567 Blockno.01234567
BIOC"O. • 1m
Set: 0 I 2 3
'-'Pft 2.S CacM (hpaizalioM
(Tllis e~3JT1ple .:ache II.u tiglll bla<:k fnmes and memory has 32 blocks.)
Caches have an address tag on each block. frame that gi\'es Ihe block address. The
tag of e\'ery cache block. thai might conttLin that desired infonnation is checked to see if it
matches the block. address. All possible tags are searched in parallel because speed is
critical. A valid bil is added to the tag to say whether or not this entry contains a valid
address. If the bit is not set. there cannot be a match on this address. The cache block also
includes the index and block offset. 'The block offset field selects the desired data from
.9
the block. the index field selects the set. and the tag field is compared against block
address for a hit {81. Figure 2.6 (8J shows the cache block structure.
Block Address Block
Ta Index Offsel
When a miss occurs. the cache controller must selecl a block to be replaced wilh the
desired data. Several algorithms are used 10 selecl the viclim block:
Random
To spread allocation uniformly. candidate blocks are randomly selected. Some
systems generate pseudorandom block numben to get reproducible behavior. which is
particularly useful when debugging hardware f8].
Least-recentJy used (LRU)
To reduce the chance of throwing OUt infonnation that will be needed soon.
accesses to blocks are recorded. The block replaced is Ihe one lhal has been unused for
the longest time. LRU marks use of a corollary of locality: If recently used blocks are
likely to be used again. than the best c::mdidate for disposal is the leasl.recently used
block.
Some other replacement algorilhms are: FIFO which replaces the oldest block.
LIFO which repl3Ces Ihe newest block. as well as IDEALLY which replaces the block
20
that will not be used for longest time (optimal) [9J. Considering the feasibility and
additional cOnlrol bits. some of the above algorithms are not suitable in pl11ctice.
Besides the various schemes for the input controller. there also ell.ists the ttade-off
between software implemenlation and hardware implementation. White some people
prefer software's flell.ibility. hardware has the benefit when speed is still a main concern.
2.4 Input Controller in Commercial SWitching Products
One of the most attractive aspects of technology is how the technology would be
used. Looking back o\'er the communications products of the last three decades.
cosu'perfonnance is probably the most imponant of the three attributes - the othe~ being
standards and time 10 market - that delennine product success.
Teday's \'endo~ of switching equipment include some tradition:tl
telecommunication equipment-makers. such as Nonel Networks. Lucent Technology.
Fujitsu. etc.. some data/networking companies. such as Cisco. Juniper Networks. etc.. and
othe~ like IBM. JCorn. PMC-Sierra, MOSAID. etc. Some small comp:tnies also ha"e
leading edge solutions for switching products.
We now re\'iew some commercial switching products.
Cabletron Systems has introduced a new product called Smart Switch Router.
which is based on YAGO Systems. Inc. router de\'elopmenl. There is a 16 Gbps non-
blocking switching fabric in the backplilne and the futl routing table of m:uimum of
:::50.000 routing enlries. Each pon can switch 50.000 MAC addresses [10].
21
BBN Corporation is p3n of GTE (ntemetworking. a unit of GTE Corporation. The
~h group of BBN has published a design for M A Fifty Gigabit Per Second lP
Router- called MGR. MGR is built around a high-speed cuslom design IS-port switch.
Forwarding engine cards. having a complete SCI of routing information.~ sep;1r.l.le from
Ihe line cards. All the line cards~ able to lr.lnslate the link-layer he:1der to an abslIact
link-layer header having information ~uired for lP fOlwarding. The forwarding engine
comprises three le\'els of cxhes: the most used part of the code fits into the first levd
8KB Instruction cache (!cache). the on-chip 94KB secondary cache is filled wilh 12000
cached route entries. ::lOd 3. third level 16MB off-chip cache is reserved for Ihe complete
forwarding table. This memory is divided into two 8 MB memory banks. The route
updates are made by copying a new tOlble (0 one bank and Ihen disabling the previously
used memory bank. The switch is input-queued All inputs hOlve 3. separate FIFO for each
output. The scheduling gU3r.lJ11eeS thOlt~ is no HOl blocking and the switch operates
wilh full throughput riO).
The Cisco 12000 series switch is optimized for perfonning routing and packet
forw3rding functions to transport [P datagrams across a network. This prodLlCt is based on
a high speed disuibuled tOUling an:hiteetw"e. The packet forwarding functions are
perfol'Tl'ltd by each of lhe line cards. A copy of !he forwarding tables computed by the
GRP (Gigabit Route Processor) is dislTibuted to each of the line cards in the system. E::M::h
line card performs independent lookup of a destination address for each da!agnm
received on a local copy of lhe forwarding table. The routing table has up to I million
roUleentries[33].
The next example is the SiberCore Technologies SiberCAM Ultra-2M device [II].
a product of the third generatIon packet forwarding engine. The packet forwarding
solutions have progressed through three generations already. In the finl generation. the
lookups was largely done in software. Mosl systems today. including the :lbove discussed
commercial examples. are the second generation products which migrate lookup and
forwarding functions to hardware solutions. especially using ASICs with an on-chip
engine and SRAM. The third generation forwarding engines are tending towards using
application specific standard product (ASSP) based on high-performance. high-capacity
ternary CAM technology. CAM is :l type of specialty memory device that is accessed in a
fundamentally different way chan R.AMs. While a CAM is written and read e.\3ctly as a
RAM. it may also be searrhed. In a search. all data stored in the memory is
simultaneously compared to the search key in a single operation. The result of the search
is the physic:ll address at which the matched entry is found. This operntion inherently
accomplishes a table look-up in hardware. and requires that dedicated comparison
circuitry be co-located with every bit of storage. The CAM-based forwarding engine
uses this input to search the route look-up table to detennine the appropriate address for
the data. This. in tum. is output from the CAM to a context memory to determine the
appropriate routing. switching or policy (0 be applied for mal panicular packet. The table
update and maintenance operations are performed out-of·band. that is. they can occur at
the same time as look-up operations. without either operation being slowed down [Ill.
2.5 Summary
Many proposals have been made on A1M switch arrhit«t~design. Among these.
\'anous input conU'OlIer schemes show up. These choices. including the examples we
discussed in Ihis chapter. have pros :md cons in terms of hardware complexily. material
COSI. managemenl. elC. Some of them can no longer meet the speed requirements in the
broadband applications in the gigabit range. ","owadays. rescart:he~ still work hard to
look for other solutions 10 high performance ATM swilching. Our design also ......orks in
this direction. In Ihe later chapters, we will discuss our input controller architecture in
delail.
Chapter 3
BALANCED GAMMA NETWORK·BASED ATM
SWITCHES
J.1 Introduction
B~fore W~ forward to our input controller architecture design, we discuss the
Bal.:lnced Gamma (BG) network-based ATM switch. which is one of the applications our
input controller is designed for. As we mentioned in Chapter I. our design of input
o:onlrollers an: not only applicOlble to BG ATM switch. but also any switch or routtrthat
h:lndles fixed size packets.
~any muhist3ge interconnection networlts (MINs) for ATM switch architeetu~
ha\'e been reponed in the liter.llure. IXII most of these: architectures are not efficient due to
their high costl~rfo~ r3lios. indficien! performance. or 1:\C1t of scalability. The
B:llanced G:lI1lma ~tworX has been recognized to be reli::.ble. fault-toler:ulI. efficient.
scalable and has much bener pcrfonnancc than the other MlNs with same hardware
complexitypJ.
The BG network is a MIN based on 4 x 4 switch elements (5Es). It is c311ed
"b313nced" because the BG network is developed from G3mm3 network which is based
on 3 :\ 3 5Es. The Gamm3 network is unba.13nced in the sense thaI. :unong the lhree
ompul links of each SE. one of the nonstraight links may be considered as an alternative
10 the other nonstrnighl link and a rerouting scheme may be evolved to exploit Ihis
inherent redundancy. But the straighl link would become Ihe critical clement withoUi an
alternative. The BG network adds the 41t1 link 10 each SE, nuking it 4 x 4. Hence each
outpUi link can have an alternative and it is balanced.
In the following seclions. many aspects of the BG network. including the
IOpology/suucture. routing algorithm. hardware complexity. faull tolerance. reliability.
:md performance analysis. will be summarized. In Section 3.2. the topology/suucture is
fi~tly inuoduced. (n Section 3.3. the routing algorithm is discussed. Then. in Section 3.4.
the BG nelwork propcnies are shown. including the hardware compleXity (HCl. fault
tolerance (Fe). and reliability analysis. At last. based on some simulalion results. Ihe BG
network's performance is analyzed.
3.2 Topology
SO network is a MIN based on 4 x .. 5Es. For an N x N BG network. where N is
the size of Ihe network. there are 10g:N+1 stages. The 1st stage (Stage 0) consisls of 1 x
4 crossbar 5Es. i.e.. each SE has one input link and four output links. Each of the
followmg 10g:N-1 sl3ges are based on 4 x 4 crossbar 5Es each of which has four inpul
links and four output links. The last stage is a special buffer stage.
In each of the first 10g:N stages. there arc N SEs numbered from 0 to N _ I. The illl
SE in the j'" Stage is connected to four SEs in the (j + I)th. The four SEs are (14):
i. (i + 2)) mod N. (i - 2 J) mod N. and (i + 1 J• 1) mod N.
26
The pseudo code showing the conncction scheme can be found in [17J.
The buffer stage is a special stage. II is used 10 collcct the outgoing cells from the
switch fabric and feed them 10 the respective destination Outpul port. From the structure.
there are four input links for each outpul buffer. The OUlput buffer is designed to accept
up to 4 cells in each switch cycle.
Figure 3.1 {7J depictS an 8 x 8 BG network. where N = 8. totally log:8 + I = .;
siagesindudi'lglhe buffer stage.
rlJUn3-1 '.IIBGNdwOfk
3.3 Routing Algorithm
One of the attr.1Ctive char.JCleristics of BG net.....ork is irs simple rOl!ling algorithm
called Reversed Distance Tag Routing algorithm. 11 is eltplained in detail in the
following.
Each cell from source S to destination D has been assigned a routing tag which
represents D. i.e. the output pon number. in binary: d ~.I d •. ~ ...do. where n =10g:N:
SEs interpret the tag in revef5e order. i.e.. SEs in Stage 0 switch a cell based on bit
do. SEs in Slage I switch a cell based on bit d1•.••• and so on. until SE in Stage n-1 which
s..... ilcheson bitd •. I .
For each of the SEs. the four output links can be divided into t.....o groups. The upper
two links are used for s.....itching 3 cell with routing tag bit 0: the lower two links are used
for switching a cell ..... ith tag bit 1. In either the upper or the lower group, the finl output
link is considered normal. When the normal link is faulty or busy directing a cell. then the
second one - the alternative link will be: chosen. Figure 3..2 {I4] shows this rule.
g =:."SE;,i """""','
alternative'"
28
When more than one cell with same tag enter a SE. up to 2 cells with same tag can
be routed without any cell loss. If three or four cells with same tag bit enter a SE. then
t.....o will be routed and the rest are discarded.
Figure J.J shows an example of routing inan 8:{ 8 BG network.
A cell with routing tag 110 comes into the network from input port 11:3. Stage 0
switches it based on '0' and chooses one of the two upper links; when it reaches SE 1#2.
Stage I. it is switche1i based on the second '1' in 110 and routed to one of the two lower
links: when il reaches SE 1#2. Stage.:!. it is switched based on the tint 'I' in 110. Thus it
gets to the output port #6.
Fipre)o3 Routine in an 8 x 8 BG Nehl'ork
o.
Here we inspect the switching mechanism that contributes to low cell loss and high
perfonnance in SG networks. the backprasure strntegy. The Sackpressure (BP) strategy
is a technique in which by means of a suitable backward signaling the number of packets
actually switch~d to each downstream queue is limited to the current storage capability of
the qu~ue and the collisions are avoided during lhe middle stages in a MIN: in this case
all the other HOL packets remain stored in their respective upstream queue (3J. The BG
network is self-routing. that is. the cells have prepended routing tags that t~1l the switch
fabric which Outpul links the cells should be sent along. According 10 the BG network
switching mechanism. the self-routing tag specifies the required output pon l)f the switch.
It is the in\"crse order of the output pon numbers. The bits are used. one per slage. by the
switching elements. In SG networks. a switching cycle is composed of two periods. a
reservation period ([he rouling period) and a relaying period. In th~ reservalion period.
each Input controller sends the self-routing lag representing the cell at the HOL of the
buffer that belongs to that input controll~r" The routing unit in each switching ~I~menl
(SE) uses the arriving self-routing tag 10 set up a routing paltern for the main cell to be
relayed. This process continues until the lag reaches the output controll~r. The output pon
acknowledges the arriving self-rouling tag. A tag is positively acknowledged if [) it
successfully reaches its output destination. and. 1) there is room for the cell represent~d
by self-routing lag in the buffer of its destination. The acknowledgment signal is returned
10 lhe mput controller by the switching mechanism. In the relaying period. the input
controller passes the body of the cell lhal was positively acknowledged during the
reservation period [7J. We will discuss this in Chapter 4.
30
3.4 Properties
3.-&.1 Hardware complexity
To inc~ase the speed is the main purpose of ~search on ATM switch fabrics.
Therefore. minimum hardware compJe."(ity (He) is desired in the hardware
Implementation.
The He of a network is llle sum of the HCs of the SEs of [he MIN. The HC ofa SE
depends on the tOlal number the connections between the inpul pons and the output pons.
In a BG network. there are four connections in a SE in {he first stage and there are
4:\4 connections in a SE in the intermediate stages. Figure 3...l (14J shows these
Conneclions in first stage Connections in internal stages
rlCurt' J..a Crosspoint Complellity in a SE
So. the HC for a BG network is [14J:
HC ,8G,'" I x4 x N +4:\..l:\ N x (log:N.I)
31
In the above HC calculalion. the complexity of the OUlpul suge is ignored. becatUe this
HCis glven as the HC for the switeh fabric and the complexily of the buffer stage. th:uis.
the output stit:,oc. IS usu3.1ly not counted. Here. connections are the main concern 3bour. the
HC: in practice. some Olher fxu>rs are also taken Into xcount These factors may include
the number of I/O pins. the interconnection lines. and envll'Onmental parameters.
3.4.2 Fault Tolerance Properties
A fault-Iolerant MIN is ::I net ....·ork thai is able to route packets from input pons to
Ihe requested output ports. in at least some cases. even when some of its network
components (5Es. Jinks) are f:lUlty.
There are mainly twO network components in the BG networks: the 5Es and the
links. The fault lOlerance perform::ance IS discussed for these two components.
When a link fault occurs. lhe bull will be: notified to the corresponding SEs to
which the links are connected 1be SE will route cells UlfOUgh the alternative Iink..lfbOlh
normal ::and ahemau,'e links for a s....itching bit from a SE 3ft faulty. the SE infonns SEs
in Siage lj-Il. to .....hich it is connected.. to mcx1ify the routing tables.
Figure 3.5 sho.....s the dyn:muc rerouting in an 8 x 8 BG network in C3.SC of link
faults. For the 5E II In 513ge I. lhe first link to "xne cells .....ith tag bit 0 is broken, So.
SE"I will route cells with tag bil 0 to the allemalive link. In the SE~ in Stage I. bCllh
links routing cells with tag bit O:lte broken, Then. SE 10 notifies the four SEs to which it
is connected in the Stage 0 to change their routing table and rt·route cells.
32
Fipn 3.5 DyD&lDic Rrroutinc ia .... 8 It 8 BG NU\Il'ork ia Case or Link hull!i
We will no"' consider 5E faults: mese are similar to link faults with some minor
differences in lems of manifestation ad correction. When an 5E faull occurs. the faull
will be notified to the four 5Es connected to it in the previous stage to change their
routing tables.
The 5Es in previous stage to the fault will roUie cells through others 5Es in Ihe
same Stage as thai of the faulty 5E. But. there exist some 5E pairs Ihat the failure of both
two will lead to loss of fuJI access property which means that there exists at leasl one palh
33
between any input-oulput pair in case of faults. These p3irs:are called critical pairs. For
e3Ch SE. SEt.) fonns critical p3ir with e:ll3ctly two other SEs. SE i.r.j and SE ,Ii forO
<j<n-I.
II is obvious that. for a BG network. it can tolerate up 10 Nr.!. SE faults in each slage.
e~cluding the firsl and output slage. and it can tolerate up to N/2 ~ (1o~N-2) SEs in the
whole network. Figure 3.6 shows the dynamic rerouling in an 8 .'( 8 BG network in case
of SE faults. The SE #1 in Stage 2 is faulty. This SE will notify the SEs to which it
connects in Ihe previous stage. here. SE #1. #3. #5. #7. These SEs will change their
routing table. For instance. if SE #1 in Stage I wants to send OUI a cell with lag bit O. il
will route it to the :lltemalive link that is connected to SE #5 in Stage ~. instead of the
nonnal outpullink which is connected to SE #1 in Stage 2, Another case: SE #0 and #4 in
Stage 2 is a crilical pair. If they are bofh faully. then the network loses the full access
propeny. For instance. if SE If{) in Stage I wants to route a cell with tag bit O. if cannot
choose eilher of the two links since both SE If{) and #4 are broken. So. Ihe cell cannot be
routed.
The grey line in Fig.3.5 shows an example of rerouting for Ihe routing path from SE
#3 (Stage 0). SE #2 (Stage I). SE #2 (Stage 2). to SE #6 (Ompul Stage).
So. il can be conduded that BG network is a single faull·tolerant MIN and robust.
(A network is called single faulHolerant if il can function as specified by ils IT criterion
despite any single fault confonning to its fault model. If a network can IOlerale any set of
i faults. Ihen it is said to be i-fault loleran!. A network that can lolernte some instances of
i faults is said to be robust although it is not i·fault tolerant).
J4
Stage 0 Output
Stage ,':......"s.ta.
g
.e.2i==Stage
_0::;. ",_0~_o
-=__='=_~~~'>==__ 2=-~__ ' _=_4=-_=-
- 2 =' = 4 - \' - 2 ~2 -
-- .'\~ -,'\\-- ....,,--
- 3 = C 6 =..AO: 6 =' 6 -
- 4 ~~,~--- -, -
:]~~,::~: =
-_7_?~~;;'¥~_7_~_7_-
Dynamic reroumg in an 8118 BG network in case of SE faults
'lpIft l.' Dy.-ic~ .. _. I.' 8G ~orit.e-olSEFauks
J.~.3 Reliability Analysis
High reliabililY is important f(K" real-time communication systems. There are three
main measures thaI arc: used to evaluate the reliability perfonnance of MINs. 1nesc arc:
lenninal reliabililY. broadcast reli:ability and network reliability [14].
Terminal rdiabilily (TRl is the probabilily that there exists at leasl one fault-free
path from a pmicular input port 10 a particular ourpul pon (14). TR is always associated
35
with a tenninal path that is one·to-one connection betw~n an input pon (the source) and
an output pon (the destination) [71.
Broadcasl reliabilily (SRI is the probability that at least one fault free path e:\islS
from a panicular input pon to all the output pons (14]. BR is always associated with a
broadcast path that is a connection from one source to all destinations in the network [7].
Ner.work reliability (NR) is the probability of maintaining full access capability
throughout the network. The BG network loses full access property only when one or
more critical pairs fail [14J.
Mean time to failure (MITF) specifies the quality of a system and is the expected
time that a system will operate bt;fore the first failure occurs. The failure rntes from
MTTF of the BG network are used to evaluate the above three metrics. The obtained
results demonstr.lte that the SO network is robust and reliable. The detailed reliability
analysis is presented in Pl.
3.5 Summary
Most of the aspects of BO network are brieny discussed: topology/struclUre. routing
algorithm. hardware complexity. fault tolerance. and reliability. The SG network is a
MIN based on 4 It 4 SEs. It adopts a distributed Reverse Distance Tag routing algorithm.
It is a single fault-tolerant MIN and robust under multiple faults. It elthibits good
reliability. In summary. the BG network is an excellent candidate for switches in the
broadband communication networks.
36
Cbapler4
INPUT CONTROLLER ARCIDTECTURE DESIGN
FOR BG ATM NETWORKS
4.1 Introduction
In Chapter 2. we reviewed various popular schemes of input controllers for
switches. In this chapler. we discuss in deta11 our input controller (Ie) design for BG
networks. We firstly give readers an overnll picture of the entire Ie architecture. Ne:u. we
illustrate several key ponions of the Ie. namely, the input buffers. the routing table (RT)
memory. Ihe cache. and the arbiter. For each we describe things such as the struclUre. the
algorithm. the operation. and the reasons why we choose such kinds of SlJ'UClUIeS. etc. At
the end of Ihis chapler. we conclude the Ie archito::ture design.
4.2 System Description
AS we discussed in the previous chapter. the Ie processes Ihe look-up operation and
maps the VPUVCI in the header of Ihe incoming cell into the corresponding output
VPUVCI. AI Ihis siage of the swilch an intemal routmg tag is added to each cell.
Buffering of cells may also be provided by lhe Ie to pre\'ent cell collisions.
37
A simplified outline of the Ie design for an N x. N ATM switch is shown in Figure
·U. which illustrates the data processing path for a stream of cells entering from the IC
leff and ex.iting from the IC on the right.
---,
-!
As we know. ATM is a cell-based technology. Data is segmented into 48-byte
payloads. A 5-byte header is :lttached to this payload to fonn a cell. The header has fields
that detennine routing, flow control. elTOr control, and other functions. The field that is of
interest here is the VPUVCI (Vinual Path IdentifierNinuaJ Channel Identifier). These
VPUVCI values are assigned 10 each section of a connection for the complete duration of
38
me connection. In me Ie architecture. there is 3 centr.il m3Ster fOUting table (Rn and the
satellite cx:hes of modest size with recently used routes. An ;ubiter logic is used to
coordinate the requests from e:K:h input pon c:ache to RT. Each input port kttpS an input
buffer. When a cell :lI'TWCS :u IC. it is temporarily stored in the ettmiponding input
buffer. The header infOl'1T1;ttion is retrie"ed into the c:K:he 10 tool: up :ill infOfTnation
necess3ry to forward a cell tOw:u-ds its destination. If a route is not in the satellite cache.
11 .....ould request the relev30nt route from the central table. RT. IC then rephrases the
VPVVCI and other routing infonnation in the cell header. and forwards the cell to the
switch fabric that delivers cells to outgoing lines. In the following sections. we describe
each ponion of the IC in det::tils.
4.3 Input Buffer Module
In Chapter 1. we mentioned that there :ue several buffer strategies for ATM
swncl'les. It IS \'erified that 3 pure output buffering is c1emy the best solution. but it
.....ould require a prohibiti"e1y comple~ switch fabric. A pule input buffered str.negy is I1Q(
acceptable as pcrfOl1T\.:lOCe will be throttled due: to the HOL blocking problem. Instead.
the hybrids of those buffering str.1legies would lead to a better performance. This
conclusion can be d.r.1wn from sever.11 studies reponed in (61. In BO ATM networks. we
use Inpul-QUtput buffering str.:llegy. We gain the capability of the input buffers 10 combat
internal and output blocking. We ::tlsa gain the reduction of the HOL blocking enjoyed by
the OUlput buffering strategy [7J. Deuiled infonnation ::tbout the input-<:lUtput buffering
for BG ATM switches can be found in (1). The design of the output bufferponion of this
39
input-outpUi buffering is also described there. In this section. we discuss the design and
the operation of the input buffer portion of the input-output buffering.
4.3.1 Inpul Buffer Operations
In the N-port switch that we describe here. each of the inpuc ports has an inpul
buffer. It is a first-in-first-out (AFO) queue. The input buffer interfaces to the
transmission system. When a cell arrives. the corresponding input buffer first checks if
the buffer is full. If the buffer is full of cells. then the incoming cell is rejected: in a well-
designed s)'5tem. this occurs very nrrely. for example. once in ten million times.
Otherwise. the cell is injected into the buffer. Meanwhile. a flag is prepended to the cell.
When thIS nag is set. the cell has gone throUgh the routing table (RT) lookup: .....hen the
flag is cleared to O. it indicates that the RT lookup is under service or waiting for the
The HOl cell is ready to be sent 10 the cache related to this input port 10 lookup
routing informatIon. When the lookup is done. along with the updated VPUVCI value. a
self.routing tag is added to the cell. The self-tag values are also stored in the RT .....ith the
updated VPIIVCI.s. Meanwhile. the flag is set to indicate that the HOl cell has completed
the routing information lookup.
The HOl cell with the updated header is ready for picking up by the switch fabric.
The inpul buffer waits for the acknowledgement from the switch fabric to send out the
HOl cell. Once the HOL cell is sent out. the cell next to the HOL cell in the FIFO buffer
is forwarded to the HOl and would be ready for lookup in the next cycle.
"'.3.2 Input BufTer Structure
The input buffers for an N x N switch are n FIFO queues. each wilh a fixed length
of 10 entries. This number has been determined based on our software simulation results
that are discussed in the next chapter. Buffering requires adding ex.1ra hardware 10 the
system in the form of memory elements and control circuits. The larger the buffer size.
the bigger the delay. In next chapter. we will ex.plain the simulation results of buffer
length. Each entry consists of a I-bit flag. 3 lO·bit self-routing t3g. 40 bits cell th3t
includes 14 bits VPVVCI and 16 bits payload. Here. to simplify the hardware comple.."ity.
we use 3 1-byte payload instead of the standard 48-byte payload; that is. we use a 40·bit
cell. instead of 3 53-byte cell. However. this simplification does not affect the
performance characteristics we allempt to eV31uate in our simulation. The imponant
information. Ihe VPUVCI. in a cell slill remains as 14 bits. The input buffer Slructure is
shown in Table 4.1
Entry
1 bit !Obits
Valid Self-
bil routinl!:tal!:
14 bilS
VPuYCI
16 bits
Cell
oavload
4.4 Cache Memory Module
Cache memory is the key component in our IC We expect an efficient cache
structure optimized for routing. Based on our study of the ATMfIntemet traffic attributes
and our simulation experiment (reponed in the next chapter), we build our cache as a
one-level 32-block fUlly associative cache. Each input pon has such a cache and totally
there are N caches for an N x N BG switch.
4.4.1 Cache Siructure
To build up our cache. we need to address several issues:
Onc·lcvel or multi-level cache?
Separated or distributed cache?
Cache organization: direct mapped cache? Fully associative cache? Or set
associative cache'!
l)e{:ide the number of cache entries. and the mapping scheme.
We only consider one-level cache r.llher than a multi·level cache memory which
leads to more scan:hing time and higher hardware compleXity. We choose distributed
caches rather than a commonly shared cache for a whole switch. This is based on the
observation that it will cause severe bottleneck if a commonly shared cache is built since
cache is a fl"C1:luemly accessed component. Also the entries in a cache are naturally
associated with connections which correspond directly to input pons.
42
To decide the cache organization. we have c:ll'T'ied out much study_ Firsdy, how
well !he cache works depends 00 the localization of the data. We have inlmduced the
concepts of sp:ltial locality and temporal locality in Ch3pter 1. The study on the lntemct
lr.J.ffic srreams indicates that they tend to exhibit a low degree of spatial loc:Llity,
especi311y wtK:n they are occurring on as large and diverse a medium as :m Internet
router. Once a switch is detennined to be on the virtual path or virtual channel ((hat is. on
the route of a session). se\'eral hundreds or thousands of cells belonging to that
connection will pass through this switch for the duration of the connection. However. if
one particular VPUVCI field is currently 3ccessed. there is no greater probability for the
::.djacent VPIIVCI to be accessed in the near future than any other VPIIVCL that is.
spatial locality is not present. However. there is a degree of temporal locality. This
obscrv:uion tells us that cache entry size should be small. Since all of the blocks in an
entry are referenced by the~ 13g value. they rely on spatial locality to be useful: low
spatial locality means we should give up large entry size. Here. we consider only one
block in a cache entry. A cache is valuable provided it achieves 3t least 3 modest hit rate.
To compare among different organizations. we build up 3 software simul3tor to evaluate
cache performance of v:uious cache organizations. We test fully :woci3tive cache.
directl)' mapped cache. two-w3y set associative cache and th3t combined with 3 \'ictim
cache. and four-way set associative and th3t combined with a victim cache. The victim
cache contains only blocks that are discarded from a cache because of a miss - "victim" -
and are checked on a miss to see if they have the desired data before going to the next
lower· level memory. If il is found there. the victim block is swapped. The simulation
43
results (given in Chapter 5) show that a fully associative cache gains much better
perfonnance in tenns of cache hit rate. hardware complexity and time-consumption.
Besides. a fully associative cache has the most flexibility in mapping cache blocks,
Detailed perfonnance analysis can be found in Chapter 5 of this thesis. Caches an: costly.
So. caches should be small yet efficient. We have tested different entry numbers. such as
16.32.6·1. 128. 256. etc. and compared the results of cache hit rates. Finally. we decided
each cache to have 32 entries.
Based on our study and analysis. we finally conclude that. for an N x N switch.
Ihere are N caches. each for an input port. They are all fully associative ..... ith 32 entries
each. Each enlry has only one block. The fields in a cache enlry are described as follows:
Valid_bit (I bit)
It indicates an enuy in the cache is valid or nm in these pre-setup connections.
tag (24 bilS)
It is a cell's incoming VPUVCI and serves as a tag to search for cache entries.
VPlVCI_o (24 bits)
It indicates a cell's outgoing VPWCl related to the tag in the same entry.
oUlput_num (to bits)
In the perfonnance analysis of switch fabrics with port number from 8 to 1024.
output_num is 10 bits. In the hardware implementation of an 8 x 8 s.....itch fabric.
output_num is 3 bils.
44
5 bits. It serves as an index of access number of the given entry. Since there are 32
entries in a cache. 5 bits are needed.
So. one cache has 8 bytes x 32 = 156 bytes. In an 8 x 8 swilCh fabric. the 10lal cache size
is S x 156 bytes. Table 4.1 depicts the cache organization in the Ie.
Tlblr C CxfHo Orpnilltion
·t4.2 Cacbe Operation
The caches are initialized at the network pre-setup stage. All valid bits are cleared
to O. During the switching cycles. the incoming VPUVCI from the input buffer;; are used
as tags to malch entries. All the n caches of the n input ports work in parallel. If there is a
cache hil. then the data in that matched entry can be accessed. The updated VPUVCI. as
well as the output pon number. is relUrned (0 lhe related input buffer. If there is a C;lche
miss instead. we stan the main RT lookup o~r:llion. The RT lookup o~T:J.tion will be
explained in the next seclion. The returned d.'lIa from RT. the updated VPUVCI and the
output pon number. are sem to the input buffer. Meanwhile. the returned data from RT
replaces one of the cache blocks.
With respect to the above mentioned "replace", among the various replacement
3.lgorilhms reviewed in Chapter 2, what we use in the Ie design is the least-recently used
(LRU) algorithm. In LRU, 10 reduce the chance of throwing out infonnalion that will be:
needed soon. accesses to blocks are recorded. The block replaced is the one thai has been
unused for the longest time, 11 can be seen here that LRU makes use of a corollary of
locality; If recently used blocks are likely 10 be used again. then. the besl candidate for
disposal is the least-rccently used block. Obviously this is consistent 10 Ihe Intemel
lraffic attribute. the temporal locality. The core of LRU algorithm is to rccord the cache
block access number.;. In a practical hardware implementation. Ihe access number must
be finile and reasonably small. A dynamically calculated \':lIue is assigned to the access
number. Let us firstly assume the access numbers of the block currently accessed to be k.
Next we set the access numbers of this block to be 0 and increment access numbers of all
OIher blocks that currently have an access numbers less than k. Those blocks whose
access numbers are equal to or greater than k remain. In this way. we keep the access
numbers always less then 32. that is. the cache block size. This process is shown in Table
·u.
46
Th,"
cachc:acecss:
-+
Thcf(n+1l
cache access:
--+
Thef(n+2)
c:xhe access:
--+
10:-2 1;-1 k-I '-I
k·l , k
,
, 0 0 1
'+1 '+1 '+1 '+1
k+2 k+2 k+2 10:+2
30 30 30 30
31 31 31 31
·32 cache
- thecachebJock . oncc:3gain . lhecachc
blocks with access that cache block with
lisledin number block is C3Cheaccess
=~, '!(isxceS$ed
"""'"
number 'I'
numbers
. afteropera1ion. -a11theaccess is accessed
""'"
k changes 10 O. numbers
O-k-I increase: I. '''P
k+I-31 not tho"""
change
Tablr.l.J CacM BkJcII. Aca:ss N...-n Updalillc P'roass
47
4.5 Routing Table Module
"The RT data MlUCt~ contains all me information necessary to forward cells
toward their destination. There is a single RT in each ATM switch. When the RT receives
the lookup requests. it processes the lookup oper.uion and returns the result.
4.5.1 Routing Table Structure:
Our analysis shows that complicated RT structures. such as mulli-level memory or
saned memo!')', ~ not requlIed for this primary storage. Thus. memory access times and
processing time ~ minimized when the RT structure is as simple as possible. that is. the
RT has a low depth in its storage structure, Instead. an efficient cache is the main concern
..... ith res~t to the perfonnance gain,
To determine the RT size. let us do some calculation first:
For a 32 X 32 ATM $M·jtch. ifall traffic is ,'aiu traffic (band\l.·jdth 64l:bps! coming
al th~ lin~ rau of 2.5Gbps. and a traffic load of 0, 7. th~n. th~ nUmMr of VPIVCs ;s
2.5Gbps X32 XO.1 = 815.000 (VPIVCsJ
tAtl,ps
Another eil.ample: For an 8 X 8 ATM switch. if ,'id~o rraffic Il'ith balld\l.idth / Mbps
;s coming at th~ lin~ rat~ of2.5Gbps. and a traffic load of0.8, th~n. th~ total VPIVCs s~t
2,5Gbpsx8 XO.8 = 16.000 (VPIVCsj
IMbps
48
From above two examples. we can see that. theoretically. for the 24-bit index
VPlIVCI. there should be 21~ = 16777216 entries in the RT: however. the vPlVCs setup
in a given time for a particular switch are very small amount in practice. The 21~ yplVes
denote the traffic all over the world. We also learn from above examples that the yPIVC
number is defined by the traffic type. line r:lIe. traffic load. as well as the switch fabric
size N. Another point we should note is that the speed of memory ~generates as Ihe size
increases. placing another limit on the maximum memory size. In other words, the switch
capacity limits the traffic volume passir,g through. Therefore. keeping a routing table of
21~entries is impractical and prohibitive in tenns of memory size. We only need consider
those VPIVCs that are set up for a particular switch. Based on a study of the industry
experience. for example. Lucent Cajun A500 ATM Switch and Cisco ughlStream 1010
Multiservice ATM Switch can cope with 32.000 Yes. l'l:Spectlvely (3OJ (31); Nonel
Networks Centillion 100 ATM·[AN Switch can support greater than 10.000 VCs (32).
We have decided 10 support maximum 20.000 VPIVCs at any given lime. So. the
designed maximum RT entries are 20.000.
Obviously. RT memory has almost the same fields as cache memory except no
'access_num' field thaI indicates the cache block access numbers. The fields of an entry
in the RT are ~scrilJed as follows;
It indicates an entry in the RT is valid or not in these pre'5elUp conn~tions.
VPIVCU (24 bits)
11 indicates a ceJl's incoming VPUVCI value.
4.
It indicates a cell's outgoing VPINCI value rel:lIed to the incoming VPVVCI in lhe
urneentry.
outpuCnum (IObits)
In the performance 3n3.lysis of switch fabrics with pan number from 8 to 1024.
output_num is 10 bits. In the hardW3I'e implementation of 3n 8 x 8 switch fabric.
outpuC num is 3 bits.
Thus. the RT is estimated to occupy around 160k bytes. Table ~.~ depicts the
organization of the RT.
....5.2 Routing TatM Operatioa
The RT is initialized ':1.1 the netWork pre-setup stage. The data is updated and valid
bits 3I'e SCI for new connections and cleared for previous connections.
During SWitching cycles. if RT receives a look-up request. the field 'VPIVCU" of
all the RT entries with valid bil 'I' are compared with the incoming VPVVCI. The RT
finds the matched entry and returns the resuh to the relevanl cache that issued (he request.
If no matched entry is found. then the request is discarded.
50
4.6 Arbiter Logic Module
One concern with our IC architecture is the coordination of a set of lookup requests
at the same time. Thus. an arbilration logic. or arbiter. is needed to determine which one
of the requests from the caches should be granted to access the RT. There are several
possible arbitration mechanisms. such as simple round robin. a specified pon assigned.
and olhers. Studying the traffic attributes and switch architecture. we have chosen
Random Pon Number Select algorithm for the reason of fairness and simple hardware
implementation. The arbitration algorithm is described using an example of an 8 x 8
switch as follows:
l) Since then~ are 8 caches. Ihere can be maximum of 8 simultaneous cache
misses. So the maximum of RT lookup requests is 8. Co - C7 denote the eight
caches. and Ro - R1 denote the eight requesls.
2) Now. leI us assume Ihat in a given cycle. Co, C2, and (6 are Ihe only three
caches that have cache misses and issue RT lookups. The requests are Ro, R I •
and R2.respectively.
3) Generate a random number between 0 and 7: Suppose the random number is
1102 =610•
~) Then, 6 mod 3 = O. So, select Rothat corresponds 10 the Co to granl in this cycle.
5) In the nexi cycle. following Ihe same rule, the arbiter selectS another requesllo
grant iss~ed by C2, C6 : and so on.
51
4.7 Summary
The Ie architecture is ... caclle-based memQry hierarchy. The key function of the Ie
is 10 process the RT lookup operation. The design proposed in this chapter combines
many beneficial considerations in terms of switch fabric architeclUTe. broadband network
traffic characteristic. memory :Ittributes. speed. hardware complexity. elc.
In the end. we make a rough comparison of the Ie with one commercial product
discussed in Chapter 2. the BBN MGR:
I) The MGR switch is input-buffered. All inputs have a separate FIFO for each
output. A scheduler is dedicated to guarantee Ihat there is no HOl·blocking and the
switch operates with full throughput [IOJ. The SO switch utilizes inpUl-outpuI buffering
with mainly output buffering. In the Ie for the DO network. every input port has a small
FIFO buffer. HOL blocking is not a major concern because of the largely output
bufferingslT'iltegy.
2) The MGR is pipelined. So a swilch allocator is dedicated to form a switch
pattern. The alJocalOr has to choose the way in which N II N possible input-DUlpUI
pairings are to be connected to serve all connections effectively. A limitation of this is
thaI it cannot make one-to-many transfers. Thus. for multicast switches. multicast packets
are copied to each line card separately. The Ie of this thesis is a non-pipelined structure.
So no hardware of allocatOr or scheduler is needed. This largely simplifies the hardware.
Multicasl can be realized flellibly. either making cell duplieation inside switch elements
in the swilch fabric. or inserting a copy network before the switch fabric.
52
3) Thc MGR is implementcd using a RISC processor and fcatures Ihn:e Icvel
cachcs. the code. me cached roUlC cntries. and the complele forwarding lable.
respectivcly. This structure is similar 10 the 2-lc\'cl cache.memory hierarchy of the IC
that follows ASIC design flow. But no cache pcrfonnance result of the MGR is reported.
So we cannot continue the comparison.
In the next chaptcr. we will show some simulation results to assess the IC design.
53
Cbapter 5
PERFORMANCE ANALYSIS OF THE INPUT CONTROLLER
S.l Introduction
In this chapler. we discuss the performance of our Input Controller (Ie), the design
of which was described in Chapter ..t, The parameters used to measure the performance
are cell loss ratio. cache hit "ltc. cache miss rate. inpul buffer requirements. eiC. We use
both uniform and non-uniform traffic patlems to study the Ie performance. Among these
panems. we choose URT and bursty traffic loads. The organiz':lIion of this chapter as
follows: we first provide a survey of traffic paltems in broadband communication
networks. We mainly focus on the URT and bursty traffic loads. Ne~l. we describe our
simulator software buill for the perfonnance analysis. The simulator also includes the
traffic load generator module. Then comes the simulation results and analysis under the
URT und bursty traffic loads. We finally conclude wilh the results of the perfonnance
analysis for the Ie.
S.2 Traffic Modeling
5.2.1 Traffic Patterns in Broadband Communication Networks
In (he communication systems, throughput is a key parameter in performance
analysis. It is defined as [14J:
54
throughput IOtal'number' 0/ .delivned .cells (in a given period of time)
IOlal· numbn'of . inputted 'cells
Since an ATM network is a unified network imegt:lting various services. such as video.
voice. data. and other payloads. So. the perfonnance analysis should be under different
lraffic types. Studying the traffic patterns used in the switch fabric of broadband packet
switch architectures could give a better understanding of the performance of the Ie
system. II may not be possible to ex.actly simulate realistic traffic loads. However. some
cenain traffic patterns may help in understanding perfonnance under realistic traffic
patterns.
The traffic patterns expected in broadband packet switch architectures can be
classified imo two categories. unifonn traffic patterns and non-unifonn traffic patterns
114J. We present several popular traffic patterns under these two categories.
5.2.1.1 Unirorm Trame Palterns
The two main uniform traffic patterns that are being studied are pennutation traffic
and unifonn random traffic.
PermutationTraITk:
The pennutation traffic pallern refers to the case where the output pons requested
by packets arriving to the switch in the same time slot are dislinct from one another. It is
referred to as pennutation traffic because. al full load. the list of requested destinations
ordered by input lines fonns a pennutation of the set (0. I. .... N-I).
The amount of internal blocking in the MIN is distinctly visible under pennutation
traffic pattern. In the case of pennutation traffic. at full load. only one packet is destined
55
to each destination. So pack.ers will be lost only due 10 intemal blocking in the network
and not due 10 output contention at a panicular destination.
Uniform Random Traffie
The uniform random traffic (URn panem refers to the case where each OUlput pon
has equal probability of being requesled by packers arriving at the input pons. As
destination addresses can be repeated. it is referred as uniform random traffic.
The performance of MINs used in multiprocessor interconnections and in the switch
fabric of broadband packet switch art:hitectures is USUally studied under uniform random
traffic. It is a much simpler traffic pattern to be analyzed as compared to the non·uniform
traffic types. yet much more realistic traffic pattern when compared 10 the permutation
traffic pattern described above.
5.2.1.2 Non-uniform Trame Patterns
Much research has been carried OUt and is still ongoing to delermine the traffic
types closer to the real traffic. Although it is quite difficull to modeL simulate and
analyze the exact traffic e~pected in broadband communication networks. researchers
have come up with cenain U3ctable non-uniform traffic types and Ihese patterns are more
realistic than simple permutation traffic and uniform random traffic.
The most widely studied non-uniform traffic types are hotspot traffic. community of
interest traffic and bursty traffic.
Hotspot Traffic
56
HotspclI traffic is defined as the tr.lffic type in which one or more nodes of the
switch fabric or of the destinations receive a given pcrttTItage of the incoming packets.
The rest of the incoming tr3ffic is of the type uniform random. These nodes ~
dc:stJnalions 3J'e refelRd to as houpots. HotSpcx uaffic is also referred to as OUlput-
concentration traffic.
Community or Interest TratrtC:
The lraffic pattern of Community of Interesl Traffic is described in such a .....ay thai
a certain percentage of the traffic arriving at a certain input(s) is always directed to
certain output(s). The remaining traffic originating at these inputs are of the uniform
random traffic Iype. Uniform random traffic type is e;o;pected at all other inputs.
In this traffic type. cen01ln output request patterns cause degradation of throughput
primarily due to contention on intemal links as opposed 10 output conflicts. This occuo
.....hen the number of community of interest input - outpul pairs increases. Due to Ihis.
more path conflictS occur in the intermediate stages of the MIN. The throughput
degrndauon is~ pronounced in case of large sized netwOl'\s.
BursryTratrk
Bursty Ir.lffic is a uaffic Iype in which the inputs of the switch receive sudden
buots of pxkets. A popular method to appro:\.imate bursly traffic is as follows: the traffic
source 011 exh input pon alternates belw~n active and idle periods with IWO independent
geometncal distributions of predefined avcrnge burst lengths. The aClive period has cells
for 11/ continuous cycles is called the burst length. The idle period is also referred to as a
burst gap as it occurs belw~n IwO aclive periods that have bursty traffic. Cells arriving
57
at an input pon within the same butSt are always directed to the same output pon and~
sepa.rated by fixed or random spacing. It is assumed that the active periods have a
probability p and thus the idle periods a probability of I-p. The bu~t length for an input
pon is not constant because the burstiness is assumed to be caused due to different
payloads that are to be integrated in broadband communications.
Here are some attributes of bU~ly traffic. The gap between bursts under full load
bursty traffic is assumed to be zero. It is also assumed thai Ihe distribution of burst
lengths is the same for all bursts arriving on any given input pon. and burst lengths and
gaps between bursts are drawn independently from geometric distributions with mean L
packetslburst and L packet slots/idle period The OUiPUI pon requested by a burst is
assumed to be uniformly distributed over all the output ports independent of all other
bursts entering the switch.
5.3 Simulator Software
To evaluate the performance of the designed IC. we have developed a simulator
software using the C++ language. The simulator has two main capabilities. traffic
generation and IC function simulation. The goal of the simulation is two. One is to
ascenain that the designed architecture has good performance; the other is 10 gain a clear
understanding of the relationship between the various modules in the IC structure that
facilitates hardware implementation. We now discuss the simulation in detail.
5.3.1 Traffic Generation
58
Among the traffic patterns mentioned in Section 5.2. we select URT as the
representative of the uniform traffic patterns and bu~ty traffic as the representative of the
non·unifonn U":I.ffic pauerns to evaluate the Ie performance. We select URT because it is
simple. feasible to model and more realistic than permutation traffic pattern. It can De a
good Slart point 10 study the effect of traffic patterns. As for the bursty traffic. since the
expected traffic loads in broadband communication networks may mix voice. video and
data. etc .. the bursly traffic is nexible enough to accommodate mosl of the existing traffic
sources with a reasonable accuracy.
II is obvious that the variability of the bit rate and the connection mode (connection-
oriented or connectionless) are important characteristics of an end-Io-end palh. Besides.
timing infonnation is another importanl factor that needs to De preserved from source to
destination. Different traffics have various requirements of the above Ihree factors. ATM
has the ability to guarantee class of service to different applications. that is. to map
different information into cells. This means that with decent planning. users with very
different application needs (voice. video. several kinds of computer data) can use the
same network links and be satisfied with the service, and thus save money. This is why
ATM is referred as the 'unification technology'. According to ATM Forum, which is a
commercial association of ATM equipment manufacturers and researchers. and the
Internet Engineering Task force or rETF. which is the Internet's standardization body,
the ATM traffic can be subdivided into four classes in terms of the three criteria [151
"9),
(I) Constant Bit Rale (CaR) sources.
59
This type consislS namely of voice and some video soult:es. e.g. telephony. voice
mail. and some {elemedicine applications. CBR sources require a sustained amount of
bandwidth. low latency. and a low cell delay jitter.
(2) Variable Bil Rale (VOR) sources.
This type is mainly the resuh of multimedia applications. [t models applications that
generate traffic in bu~ts. rather than in smooth stream. The peak bit rate and the
allowed 'bu~tiness' of the trnffic will have been prearranged between the network
provider and the user. VBR application can tolerate higher delays and higher delay
variations than can CBR soult:es.
(3) Available Bit Rate (ABK) sources.
These mainly include computer related dala sources. such as file transfer. electronic
mail. tenninal emulation. or LAN emulation (LANE) over ATM. These sources do
nm require any specific bandwidth and have widely varying latency requirements and
cell delay jiuer. Instead of defining in advance the bandwidth that a customer can use.
the system provides now conlfOl from the network to Jimitthe infonnation now from
the customer's terminal to a rate that the network can accept. The bandwidth available
to the customer is therefore subject to change as the network congestion changes.
This allows the network provider to exploit otherwise unused bandwidth on the
network.
(..I) Unsp«ined Bit Rate (UDR) sources.
60
A UBR sourt:e neither specifies nor receives a bandwidth. delay. or loss guarantte.
This com:sponds to the 'best effon' Inlffic. Currently. the Internet provides only thiS
type of traffic. and does not facilitate quality guarantees.
Traffic modeling is the first Slep of IJ3.fflC generation. It should be noted mat by the
term 'model' for :I U':lffic source we shall be refemng to an algorithm gi\ing !he
generation time K,of the ith cell. for i=l.~.... ; almost always. the X,'swill be taken as
random variables [161.
For URT generation. the only parameter we need to know is the traffic load rate.
This traffic load rate determines whemer Of not there is a cell at an input line in a given
cycle.
Many models exist to describe bursty traffic. The most genera.! model of an AIM
traffic source would be lhe one that takes into account the entire history of cell gener.won
so as to determine: the time for the next cell to be generated. Such a model would be \'~'
complicated to describe. as well as very hard to fit 10 0Ktua.l sources: it would also be
analytically intractable. 'The model we use is the popular ON/OFF model.
According 10 the ON/OFF model (17) [18]. during lhe lifetime of a \inw.l
conneclion. Ihe cell ~am from :I single ATM source is modeled as a succession of
active and silence periods. Cell genemtion occurs. or the source tmnsmits cells. dunng
the active periods. i.e. the ON periods. as opposed to the OFF (silence/idle) period. which
involves no cell generation. Different active and/or silence periods are assumed
independent of each other. The cells generated during the same ON period form a burst.
61
In almost all of the related work. each of these periods is laken either as an exponential or
a geometric random variable. depending upon the choice of the time axis as either
continuous or slotted. More general dislributions can also be assumed. The length of the
active period is denoted by t-. while that of the silence period is denoted by I...otr. During
the active period. one cell is generated every cycle time T. with the firsl ~Il emerging T
after the beginning of this period. A scenario of cell generation penaining to an ON/OFF
truffic source is depicted in Figure 5.1 [161.
_ryTms
.: ::
IJillllil
F'"...~ 5.1 tbe ONfOFF Sou~ Modd
In our work. the lengths of ON and OFF periods are independenlly evaluated from
geometric distributions given by [17] (7J:
1- = I ... r::=;: -11.
62
ill
i2J
Where: I..- is the ON period length in cells.
l..Qr is lhe OFF period length.
O'S R <I is the rnndom number gener.:lted.
o< p < I is the inverse of the average period length in cells. and
O<p<1 is the network traffic load.
[t is obvious from the above equations I..- and L."r are equal to or greater than O. But the
lengths should be equal to or greater than L that is. L"., 2: I and L.,ff ~ l. [n the
experimenfs. we found that Ihe ON-OFF model could be strictly followed only if L"., ~ 3.
l.otr ~ J. Also in order to regulate the generated traffic as close to lhe assigned traffic load
as possible. the above equalions should be modified as follows:
t- =.a+ r::=:: -11.
L.,jJ= i 1+1...0. {~l1.
i3i
Therefore. the bursty model used here is a variation of the conventional ON-OFF model.
The burst length for an input port is nOi constant because Ihe burstiness is assumed
10 be caused due to various payloads that are to be integrated in broadband
communications. The cells arriving at each input port in a bUTSt are destined 10 the same
output port. The output is selected randomly.
Despite ilS simplicity. the ON/OFF model can be used for traffic modeling of
broadband communication networks. There is a general belief thaI. with the exceplion of
63
CBR sources. all IXher traffic sources exhibit this ON/OFF behavior. For CBR sources.
there is no idle period (18). Also. given its analytical u-aclabilit)'. the ON/OFF source
model is by far lhe most popular in perfonnance evalualion work [(6).
Input COfttroller Simulation
We now proceed to outline the IC simulation software.
The main data Slrueturcs used are:
ATM cell structure
stmctcell_Jrmctl
long VPIVCI: lIthe 14_bjt VPWCI in ATM header
int rorltingTa!:: 1/(logNLbitsfor self-rol/ting fag. fhe reverse of the
// OIl/pm pon number ofthe switch fabric
"
CachestfUCture
booll'Qlid_bit: //I_bit ~'alid bit
long tag: 1/24_bit tag
long VPIVCCo: //l4_bit Ilpdated VP1IVCl
int outplft_num: lifO_bit outp/It pan number
jill access_num: //5_bit record ofcache accessing times
64
Routinglabl~structure
SU1Kl memOf)'_SrrUCl{
bool \'ulid_bil: Ill_bit I'ulid bit
long VPIVCCi: 1124_bit incoming VPWCl
long VP1VCCo: /1'!4_billlpdaled VP1IVCl
int olltput_num: //lO_bit omput port number
Sinc~ this simulator is dev~loped in modular fashion. th~ Ie functions are simulat~d
insevcralmainprocedu~s:
void cach~_inllial (cach~_struct" cach~l
l/This procedure is to perfonn th~ cach~ inilialization at the Ie sysl~m stan:. Th~
valid bit and cache acc~ss number are reset.
void RT_initiaJ (long "h~ad~cin)
l{fhis proc~dure is to inilialize the RT at the IC system start, The values of the
pairs of the incoming VPVVCI and the updated VPIIVCI are assigned to Ihe RT
entries. as well as th~ output pon numbers. All this routing infonnation is
assum~d to be given by the higher network-l~v~l routing mechanism. The valid
bit isseI.
• boollookup (long ad~ssy. consl iOl ponNum. cach~_StroCI "Pcache. long
65
·VPIVCCo. inc ·outpuCnuml
IfThis procedure is 10 implement the cache lookup during the IC system running.
The incoming VPUVCI is used as the tag to lookup in the cache. If cache hits.
that is. the lag matches the VPUVCI in a cenain cache block.. and the valid bit of
the corresponding cache block. is set. the updated VPUVCI and the output pon
numtxr in that cache block are returned. Meanwhile. the cache access numbers of
the whole cache block. are updated followlOg the given rule. If cache misses. that
is. either the tag does not match any VPUVCI in the cache. or the corresponding
valid bit is not set. though the tag may match Ihe VPUVCI. When cache miss
happens. the IC system first uses LRU algorithm 10 find the cache block number
to be replaced later. Then. the incoming VPUVCI is used as tag to process lookup
in RT. The found updated VPUVCI and the output pon number are returned. At
the same lime. this found content is written back. to the cache block whose block
number is calculated previously using LRU algorithm. The cache access limes of
the whole cache blocks are updated following the given rule.
bool memory_op i10ng addresssy. long· VPIVCI_o. int· outpucnumj
l{This procedure is to implement RT lookup function during IC system running.
When IC cache miss happens. the system convens to RT lookup procedure. The
incoming VPVVCI is used as the tag to check. through all Ihe RT entries. If one
encry matches. the updated VPUVCI value and the output pon number in the
66
corresponding RT block are relUmed. Otherwise, invalid VPUVCl message is
output,
in! LRU (cache_struct .. Lcache)
'!This procedure is 10 implement the LRU algorilhm .....hen cache miss happens
during IC system running, This is done by searching for the least-re<:ently used
cache block number based on the records of the cache access times. The
corresponding cache block number is returned.
To evaluate the Ie performance. some important parameters are necessary to model
the traffic and simulate the IC using the developed simulator soft .....are. They are:
Switch fabric size N'
The simulator can handle s..... itch fabric size N from 8 to 1024.
VCN? numbers:
The simulator can handle a maximum number of VClVPs 20,000 set up 01\ any
given time.
:-.Ietwork traffic load p:
The simulalorcan handle network traffic load p from 0.1, 0.2... , to 0.9.
S..... itching cyde number.
Traffic type:
Currently Ihe trnffic types that the IC simulalOr uses are URT and bursty traffic.
Mean burs! length. if bursty tnlffic:
67
This parameter is e"pected 10 be 3 positive integer.
5.4 Simulation Results and Analysis
In mis section. \l.·e :m3lyze lhe performance of the input buffer :md the C3Che·
memory hierarchy described in Ch3pter 4 b35Cd on !he simulation results. The simul;uion
of me designed input buffer is done in conjunction wilh the 8G swilch f:lbric simul3tor.
which h3S been developed 3t Memorial University of Newfoundl:md to evaJu:lle me
perform:mce of the SG switch fabric. The criteria are the input buffer number. the output
buffer number. :md Ihe cell loss rate, under diffcrenl tr.1ffic 103ds. The simu1:uion of Ihe
Ie is done under the simulator menlioned in Section 5.3. The m3in evaluation criterion is
thec3Chehit rate.
5.4.1 Input Buffer Number
According to the descnption in Ch3pter 4. me input buffer number is expected 10 be
as small 3S possible. So the h3tdware COSI can be ~3SC'd while the high s)'Slem
perfomunce can still be gliDed..
Tllble 5.1 - 5.4 depicl me number of inpul buffers :md OlJlput buffers lesled under
100.000 s..... itching cycles. under different URT lr3ffic loads for a swilch f3bric wim N
ranging from 8 !O 64. The results imply that. for 5G nelworks. only sm3i1 numbers of
input buffers would be sufficient to 3Chievc high system performance if traffic can be
assumed to be URT. Even through the number of input buffers incre:JSe a little while the
traffic load and/or the size N grow, the number oflhe input buffers is still very smalL
68
In the above simulations. we do the following pnxedures to try to find the proper
number of input buffers. meanwhile keeping the cell loss rate as small as possible. and
even towards to zero. We expL::lin it by gi\ing an example 0( 0.5 URT traffic load for an 8
18 switch. FirsL we choose 5000 input buffen. 5000 OUlput buffers :tnd. 1000 cycles. We
lhen run the simulation for five times to find out the output buffer numbers needed from
the simulation reports. Next. still using 1000 cycles. we run the simulation for five times
wilh selected OUtput buffer numbers in abo\'e step. By so doing. we can detennine the
input buffer numbers needed. Finally. we run the simulation al 100.000 cycles. wilh the
selected input buffer number and output buffer number in the above two steps. We check
the cell loss ratio. We apply proper adjuslrnents to both the input buffer number and the
OUtput buffer number. if necessary. :lI\d repeat the simulations until the cell loss ratio is as
small as required. We repeat this procedure for 0.6. 0.7. 0.8, and 0.9 URT loads for an 8 1
8 s..itch fabnc. lben we can conclude the input buffer number needed for:ll\ 8x8 switch
fabric under URT tnlffic. We can 3150 conclude the input buffer numbers needed. for 16 1
16.32 x 32. and 64 x 64 switch fabric under URT traffic in the same way. Table 5.5 and
Figure 5.2 show the conclusion of the number of input buffers under URT for different
sizes of switches.
Using the simulation method introduced in Table 5.1- 5.4. we measure the number
of input buffen and output buffen tested under 100.000 Swilching cycles. under different
bunty tmffic loads. with mean bunt length of 15, for switches with N mnging from 8 to
64. Tables 5.6 - 5.9 show the corresponding results. Table 5.10 and Figure 5.3
summarize the results. The simulation results imply th.u only a small number of input
69
buffers is needed for BG switches such thai a small cell loss ralio can be oblained when
the networX traffic loads arc: 0.8 or lower. Traffic loads of 0.9 and above require
inordinately high numbers of input buffers. as HOL blocking dominates in these cases.
Therefore. under butsly Ir3ffic conditions. the load must be kept at 0.8 or lower so Ihat
reasonably sized input buffers can be employed.
Le»d No. or Input Buner No. 01' Output Bun*,
0.5 1 9
M 2 9
Q7 2 14
M 3 ~
0.9 3 41
Table 5.t Nurnbfr of Input Burrtrs and Output Burrtt'J undu Dift"trtnl URY Tran-.c Loads
forSwildttllwitbN=8
L....
0.5
0.8
0.7
0.8
0.9
No. of Input Burr.
2
2
3
No. of Output Buffer
9
14
"26
47
Tablt S.l Sumber of tnput BldI"trs aad Output BldI"trs IInckr Dill"1tftIII VRT Traff-.c Loads
forSwildltswlltlN=16
L....
0.5
0.6
0.7
0.8
0.9
No. of Input Buffer
2
2
3
No. or Output Buffer
11
21
20
50
92
Tal»t S.J Nwabtr of Inpul B.rrtrli.1Id Ourpul Buff"tn \Indtr Diffrrml VRT Traffic Lo.ds
rorSw;lc:t!tswithN"J2
70
L_
0.5
0.8
0.7
0.8
0.9
No. or Input Bun-
2
3
4
No.ofOvtputIuffM'
18
22
30
42
100
Tab~ 5.4 Numbtr of IIlflUI Bld'l'trs Mel 0u1",1 Butren UIIlkr Difl'trml URT Traft'"1t LoMs
rorS_ilcblswithN=64
~- 0.5 0.6 0.7 0.8 0.9SwltchSWi
8,8 1 2 2 3 3
16x16 2 2 3 3 6
32><32 2 2 3 4 7
...... 2 3 4 5 9
Tab~ 50S Nu~ror Input Buff," unckr uaT ror Difruml Sius or S.ilc:hn
Figun 5~ :"Iumbrr or Input Bllft'tn uodu URT Trame Cor Difftl'ftll Silc:s ofSwilclws
71
Lood
0.5
0.6
0.7
0.8
0.8
No. of Input Bun.,
13
11
12
14
1502
No. 01 OUtput Bun.
89
13.
131
211
1309
Tabar S.6 Nllmbr:r of Input Buffers aad Oulplll Buffers undtr"rAy TratIk: with Mean Burst
uqtb=- 15 UitlHf dUJrnnl Traftk Load. for Switdla wi'" N::: 8
Load
0.5
0.6
0.7
0.8
0.9
No. of tnput Buffet'
14
12
15
27
1534
No. or Output lutler
94
111
171
298
2690
Tab"" 5.7 N....r of Input Burrus aDd Oulpul BufTrn unde1' Bursty TrafTlc with Mrl/l Bursl
lAnelh = 15 under dlfTurnl TflImc: Load. for Switches with N '" 16
Load
0.5
0.6
0.7
0.8
0.9
No. of Input lIuffer
13
16
15
29
1703
No. or Output lIu"'"
89
119
141
292
40n
Tab"" 5..8 Nlllnbu of I_pul SulI'," ud Oulpullkdl'ers IUMltr Bursh' TratIk: _llh Mr," Burst
UnctJl '" 15 uDdudill'rreat Trame I..oMl. forSwitclan'.itb N =J:!
Load
0.5
06
0.7
0.'
0.9
No. or Input Bu""
14
13
16
32
1908
No. 01 OUl9Ul 8,,"-,
94
139
147
317
4500
Tab~ 5.11 Numbtr of Input BllfTltrs aad Oulpul Buft'rrs uDder Bursey TnITk witt! M_ Burst
uoCtJI:l: 15 ullder dift'URI TnfIk Load. for S-mlles with N = 32
"
~ 0.5 0.• 0.7 0.' 0.'SwitehSIft
,,' 13
" "
14 1502
1610:16 14
"
15 27 1534
32><32 13 ,. 15 29 1703
64,64 14 13 16 32 1908
T.b~ 5.10 Sumber of Illpuilluff~nulJder Bursty TrUrx: for Dift'~I'ftI( Si16 of Swildles
.."ithMean Bllrst!..m&th:IS
:.
:/
./
.j
.j
.;
.J
~~~-.-'U--D~···'-------
Fiprt5.J N~rollnputButf~WlderBursty Tramc witb MRII Burst J..mcttI = IS
(or Dirftrmt Siul; ofS"itche$
73
5.4.2 Cacbe Performance
As discussed in Chapter 4. we aim al building high performance caches. The higher
the cache performance is. the belter the IC system performance is. Cache hil rale is Ihe
major criterion 10 e\"aluale the c:Y.:he performance. Figure S.4 depicts the cache hil rate
compansons under various traffic types and switch fabric sizes N. given a traffic load of
0.8 and the number of vPNCs of 9000.
./IZjZSS7r
For eXOlmple. for a 32 x 32 switch fabric. when bursty uaffic comes. the cache hit
rate is ~2.22% if the mean bUfSllength is I. 82.53% if the mean bursl length is 5. 96.22%
if the mean burst length is 20. When URT traffic comes. the cache hit rate is 22.57%.
74
Theoretically. URT is a special case of bursty traffic with mean bursl length L But in
traffic modeling. we have modified the bursty traffic model to striclly follow the ON-
OFF model. Therefore. we can sec the little difference between the hit roue :!2.22% of
bursty traffic with mean burst length I and Ihe hit rate 22.57% of URT. From Figure 5.4.
we can conclude that. for a given switch. for the bursty traffic. the longer the burst length
is. the higher the cache hit rate is. This is logIcal since all the cells in a cenain burst have
the same output VPUYCI.
x~x~
X~X8.5
~I
i!
!I
~i
~82S3
~~80'8
~"..
---r:=
rJCU~S.5 ClIdw Hil Ral6ll11lkr Variolas VPIVC Nlllllbrr and N
Figure 5.5 depicts the cache hit rate comparisons under various VPNC numbers
and switch fabric sizes. given bursty traffic with mean burst length 5 and lTaffic load 0.8.
The curves in this figure imply that. for a given switch fabric under cenain bursty traffic.
if the numbers of VPNCs increase. the cache hit rates decrease. Nevertheless. the cache
75
hit roues ilI'e kept very high at more than 80%. for all sizes N and for as many VPNCs
numbers as 9000. even for burst length 5.
From Figure 5.4 :lJld Figure 5.5. it can also be concluded thai the cache performs
better in larger switches because. for the 50llTIe number of total VPNCs. larger N means
fewer VPNCs at exh pan. and temporal klcalily is more strongly obeyed.
According 10 the studies on Fix West (a commercial US core router. a major inter-
exchange point in the Internet) core routers. the hit rate for the l2()()()...entry centralized
cache under realistic traffic C:1n reach 95% 119J, and in the industry. the hit rale of 50% is
still an optimistic rate. e.g.. in Ascend GRF 400 Multi-gigabil IP Switch [20J. From our
resulls. when under bursty traffic with burst length = 10 which is close to realistic traffic.
a hit rate 95% can be achieved. The cache hit rate coonoc reach 100% due to the cache
compulsory miss which defines system cold sun misses or first reference misses. These
re5UllS show that the designed cache archilecture works well. The Input Controller
pcrfonnance should be improved when cache: hits occur most of the time. It is also
interesting 10 noce that the c:lChc performance improves with i1'lCttllSC in burstincss. This
is reasonable because temporaJ locality is followed better in the bursty case. Therefore.
the IC design proposed here works especially well if the anlicipated traffic is bursty.
5.5 Sununary
Evaluation of performance is very lime-consuming bec3use it requires huge dat3
processing 10 simulate the realistic Silu3lion. The numerical results in this chapler have
demonslrated the efficient performance of the IC. The simulalion results show Ihat. with
76
very small numbers of input buffers. the BG networks can still achieve low cell loss rates.
The simulation also shows that the designed cache-based Ie architecture works
efficiently and the system perfonnance can be improved when the cache perfonnance is
good. The simulator not only helps us figure OUt the details of the interactions between
different pieces of the design. but works as a tool to decide the cache and the memory
size. Only with a working simulation can a full hardwilfC design be undenaken.
77
Cbapter6
HARDWARE IMPLEMENTATION OF THE IC
6.1 Introduction
In this chapter we introduce the hardware implementalion of the input conlroller
(Ie) for BG ATM networks. We firsl describe the hardware implementation methodology
recommended by Canadian Microelectronic Corporation (CMe). Then we illustrate the
architc:cwral design oflhe Ie in detail. Following the design 110w. we describe the overall
picture of the Ie architecture and then the structure of each module in the dt:sign
hierarchy. Finally. we talk about the simulation and synthesis for the individual modules
and for the whole Ie system.
6.2 Hardware Implementation Metbodology
In broadband communication hardware designs. achieving high speeds is among
the primary missions. So. the design method of application-specific integrated circuits
(ASIC) is applied 10 the design process. A chip designed for a particular product or
application are called an ASICs. An ASIC has many advantages that make it a good fit in
our design: reducing the tOtal component and manufaclUring cost of a product by
reducing chip count. physical size. and power consumption. and me higher perfonnance.
e!c.[~IJ. Field programmable gate arrays (FPGAs) are another integr:lled circuit
technology which features [ower quantity production costs and faster turnaround design
78
lime. However. an FPGA cannot meel me speed requirements in some broadband
communication applications.
The ultimate goal of our work is 10 allow a seounless tr.II\sition from algorithm
design and architecture specificalion [0 the final IC implemenwion. Our approach 10
achieve this goal is following !he design now recommended by Canadian
Microelectronic Corporation (CMC) for deep submicron 105M) technologies. As
depicted in Figure 6.1 1:!1). using Synopsys VS5 (VHDL System Simulation). this first
Simulation is perfonned to \'erify that the funclion of the very-high-speed imegrated
circuits hardware description language (VHDLl behavioural regisler transfer level (RTL)
model matches the specified requirements. If the requirements are met. then the VHDL
RTL model can be syTlthesu:ed to produce a VHDl gate-level circuit description. If the
requirements are not met. then il will be necessary to modify the RTI. model. and/or rela
the specifications. and then repeat the Simulation. The modification loop musl be repeated
:lS m:tn~· limes :lS it is necessary 10 acquire a satisf3etory simulation.
I VHDLlRTL\ I
..
I Rll.Simulation I
..I Syrllhesis I
..
I Gate-Level ISimulation
rlCU" '-1 Daip Flow RtcO-.ded by CMC
79
The IC system is described and modeled using VHDL at RTL level. VHDL is well
established in the hardware design community and is known as a strong specification
language. Initially targeted to circuit-level synthesis, tools and memodologies have
evolved to synthesize designs from the behavioural level [23]. Simulation and synthesis
tools. VHDLAN and DESIGN_ANALYZER. supplied by Synopsys. are powerful tools
used in the implementation. The VHDL simulator in V-System. supplied by Model
Technology. is another helpful tool we use.
In the IC implementation. we put great efforts on modeling of the system and
indh'idual panitions. This is based on the observation that. although highly significant.
synthesis has played :I secondary role to the deSCription langu:lge definition. It has very
powerful constructs for simulating the hardware: however. in part because of its low-
level modeling fe:ttu,es. automatic refinement of models has been proven to be elusive at
the very high levels of abstraction that are needed for the analysis of costlperfonnance
tf"oide-offs [231. Additionally. it is reported in (24] that. in the ATM switch design that
was perfonned. almost all of the errors recorded occurred during behavioral modeling of
the ASIC.
In the following sections in this chapter. we will reflect how we use these state-of-
the·art design l:lnguage and tools in me IC implementation.
80
6.3 Input Controller Architecture
As noted before. one of the most important aims of ATM switch design is to find J.
fast routing table lookup Icchnique for the high-speed communication networks to
increase speed. capacity and overall performance. So. we choose hardware
implementation for tile IC design instead of software implementation. Obvioosly.
h::trd.....are implementation has the advantage in terms of speed compared with soflware
implementation.
We choose a non-pipelined architecture for the Ie. The principle of using pipelined
architecture is that the whole system can be nearly evenly divided into severnl small
sl:lges: all these stages occur in one clock cycle. In the Ie. Ihe most time-consuming
operations are the cache lookup and the RT lookup. Other oper::ttions consume much less
time. If we choose a pipelined architecture for the IC. the divided stages will not be
balanced in terms of timing. The small stages will spend the same work..ing clock cycle as
the "heavy stages" Therefore. the whole system's processing time might not improve.
Besides. the system controller will be complicated. Rather than risking this mistake. we
choose non-pipelined architecture.
The design also features a modular style. It is recommended in the S)llopsys
documentation Ihat large designs not be imported directly to the synthesis tool. Instead. a
hierarchical bottom-up approach should be followed. This is because importing large
designs leads 10 crashing the synthesis 1001 and in some cases may result in an
unoptimized design (7J. Accordingly. we panition the architecture of the Ie into modules
(see Appendix AI. Each module is designed. lested. simulated. and synthesized
81
individually. These modules then arc used as building blocks in forming the final IC
design. Besides. the modular style has many other merits. First. the debugging process is
simpler when building high-level modules from low-level ones. Second. it simplifies
hardware. thus reducing its cost and increasing its speed. Third. the modules can be
reused in the future design [7J. Thus. a top.down approach is used in design. but a
botlom-up approach in simulation and synthesis.
Figure 6.3 shows the Ie system diagram. Before continuing this discussion. we
present some background information. tn Figure 6.3. the cell streams come into the IC
system from the left side. In commercial networks. serial data arriving at fiber-optic
transmission lines is converted from light signals to electrical signals using readily
available equipment. Another kind of equipment. the Universal Asynchronous RecelVer-
Transmitter (UARn is used to convert the serial data stream into a parnllel data stream
on a certain bit wide bus. which then feeds the data into the corresponding input port in
the Ie. These processes are shown in Figure 6.2.
Here. as discussed in Chapter 4. for simplicity. we assume a 4O·bit cell. In practice.
a cell contains 53 bytes. The UART is usually not designed to send out 53 bytes at a time.
but 81l6l3:!J64 bits in parnllel. with some Stan. End. and Parity check bits. Thus. the
input port side needs to deal with these control bits and collect all 53 bytes in a cell.
Figure 6.3 shows the IC hardware implementation for an 8 x 8 ATM switch. It is
composed of two groups of modules. One is the input buffer module group. There are
eight identical Input Buffer (m) modules. which receive incoming cells. temporarily
store cells. and send out updated cells to the switch fabric; the other is the input controller
82
center (IC Center) which performs the cell header lookup function. The IC Center can be
funher divided into sevcn.l sub-modules. The IC Center ~hitecture is shown in Figure
6.4. The main sub-modules In IC Center are a set of Lookups. routing table (Rn. and
Arbiter. The I...ook:ups ~ cache memory performing routing information lookup for the
incormng VPUVCI. The RT performs lhc RT lookup for the VPUVCI whose Lookup
request is granted. The Arbiter coordinates the RT lookup requests from multiple
Lookups. The input SIgnals ~ system clock. system preset. and ccll streams to eight
input pons. The output signals are upd3ted cell SlJ'C3.JTlS to the switch fabric.
In the ncxt section. we will present each componcnt in detail.
-
83
~
~
- +::
·
0
"~
~
'---,
.-
·
,
v
- .,
~
2:::::;
·
SWitCh
~ · , Fabnc
"
- ~ · , axa
[3~t1 · ,
~ ~~ ~ ,
~ ~.. ,~ i'
~
== Ie ~:;= t===Centre F=F=
I ~
Figure 6.3 Input CoDtrolier (IC) Sysklll Block
84
~~: ~ §~~ i'l."""n; F==. ..cit
Fri'OO'OO, F:
'0 I ~
:::::::;LOO>-UP
~."
~'OO'OO
.,
- ~'~'"'
.,
~'OO'"';
.,
F==E LOO~UP,
~'J
- ~~'J
3 LOOIcUP '
12, .,
F"""rf: 6.4 Ie Ctntte An:bilecture
85
6.4 System Components
In this section. we present each module in the IC system.
6.4.1 Input Buffer
The Input Buffer (m) receives the cells. sends the VPVVCls to be looked up. packs
lhe cells again with the returned VPUVCls. waits for the acknowledgements from the
switch fabric to process the next cell. It is a FIFO queue with length 10. The interface of
[B is shown in Figure 6.5. An updated cell in lB composes of 44 bits. including 1 bit nag.
3 bits self rouling tag. 24 bits OUtput VPUVCI. and 16 bits payload.
Inputac' I u~cell....p(toSA
cell BuHer I (payload -+ VPlNCI -+
headeUlY". 25 self_"tJ""g~g+lIagl
header Ito IPCl
IPG_"ad' ," IVPLNGI + l~gJ
elk preset
86
6..&.2 IC Cenler
Figure 6A illustrates the Ie Cenler arrhitecture.
' ....2.1 Lookup
The Look.up perfonns mapping the cache blocks to find the routing infonnation and
sending it back. to the ffi. If cache misses occur. Lookup requests the routing infonnation
from the RT. loads in the returned cOnlenl from the RT and sends it back to the ffi.
87
Figure 6.6 depicts the Look.up cache architecture.
The Look.up is composed of several components, The main components are:
It holds incoming VPl/VCIs for mapping process. It also has valid bits to combine
with VPVVCI in the mapping, The replacement in the cache miss is under Ihe control
signal of 'n_req' and 'grant', The written back block is indexed by the signal
'hiIBlockNum' Figure 6,7 depicts the interface of the c_rqf'iIe_O. The l·bit flag
indicates whether or not the signal ·vpivcU' is looked up,
_-t-__"'_WC_'.'_--:---'::::::: :~:_~IOI
-.'-"-
n,tBiOd<~
--t---------';.-'-~~.u~31l
-~:-,--+---.-----.-.------~-.~
88
This is the co~ component in Ihe mapping process. The incoming 24-bit VPUVCI
is compared with the slo~d VPUVC!s. Figure 6.8 shows lhe wi~d up.24 bilS comparntor.
The application of Ihe .24-bit compar:lIor. the comparison portion in Ihe IC, can be seen in
Figu~6.9.
FilU~ 6.1 2"'bll Comparator
89
f"1P~ 6.9 Compari5oa Portioa ill thr Lookup Cac:1w
This component makes lhe decision if there is a cache miss or hil. If Ihe valid bil is
sel and the ~4-bil comparalOr sends "\" then lhe Lookup cache hilS at thaI block.
c_comp_dKision OUIPUIS that 5-bil block number: if the AND of \'alid bit and lhe resuh
of the ~-l-·bil comparator is '0'. then a cache miss occurs. A I-bil cache miss signal is sent
out. The interface oflhe c_comp_dKlstoD can be seen in Figure 6.9.
90
This component is used in the LRU algorithm. It records and updates the access
numbers of the blocks in the Lookup cache. It updates the access numbers based on the
hit block number if miss signal is cleared to 0: it calculates the repl:K:e block number if
miss signal is set. Figure 6.10 depicts the interface of the c_regfile_l.
''-1 --;
i- LRU •o
:~ .
: i
,1.: I '11:
This component is the cache memory housing the 24-bit updated VPVVCls and 3-
bit output pon numbers. Since there are 32 cache blocks. a 5-bit address is applied. In the
cache memory. (24 + 31 ="!.7 bits occupy 4 bytes. So. totally there are 31 x -t =118 bytes.
which results in 7-bit addrtss for the cache_l\tEM. So. a 1-bit shifter component is added
on the path before the hit block number or the replace block number is loaded as address
into the cache_\IEM. Under the control signals, if cache hits, the memory content
mdexed by the signal ':Iddr' is read oul: if cache misses. the content of signal
91
'matched_data: is loaded into the cache memory where signal 'addr' points to. Figure
6.11 depicts the imerface of the <:acM_MEM.
/~,
__'_'s/lift~
pres.el I'l_req grant stal'l
6.4.2.2 Routing Table
The requesting VPlIVCls from eight Lookups are tied together and connected to
RT lhrough a lti~state bus. In this structure, only one VPlIVCI at a given cycle can be
galed onlO the bus and forwarded to the RT for the lookup process. Figure 6.12 gives the
tn-state symbol and the truth table (25]. Figure 6.13 depiCts the tri·state bus we use in the
design. Figure 6.1~ depicts the RT architecture. In the hardware implementation, we
simplify the RT entry number to 50 in order to complete the synthesis step within a
re:lSOnabletime.
92
~"f>-o"'
Enable
Enable data
Hi-Z
Hi-Z
f~J Bus
F"ICU~UJ Tri-scauBus
93
The RT is panitioned into several components. The main components are:
RTJegfile
It holds incoming VPUVCls for mapping process. It also has valid bilS to combine
with VPUVCI in the mapping. Figure 6.15 depicts the interface of the RTJtgfile.
94
"I, VPlNCI i ~
11 1=::::::
I I!1 ,....-....
---+-+
0 ; I
i iI.
I II "0 I
iI
ruea(Cvpivci(O)
rt_v_bit(O)
rtJead_vpivci(49)
rt_v_bil(49)
preset
This core compcmenl is e~actJy (he same as the one in the Ie Center. Refer to the
circuit diagram of the 24·bit comparator in Figure 6.8. Figure 6.16 shows the comparison
ponion in the RT .:m:hitectuTe.
95
'NO_vprvo(l~
This componenl makes decision as 10 which RT enIry is mapped. If the valid bil
coming from the RT_regfile is sel and Ihe resuh of the 14_bit_comp is '1'.lhen Ihal entry
is chosen. The enlry number is OUlpUI. Figure 6.16 illustrates Ihe component inlerface.
RT_~tEM
96
This component is the RT memory housing the 24-bit updated VPVVCls and )·bil
output pon numbers. Since there are 50 RT entries. a 6-bit address line is applied. In the
RT memory. (24 + 3) =27 bits occupy 4 bytes. So. totally there are 50 It 4 =200 bytes.
which results in 8·bit address for the RT_l\lEM. So. a 2-bil shifter component is added
on the path before the entry number is loaded as the address into the RT_MEM. The
memory content indexed by the signal 'rCaddr" is read out. Figure 6.17 depicts the
RT_MEM interface.
604.2.3 Arbiter
The job of the Arbiter is to coordinate the Look.up cache miss requests and to grant
only one request at a given clock cycle to the RT memory following a specified rule.
Figure 6.18 shows the interface of the Arbiter.
97
req<1:O:>
r.md<2:0>
elk preset
gr.mt<7:0>
Fiprt6.18 Intl'rflloC'foftbeArtlilu
At any giv~n cycle. the Arbiter checks the eight 'req' signals from the Lookup
cache. If no 'req' signal is set. then none of th~ eight ·grant" signals is issued. If only one
'req' is set. then the Arbiter setS the relevant grant signal. If more than one 'req' are sel.
then the Arbiter picks signal 'rand' and uses this number to decide .....hich one to gran!. At
the n~xt cycle. the Arbiter first setS the 'grant". which was set in Ih~ last cyde. to 0 and
then continues the arbitration corresponding to the current cycle. The 'preset' signal is
used to invalidate the internal variables and signals in the Arbiter at Ie reset stage.
A pseudo random number generator (PRG) serves in the arbitration as the pseudo
random numbers provider. using a linear feedback shifl register (LFSR). The use of
LFSR 10 genernte pseudo random numbers is well documented in genernl Iit~r.lIure. An
LFSR is a shift register that. when clocked. advances the signal through the register from
one bit to the next mosl-significant bit, Some of the outputs are combined in exclusive·
OR configuf'.1tion to form a feedback mechanism. An LFSR can be fonned by performing
exclusive·OR on the outputs of twO or more of the flip-nops together and feeding those
98
OutpulS back into the input of one of the flip-flops {26]. Provided OJ. suitable feedback
conneclion is used. an LFSR produces a panem count equal to 2- - 1. where n is the
number of register elemenlS in the LFSR (27]. In the Arbiter. the maximum number of
requests is 8. so. n = 3. The generated panem count is ~ - I = 7. Figure 6.19 (26] shows
the PRG. that is. a 3-bit LFSR. Table 6.1 lists the patterns produced by the LFSR in
Figure 6.19. assuming that a pattern of III is used as a seed.
r------ FFUltII
FFO.lllIt
.:,~±~==::j===+=j--'
F"&u~ 6.19 PshcIo RaDdo. Nwobcr Gnwrllor IPRG) StruCI1l~
Clock Pulse FFI_out FF2_oul FF3_oUl Comments
I I I I Seed value, 0 I I
3 0 0 I
, I 0 0
5 0 I 0
6 I 0 I
J I I 0
8 I I I Startsrepea.t
T,"6.1 h.ttel'll_~ralorOu.tpub
99
LFSRs make exu-emely good pseudo random generawTS. The seed value can be
anything except all Os. which would cause the LFSR 10 produce all 0 panems. The only
signal necessary to generate the pattern is the clock and hence. it is self.generating. So. an
LFSR is also called autonomous linear feedback shift register (ALFSR) [27].
6.5 Simulation and Synthesis
Verification of the Ie implementation has been carried out on different levels of the
design hieraTChy. Testbenches have been developed to check the functionality of
individual components and the whole IC system. The powerful and convenient tools we
use in sImulation are VHDLAN by Synopsys. and V·SystemIWindows by Model
Technology. Simulation wavefonns are shown and explain the design correctness. The
simulation results are included in Appendi.\ B in this thesis.
The simulation results demonstrate the correct functionality of the Ie system
designed here. Figure 6.20 shows a part of the simulation results. It can be seen that the
cache hit operation takes Ihree clock cycles. The cache miss operation takes :u least one
cycle more than the hit operation. This best case happens if that cache is gramed access to
the RT at once. If more cache misses occur in a given cycle. waiting is needed. Seven
cycles more than the best case cache miss will be required in the case thai all eight caches
miss. In Ihis case. it will take eleven cycles to complete the cache miss operation. Cache
compulsory miss is one of the worst cases. The figure shows the compulsory miss case.
At the beginning of tile system running. al cycle *1, all eight caches miss and issue the
RT lookup requests simultaneously (see (I) in Figure 6.20>. In this example. the arbiter
100
judges that the r.lndom number is 1 at that cycle 3nd there are 8 RT requests. 1 MOD 8 =
I. So cilChe # I is granted aCCess to the RT (see (2) in Figure 6.20). lei us take another
sample point in this e~3mple. At cycle #4. Ihe r.rndom number at this cycle is 5 and there
are 5 RTrequesls. 5 MOD 5 =0. Cache #0 is gnnted access to the RT (see (3) in Figure
6.20). The simulation results verify our objective of cache usage: in Figure 6.20. the
lookup clock cycle numbers have been saved when cache hits occur which would be the
case most of the time. Thus. the IC performance should be improved.
The Design Compiler is the core synlhesis engine of the Synopsys synthesis
prodUCI family. It has tWO user interfaces. the graphical user interface {GUn and the
command line interface. We ~ the GUI. thai is named Design Analyzer. The synthesis
process is 10 analyze our Rn files. Ihe abstract descriptions of the circuits. described
using VHDL and to produce a gate level net-list that would be ported to the t3f'gel library.
The synthesis is based on the 0.18 ~m CMOS technology. We synthesize the IC system
from the bottom to the top. A report is oblained from the lools for each synthesized
component and the inlegrated system. In Table 6.2. we summarize the synthesis report of
the main components and the integrated system. We include two main paramete~ in the
table. the area and Ihe timing. A 2·input AND gate requires a cell area of 70 units. The IC
system is comprised of Ihe ipcsenter and the input buffer. The total cell area for these
twO modules as c:m be seen from Table 6.2 would be (O.5M x 8 input buffer modules) +
16.7M ipc_center) = lO.7M units of cell area. that would be approximatelyequiv:l1enl to
O.3M gates. Thus. to build the IC system. 0.3M gales are expected.
101
Il IJ
f-++JHH+I ~i!
H+l-iI-Hl1.,.IJ.H+I+ ++-HH-----'J
1 I~
H41-!+H-iJ4+~~+mHH--'.'
Timin£!losl Timing (ns)
Component Cell Area Component Cell Area.
Delay Slack Delay Slack
"""
7' D.31 cache....mem 82292D D.81
1101.5 2.19 " ....mem '008' 17.86
:l:OT....SUUCt 112.5 0.54 "_coIDP....deci ,>10 0.75
nO'-OT.... 2-l. 91. 1.38 ~57.5 17.(16
leftshifUl 262.5 0.26 c_re£!file_O -l.87952.5 ...U2
leftshift_l
'"
0.26 d_ff 367.5 -l.9.5
tri_5tale_buff 3010 1.19 arbiter 37165 25.83
bit_comp_.2-l. 1 3850 2.16 pc, 1225 -l.7.76
c....comp....deci 5967.5 '.84 inpucbuffer 496282.5 10.54
c_regfile_l 5(»402.5 0.83 ;p< 1948957.5 ....1.12
req....modi 542.5 '.7 lpc....cenler 16700000 4.12
" ....regfile -10890' 0.66
T.bW6~ ~orArnaodl1""'~por1rorthtCOfllPOMlllandthtSystesn
The timing :malysis looks al the structure of the circuit and measures the delays
along the dampalh. In the liming repon. one imponant item is the slack. The amount of
slack (time befor~ the next clock edge) indicates the required delay minus the actual
dela~·. The slacks corresponding to most of the components reported in Table 6.2 were
shown as i"wtET. i.e. having poSitive values. So. these designs have met the liming
constraints. We selcct the clock. speed at 20MHz for the synthesis. But the c....regIIIe....O.
103
the ipc: and Ihe ipc:_cent~ repon VIOLATED. They have negative slacks. Hence. Ihe
circuit cannot run at the clock speed selected for the synthesis. However. the reponed
discrepancy is small and the worst-c~ assumptions made by the timing model are
normally 100 suiCI for me larget library. We selected a clock frequency of 20MHz and a
low-effon synthesis algorithm. We chose low-effon since Ihe design [S large and time-
consuming (0 synthesize even :It this setting. ThIs may also affect the lirt'jng and cause
longer delay in the circuit. In the real hardware implementation. the delay in the circuit is
usually much smaller than the synthesized delay value. Therefore we expect this circuit to
function com:ctly but if the failure is detected during the testing. this timing issue would
be wonh re-addressing. Note that the cJegfile_O. the ipc: and the ipc:_cenl~ have the
same negative slack. The c_regfile_O is one of the subcomponents of the system. $0. the
negative slack of the system is caused by Ihe slack of the cJegfile_O. So. we might be
able to improve the design of the c]e'gfile_O to improve timing. To help improve the
circuit performance is !he purpose of our timing analysis.
The above synthesis tepon is for our simplified Ie system. In practice. 10 design a
realistic 8 ;II: 8 ATM swilch. !he cell size is 53-byte and the RT memory has 10.000
entries. The above timing will almost not change since it is still the 24-bit VPUVCI which
is retrieved to process lookup. not the whole 53-byte cell. and the extended RT will not
affect the delay on the datapath. But the number of gates will definitely be changed. The
input buffer will enlarge (53 bytes;ll: 8 + 3 bits) I (40 bits -+- 3 bils) = 10 times (see lable
4.1 for the input buffer structure). The eight input buffer modules will have 40M units of
cell area. The RT would be 10.000 entries which is 200 times the simplified 50 entries.
104
So. the lotally cell area of the RT will be 100 times of the ~11 area of Ihe simplified RT:
100 x 16.7M =3340M. Hence. the whole IC system will be 3340M + 40M =3380M
units of cell area. which is approximately equivalem 10 48M gales.
The schematics generated by Design Compiler are not provided here for lack of
space.
6.6 Summary
Based on the Ie design presented in Chapter 4. and encouraged by Ihe performance
analysis results reponed in Chapter 5. we carry out the hardware implementation of inpul
comrolJer (lC) for BG ATM nelworks. Following the design flow recommended by
CMC. we design. test. simulate and s)Tlthesize each component and Ihen assemble the
whole system. The results of simulation and synthesis demonstrate that the designed IC
perfonns correctly. Besides.lhe IC is easy to implement. practical and cost efficient.
One problem we should mention here is about the pseudo random number generator
(PRG) design. In Section 6.4.1.3. we describe that the PRG may have seven possible
OUlpUtS. l. 1. 3. 4. 5.6. and 7. respectively. NOle thaI there is no O. Thus. if all eight
caches miss in a cycle. Ihough it is not very probable. Ihere is no chance for cache l'O 10
be granted to :lccess RT in that cycle. This affects the performance 10 some degree. In
Chapler 7. we will briefly discuss a solution 10 this problem.
105
Chapter?
CONCLUSION
7.1 Contributions in this Thesis
The main contributions of this work can be summarized as follows:
• Modifying the IrafTic models and using these models in the performan~
analysis of the Ie system for BG ATM networks.
To design the Ie system. traffic models are needed in ~rformance analysis before
implementation, URT and bursty traffic are the (wo traffic types we choose. The former
is the most commonly used tr.lffic model when an.llyzing the performance of switches.
The lauer is ne~ible enough to model most of the existing traffic sources. We follow the
same modeling means set up for BG networks. but modify and apply to the Ie system.
• Demonstrating tbe oulstanding performance of cache structure in the Ie
After studying the various cache structures. we finally decide that. for an N x N
SWitch. there are N caches, each for an input port. They are fully associative with 32
blocks each. Fully associative cache has the most flex.ibility in mapping cache blocks. It
has higher hit ratio than other organizations. The possible concern for this kind of cache
is that. since the whole block address is used as tag to match the entries. it increases the
hardware comple"ity. Thus. a fully associative cache has been implemented only in
moderate size. A n-block cache is a reasonable choice. The designed cache-memory Ie
106
system yields a hit rate well in excess of 80% under most bursty traffic and achieves
above 95% under bursty trnffic with mean burst length of 15. This compares favorably
wilh the results reported for the commercial devices.
• Devefoping a prxtical and high-performing Ie an:hitecture.
A cache-based IC architecture is motivated in this thesis, The IC performs routing
table lookup. rewrites the processed ATM cells and fOf\\'ards them to the switch fabric
that deliven; cells to otllgoing lines. The cache holds Ihe frequently used fOf\\'arding
informatIOn. The Ie looks up all the information necessary to forward a ceJltoward its
destination from a high-speed cache. If cache misses occur. a slower routing table lookup
follows the cache operation. The key issue of this architecture is 10 find an appropriate
cache-memory hierarchy in teons of cache hit rate. speed, cost_ etc. Based on our design
experience. the whole IC performance should be improved when cache hits occur moSt of
the time,
• Carrying out software simulation and rurther hardware imp'ementation_
The use of a C++ based programming enables the efficient modeling of switch
hardware components in order 10 perform cost/performance trade-off. and the successive
refinemenl of interfaces and behavior. Then. following a smooth design now. the
hardware components are built to a detailed level that contains all necessary information
about timing and Structure in order to support functional verification capabilities. This
combination of software analysis and hardware implementation guarantees the IC design
operates dearly. correctly and efficiently.
107
7.2 Rerommendations for Future Work
This section summarizes several missing pieces because of the limited time:
Routing .... (RT) refresh issue
One of the most difficult decisions in ATM switch design is to detennine how to
handle the routing table memory refresh. The designed RT works fine if no dynamic
changes occur. However. in a dynamic routing algorithm. we need to update the routing
table periodically or frequently (whenever we receive a routing update message). In this
case. we need to refresh the RT. A simplistic. but wasteful. solution would be to keep two
copies. and alternately usc one and refresh the other. One is hot. The other is standby.
This will need more complex controller logic. How to keep the Lookup caches and RT
memory consistent is a key topic. In practice. if one VPNC is tom down and is made
invalid in the RT. even though the same VPNC cannot be marked invalid simultaneously
in Lookup caches. it seldom causes trouble. because incoming cells are much less likely
to be assigned that VPUVCI by the network operator in immediate future. Cache
coherence is a well·slUdied issue in parallel processing. and some of the innovative ideas
from this domain may be considered for adaptation to this problem.
Pwudo Random Number Generator (PRG) improvement
In the PRG designed in Chapter 6. a problem exists. If eight cache misses happen
simultaneously. since there is no 0 in the PRO outputs. cache IKl cannot be issued a grant
to access the RT. This happens rarely. but this would still affect the system perfonnance.
However. this is not a major problem since in the subsequent cycle. there will be fewer
caches misses and thus cacheNO will get a chance. One solution is to use four D-flipflops
108
instead of three D-flipflops in the design of the PRO since 2" - I = 15. Thus. the OUlputs
of the PRG will be I. 2. 3..... 14. 15. This makes it possible for cache 10 to be granted
access if all eight caches miss_ since 8 MOD 8 = O. In this way. the PRO is improved
because every cache has a chance to be granled access to the RT when all caches miss.
Although this improved PRO is still unfair (as the~ is only a 1115 chance for cache NO to
access the RT while 2115 for all others). this bias is insignificant because that the event of
all eight caches missing in the same cycle is rare. Figu~ 7.1 shows the improved PRO
;- -+FF3_Q\l1
.------;::===: ""_0"'I FFl_0Il1
FFO_OIl\
Figure 7.1 Improvtd Psntdo Ra.dom NIIIIItln- Gellfr1ltor (PRGI
Next-to-HOL Cell Lookup Improvement
As mentioned in Chapter 4 and 6. the system always looks up the HOL cell in the
input buffer (ffi) which is a RFO queue. Only after the HOL cell lookup is completed
and the HOL cell is picked up by the switch fabric. the next to HOL cell can stan lOOKUp.
This strategy works well in most situations except when all eight caches miss at a cycle.
For the latter situation. the cell waiting behind the HOL cell may have to wait for several
109
cycles 10 be served. for seven cycles in the worst case. A possible improvemem to
addrt:ss this issue is [0 add logic 10 check the nexHo-HOl cell and look it up after the
HOl cell has completed the lookup. Thus. the waiting time for the lookup is saved and
the system perfonnance is improved. Lookup for more cells than the first two cells in the
mmay not be practical because the hardware comple.\ity is increasing greatly.
More trame loads should be tested when investipting the Rlalionships
between the number of inpul/output bufl"en and the lrafrtc load
In the system perfonnance analysis. when we investigmed the rel:ltionships between
the number of input/output buffers and the traffic load. traffic loads of 0.5.0.6.0.7.0.8.
and 0.9 were tested and Ihe results showed clear trend of the perfonnance. However.
obviously. we missed some points among this range. especially, we were expected to
investigate by e.\ecuting cases of loads between 0.8 and 0.9. say. 0.82. 0.84. 0.86. 0.88.
This future work may help us find more infonnation about the jump in number of
input/output buffers for load of 0.9.
7.3 Summary
This chapter summarizes the contributions of this thesis. and also slates the work in
the future.
110
REFERENCES
[I) M. Bentall. C. Hobbs. and B. Turton. ArM and Internet Protocol: A Com"ergence
a/Technologies. Amold Inc.. 1998.
121 B·ISDN ATM L::lyer Specification. Telecommunication standardization sector of
rru (ITU-n 1.361(11195).
(3] A. Pattavina. Switching Theory: ArclJitec/Ure and Performance in Broadband
..\TM Networks. Wiley. 1998.
(4] "A Survey of ATM Switching Techniques", http://www.cis.ohio-
state.edul-fahmylcis788.08Q/:;J.lmsw itch.htm!.
(51 S. Keshav. An Enginuring Approach to Computer Networking: ATM Nt!tworks.
lhe Internet. and the Telephone Nnwork. Addison Wesley. 1997.
[6) R. Y. Awdch and H. Mouftah. "Survey of AIM Switch Architecture", Computer
Nefl>'orb and ISDN Systems. vol. 27. pp. 1567-1613. November 1995.
[71 Y. E. Sayed. "Performance Analysis. Design and Reliability of the Balanced
Gamma Nelwork", Ph.D thesis. Memorial University of Newfoundland,
December 1999.
[8] D. A. Panerson, J. L. Hennessy. Computer ArchiteCTure: A Quamitatil'e
Approach, Second Edition, Morgan Kaufmann Publishers, Inc. 1996,
[91 hup:l!phoenix.!!oucher.eduf-kelliherlcs161nov I~ 14.html.
[IOj K. Lindberg. "Multi-gigabit RoulersM • hup:ffwww.csc.fillindbergtlikipaper.htmJ.
III] K, Schultz and A. Sorowka. "High-Perfonnance CAMs for IOGbfs and Beyond",
hup:l!www.sibert:ore.comfprofile.html.
[11] "SiberCAM Application NOIe··. SiberCore Technologies Inc.
(13] D. Wu. 'The Balanced Gamma Network: A Prospective ATM Switch Fabric",
course project ~pon,Memorial University of Newfoundland, December 1999.
III
(l4) H. Sivakumar, "Perfonnance, Faull Tolerance and Reliability of Multistage
Interconnection Networks for Broadband Packet Switch Architectures", Master'S
degree thesis, Memorial University of Newfoundland. December 1995,
[15] M. Bentall, C. Hobbs and B. C. Tunon, ATM and Inumel Protocol: A
Convugence ofTechnologies. Arnold, 1998.
[16] G. D. Stamoulis, M. E. Anagnostou, and A. D. Georgantas, "Traffic source
models for ATM networks: a survey", Compwer Communications, Vol. 17.
numbcr6,June 1994.
[17) 1. Banks, J. S. Carson, B. L Nelson. Discrete·Event Sysrem Simulation. Second
Edition. Prentice Hall. 1996.
118] M. Al-Mouhamed. H. Youssef, and W. Hasan, "A Fast Parallel-Tree Switch
Architecture for ATM Networks" Imernational Journal of Communicalion
Sysrems. Vol. II. pp. 59-77,1998.
[19] C. Partridge, et al.. "A 50-Gbls IP Router". IEEElACM Transaction on
Nem'Orking. VoL6, No.3, pp.~37-248, Jun. 1998.
[~O) "GRF 400: A Practical IP Switch for Next·Generntion Networks"
http://apac.asccnd.comlI680.htrnl.
[:!IJ 1. F. Wakerly. Digital Design: Principles & Practices, Third Edition Updated,
Prenlice HaiL Inc. ~OOI.
[2~] Royal Military College of Canada and Canadian Microelectronic Corporation,
Instruction on Basic DigitallC Design Row From RTL Description 10 Compleletl
CMOS Design Using Cadence (97A) and Synopsys, November 1998, Document
ICI-089.
[231 "Design of ATM Switch Hardware". hno:/lwww.cn.r\\."lh·
aachcn.delPcrsonen/posl/node2.html.
[24] A. Silbun cl.aL "Accelerating Concurrent Hardw~ Design with Behavioural
~odelling and System Simulation", in Proc. ofDAC'95.
1251 V. P. Heuring. H. F. Jordan. Computer S.l'SIems Design and Architectl/r~,Addison
Wesley, 1997.
tl2
[26J "What's an LFSR". Texas InslrUmenu Inc. document. hup:/lwww-
s.tj.com/sc/p5heetsJscla036atscta036a.pdf.
[27J E. J. McCluskey. "Built-in Self-Tesl Techniques". IEEE Design &: T~st. Vol. 2.
No.2. pp. 21-28. April 1985.
[28J M. Guizani. A. Rayes, D~signing ArM Switching N~rv."Ofks. McGraw-Hili, 1999.
[29) E. R. Coo'ler, ATM S....ilChes. Anech House, 1997,
(301 "Cajun A500 ATM Switch". Lucenl Technologies,
w\\w.lucl:nt.com/productsJa500/,
[31J "LightStream IOtO", Cisco,
\\'ww.cisco.com/warn/publicf,xkiscolmktlswitchllslOIO/orodtil/lantm ai.pdf.
[32J "Centillion 100", None! Networks, www.nonelnetworks.comloroducisi.
(331 "Cisco 12000 Series", Cisco. www.cil.oCo.coml\\arolpubliclo.:c/ciscolmktlswitch.
113
APPENDICES
A Module Structure or lbe IC System
Entities Structure:
~..
. Tr-e_c:omponl'lllS."llOloncluCleCltlO\jl<ltconttolCi4eSql,
tl<ItlotllySl.... 1fllIlUIJlOH;
.Th<I$IIaOO*-.!componenlSar... multlplea>pes;
114
Packages:
B Behavioral Simulation Result or the Ie System
I I I I:
I
ii rL;, l ~
I i
115
116
111
~ ; II III I II II II WIi:+- +4- II 'tI Iii :1 Jl~ , :!II 1 I! I:
II ~ I:1 1111211 I ~I ' rl, i ! ~
, I ! I IIII I ~IIII ! n\ \
~ I:
I
I Ii I ~ !~ ;[11 II ! Ii
I r'I I II i l!
:1 III I II ) \!
" ,I
I I:: I I\lit! ii I ~r I:I I' IiI 1~ !li~1 i~ I I , Iii ~ I~ lI ~ c
118
119




