A Strategy for Integrating IP with ATM by Guru Parulkar et al.
aItPm: a Strategy for Integrating IP with ATM
Guru Parulkar, Douglas C. Schmidt, and Jonathan S. Turner
guru@cs.wustl.edu,schmidt@cs.wustl.edu, and jst@cs.wustl.edu
Department of Computer Science,Washington University
St. Louis, MO 63130, USA
TEL: (314) 935-6160,FAX: (314) 935-7302
￿
A
b
s
t
r
a
c
t
Thispaperdescribesresearchonnewmethodsandarchitecturesthat
enable the synergistic combination of IP and ATM technologies. We
have designed a highly scalable gigabit IP router based on an ATM
core and a set of tightly coupled general-purpose processors. This
a
It
Pm (pronounced “IP on ATM” or, if you prefer, “ip-attem”) archi-
tecture provides ﬂexibility in congestion control, routing, resource
management,and packetscheduling.
The a
It
Pm architecture is designed to allow experimentation
with, and ﬁne tuning of, the protocols and algorithms that are ex-
pected to form the core of the nextgeneration IP in the context of a
gigabit environment. The underlying multi-CPU embedded system
will ensure that there are enough CPU and memory cycles to per-
form all IP packet processing at gigabit rates. We believe that the
a
It
Pm architecturewill notonly lead to a scalablehigh-performance
gigabit IP router technology, but will also demonstrate that IP and
ATM technologiescan be mutually supportive.
1
I
n
t
r
o
d
u
c
t
i
o
n
The Internet protocol suite provides the foundation for the cur-
rent data communications infrastructure in the United States and
much of the rest of the world. The IP protocols have proven to
be very ﬂexible and have been deployed widely over the past two
decades. Astechnologymakesitpossibletocommunicateatgigabit
speeds, it is essential to create scalable, high-performance routers
that implement IP protocols. In the past ten years, Asynchronous
Transfer Mode (ATM) technologyhas emerged as a key component
of next generation networks. ATM offers unprecedented scalabil-
ity and cost/performance, as well as the ability to reserve network
resources for real-time oriented trafﬁc and support for multipoint
communication.
Although IP and ATM often have been viewed as competitors,
we believe their complementary strengths and limitations form a
naturalalliance that combinesthe bestaspectsofboth technologies.
For instance,one limitation of ATM networkshasbeentherelatively
large gap between the speed of the network data paths and the
control operations needed to conﬁgure those data paths to meet
￿ThisworkwassupportedinpartbyAscom-Timeplex,BayNetworks,BNR, NEC,
NTT, SouthwesternBell, and Textronix.
changing user needs. IP’s greatest strength, on the other hand, is
its inherent ﬂexibility and its capacity to adapt rapidly to changing
conditions. These complementary strengths and limitations make
it natural to combine IP with ATM to obtain the bestthat eachhas to
offer.
This paper describes our research on new methods and archi-
tectures for achieving the synergistic combination of IP and ATM
technologies. We have designed a highly scalable gigabit IP router
based on an ATM core. This a
It
Pm router integrates the following
core architecture components:
￿ Agigabit ATM switchingfabric thatis highlyscalablein terms
ofthenumberofportsandprovidesoptimalhardwaresupport
for multicasting [19, 20];
￿ A multi-CPU embedded system that includes a string of ATM
PortInterconnectControllers (APICs)[4,5,6]andallowsﬂex-
ible and high-performance IP packetprocessingin software.
￿ A distributed softwaresystemcapableofforwarding IPpack-
ets at gigabit data rates on the ATM substrate and conﬁguring
that substrate dynamically to provide efﬁcient handling of IP
packetstreams.
The paper is organized as follows: Section 2 outlines the hard-
ware and software architecture of the a
It
Pm router; Section 3 de-
scribes how packet processing is carried out; Section 4 describes
how various other Internet protocols (such as ICMP, IGMP,a n dIP
version 6)are supportedby an a
It
Pm router; Section 5 comparesthe
a
It
Pmapproachwithrelatedwork;andSection6presentsconcluding
remarks.
2
A
r
c
h
i
t
e
c
t
u
r
e
o
f
t
h
e
G
i
g
i
b
i
t a
It
Pm
R
o
u
t
e
r
2
.
1
S
y
s
t
e
m
O
v
e
r
v
i
e
w
An overview of the a
It
Pm router architecture is shown in Figure 1.
Eachrouterisdesignedusing ATM switchandhostinterfacecompo-
nents. Thesecomponentsform a substratethat links a setof IP Pro-
cessing Elements (IPPE). IPPEs handle the IP packetprocessingand
directly control the ATM substrate. The IPPEs are general-purpose
processors implementing ﬂexible routing and queuing strategies
that are central to high-performance IP networks.
Eachrouterhasanumberofhigh-speedports(1.2Gb/s)equipped
with routing cardsthatimplementtherequired IPfunctionality. The
main data path of each routing card passes through a sequence of
ATM Port Interconnect (APIC) chips. The APIC chip is an extensi-
ble, high-performance network interface chip designedto interface
directly to the main memory bus of a high-performancecomputing
system. The APIC supports zero-copy semantics, so that no copy-
ing is required to deliver data from the network to an application.Figure 1: Gigabit a
It
Pm Router
The gigabit a
ItPm router uses a chain of APICs to support a set of
embeddedmicroprocessorsthat perform IP processing.
Each router may also have line cards that support interfaces
to workstations, servers and other IP routers (to reach machines on
conventionalLANsandontheInternet). Alinecardhas12interfaces
at 150 Mb/s and one at 600 Mb/s. These streams are multiplexed
together at the cell level and sent through the ATM backplane. Line
cards interfacing directly to shared access LANs (such as Ethernet
and FDDI) are also possible.
The basic operation of the router is illustrated in Figure 2. This
ﬁgureshowsanIPdatagrambeingsentfrom theworkstationlabeled
A across an ATM permanent virtual circuit to the IP processing
element labeled
B.A t
B,a nIP routing table lookup is done to
determine the router’s output port where the packetshould be sent.
The packet is then forwarded to that output port (
C). At
C,t h e
packet is queued for transmission on the outgoing link and then
sent on its way.
The scenario in Figure 2 routes the packet using two passes
through the ATM switch. The second pass is required since the
line cards have no dedicated IP processing capability. While such
functionality could beincludedin the line cards,the addedcostmay
not be justiﬁed by the performance improvement. In particular, the
ATM backplane imposes only a small latency for the extra pass
(
￿ 10
￿s).
Insomecases,onlyonepassthroughthe ATM switchisrequired.
For example,one passis possiblewhen a packetarrives atan input
port connected directly to a routing card, and it is destined for an
output port connected to another routing card. The second pass
may be eliminated in other cases, as well. For example, in the
scenario described above, the routing table lookup done at
B may
determine that the outgoing link connected to
B’s routing card is
the best choice. In this case, the packet is queued for transmission
at
B, and the extra pass through the switch is skipped.
The a
It
Pm router architecture permits the IP layer to dynami-
cally conﬁgure ATM virtual circuits to optimize the handling of user
information ﬂows. To achieve this, we introduce a “cut-through”
mechanismto allow more efﬁcient handling of bursts consisting of
manypackets. Forexample,supposeanapplicationstarts transmit-
ting a ﬁle and the resulting stream of packets arrives at the router
in Figure 2 on the external link connected to router card
B.T h e
IPPE handling the packet stream responds by queueing the packets
temporarily while it determinesthe bestoutgoinglink. It thensends
controlcells to the embeddedATM switch, instructing it to conﬁgure
anewvirtualcircuitontheselectedoutgoinglink. Packetsreceived
for subsequent transmission are queued on the newly conﬁgured
virtual circuit. After the buffer for this ﬂow has been completely
ﬂushed, it may instruct the APIC to forward subsequent cells from
this ﬂow directly along the virtual circuit, without passing through
the IPPE. This approach allows the IP layer to maintain full control
overindividual packetﬂows and modify its decisionsas necessary,
while amortizing the cost of the more complex decision-making
processes over many IP packets. The design of the cut-through
mechanisms is discussed further in Section 3.
Thea
It
Pmarchitecturealsoenablesthe IPlayertoconﬁguremul-
ticast virtual circuits by allowing IP multicast to be implemented
directly at the hardware level. Consequently, IP multicast appli-
cations such as the MBONE can be supported in a highly scalable
fashion. Thismakesit possibleforaverylarge numberofmulticast
applications to operate on the network at the same time.
2
.
2
H
a
r
d
w
a
r
e
A
r
c
h
i
t
e
c
t
u
r
e
An overview of the a
It
Pm system architecture was describedabove.
Thissectiondiscussesthevarioushardwaresubsystemsandcompo-
nentsthatare usedto implementthe a
It
Pm architecture. One central
componentis the ATM Port Interconnect Chip (APIC). The APIC has
been designed as a ﬂexible, extensible, and high-performance ATM
hostinterfacechip[4,5,6]. Itprovidesdirectsupportforsegmenta-
tion andreassembly,efﬁcientdatatransferacrossahostprocessor’s
memorybus,supportforzero-copyto andfrom anapplication’sad-
dress space,and pacing of cell ﬂows for individual virtual circuits.
The APIC is designed with two bidirectional ATM interfaces. This
design allows multiple APICs to be chained together conveniently,
as in the router card application described above.
A blockdiagram ofthe APIC is shownin Figure 3. Theinputsat
the top left are parallel interfaces that transfer ATM cells accordingFigure 2: Example Showing Data Flow
IPS
IPS
TxFmt
Cell Store Subsystem
VCXT
Subsystem
Dispatcher
Subsystem
OPS
OPS
RxFmt
Bus Interface Subsystem
Figure 3: ATM Port Interconnect Chip (APIC)
to the Utopia [2] interface standard. This standard speciﬁes how
to connect SONET transmission devices to ATM switches and host
interface circuits. The output interfaces at the top right are similar.
The bus interface at the bottom of the ﬁgure is 64 bits wide. It is
designed to be directly compatible with the Sun Mbus speciﬁca-
tion [18], but may be adapted to other memory buseswith auxiliary
logic.
Cells arriving from either of the Utopia inputs are placed in the
central cell store (its capacity is 256 cells) via the VCXTsubsystem.
This subsystem performs a table lookup to determine how to pro-
cess a cell. Cells may be forwarded directly to one of the outgoing
Utopia interfaces or they may be directed to the external memory
interface (EMI). A single cell may be directed to both. The output
framers (OF0a n dOF1) and the RX-CRC block schedule the trans-
mission of cells to either of the two Utopia outputs and to the EMI
respectively. The pacer is responsible for cell pacing for all active
connections to ensure that cells are transmitted at the appropriate
rate from the localexternalmemory to the Utopia outputs. For cells
directedtothe localexternalmemory,the EMIalsoprovidesaddress
information and batches multiple cells together to achieve efﬁcient
transfer across the bus. The receive CRC block (RX-CRC) computes
the AAL5 CRC as cells pass through to the EMI. The transmit CRC
block (TX-CRC) computesthe outgoing CRC as cells pass to the cell
store.
The interfaces between the various inputs and outputs within
the central cell store are completely asynchronous. This allows a
wide range oflink speedsto be accommodatedin a straightforward
fashion. The APIC permits individualUtopia ports to be conﬁgured
for either 16 bit or 32 bit operation, with completely independent
clocks. ThisenablesthedatapathsﬂowingthroughtheAPICchainto
andfrom thecoreswitchtooperateatahigherrate thantheexternal
links. Therefore, queuing may be managed primarily within the
IPPEs. In addition, the VCXT and dispatchersubsystemsmay cache
information relating to speciﬁcpacketﬂows. Thus,packetsmaybe
forwarded directly along the main data path without processing by
the IPPE.
The second key component of the gigabit a
It
Pm router is the
ATM switch at its core, as illustrated in Figure 4. The system com-
prises a multistage switching network that implements dynamic
routing of cells to evenly balance the load from all inputs and
outputs over the entire network [19, 20]. The network supports
gigabit operation by striping cells across four parallel planes (each
cell is divided and transferred in parallel through all four planes
simultaneously, minimizing latency). The network also supports
an elementary ‘copy-by-two’ operation. Togetherwith a novelcell
recycling scheme, this permits an incoming cell to be copied
F
times with log2
F passes through the switch. This architecture
is the only nonblocking multicast switch architecture known thatFigure 4: Gigabit ATM Switch Architecture
achieves optimal scaling with respect to interconnection network
complexity, routing memory and virtual circuit modiﬁcation. In
large conﬁgurations it achieves order-of-magnitude cost improve-
ments over competing multicast switch architectures. See [17] for
further details. The architectureis implementedusinga setof three
custom integrated circuits. One circuit implements the Switch El-
ements making up the switching network. The other two circuits
implement the Port Processors that interface to the external links
and perform cell routing (using virtual circuit identiﬁers) and cell
buffering. TheInputandOutputPortProcessorchips,likethe APIC,
implement the Utopia interface. This makes it possible to directly
connectan APIC to a switch port processor.
The gigabit a
It
Pm router uses the ATM switch to conﬁgure it-
selfby sendingcontrol cellsoverits ATM links. Cells with a special
VPI/VCIcombinationare interpreted as switchcontrol cells for those
ports that are enabled to receive them. The contents of their pay-
loadsdeterminethefunctionstoperform, aswellastheswitchports
where they are to be performed. When a control cell is received
from a link, it is forwarded to the port it operates upon. The target
port processor then carries out the required operation. Most com-
monly, the requested operation is to read or write an entry in the
port’s virtualcircuitroutingtable. In addition,thissamemechanism
may be usedto conﬁgurecertain hardware options or to accesscell
counters(these are maintainedon both a link andper virtual circuit
basis) or other status registers. In typical ATM switch applications
there would be a single processor managing the switch resources.
However,thereis nointrinsic reasonwhytheseresourcescannotbe
managedin a distributed fashion by a collection of IPPEs that setup
and modify virtual circuits quasi-independently. The IPPEs inform
each other of their resource usageand respond cooperatively when
conﬂictsarise. Therefore,it is possibleto distribute controlin such
a way that non-conﬂict-producingdecisionsare very fast, allowing
virtual circuits to be established in under 100 microseconds. For a
tenmegabyteinformationburst(transferredat1.2Gb/spersecond),
the virtual circuit could be established before the ﬁrst .2% of the
burst has been received. Thus, the IPPE would explicitly process
only a small part of the entire burst.
2
.
3
S
o
f
t
w
a
r
e
R
u
n
-
t
i
m
e
E
n
v
i
r
o
n
m
e
n
t
The common path of IP packet forwarding is relatively simple and
maybeimplementedinlessthanonehundredRISCinstructions[16].
However, operating system (OS) related overheads of packet pro-
cessing (such as data movement, interrupt processing, and context
switching)are signiﬁcantandmay limit the overallpacketthrough-
put of a system. Therefore, we have selected a software run-time
environmentforthea
It
PmIPPEsthatminimizes OSrelatedoverheads.
We are usinga general-purposeUnixoperating systemthatis tuned
for the IPPE environment. Since an IPPE is not a general-purpose
workstation, it is possible to “disable” the following unnecessary
OS capabilities that represent sources of performance overhead:
￿ Demandpagingis disabledby allowing the kernelcodeto be
lockedinthephysicalmemory. Thiswillensurethatthereare
nopagefaultsduringpacketprocessing. In fact,the IPpacket
processing code may reside in the CPU cache,and thus, save
memory and bus accessesduring packetprocessing.
￿ The interrupt-driven receive interface may be replaced with
a polled interface to eliminate any interrupt processingover-
head. Note that the APIC also supports a polled interface that
allows the APIC to signal packet reception and transmission
by modifying status bits in a data structure that is shared
between the APIC and kernel memory.
Also note that the APIC allows zero-copy packet processing
so there is no performance cost due to data copying.
￿ All daemons and other system processesthat are not needed
on agiven IPPEare disabled on it. Since the IPPEsdonothave
any local disk they boot off of a boot ROM using a remote
machine as the boot server.
With these modiﬁcations, the Unix environment can be nearly
as efﬁcient as a custom embedded software system. One major
advantageoftheUnixenvironmentisthatthesoftwaredevelopment
and runtime environments are essentially the same. This greatly
facilitates development,debugging,andtesting ofvarious software
modules. Also, existing Unix implementations of protocols are
easily ported to IPPEs.VRAM
CPU
Cache
APIC
VCXT
23 89 23
23
Incoming ATM cells
 Buffer Pool
VCXT Match in APIC.
Free Buffer obtained from Free Buffer Queue .
The data from the ATM cells is copied over the serial interface of the VRAM into the buffer.
The buffer pointer is added to the tail of the Receive Buffer Queue for this VC, once the whole
Table
Memory Bus
Free
Buffer
Queue
Receive Head of  Queue
1
2
3
4
1
2
3
4
Global
Buffer Queue
for VCI = 23
VRAM
Serial Interface
packet has been received.
Figure 5: APIC/IPPEData Transfer
3
P
a
c
k
e
t
F
o
r
w
a
r
d
i
n
g
i
n
a
n a
It
Pm
R
o
u
t
e
r
Beforediscussinghowpacketsareforwardedinthea
It
Pmrouter,
wedescribehowthe APICmovespacketsbetweenthe ATM datapath
andthe IPPE. IP packetspassingalongthe chainof APICs are carried
on several ATM virtual circuits. Different virtual circuits are used
to prevent interleaving of cells belonging to separate packet ﬂows
and provide trafﬁc isolation. Each APIC is conﬁgured with a list of
VCs that are the responsibility of its attached IPPE. When a cell is
received on one of these VCs, it is forwarded to the attached IPPE.
Otherwise, it is passed along the APIC chain.
An APIC maintains a Connection State Block (CSB) for each of
its VCs. Among other things, the CSB stores a packet chain pointer
that points to a queue of packets received. It also contains a write
pointerthat pointsto a buffer indicating whereto store the nextcell
within the IPPE’s VRAM. Amemorytransferis carriedoutinbatches
of cells to maximize VRAM throughput. The APIC also computes
the AAL5 CRC as the cells belonging to a packet are being written
into the IPPE’s memory. If the CRC does not match, the packet
is dropped silently. Otherwise, the APIC enqueues the packet in
the receive packet queue. The APIC also adds the CSB to a linked
list of CSBs (if is not already on the list) to indicates that the VC
needs attention from the CPU. Therefore, the APIC does not need to
interrupt the CPU for every packet.
The APIC driver executingon the IPPE processorstays in a tight
loop checking for additions to the linked list of CSBs containing
valid data. As soon as the APIC driver ﬁnds a new CSB,i tf o r w a r d s
it and the associatedpacket(s) to the IP packetprocessingroutine.
Packets being forwarded by the IPPE back into the ATM cell
stream are handledsimilarly. Each active outgoing VC hasan entry
on a linked list of CSBs. The CSB contains a pointer to a linked list
of packet descriptors that constitute the transmit packet queue for
that VC. Once the IPPE enqueues a packet in this queue, it gives a
grant signal to the APIC, causing it to DMA the packet (or packets)
from the IPPE’s memory, segment it into cells and insert the cells
into the ATM cell ﬂow.
Forlongmessagesorbursts,weproposetoimplement IPpacket
forwarding by using the ATM layer to provide a ‘fast-path’ that is
used for most packets in a burst. The fast path is established by
software in the IPPEs at the start of the burst. For short messages
(in particular, messages consisting of a single packet), all packet
processing is handled by the IPPE software. In this section, we
describe the processing of packets for both short messages and
longer bursts.
3
.
1
I
P
P
a
c
k
e
t
P
r
o
c
e
s
s
i
n
g
f
o
r
S
h
o
r
t
M
e
s
s
a
g
e
s
In Section 3.2, we describe a technique for processingof IP packet
ﬂows containing many packets, which allows most of the data to
be forwarded directly at the ATM layer without explicit software
processing. However, often a router receives short messages that
do notbeneﬁtfrom the useof suchtechniques. In addition,a routermay receive packets that for some other reason require explicit
software processingat every router.
Each ATM link connectingdifferent routers hasone (or possibly
more than one) virtual circuit dedicated to carrying packets that
require explicit software processing. An APIC at the receiving end
of such a link forwards such packets into the memory of its IPPE.
In this case, the required IP processing is carried out on the IPPE.
Thisprocessingincludescheckingthe validity ofthe packetheader,
making a routing decision, updating the appropriate header ﬁelds
(e.g., the time-to-live and header checksum), and enqueueing the
packet for forwarding to the proper output port. The bulk of this
processingis identical to that performed by conventional IP routers
and can be optimized to aboutone hundred instructions per packet
in the common case,as described in [16].
Each input IPPE maintains a dedicated virtual circuit to each of
the output IPPEs. When queueing a packet for a particular output
port, it selects the virtual circuit corresponding to an IPPE at the
proper output port. The output IPPE buffers packets received from
differentinputIPPEsandschedulesthemfortransmissiononasingle
outgoing virtual circuit. The scheduling algorithm performed by
the output IPPE is designedto ensure that eachpacketﬂow receives
the appropriate quality of service.
3
.
2
I
P
P
a
c
k
e
t
P
r
o
c
e
s
s
i
n
g
f
o
r
L
o
n
g
e
r
B
u
r
s
t
s
To allow fast-path processing for longer bursts of IP packets
we pre-conﬁgure a set of permanent virtual circuits (PVC) joining
IPPEsinadjacentrouters. Thatis,wehavepermanentvirtualcircuits
crossingasinglelink, joiningthe IPPEsontheoutputsideofarouter
to the IPPEs on the input side of adjacentrouters. Each PVC may be
in one of two states: active or inactive. At the time of initialization
or whena PVCis not being used,it is in an inactive state. The APICs
at the receiving end have all PVCs in their internal VC tables, and
keep track of the state of each PVC.
If a packet is received on an inactive PVC, (meaning that the
upstream router has decided to use this PVC), the virtual circuit
switches from inactive to active. The APIC sends the packet to the
IPPEforprocessingasshownin Figure6 for VC = 54. The IPPEdoes
four things:
￿ The IPPE makes a routing decision to select the output port.
This decision may be made based on dynamic information
on the status and current loading of the various output ports,
as well as static routing information.
￿ Next,the IPPEexchangescontrolmessageswiththe IPPEatthe
selectedoutputporttogetanunusedPVCtoforwardpacketsto
thenextrouter. Uponreceivingtheproperresponse,itsendsa
controlcell to the ATM switch, conﬁguringit to forward cells
received at the input to the proper output, with the proper
virtual circuit identiﬁer. Note that this control interaction
is purely local to the router and involves no long-latency
interactions.
￿ The input IPPE, in cooperation with its APIC, then forwards
packets it has received during the time that has passed back
into the main data ﬂow. When all such packets have been
forwarded, the APIC begins forwarding cells on that virtual
circuitdirectly,withoutdivertingthemthroughtheIPPE.T h i s
operation of ﬂushing packets that have accumulated at the
input IPPE requires close coordination between the IPPE and
the APIC,butposesnofundamentaldifﬁculty. (Thebandwidth
available through the APIC chain is roughly twice as large
as the bandwidth of the external link, guaranteeing that the
accumulatedpacketscan be ﬂushedrapidly.)
When the burst of packets that established a given connection
is completed, the connection may be torn down. This may be done
explicitly, through ﬂow maintenance messages (where available),
or implicitly. One simple implicit mechanism involves monitoring
usage of PVCs on the output side of the router, and allowing the
output IPPEto reclaim any PVCthathasnotbeenusedrecently. This
would require that the output IPPE inform the input IPPE using the
PVC, so the input IPPE can set its incoming PVC to the inactive state
(meaning that packets received subsequently will be processed by
the IPPE). The output IPPEwould also needto senda controlcell on
theoutgoingPVCtoforcetheIPPEatthenextroutertoresetthePVCto
the inactive state. This may be accomplishedby deﬁning a special
resource management cell for this purpose. Reclaiming unused
PVCs may be done as a continualbackgroundprocess. Likewise, it
may be doneon-demandonly when an arriving packetrequestsuse
of a link in which all PVCs are already in use.
When selecting output ports to receive a given stream of pack-
ets, we try to select the port that is best able to accommodate the
added trafﬁc. In some cases however, an output link will become
overloaded,causingcells to accumulatein the buffer at the APIC on
the output side. When this happens, the APIC will start diverting
packets to the IPPE. The larger memory capacity at the IPPE allows
it to absorb fairly long-lasting overloads. However, it is important
that the IPPEschedulethe use of the overloadedresource to provide
fair treatment of all the competing trafﬁc streams. Since the over-
load could also lead to congestion in the ATM core, it should send
control cells to the IPPE’s that are sending it packets,causing them
to start buffering packets on the input side and forwarding them
on at a reduced rate. The input IPPE may also reconﬁgure packet
streams away from the congested link if there are other acceptable
choicesavailable. Thisrequiressomecoordinationwith other IPPEs
to prevent control oscillations between different links.
One aspect of cut-through handling of IP packets is that the
time-to-live ﬁeld (hop count in IPv6) is processed only in packets
that passthrough IPPEs. While this violates a strict interpretation of
IP protocol processing,we believe it is not a serious violation. The
purpose of the time-to-live ﬁeld is to detect routing loops. Thus,if
we processthe ﬁrst packetof a burst in each router on the path, we
can still detect routing loops and ﬂush the entire burst. We simply
interpret the time-to-live ﬁeld of the ﬁrst packet in the burst as
applying to the whole burst. A similar argument justiﬁes selective
processingofotherﬁeldsthatwouldnormallybeprocessedatevery
hop.
3
.
3
C
o
n
g
e
s
t
i
o
n
A
v
o
i
d
a
n
c
e
a
n
d
C
o
n
t
r
o
l
As indicated in the previous section, we seek to avoid congestion
as much as possible by routing arriving bursts of packets to output
links that are best able to accommodate them. The loading on
the various output links can be obtained by polling hardware cell
countersintheoutputportprocessorsofthe ATM core. Thecounters
are read using control cells, which are time-stamped when data is
read to allow accurate determination of the load during short time
intervals.
If information describing the data rate for an IP packet stream
is available(through areservationprotocollike RSVP,for instance),
this can be used to optimize the output port selection process. In
particular, if there are several good choices available (that is, links
that can all accommodate the added trafﬁc), it is best to select the
busiest link. This policy minimizes bandwidth fragmentation and
improves the performance for later-arriving bursts. In the absence
of suchinformation, the best choice is the least busy link.
When the load on an output link exceeds its capacity, packets
mayaccumulateintheoutgoingIPPE. Carefulschedulingisrequired
to ensurethat eachpacketstream receivesacceptableperformance.
A number of scheduling algorithms have been proposed over the
past few years [9, 8, 10, 11, 12, 13, 14, 22]. Two of them have re-
ceivedmostattentionin theInternetcommunity. OnewasdesignedVCXT
VC F
23 1
54 0
VRAM
1
2
3
APIC
When the forwarding ﬂag (F) in the VCXT is 1, the ﬂow goes directly to the switch.
When the forwarding ﬂag (F) is 0, the ﬂow undergoes the usual IPPE processing.
For cut-though ﬂows, the routing is done by the Input Port Processor.
54 15 1
23 12 2
15 31 3
VCXT
IPPE/APIC
1
2
3
The  IP ﬂow carried in VCI 23 is destined for link 2.
The IP ﬂow carried in VCI 15 is destined for link 3.
15
23
31
12
5
4
23
15
Input Port Processor 0
Input Port Processor 1
Input Port Processor 2
Input Port Processor 3
Output Port Processor 0
Output Port Processor 1
Output Port Processor 2
Output Port Processor 3
31
12
ATM Switch Fabric CPU
Figure 6: Cut-through Routing
by a collaboration between MIT and Xerox PARC and is generally
referred to as the CSZ algorithm [3]; the other was designedat LBL,
and is named Class-Based-Queueing(CBQ)[ 9 ] .
The CSZalgorithmcombinesthreesimplebuildingblocks;weighted
fair queueing (WFQ), priority queueing, and FIFO. In order to pro-
vide guaranteed service to selected trafﬁc streams, WFQ is used at
the top level of CSZ scheduling to provide trafﬁc isolation. Prior-
ity queueing is used to separate predictive real time services from
best-effect trafﬁc, as well as to separate different classes of pre-
dictive services. Each predictive class contains multiple real time
data ﬂows. Within a class, FIFO queueing is used to take the most
advantage of statistical multiplexing. The principle of WFQ is also
used to build a sharing tree, which is orthogonal to the scheduling
architecture, to enforce link-sharing.
In CBQ the basic building blocks are priority queueing, round-
robin scheduling, and a novel “borrowing hierarchy” for link-
sharing control. CBQ usespriority at the top levelof the scheduling
control, thus it does not provide guaranteed real time services.
Classes within each priority level are served in a round-robin fash-
ion. In addition to this scheduling architecture, there is a separate
“borrowing hierarchy” that includes all trafﬁc classes and the allo-
cated bandwidth to each class. Whenever a packet is forwarded,
the bandwidth usage of the corresponding class is adjusted. Upon
becoming resource overdrawn, a class may either borrow more
bandwidth from its parent class (if the parent has any left), or be
handled in a predeﬁned manner (such as being suspended for a
while, or packets being dropped). This borrowing hierarchy may
provideadequatelink-sharingcontrolandhasanextremelyefﬁcient
implementation.
The a
It
Pm router will provide a testbed for evaluatingthese and
other packetschedulingalgorithms in a gigabit environment.
3
.
4
I
P
M
u
l
t
i
c
a
s
t
F
o
r
w
a
r
d
i
n
g
To support IP multicast forwarding, a router must be able to take a
single incoming packet and send it out multiple outgoing ports. In
conventional, bus-based router architectures, this is an expensive
operation. Typically,aCPUmustmakemultiplecopiesofthepacket,
which incurs a large number of bus and memory cycles. It also
incursundesirabledelay: the lastcopyof apacketis delayedby the
time it takes to make all the other copies. The a
It
Pm router avoids
the cycle-costanddelayof CPU-basedpacketcopyingbyexploiting
the cell-replication capabilityof the a
It
Pm ATM backplanetoachieve
high-performance IP multicast forwarding.
In particular, when the ﬁrst of a stream of IP multicast packets
is received at an IPPE, a multicast route lookup is performed. This
yieldsanATM VCI thatwaspreviouslyconﬁguredusingtheIGMPand
multicastroutingprotocolsandlocally boundtothe givenmulticast
address. The IPPE then modiﬁes the virtual circuit table in the ATM
switch’sinputportprocessortoforwardcellstothepropermulticast
VCI. After that, the IPPE hands off to the APIC as before, allowing
the remainder of the burst to be processedat the cell level.
3
.
5
P
e
r
f
o
r
m
a
n
c
e
I
s
s
u
e
s
We estimate that the combination of the APIC and IPPE will achieve
sustainedpacketprocessingratesof between100and200thousand
packetsper second. This is basedon the assumption that the IPPE’s
processorexecutesinstructions at a sustainedrate of 40 MIPSwhile
processing packets and that between 200 and 400 instructions aresufﬁcient for each packet. This is fast enough to accommodate a
fully loaded1.2 Gb/slink with averagepacketsizesofbetween750
and 1500 bytes. This implies that trafﬁc loads consistingmostly of
small packets can be handled with reasonable efﬁciency and that
IPPEscanaccommodatethetrafﬁc ﬂowstheyarerequiredtoprocess
during the start of a burst or a congestionbuild-up.
For long bursts, it’s important to understand the control delay
incurred at the start of a burst while packets are waiting for the
local virtual circuit to be established. We estimate that this control
delay will be less than 100
￿s per router. Therefore, an end-to-
end path through ten routers will involve about 1 ms of control
delay. Forwide-areanetworkapplications,this addeddelayisnota
signiﬁcantpenalty. Noticethough,thatasaburstprogressesthrough
the network, each router adds some additional delay, meaning that
the ﬁrst portion of the burst becomes “compressed” as it moves
through the network. Thus, the last router in a ten router path
will process about ten times as many packets in software does the
ﬁrst router in the path. This suggests that establishment of the
local paths is clearly beneﬁcial only for bursts that use the virtual
circuit for a time duration of at least about 10 ms. This need not
be continuous. As long as the virtual circuit is used for at least 10
ms before beingreallocatedto some othertrafﬁc stream, we beneﬁt
from establishment of the local virtual circuit. This implies that
we should hold virtual circuits for as long as possible, once they
have been allocated to a particular trafﬁc stream. So long as there
are sufﬁcient virtual circuits available, this does not present any
difﬁculty.
4
O
t
h
e
r
P
r
o
t
o
c
o
l
s
To operate correctly, an a
It
Pm router must implement several pro-
tocols besides IP. These include an ATM-PPP protocol, the Internet
Control Message Protocol (ICMP), the internet routing protocol(s),
telnet, a ﬂow setup protocol such as RSVP, and others. In addition,
next-generationrouters mustsupportthe IP version6 protocol. The
following paragraphs summarize operation of these protocols on
a
It
Pm.
ATMPPPProtocol. Theprecedingdescriptionimpliestheex-
istence of a point-to-point protocol for carrying IP packetsbetween
a
It
Pm routers. This protocol deﬁnesthe use of AAL5 for carrying IP
packetsand deﬁnesthe “reset”control cell used to force a PVC into
the inactive state. It also deﬁnes the procedures for establishing
these PVCs in the ﬁrst place.
Internet Control Message Protocol. ICMP is used to send
error and control messages from a router back to the originator of
thepacketwhoseprocessingledto the ICMPmessage. In thecaseof
an a
It
Pm router, the ICMP messagesare generated by either an input
or output IPPE (the input IPPE is the one that received the original
packet and the output IPPE is the one to transmit the packet). If
a message is generated at an input IPPE, it may be sent on the
same link and does not need to go through the switch. Some ICMP
messagesmay be generated at an output IPPE by packetscheduling
andcongestiondetectionalgorithms. Thesemessagesaresentfrom
the output IPPEto the input IPPEthrough the switch so that the ICMP
messagemay be returned to the sourceof the packet.
Unicast and multicast routing protocols. One of the IPPEs,
called the “route server IPPE”( RS-IPPE) is responsible for running
Internetroutingprotocol(s)thatmaintain anup-to-daterouting table
for the entire router. Routing updates received at all IPPEsa r es e n t
to the RS-IPPE. In responseto theseupdates,the RS-IPPErecomputes
the routing table and uses a pre-established multicast virtual circuit
to broadcast a copy of the table (or only modiﬁcations) to other
IPPEs within the aItPm. Thus, each IPPE independently makes the
routing decision for an incoming IP packet.
Another IPPE,called the “multicast server IPPE”( MS-IPPE)i sr e -
sponsible for maintaining multicast group information within the
a
It
Pm. Efﬁcient multicast support is essential for higher-level ser-
vices such as Multicast Backbone (MBONE), which provides one-
to-many and many-to-many network delivery services for applica-
tions such as video-conferencing and network audio. Creation of,
and modiﬁcations to, multicast groups in an a
It
Pm router happen
in response to IGMP messages and multicast routing updates. The
MS-IPPE is also responsible for creating and maintaining multicast
VCs within the a
It
Pm for active IP multicast groups. Once these
multicast VCs are set up, routing of an input multicast IP packet is
done independently by an IPPE without having to go through the
MS-IPPE.
Telnet. The Internet model allows each port of the router (and
even each IPPE in the a
It
Pm) to have a unique IP address. However,
to hide the internal complexity of the router and to conserve IP
addresses, we allow only one IP address per router. This means
that IPPEs are not addressable individually. Therefore, we plan to
have one of the IPPEsp e ra
It
P m, called the control IPPE (C-IPPE),
run the telnet and other daemons. Thus, telnet’ing to an a
It
Pm
router involves connecting to its C-IPPE. Of course, the C-IPPE has
VC connections to all other IPPEs within the a
It
Pm router. Thus, a
remote user with appropriate access rights may access and control
any of these IPPEs.
Flow Setup and Reservation Protocols. These functions are
handled similarly to the routing protocols. All RSVP messages are
sent to a designated IPPE (possibly the same as the RS-IPPE). This
IPPEperformstheadmissioncontrolfunctiononﬂowsetuprequests,
and if a request may be admitted, the IPPE makes an appropriate
reservation andinforms the concernedIPPEsontheinputandoutput
links.
An alternate arrangementcouldbe to process RSVP messagesat
the appropriate output IPPEs. However, this could lead to synchro-
nization problems, in the case of a multicast ﬂow if some output
links may accommodatethe ﬂow,and some may not.
IP version 6. IPv6 is designed as the next-generation Internet
routing protocol. a
It
Pm support for IPv6 is straightforward since
IPv6packetformats havebeendesignedto simplify packetprocess-
ing and help with QoS guarantees, as explained in the following
paragraphs:
￿ The presenceof a Flow Label in the IPv6 packetheadersim-
pliﬁes per-packet processing. For example, ﬂow labels may
be usedto dohashtable lookupforpacketrouting. Likewise,
they may be mapped directly on to VCs to allow hardware
based cut-through routing and provide Quality of Service
(QoS) support at the IP level.
￿ IPv6 eliminates computing the header length and comparing
it with a minimum. In IPv4, it is necessary to check each
datagram headerto seeif the headerlength wassetto a value
greater than or equalto the minimum.
￿ IPv6 also eliminates the header checksum. It assumes that
each link level protocol (for example AAL5) will provide a
hop-to-hop error detectionusing CRC orsomethingcompara-
ble.
￿ IPv6 eliminates fragmentation at a router. If a router cannot
forward an IP datagram because the outgoing interface sup-
ports MTU sizes less than the packetsize, the router does not
fragment the packet. Instead,the router drops the packetand
sends an ICMP error messageback to the source.
￿ IPv6 eliminates IP options processing for the common case.
If the destination address does not match any local address,
then IP option headers do not have to be examined (except
for the unusualhop-by-hop options header).5
R
e
l
a
t
e
d
W
o
r
k
IP routers have been produced commercially for many years now.
Theclassicalrouterarchitectureconsistsofasinglegeneral-purpose
processor with multiple hardware interfaces to point-to-point links
or shared access subnetworks. Over the last decade, commercial
router vendors have migrated to architectures in which increasing
amounts of processing are placed on the interface cards and a high
bandwidth interconnect(usually a high-speedbusor crossbar)pro-
vides connectivity among the different interface cards. An notable
example of this style of architecture is a recent product by Net-
Star [15]. The NetStar architecture comprises a central crossbar
with serial interfaces operating at about 1 Gb/s per port together
with interface cards containing two programmable processors and
custom hardware for buffering and selected IP functions.
The a
It
Pm router differs from this type of architecture in two
fundamental ways. First, because it is based on a scalable switch
fabric with optimal cost/performance, rather than a crossbar, the
a
It
Pm architecture scales up economically to conﬁgurations with
thousands of high-speed ports. This allows large networks to be
constructed far more economically than is possible by composing
many small switches. Networks constructed from large switches
require substantially more interface cards, which are a major cost
component. Theconstructionoflargenetworksfromsmallswitches
also leads to better performance, since it minimizes the number of
hops required. Most of the commercial architectures have no way
to support multicast at the hardware level. Therefore, the software
bears the entire load, which signiﬁcantly restricts the amount of
multicast trafﬁc that may be supported.
AsecondfundamentaldifferencebetweentheaIt
Pmarchitecture
and conventional routers is its ability to use the ATM core in a dy-
namic fashion to allow the vast majority of IP packets to be routed
directly in hardware,without the requirementfor software process-
ingateveryhop. Whilethea
It
Pmarchitecturepermitsallprocessing
to be done in software, the potential for using cut-through packet
handling to optimize the normal case (while software processing
is triggered for exceptions), raises the possibility of getting sub-
stantially higher data throughputs for a given amount of software
processingcapacity.
Another less basic, but still important, distinction between our
approach and conventional routing architectures is that we use
general-purposecomponentsin the a
It
Pm. We expectthese compo-
nents to become commodity ATM parts over the next severalyears.
The APICisageneral-purposehostinterfacechip,andtheextensions
required for the a
It
Pm router require only that it be able to identify
packet boundaries, using AAL5. Commercial router vendors are
generally moving toward embedding portions of the actual IP pro-
cessing in custom hardware. We feel this approach will limit their
ability to keep pace with future protocol enhancements and tech-
nologyadvances. (A goodexampleofthekind ofcustomhardware
solution we feelis inappropriate is describedin [21], which embeds
some of the IP protocol processingin custom integrated circuits.)
Ourapproachalso differs signiﬁcantlyfrom simply implement-
ing an IP overlay network, on top of permanent or semi-permanent
virtual circuits provided by an underlying ATM network. The fun-
damentaldifference is againthatin the a
It
Pm router, the IP layercan
directly establishvirtualcircuits ontheﬂyforindividualdatabursts
(without end-to-end processing). This enables it to exploit the
hardware switching advantages of ATM to dramatically reduce the
amount of software processing that is required. Note that this can
accomplishedwithout the performance penaltyand loss of ﬂexibil-
ity associated with end-to-end virtual circuit setup. Moreover, the
closephysicalintegration ofthe IP processingwith the ATM switch-
ing leads to signiﬁcant implementation economies. The compati-
bility of the gigabit switch port processors and the APIC, resulting
from theircommonuseoftheUtopia interface standard,makesthis
physicalintegration particularly beneﬁcial.
A similar gateway architecture, with an ATM fabric at the core
for high performance and scalability, was also proposed in [17].
However, [17] had argued for all per-packet processing to be im-
plementedinhardwarewhichisnoteconomical,ﬂexibleandneces-
sary, especially if the software per-packet processing can be made
efﬁcient enoughto supportthe necessarydata rates.
Finally, the a
It
Pm router differs from commercial efforts in that
it provides a ﬂexible testbed for experimentation with the latest
versions of the IP protocols at gigabit rates. As such, it is an
invaluablesourceofrealexperimentalresultsandatoolforprotocol
researchersthat may be tailored to suit evolving needs.
6
C
o
n
c
l
u
d
i
n
g
R
e
m
a
r
k
s
Thispaperdescribesastrategyforintegrating IPwith ATM toachieve
scalable, gigabit internets. We seek to go well beyond the conven-
tional approach of implementing IP over ATM in a strictly layered
fashion. We believe that by allowing the IP processing layer to
directly control and manipulate an underlying ATM switch core, IP
can directly beneﬁt from the hardware processing efﬁciencies of
ATM switching technology, or looking at it from the other perspec-
tive, ATM can enjoy the inherent ﬂexibility and adaptability that are
among IP’s greateststrengths.
Ourworkuses ATM technologytobuildscalablehigh-performance
gigabit IProuters andtightly couplesthe IP and ATM layersto obtain
maximum advantagefrom each. We plan to usethe proposeda
It
Pm
routerto implementavarietyof IPprotocolsandcontrolalgorithms,
including IPversion4,theproposedIPversion6,andvariouspacket
scheduling and congestion control algorithms to support both best
effort and continuous media trafﬁc. The software implementations
will allow us to experiment with and ﬁne tune the protocols and
algorithms that form the core of the next generation IP in the con-
textofagigabitenvironment. Theunderlyingmulti-CPU embedded
systemwill ensurethat there are enoughCPUand memory cyclesto
perform all IP packetprocessingat gigabit rates.
We believe that the a
It
Pm architecture will not only lead to a
scalable high-performance gigabit IP router technology, but will
also demonstrate that IP and ATM technologies can be mutually
supportive. In addition, the architectural approach developedhere,
in which powerful general-purpose processors are closely coupled
to a high-speed and scalable switching system, offers possibilities
that go beyond IP routing. The integration of IP with ATM may be
viewed as just a ﬁrst example of a new form of ‘integrated layer
processing’ that blurs the boundaries between different network
layers and between the network and application processing layers.
While in the past, the conventionalwisdom has been to keep these
layers strictly separated, there has been a growing appreciation of
the potential advantagesthat may be obtained by a disciplined and
carefully structured blurring of the boundaries in the workstations
andserversthatusethenetwork. Weexpectthatsimilaradvantages
can be obtained by carrying this process into the network core, as
well.
A
c
k
n
o
w
l
e
d
g
m
e
n
t
s
Theauthorswantto thankLixiaZhangandSteveDeeringofXerox
PARC for technical discussions leading up to this paper, particu-
larly with respect to new Internet packet scheduling, congestion
control, and multicast mechanisms. Thanks also to Hari Adiseshu
of Washington University for assistancein clarifying certain details
of IP version 6 and helping with the ﬁgures.R
e
f
e
r
e
n
c
e
s
[1] Asthana,A.,C.Delph,H.Jagadish,andP.Krzyzanowski“To-
wardsa GigabitIP Router”,Journalof HighSpeedNetworks,
Vol. 1, No. 4, pp. 281-288,1992.
[2] ATM Forum, “UTOPIA, An ATM PHY Interface Speciﬁca-
tion,” Level 1, Version 2.01, March 21, 1994
[3] Clark, D., S. Shenker, and L. Zhang “Supporting Real-Time
Applications in an Integrated Services Packet Network: Ar-
chitecture and Mechanism”, Proceedings of SIGCOMM’92,
September,1992.
[4] Dittia, Zubin, Jerome R. Cox, Jr. and Guru Parulkar. “De-
sign of the APIC: A High Performance ATM Host-Network
Interface Chip,” Proceedingsof IEEE INFOCOM’95.
[5] Dittia, Zubin,JeromeR.Cox,Jr.andGuruParulkar.“Catching
Up With the Networks: Host I/O at Gigabit Rates,” Techni-
cal Report WUCS-94-11, Department of Computer Science,
Washington University in St. Louis, 1994.
[6] Dittia, Zubin, Jerome R. Cox, Jr. and Guru Parulkar. “Us-
ing an ATM Interconnect as a High Performance I/O Back-
plane,”ProceedingsoftheHotInterconnectsSymposium,Au-
gust 1994.
[7] Dittia, Zubin, Andy Fingerhut and JonathanTurner. “A Giga-
bit Local ATM Testbed for Multimedia Applications: System
Architecture Document for Gigabit Switching Technology,”
Applied ResearchLab, Working Note 94-11.
[8] Ferrari, D. and D. Verma. A Scheme for Real-Time Channel
Establishmentin Wide-Area Networks,I nIEEE JSAC,V ol .8,
No. 4, pp 368-379, April 1990.
[9] Floyd,Sally,“TalkgivenatMarylandHighSpeedWorkshop,”
March, 1992
[10] Golestani, S. J. A Stop and Go Queueing Framework for
Congestion Management,I nProceedingsof SIGCOMM ’90,
pp 8-18, 1990.
[11] Golestani, S. J. Duration-Limited Statistical Multiplexing of
Delay Sensitive Trafﬁc in PacketNetworks,I nProceedingsof
INFOCOM ’91, 1991.
[12] Hyman, J., and A. Lazar. MARS: The Magnet II Real-Time
Scheduling Algorithm,I nProceedingsof SIGCOMM ’91,p p
285-293,1991.
[13] Hyman, J., A. Lazar, and G. Paciﬁci. Real-Time Scheduling
with Quality of Service Constraints,I nIEEE JSAC,V o l .9 ,
No. 9, pp 1052-1063,September1991.
[14] Kalmanek, C.,, H. Kanakia, and S. Keshav. Rate Controlled
Servers for Very High-Speed Networks,I n Proceedings of
GlobeCom ’90, pp 300.3.1-300.3.9, 1990.
[15] NetStar, Inc. GigaRouter System Description. Publication
SPD000001,July 1994,Revision 2.
[16] Partridge, C., “Gigabit Networking,” Addison Wesley, 1993.
[17] Parulkar, Guru. “The Next Generation of Internetworking,”
ACM SIGCOMM, Computer Communication Review,J a n -
uary 90.
[18] “SPARC MBUS Interface Speciﬁcation: Revision 1.2”, SPARC
International, April 1991.
[19] Turner, Jonathan S. “Progress Toward Optimal Nonblocking
Multipoint Virtual Circuit Switching Networks,” Proceedings
of the Thirty-First Annual Allerton Conference on Commu-
nication, Control, and Computing, September 1993, pp. 760-
769.
[20] Turner,JonathanS. “An Optimal NonblockingMulticast Vir-
tual Circuit Switch,” Proceedings of Infocom, June 1994, pp.
298–305.
[21] Tantawy, Ahmed, Odysseas Koufopavlou,Martina Zitterbart
andJosephAbler.“OntheDesignofaMultigigabitIPRouter,”
Journalof High SpeedNetworks, vol. 3, no. 3, 1994.
[22] Verma, D., H. Zhang,and D. Ferrari. Delay Jitter Control for
Real-Time Communicationin a PacketSwitching Network,I n
Proceedingsof TriCom ’91, pp 35-43, 1991.