A General Method for Finding the Most Economical Distributed Router Architecture by Kencl, Lukas & Radunovic, Bozidar
General Method for Finding the Most Economical
Distributed Router Architecture
Lukas Kencl
IBM Research, Zurich Research Laboratory
Saumerstrasse 4, CH-8803 Ruschlikon, Switzerland
lke@zurich.ibm.com
Bozidar Radunovic
Department of Communication Systems (DSC)
Ecole Polytechnique Federale de Lausanne (EPFL)
Lausanne, Switzerland
bozidar.radunovic@ep.ch
Keywords: Router architecture, distributed systems,
packet processing, queuing, cost optimization.
Abstract
In this work we present a novel method to determine
the optimal parameters of a router architecture when
certain router performance constraints are given. The
total nancial expense, or cost, is the optimality crite-
rion. We introduce a general, essentially distributed,
router architecture model, consisting of locally or re-
motely located forwarding engines or processing units
gathered around a switch of variable speed. Given the
following constraints: number of inputs, maximum line
interface bandwidth, and maximum packet delay in a
router, the presented method nds the optimal amount
and distribution of processing power among the various
available processing units and the optimal parameters
for the switching element. The optimization employs an
estimated market-based cost function per element and
nds the most economical system solution.
The results show that the optimal solutions gather
around two extreme points of the solution space, dis-
tinguishable by the distribution of the processing power
mass and corresponding switch speed. We discuss when,
depending on the customer input, one or the other solu-
tion is appropriate.
1 INTRODUCTION
Latest developments in transmission technologies have
led to an enormous increase in the potential amount of
data transported over the links of a complex network
like the Internet. Such rapid evolution places signi-
cant strain on the interconnecting equipment, primarily
routers, to scale with the pace of the transmission speed
increase. Recent works [1], [2] have provided a basis for
the new generation of interconnecting devices by pre-
senting the rst gigabit and terabit router architectures.
These works have built on new developments in the areas
of switch architectures [3], [4] and fast lookup algorithms
[5], [6].
In order to eliminate the packet processing bottle-
neck, multiple processing units, known as forwarding
engines (FE), or, more sophisticated, network processors
(NP), are typically deployed in contemporary routers. A
router system thus consists of multiple processing units
gathered around a switch element and packet processing
within such a system is essentially distributed. Various
architectures have emerged in the industry with respect
to the exact locations and capacities of the individual
router elements. With the ever-increasing number of
processing units and the total processing power present
in the system, more emphasis is being put on the most
economical use of these resources.
In principle, packet processing can either be carried
out directly at router inputs, using local forwarding en-
gines (LFE), or at remote master forwarding engines
(MFE), reachable through the switch element. Both of
these paradigms may be combined in one system. Given
the performance demands, a single MFE may not be
suÆcient and thus MFEs are often grouped into pools
of parallel MFEs. The critical question is{what are the
best capacities, locations and schemes of cooperation of
all the elements (switch, LFEs, MFEs) in order to sat-
CP
SWITCH
LC
LFE
LC
LFE MFE
Figure 1: Distributed router architecture (LC: line card,
LFE: local forwarding engine, MFE: master forwarding
engine, CP: control point).
CP
LC
LC MFE
MFE
MFE
SWITCH
Figure 2: Parallel router architecture (LC: line card,
MFE: master forwarding engine, CP: control point).
isfy given system performance demands, and what is the
most economical alternative to satisfy those demands?
Two of the possible router architectures belonging to
this space (albeit without considering the switch ele-
ment), fully distributed and parallel, were examined and
compared in [7]. In the fully distributed case (Fig. 1),
each line card (LC) has a dedicated LFE attached. When
a packet arrives at an LC, its LFE searches for the ap-
propriate route. If the route is found, the packet is im-
mediately forwarded through the switch to the output
LC. If an LFE is not able to determine the route (e.g.
contains only a part of the routing table), or does not
have enough processing power to handle all the arriving
packets, it sends the packet header to the MFE, which
contains a copy of the entire routing table and therefore
is able to nd the appropriate route. In general, as the
MFE stores the entire routing table and should be able
to assist all LCs, it is considerably more powerful and
expensive than an LFE.
In the case of a parallel router architecture (Fig. 2),
the router contains a pool of several high-performance
MFEs that handle the router's entire workload. Any
MFE can take on a new request as soon as it has pro-
cessed the previous one. As long as the switch can handle
the additional traÆc, the total system processing power
is considerably higher than in the case of a fully dis-
tributed system, but also the total system cost rises ac-
cordingly.
The content of the routing table is managed by the
router control point (CP), which often resides in the
same hardware unit as the MFE. The CP uploads the
table to the FEs. As the CP is a processor dedicated to
the control plane rather than to the data plane within
router, it is not considered in the optimization, neither
in [7], nor in this paper.
In [7], a simple framework for assessing the cost vs.
performance ratio was presented. The cost and per-
formance dierences between a fully distributed and a
parallel architecture, as well as the inuence of various
system parameters on the ratio, were studied. The opti-
mizations were carried out by constraining the maximal
packet-processing time and the maximal FE processing
power while minimizing the total cost of the system.
The results of the optimization in [7] in the case of
the distributed architecture indicate that as the cost ra-
tio between MFE and LFE increases, it is more eÆcient
to use the fully distributed architecture rather than a
centralized one without LFEs. Similar behavior occurs
when the fraction of packets an LFE unsuccessfully pro-
cesses decreases. The parallel architecture is more ex-
pensive than the distributed one for the same workload,
but it is scalable and thus able to handle a much higher
workload.
In this paper, a general model of an essentially dis-
tributed router architecture is optimized, spanning a
large set of hybrid architectures. Our model (see Fig.
3) contains a xed number of LFEs (one per router in-
put) of variable processing power (including null process-
ing power, meaning that the LFEs are absent), a vari-
able number of MFEs with variable processing power,
and a switch of variable port speed. This work extends
the model presented in [7] by a switch element and an
LFE queuing and queue overload model. Furthermore,
we present a hybrid, more general router architecture
model, encompassing a large space of possible, in essence
distributed, router architectures, including the central-
ized, fully distributed and parallel cases presented in [7].
The objective of the method is to serve as a general
means for optimizing a router architecture with a given
set of constraints. The constraints, which can be in-
terpreted as customer input, are maximum line inter-
face bandwidth, number of router LCs, and maximum
packet processing delay within the system. To carry
out the optimization on a realistic model, the system
model contains further constraints, which can be inter-
preted as technological limits, such as the maximum
processing power of the FEs or the maximum switch
port speed. The full set of constraints denes a space
of feasible solutions over which the optimization is car-
ried out. The optimization cost function is an aggre-
gate of estimated market-based costs of the individual
elements. The cost function takes the technological pa-
rameters of each element (FE processing power, number
of switch ports, and switch port speed) as input. The
optimization output consists of variables describing the
optimal architecture|the processing power of the LFEs
and MFEs, number of MFEs, switch port speed, and
the distribution of packet processing among LFEs and
MFEs.
A second objective of this work is to reach some gen-
eral conclusions about the most economical router archi-
tecture for a given set of constraints.
The paper is organized as follows: Section 2 presents
the router architecture model, and, in Section 3, the cost
optimization problem is described in detail. In Section
4, results of the most interesting optimizations are dis-
cussed; in Section 5, some further possible extensions to
the model are discussed, and, nally, Section 6 contains
some concluding remarks.
2 GENERAL DISTRIBUTED
ROUTER ARCHITECTURE
2.1 Router Model
We consider a router having k LCs, with an LFE
attached at every LC (see Fig. 3). A switch element
interconnects all the LFEs and the pool of m parallel
MFEs. All the possible sequences the processing of a
packet may take{at an LFE, at an MFE, or at both pro-
cessing units|are accounted for. A fraction r 2 [0; 1] of
the incoming traÆc is processed locally at the LFEs; the
fraction 1  r is diverted directly to the MFEs without
being enqueued at the LFEs (see Fig. 4). Thus, if r = 0,
LFEs are not used at all and the router consists only of
a pool of parallel MFEs and a switch. When r > 0 the
LFE may not be able to handle all the traÆc destined
MFE
MFE
MFE
CPLC
LC
SWITCH
LFE
LFE
Figure 3: LFEs, switch and MFEs within the general
distributed router architecture model.
T 1
λ i ib’ λ iλ rf )= r b (1−
ib’ λ iλ o=r
λ iλ p=(1−r)
λ sum
T 2
λ LFE
µMFE
µMFE
µ LFE
n
LFE MFESWITCH
Figure 4: Model of the general distributed router archi-
tecture with multiple LFEs and MFEs.
for it locally, for various reasons, such as for being over-
loaded (see Section 2.2). Such traÆc is sent to the MFEs
as well, but only after passing through the LFE. Thus,
if r = 1, all the traÆc is enqueued at the LFEs, yet a
fraction that the LFEs will not be able to process will
still subsequently be sent to the MFEs.
Arriving traÆc is modeled as a Poisson process, which
simplies the analysis. Some form of bursty traÆc is
typically observed in networks [8]. Bursty traÆc would
require a dierent, worst-case analysis with respect to
the system capacity. Recent works [9] have suggested
that the aggregate arrival traÆc on an uncogested In-
ternet link does tend to Poisson and therefore our as-
sumption may not be far from reality. The mean arrival
rate at each input LC is 
i
packets per second (pps).
The total router load is thus  = k 
i
. In the analy-
sis we consider all the links to be fully loaded. In re-
ality, workloads on dierent LCs are generally not uni-
form and may vary signicantly over time. This implies
that LFEs with a higher workload would forward more
packets to the MFEs for processing than LFEs with a
smaller workload, and one can imagine a feasible prob-
lem solution where for some periods of time, individual
LFEs would be overloaded. We have experimented with
nonuniform workload distributions on dierent LCs and
the LFE overload, but the optimization results did not
dier signicantly from the uniform model. Nonuniform
LC workloads are therefore not included in the model.
The LFE model, as presented in Section 2.2, is applica-
ble for the overload modeling, however the optimization
never nds an overloaded LFE solution to be the opti-
mal one. With respect to the optimization, parameters
k and 
i
are a part of the optimization input, whereas
values of r and m are a part of the output.
2.2 LFE and MFE Model
The processing power of an MFE is 
MFE
pps, and
that of an LFE is 
LFE
pps. We use 
max
as a bound
on the maximum number of packets an FE can handle
per second. The possible scenarios of packet-processing
distribution among LFEs and MFEs are depicted in Fig.
4. The fraction of traÆc arriving at an LFE is 
LFE
=
r
i
. A fraction 
p
= (1   r)
i
is pre-scheduled directly
for processing at the MFEs.
Regarding the packets sent for resolution through the
switch to the MFEs, we assume that it is only the packet
control information, i.e. the packet header, that travels
through the switch (as in [10]). The packet payload is
assumed to be buered until a resolution of the packet
processing task arrives from the MFE pool, again, trav-
eling through the switch. Thus, in terms of number of
packets, the amount traveling through the switch is the
same, yet in terms of bits, only a fraction of the packet
size makes the trip to the MFEs. In this work, the pro-
cessing overhead and the memory size requirements for
the packet header detachment, the payload buering,
and the packet reassembly are not considered.
Furthermore, we assume that a single LFE workload,

LFE
, can be greater than its processing power 
LFE
.
The LFE queue size n is introduced as a parameter to
model the LFE overload. When the LFE queue is full, a
packet cannot be processed by the LFE and is forwarded
to the MFE pool. Note that the overload traÆc does not
have a Poisson distribution because the probability that
an LFE queue is full depends on the LFE load, but, for
the sake of simplicity, we approximate it with a Poisson
distribution as follows: the LFE queue is an M=M=1=n
queue. Thus, the probability of a packet arriving at a
full LFE queue is (see [11]):
b
0
(
LFE
) = P
n
=
1  
1  
n+1

n
;  = 
LFE
=
LFE
: (1)
Thus, we assume the fraction of traÆc 
o
= b
0

LFE
=
r b
0

i
to be sent to the MFE pool due to LFE overload.
Furthermore, as in [7], even if a packet is being pro-
cessed by an LFE, with a xed probability b the LFE
will not be able to nd the packet next hop, and the
packet is likewise forwarded to the MFE pool. Such
packets account for route table misses, for example when
the LFE acts only as a cache, storing a fraction of the
routing table. We denote such a fraction of traÆc as

rf
= r b (1  b
0
)
i
.
Finally, the fraction of traÆc that actually does get
resolved at the LFE and is forwarded directly to the
outgoing switch port is 
q
= 
i
  (
p
+ 
o
+ 
rf
) =
r(1  b)(1  b
0
)
i
.
In Fig. 4 we observe that there are three possibilities
for a packet to be queued. Either a packet is queued
at an LFE and waits for time T
1
, it is forwarded to the
MFE and waits for T
2
, or it is queued at both the LFE
and the MFE owing to the LFE resolution failure.
Note that given the various paths the packet process-
ing in the router can take, packets belonging to a partic-
ular ow may be reordered, which is highly undesirable
[2]. In the interest of simplicity, we do not consider the
additional processing overhead required to prevent re-
ordering in this paper.
LFE processing time. The average number of pack-
ets in a processor, the average workload arriving at the
LFE queue, and the average LFE response time are
N(
LFE
) = 
n
n+1
  (n+ 1)
n
+ 1

n+2
  
n+1
  + 1
(2)

a
= 
LFE
(1  P
n
) = 
LFE
1  
n
1  
n+1
(3)
W (
LFE
) =
N(
LFE
)

a
(4)
=
(1  
n+1
) (n
n+1
  (n+ 1)
n
+ 1)

LFE
(1  
n
) (
n+2
  
n+1
  + 1)
:(5)
Observing the behavior of a saturated LFE, we see from
Eq. (5) that for higher n, the waiting time is longer.
Therefore, a router with long LFE queues would be pe-
nalized with respect to the packet delay time compared
to an equivalent router with smaller queues. On the
other hand, a router with extremely small LFE queues
(e.g. n = 1) would have frequent queue overows even on
the nonsaturated LFEs, which would again penalize its
performance. The queue length n thus has to be chosen
carefully in order to achieve optimal performance. Ide-
ally, n should be included in the optimization of the sys-
tem parameters. Experimentally, though, we have found
that changes in n do not have a very signicant inuence
on the optimization in comparison to other factors, espe-
cially as the optimization never nds an overloaded LFE
to be the optimal solution. Thus, to simplify the analy-
sis, we use xed n values. Note however that in reality, a
queue of larger size would be necessary to handle bursty
traÆc, for which we do not account for in our model.
Time T
1
is simply the average response time W (
LFE
).
MFE pool processing time. The MFE pool queue
is, as in [7], a simple innite M=M=m queue, with the
input workload representing the sum of nonprocessed
packets from the LFEs, together with the pre-scheduled
packets. As the part of workload sent for resolution to
the MFE represents a sum of Poisson processes, the sum
is a Poisson process as well. This workload and the cor-
responding M=M=m queue waiting time are on average
[11]:

sum
= k (
p
+ 
o
+ 
rf
) (6)
T
2
=
1

MFE
+
P
Q

sum
(1  )
; (7)
where
 =

sum
m
MFE
; (8)
P
0
=
1
m 1
P
j=0
(m)
j
j!
+
(m)
m
m! (1 )
; P
Q
=
P
0
(m)
m
m! (1  )
:(9)
2.3 Switch Model
General input/output switch. The switch is char-
acterized by two parameters|the switch port speed s
and the number of input ports k. As k is an input to the
optimization, s is the only parameter optimized by the
method. The switch port speed is expressed in terms of
transmission time per the xed size switch cell, that is,
in seconds per cell. The parameter s is constrained from
below by the technological limit of s
min
, s > s
min
, which
means that a xed-size switch cell cannot be transmit-
ted at a switch port in less than s
min
seconds. To avoid
confusion with the intensity indicators in packets per sec-
ond, the intensity indicators in switch cells per second
are denoted with

, e.g. 

instead of .
We include the switch in our model by introducing
the switch delay. In order to model the switch delay
time, we consider a formula for an input/output-queued
switch derived in [12] and [13]. In the interest of simplic-
ity, we assume that the switch has innitely large out-
put queues, and that the number of inputs is large (i.e.
greater than 16, [12]). In the case of a lower number of
inputs, the performance of the switch is actually better
than the formula depicts (as described in [14]), owing
to lower contention, but in such a case the formula can
still be used as a rough upper bound. Note that the
head-of-line congestion at the input port considered in
[12] and [13] has been eliminated in the latest switch ar-
chitectures; however, in this work, we conform to this
model in order to obtain a simple analytical formula for
computing the switch delay.
A cell arriving at an input port rst waits at the input
queue, then at the head of the input queue because of
head-of-line congestion, and, nally, at the switch out-
put port (see Fig. 3). We denote W
i
(

x
) as the average
waiting time until the head of the input line is reached.
The term 

x
denotes the intensity of the input traÆc (in
switch cells per second). This is an M=G=1 queue with
service timeW
b
(

x
) equal to the time a cell spends wait-
ing at the head of the input queue owing to head-of-line
congestion. We denote D
out
(

y
) as the average delay
from the instant a cell appears at the head of its input
queue until the instant it begins transmission at the out-
put port. As shown in [12], this is an M=D=1 queuing
system, as we assume the switching matrix speed to be
λ 1
λ 2LFE
MFE
*
*
Figure 5: TraÆc ows within the switch; 

1
represents
the amount of traÆc leaving the local port (equal to the
amount arriving at the local port), and 

2
represents the
amount of traÆc arriving at the master port (equal to
the amount that leaves the master port).
constant and the input traÆc to be Poisson. The term


y
is the intensity of the aggregated Poisson ow arriving
at the output port (in switch cells per second).
The switch traÆc consists of k equally loaded local
ports and one additional port, the master port, used
for transferring the packet headers to and from the
MFE pool (see Fig. 5). As we consider innite output
queue sizes, waiting time due to head-of-line congestion
isW
b
(

x
) = 0 because a cell can be queued at the output
the moment it arrives at the head of an input queue. It
follows from [12, Eq. (2.9)] that
W
i
(

x
) =


x
s
2
2(1  

x
s)
;
where 

x
denotes the intensity of the input traÆc (in
switch cells per second) and s denotes the switching ma-
trix processing time. From the analysis of an M=D=1
queue we have
D
out
(

y
) =


y
s
2
2(1  

y
s)
;
where 

y
is the intensity of the aggregated Poisson ow
arriving at the output port (in switch cells per second).
In order to exploit the above switch model in conjunc-
tion with the FE models, we need to transform the load
variables to a common unit: packets per second. As In-
ternet packets are variably sized, we need to use some
approximations to establish a relationship to the xed-
size switch cells. Two kinds of packets travel through the
switch|entire packets, and the packet headers traveling
between the LFEs and the MFE pool (see Fig. 5). Let

S
P
denote the average size of a packet in the router incom-
ing traÆc, and

S
H
the average size of a header message
traveling in the switch. We assume that a packet header
corresponds in size to a single switch cell. We denote
c as the ratio between the packet header cell size and
the average packet size, c =

S
H
=

S
P
. Thus an average
packet accounts for 1=c switch cells. Measurements and
analysis have shown that the Internet traÆc distribution
is highly nonuniform. Empirical values in recent studies
show

S
P
:
= 300 B [15]. A typical switch cell size is 64
B, with the payload part being equal to 60 B. Thus, a
40{44 B packet header usually ts into a single such cell,

S
H
= 64 B and, consequently, c
:
= 0:2 is a good typical
value.
The transfer of packet headers among the process-
ing engines may be time consuming, especially when the
switch traÆc is already high. In the interest of simplicity,
we assume that the two kinds of traÆc traveling within
the switch, entire packets and packet headers, are not
distinguished in any way by the switch element. Thus,
the packet header communication overhead traÆc satu-
rates the switch further. Note that with the latest switch
designs, full decoupling of the two kinds of traÆc may
be achievable by using dierent switch priorities or sep-
arate switch planes. The following paragraphs describe
the introduction of the overhead into the model.
A fraction of traÆc is sent to the MFE pool for
processing. With reference to Fig. 5, we have 

1
=

i
=c+(
p
+
o
+
rf
) and 

2
= k (
p
+
o
+
rf
), where 

1
is the total switch outgoing traÆc at an LFE port (com-
prising both packet headers and entire packets) and 

2
is
the switch outgoing traÆc at the master port (compris-
ing packet headers only). The values of 

1
and 

2
are
expressed in switch cells per second, whereas 
i
; 
p
; 
o
,
and 
rf
are expressed in packets per second.
From the above discussion, we see that the wait-
ing time for a packet header traveling to the MFEs is
W
i
(

1
) + D
out
(

2
) and that the waiting time for the
way back to the output port isW
i
(

2
)+D
out
(

1
), hence
the total switch delay for a packet header trip is
D
H
(
i
) =


1
s
2
1  

1
s
+


2
s
2
1  

2
s
: (10)
An entire packet traveling through the switch after its
destination output port has been found by the next-hop
resolution process experiences a per-cell delay of W
i
(

1
)
at the switch input and a per-cell delay of D
out
(

1
) at
the output port, and occupies on average 1=c switch cells.
Thus the mean switch delay for the entire packet is
D
P
(
i
) =
1
c
(W
i
(

1
) +D
out
(

1
)) =


1
s
2
c (1  

1
s)
: (11)
2.4 Time a Packet Spends within the
System
The mean time T a packet spends in the system is
T =

i
  
p
  
o

i
T
1
+

p
+ 
o
+ 
rf

i
T
2
+

p
+ 
o
+ 
rf

i
D
H
(
i
) + D
P
(
i
); (12)
where the rst element represents the fraction of packets
processed by the LFEs, the second the fraction of packets
processed by the MFEs, the third the switch delay for
a remotely processed packet header, and the fourth the
switch delay for the entire packet traveling through the
switch.
3 COST OPTIMIZATION
Forwarding Engine Cost. To simplify the prob-
lem, the cost associated with an FE is assumed to
be a linear function of processing power of the form
cost
MFE
(US $)
:
= c
1

MFE
, cost
LFE
(US $)
:
= c
2

LFE
,
where 
LFE
and 
MFE
are expressed in packets per sec-
ond. We denote a as the ratio of the two coeÆcients,
a = c
1
=c
2
.
Eective switch throughput. Recall that s is the
switch speed, and k the number of input ports. In order
to establish a general measure of the switch performance,
we dene the saturation throughput of the switch to be


s
= k=s (in switch cells per second). The notion comes
from the fact that the switch delay,
D(

i
) =


i
s
2
(1  

i
s)
; (13)
tends to innity as the per-input switch load 

i
(in
switch cells per second) approaches 1=s (see [12]).
We denote as the eective switch throughput 

e
= 

s
a fraction  of the saturation throughput for which the
delay remains within reasonable bounds. For Internet
applications, we assume  = 0:9. A common switch per-
formance metric used by the industry is the switch eec-
tive throughput 
0
e
measured in bps. The relationship
between 

e
and 
0
e
holds as:

0
e
= 8

S
H


e
: (14)
Switch cost function. To establish a switch cost
function cost
S
(s; k) dependent on the switch perfor-
mance, we establish a relationship between 
0
e
in bps
and the switch cost in US $. We assume that within cer-
tain limits of k  k
0
, there is a linear dependency of the
switch cost on the switch eective throughput 
0
e
, which
a=10
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0.0×100
4.0×106
8.0×106
1.2×107
1.6×107
System cost
a=10
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
100
102
104
106
System cost
Figure 6: Total system cost.
can be characterized as cost
S
(s; k) (US $)
:
= c
3

0
e
(s; k),
where k  k
0
. The limit of k
0
denotes the limit for single-
stage switches. For k > k
0
, we assume more aggressive
growth of the switch cost with 
0
e
, because we assume
that such a switch can only be built using a multistage
architecture. A typical empirical value of k
0
, which is
also used in this work, is k
0
= 64. From the xed point
of k
0
, we assume a cost function dependency of linear-
logarithmic form, approximately cost
S
(s; k) (US $)
:
=
c
3

0
e
(s; k) log(
0
e
(s; k)), where k > k
0
. In the interest of
a smooth transition and reasonable values, we normalize
the function as follows:
cost
S
(s; k) (US $)
:
=
c
3

0
e
(s; k) log
 
e +

0
e
(s; k)  
0
e
(s; k
0
)

0
e
(s
min
; k
0
)
!
; (15)
where k > k
0
.
Total System Cost. The total system cost formula
holds as
cost = mc
1

MFE
+ k c
2

LFE
+ cost
S
(s; k): (16)
The linear cost functions for the FEs and the switch are
of course simplied from reality. Some form of expo-
nential growth of cost with capacity should rather be
expected.
Optimization Problem. The cost is optimized over
the tunable system parameters: (r; 
MFE
; 
LFE
; s;m).
Given the maximum allowed mean packet processing
time T
max
, we derive the following optimization problem:
Optimization function Constraints
mc
1

MFE
+ k c
2

LFE
+ cost
S
(s; k) 0  r  1
0  
MFE
< 
max
0  
LFE
< 
max
Parameters s
min
< s
fr; 
MFE
; 
LFE
; s;mg T < T
max
1  m:
Note that m is an integer, whereas all the other opti-
mized parameters are rational numbers.
4 NUMERICAL RESULTS
4.1 Optimization
Numerical results have been obtained using the Mat-
lab Optimization Toolbox environment. First, the value
of m is increased until there exists a feasible solution,
and then the constrained nonlinear optimization func-
tion fmincon is used to nd the optimum. The system
variables in our optimizations have been set to the fol-
lowing values: T
max
= 10
 6
s, c
2
= 6 10
 5
, c
3
= 10
 8
,

max
= 6 10
6
pps, s
min
= 10
 9
s (meaning a xed-size
switch cell would have to be transmitted at a switch port
within 1 ns), n = 100. The values have been selected to
approximate the limited market information available.
The ratio of costs of equally powerful LFEs and
MFEs, a, is alternated in our simulations over values
a 2 f2; 10; 100g and inuences several variables. Vari-
able c
1
is dependent on c
2
and a, c
1
= ac
2
. The value of b
reects a in the following manner: we assume that a can
be interpreted as a dierence in memory size available to
the processing engines. Thus a determines a fraction of
the memory available at an LFE, and b, the fraction of
nonresolvable packets at an LFE, reects the cache miss
rate at an LFE. A sample dependency between a cache
hit-rate and cache size can be found for example in [16].
Based on that, we use the following pairs of (a; b): (2,
0.1), (10, 0.2), (100, 0.9).
Figures 6 - 18 show plots of the output variables over
a spectrum of the number of inputs k and link speeds

i
. The number of inputs k grows geometrically with a
coeÆcient of 2, k 2 f8; 16; 32; :::; 1024g. The values on
the maximum interface bandwidth 
i
axis grow geomet-
rically with a coeÆcient of
p
2, thus including link speeds
approximately corresponding to the capacities of 10 Mb
a=2
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
0.25
0.5
0.75
1
r
a=10
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
0.25
0.5
0.75
1
r
a=100
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
0.25
0.5
0.75
1
r
Figure 7: Fraction of traÆc enqueued at LFEs, r.
a=2
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
100
102
104
k µLFE + m µMFE [Mp/s]
a=10
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
100
102
104
k µLFE + m µMFE [Mp/s]
a=100
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
100
102
104
k µLFE + m µMFE [Mp/s]
Figure 8: Total processing power of the FEs.
a=2
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
100
102
104
106
FEs cost
a=10
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
100
102
104
106
FEs cost
a=100
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
100
102
104
106
FEs cost
Figure 9: Total cost of the FEs.
{ 1 Gb Ethernet and OC-48 { OC-768 links, in Mpps,
that is, 
i
2 f0:025; 0:035; 0:050; :::; 6:40; 9:05gMpps.
4.2 Total System Cost
Figure 6 depicts the system cost on a linear and a log-
arithmic plot for a = 10. The cost grows exponentially
along both the k and 
i
dimensions, with a steeper in-
crease in the high 
i
segment, which is due to the in-
creasing inuence of the switch cost on the total cost.
4.3 Distribution of Resources|LFEs,
MFEs and Switch Capacity
Figure 7 shows how the optimization selects between
the two extreme points of the solution space. For a
smaller part of the problem space, r attains the value
of 0, meaning no LFEs are needed. When r is equal
to 1, the results indicate that it is cheaper to add and
fully stress the LFEs to be able to handle the load. It is
clearly visible that a certain boundary divides the prob-
lem space into two regions. Typically, for systems with
a small number of inputs and small link loads, deploy-
ing LFEs is too expensive and r = 0. This somewhat
a=2
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
101
103
105
λs [106 c/s]
a=10
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
101
103
105
λs [106 c/s]
a=100
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
101
103
105
λs [106 c/s]
Figure 10: Switch saturation throughput.
a=2
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
100
102
104
106
Switch cost
a=10
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
100
102
104
106
Switch cost
a=100
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
100
102
104
106
Switch cost
Figure 11: Switch cost.
counter-intuitive result comes from the fact that the pro-
cessing capacity needed for one M=M=1 queue serving
the aggregate load is smaller than the processing capac-
ity needed for k M=M=1 queues, each serving 1=k of the
aggregate load, if both systems need to conform to the
same time constraint. As the relationship is nonlinear,
with the increase in the total amount of processing ca-
pacity required, at a certain boundary it becomes more
economical to use the multiple, less powerful processors,
and thus, for systems with a higher number of ports or
higher link loads, LFEs are more economical to fulll the
performance requirements and r = 1. The exact curve
of the shift diers for various a. Two further observa-
tions can be made. For a = 2 and a = 100, r starts
to decrease from 1 towards 0 for very high 
i
. Further-
more, for a = 100 and very high k and 
i
, r is undened,
meaning that it is not feasible to build a router with the
constraints given. Observing the optimization results
of other parameters leads to a better understanding of
these phenomena.
Figures 8 and 9 show on a logarithmic scale the opti-
mal amount of total processing power within the system
and its cost. Figures 10 and 11 show on a logarithmic
scale the optimal switch, characterized by the saturation
throughput, and its cost. Figures 8 to 11 explain the dif-
ferences in the shape of the boundary between r = 0 and
r = 1 in Fig. 7. We observe a trade-o between increas-
ing the switch throughput and deploying LFEs. The
inuence of the switch cost, however, diers when the
factor a changes. The lower a is, the higher the inuence
of the switch cost on the boundary shape, because when
a and b are low, the total system cost remains lower and
the switch cost inuences the trade-o between r = 0
and r = 1, described above. However, when a is large, b
becomes large as well, and a large fraction of the pack-
ets processed at the LFEs have to travel to the MFEs
anyway, thus the introduction of LFEs is benecial only
for very high link speeds. The MFEs and the system in
total are then very expensive, and the switch cost is no
longer a factor in the trade-o.
4.4 Distribution of Processing
Capacity|LFEs and MFEs
Figure 12 shows the optimal amount of processing
power at each individual LFE. For a = 2 and a = 100,
the LFE processing power 
LFE
reaches the upper limit

max
in the very high 
i
region. As the LFE is over-
loaded at that point, it does not make sense to enqueue
additional packets at the LFEs because they only in-
cur additional delay. Thus it is more eÆcient to send
a=2
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
2
4
6
µLFE [106 p/s]
a=10
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
2
4
6
µLFE [106 p/s]
a=100
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
2
4
6
µLFE [106 p/s]
Figure 12: Optimal processing power per individual LFE.
a=2
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
10-1
101
103
k µLFE [Mp/s]
a=10
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
10-1
101
103
k µLFE [Mp/s]
a=100
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
10-1
101
103
k µLFE [Mp/s]
Figure 13: Total processing power of the LFEs (note that the graph was adjusted for the reader's convenience in
order to be able to depict the values equal to 0, which would normally tend to negative innity on a logarithmic
graph).
a=2
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
10-1
101
103
m µMFE [Mp/s]
a=10
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
10-1
101
103
m µMFE [Mp/s]
a=100
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
10-1
101
103
m µMFE [Mp/s]
Figure 14: Total processing power of the MFEs.
the appropriate fraction of the packet headers directly
to the MFEs, resulting in a decrease of r. The boundary
where the LFE processing power limit is reached diers
for various a. This phenomenon is again dependent on
the particular pairing of (a; b), which, as described in
Section 4.3, inuences the boundary where it becomes
advantageous to use LFEs, and, in particular, on the
fraction of the LFE resolution failures b, which increases
the required LFE processing power.
Figures 13 and 14 show the total amount of processing
capacity of the LFEs and the MFEs, respectively. Figure
15 shows how the total amount of processing power is
partitioned among the LFEs and MFEs. In most of the
problem space, the major part of the optimal system
processing power rests with the LFEs. In the segment of
routers with high-speed links, for a = 2 and a = 100, the
LFE processing power limit 
max
is reached and thus the
bulk of the processing capacity begins to shift towards
the MFEs. At the same time, as described in Section
4.3, in the segment of devices with few, low-speed links,
a=2
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
0.25
0.5
0.75
1
k µLFE / (k µLFE+m µMFE)
a=10
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
0.25
0.5
0.75
1
k µLFE / (k µLFE+m µMFE)
a=100
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
0.25
0.5
0.75
1
k µLFE / (k µLFE+m µMFE)
Figure 15: Fraction of LFE processing power out of total system processing power.
a=2
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
0.2
0.4
0.6
s [10-6 s/cell]
a=10
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
0.2
0.4
0.6
s [10-6 s/cell]
a=100
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
0.2
0.4
0.6
s [10-6 s/cell]
Figure 16: Switch port transmission speed.
a=2
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
500
1000
1500
2000
λ*2 [Gb/s]
a=10
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
500
1000
1500
2000
λ*2 [Gb/s]
a=100
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
500
1000
1500
2000
λ*2 [Gb/s]
Figure 17: TraÆc at the switch master port|the header passing overhead within the switch.
LFEs are not used at all and thus their share of the total
capacity is equal to 0.
4.5 Switch Speed
Figures 16, 17 and 18 depict the optimal switch port
transmission speed and the overhead of the header-
passing traÆc compared to the total switch traÆc. We
see why there is no feasible solution for a = 100. Given
the dependence of b on a, a large fraction of traÆc
(b = 0:9) fails to be resolved at the LFEs and still trav-
els through the switch to the MFE pool. Thus, LFEs
cannot be used to a great extent to decrease the packet
delay in a router, and the switch, the master port in par-
ticular, is placed under increased demand to compensate
the delay. For the very high-speed, many-input case, the
switch is not able to cope with the demand and reaches
the limit of its port transmission speed. Therefore a fea-
sible solution for this region does not exist. Note that a
similar phenomenon is only narrowly avoided in the case
of a = 2, when the LFEs are already saturated as well.
To summarize the optimization results, we observe
that for each a and b, a similar trend of dividing the
problem space according to the optimum solution exists.
However, the boundary line and the appropriate switch
a=2
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
0.25
0.5
0.75
1
λ*2 / (λ*2 + k λ*i)
a=10
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
0.25
0.5
0.75
1
λ*2 / (λ*2 + k λ*i)
a=100
0.01
0.1
1
10
λi [106 p/s]10
100
1000
k
0
0.25
0.5
0.75
1
λ*2 / (λ*2 + k λ*i)
Figure 18: Fraction of the header passing overhead out of the total switch traÆc.
element parameters vary signicantly for dierent a and
b, which suggests that they are important factors in any
router design. Note that the results are, to a large extent,
also dependent on c, the ratio of the average packet size
and the packet header size. If, for example, the average
packet size were to decrease, the inuence of the packet-
header-passing overhead traÆc on the system would gain
a much higher signicance, and vice versa. Furthermore,
with a more realistic cost function, less capable elements
would be favored and thus a shift in the division bound-
aries could be expected.
5 FUTURE WORK
There are many possible improvements to the model
and to the way the system behavior is studied. Such
improvements comprise:
{ decoupling the packet-header passing from the switch-
ing of entire packets within the switch element, as dis-
cussed in Section 2.2;
{ more realistic traÆc modeling, such as self-similar traf-
c [8] or packet trains [17];
{ rather than being linear, the FE cost function could
be more realistic, perhaps tied to a more complex FE
architecture;
{ including payload buering and packet reassembly in
the processing model;
{ evaluating load balancing on a per-ow basis and ad-
dressing packet reordering prevention (as discussed in
Section 2.2), and
{ modeling specic packet processing tasks, such as
lookup, classication, ow control, or scheduling in more
detail.
We have focused here on the worst-case performance:
all router links are fully loaded. However, a useful anal-
ysis could concentrate on the system behavior within a
certain load range.
6 CONCLUSION
This work presents a model for analyzing a general
distributed router architecture and determining its most
economical variant, given a set of performance require-
ments. We bring new insight as to how the individual
elements inuence the most economical architecture for
a given set of constraints. The introduction of a nonlin-
ear LFE overload model allows us to easily combine both
LFEs and MFEs within one scalable hybrid system. Us-
ing a simple switch model, we have demonstrated that
the introduction of a switching delay can signicantly in-
uence the optimization results. The introduction of the
two elements enables us to enhance the scalability limits
of the fully distributed router architecture, as they are
reached only for the most extreme points of the spec-
trum studied. For low-end systems, we have demon-
strated that the LFE alternative is too costly. However,
for most systems, deploying LFEs is advantageous be-
cause the switch bottleneck and the high MFE cost are
avoided. We show that the cost ratio of the equally pow-
erful LFEs and MFEs and the corresponding fraction of
load the LFE is able to handle are a decisive factor in de-
termining the location of the boundary between the two
optimal solutions as well as in selecting the appropriate
switch element.
References
[1] C. Partridge, et al. 1998. \A 50-Gb/s IP
Router". IEEE/ACM Transaction on Networking,
6(3), June: 237{248.
[2] V. P. Kumar, T. V. Lakshman, D. Stiliadis. 1998.
\Beyond Best Eort: Router Architecture for the
Dierentiated Services of Tomorrow's Internet".
IEEE Communication Magazine, May: 152{164.
[3] A. Engbersen, C. Minkenberg. 2000. \A Combined
Input and Output Queued Packet-Switched System
based on a Prizma Switch-on-a-Chip Technology".
IEEE Communications Magazine, Vol. 38, No. 12,
December: 70-77.
[4] N. McKeown, et al. 1996. \Achieving 100%
throughput in an input-queued switches". In Pro-
ceedings of the 1996 IEEE INFOCOM (San Fran-
cisco, CA, March). 296{302.
[5] A. Brodnik, et al. 1997. \Small forwarding tables
for fast route lookups". In Proceedings of the 1997
ACM SIGCOMM (Cannes, France, September). 3{
14.
[6] M. Waldvogel, et al. 1997. \Scalable high speed IP
route lookups". In Proceedings of the 1997 ACM
SIGCOMM (Cannes, France, September). 25{37.
[7] H. Chan, H. Alnuweiri, V. Leung. 1998. \A Frame-
work for Optimizing the Cost and Performance of
Next-Generation IP Routers". IEEE Journal on Se-
lected Areas in Communications, 17(6), June: 1013{
1029.
[8] W. Willinger, M. Taqqu, R. Sherman, D. Wilson.
1997. \Self Similarity through High Variablity: Sta-
tistical Analysis of Ethernet LAN TraÆc at the
Source Level". IEEE/ACM Transactions on Net-
working, (5): 71{96.
[9] J. Cao, W. S. Cleveland, D. Lin, and D. X. Sun.
2001. \On the Nonstationarity of Internet Traf-
c". In Proceedings of the 2001 ACM SIGMET-
RICS. 29:102{112.
[10] Juniper Networks, Inc.. 2001. \M160
Internet Backbone Router Datasheet".
http://www.juniper.net, August.
[11] L. Kleinrock. 1975. Queuing Systems. John Whiley
& Sons.
[12] I. Iliadis, W. Denzel. 1993. \Analysis of Packet
Switches with Input and Output Queueing". IEEE
Trans. on Communications, 41(5), May: 731{740.
[13] I. Iliadis, W. Denzel. 1992. \Performance of a
Packet Switch with Input and Output Queueing un-
der Unbalanced TraÆc". In Proceedings of the 1992
INFOCOM (May). 743{752.
[14] M. Karol, M. Hluchyj, S. Morgan. 1987. \Input Ver-
sus Output Queueing on a Space-Division Packet
Switch". IEEE Trans. on Communications, 35(12),
December: 1347{1356.
[15] K. Thompson, G. Miller, R. Wilder. 1997. \Wide-
Area Internet TraÆc Patterns and Characteristics".
IEEE Network, Nov./Dec.: 10{23.
[16] N. McKeown, B. Prabhakar. 1999. \High Perfor-
mance Switches and Routers: Theory and Practice"
Tutorial M2 of the 1999 ACM SIGCOMM (Cam-
bridge, MA, August).
[17] R. Jain, S. Routhier. 1986. \Packet Trains { Mea-
surements and a New Model for Computer Network
TraÆc". IEEE Journal on Selected Areas in Com-
munications, 4(6), September: 986{995.
