The suitability of network processors for multi-fied packet classification by Sassen, Tyrell
  
 
 
 
 
 
 
 
The copyright of this thesis vests in the author. No 
quotation from it or information derived from it is to be 
published without full acknowledgement of the source. 
The thesis is to be used for private study or non-
commercial research purposes only. 
 
Published by the University of Cape Town (UCT) in terms 
of the non-exclusive license granted to UCT by the author. 
 
Un
ive
rsi
ty 
f C
ap
e T
ow
n
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
The Suitability of Network Processors 
For Multi-field Packet Classification 
Prepared by: 
Tyrell Sassen 
Supervised by: 
Neco Ventura 
Department of Electrical Engineering 
University of Cape Town 
2006 
This dissertation is submitted to the University of Cape Town 
in fulfilment of the academic requirements 
for the Degree of Master of Science in Engineering 
20 January 2006 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
Declaration 
I declare that this thesis is my own work. Where collaboration with other people has taken 
place, or material generated by other researchers is included, the parties and/or material 
are indicated in the acknowledgements or references as appropriate. 
This work is being submitted for the Master of Science Degree in Electrical Engineering at 
the University of Cape Town. It has not been submitted to any other university for any 
other degree or examination. 
20)01 12DD~ 
7 7 Date 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
I to 
• 
• 
• at 
and 
11 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
to is an 
ments 
on 
worst-case Orf",Af'O current 
use 
","'A"A" to 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
1 
2 
1.1 
1 
1.3 
1.4 
2.1 
1 
1 
ii 
1 
1 
6 
8 
8 
13 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn3 3.1 
3.7 
4 
1 
1 
1 
1 
3.7.3 
3.7.4 
to 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
5 
5.1 
1 
6 
6.1 
1 for 
Tools 
1 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
1 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
1.1 
2.1 
2. 
2.11 
2. 
3.1 
A 
A 
A 
11 
30 
a 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
7 
4.1 
5.1 
for hvo 
two 
7 
tree 
5. 
u. 1 memory as the 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
0. 
1 
2 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
2.1 packet 
3.1 
requested 
5.1 The number of tuples for 
The 
1 to generate 
to 
1Il 
test 
hash 
83 
83 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
1.1 
1 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
to 
net-
stream must 
correct 
or 
an 
a user 
2 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
'-'U,,(}V'.'U'J'" amount of 
account 
allow 
3 
over 
resources 
a 
to 
to 
to 
it becOm(~s nec-
to 
on 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
traffic. 
simple and therefore 
Along with the 
trend is 
for ARPANET 
[:17], 
to the TCP lIP protocol 
, the most of these being the 
(DiffServ) [6] 
though the router, 
edge 
due to 
to store state 
keeps to 
while keeping 
to provide 
its 
of 
core 
for packet-switched networks, the other 
core access 
it is not uncommon to see links in the 
core 
network, call be seen right the to the access ago llsers 
56 
the point where ISP companies like 
lines as 
[G7]. In access 
as speeds 
along with 
to 
over 
I\Ibps over fibre cables to the home 
of users, 
led to the 
extra traffic. 
more elements, to deal processing of 
\Vhile IP fonvarding can be done at relatively high speeds, 
quired to 
duc to 
as a relatively new element for 
IlIl 
l)iggest 
IS 
a firmware 
still being 
The 1 1 
IS 
it becomes to 
T'his is especially beneficial 
a 
4 
QoS 
power. In the 
re-
and 
the 
the 
little more 
QoS arena, as standards arc 
network would 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
B 
Server B 
Figure 1.1: An example network showing various QoS requirements 
be necessary in a typical network environment. 
• Computer A wishes to download a file from Server B, Server B is situated in the 
company 's private network, but is still accessible to the outside world. The firewall 
that lies between Server B and the public network, protects the server from malicious 
attacks by restricting the IP addresses that are allowed to connect to Server B. The 
firewall also restricts access to the TCP ports that a client can connect to. 
• The user of Telephone A wishes to have a conversation with the user of Telephone 
B. These telephones use VoIP to allow the users to talk to each other since they are 
communicating over an IP network. The network provides QoS capabilities in order 
to ensure that the telephone call receives an acceptable level of service. Link A uses a 
QoS mechanism to provide a minimum amount of bandwidth and ensure a minimum 
amount of delay for VoIP calls between the two sub-networks. 
• Computer B needs to connect to Server A in his remote office. The user connects 
to a VPN that allows him to be connected to the office internal network securely, 
regardless of his physical location. The VPN that he has connected to, has been 
configured using a QoS scheme ensuring it behaves like a leased-line with regards to 
bandwidth and other QoS parameters. 
5 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
are 
stream can a 
if it way, it 
can 
a 
use 
processors to of 10]. 
1.2 
processors. a 
it not 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
it IS must 
on 
7 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
1.3 
was on 
processors. 
1 
is 
to is 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
gones 
to 
an 
processors 
9 
some 
study. 
to 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn2.1 
2.2 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
are 
• 
• 
+--------.--- 32 bits ---------> 
(a) 
2.1: 
11 
Address 
Destination 
bits ---------> 
(b) 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
01234567 
Precedence 
• is not zero. It 
was 
it not 
stream uses 
states 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
2.3 
a 
may 
coarse manner. 
streams must 
streams must 
are not 
a stream 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
structure a 
stream 
met. 
3.1 
an 
a 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
to measure 
a stream 
stream that 
papers. 
1 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
Sender 
Figure 2.4: A typical DS-enabled WAN 
Although all of the functional blocks mentioned earlier are required in order to provide 
DS-enabled services, these need not be implemented, to the same degree, on every node of 
the network. Figure 2.4 shows the layout of a typical DS-enabled network, this network 
consists of two smaller DS-enabled networks that are connected together over a single link. 
Each DS-enabled network, or DS-domain, contains a number of DS-enabled nodes, these 
nodes are known as DS interior nodes as they operate on the interior of the DS-domain. 
The DS boundary nodes A and B connect the two domains together, while the ingress and 
egress nodes connect the sending and receiving hosts to the DS-enabled core-network. The 
dashed lines represent the routing path taken by traffic travelling between the two nodes. 
In Figure 2.4 we are presented with two hosts which wish to communicate using some 
application with QoS requirements. Since both hosts are on different DS-domains, not 
only must each domain provide a constant level of service, but the network should provide 
bounds on the end-to-end QoS of the connection. 
As traffic enters a domain through the DS Ingress Node, the traffic stream must be con-
ditioned to ensure that it fits the traffic profile to which is belongs. The ingress node is 
responsible for identifying the service-level requirements of the traffic entering the domain 
by performing some form of multi-field packet classification. This classifier may use ap-
plication layer specific header fields, or more likely network and transport layer headers. 
This will allow traffic to be marked and the temporal properties of the stream examined 
and conditioned - this may include shaping or dropping packets in the stream. The ingress 
16 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
a 234 567 
DSCP 
must a can rest 
2. 2 
a 
map to a same 
1 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
an 
correct for 
stream eXlceE~dS 
can 
enters a new 
to a 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
2 
once it 
ment 
must 
uncommon 
nature 
two extremes. 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
.0 
X 
Q) 
LL 
Performance 
Figure 2.6: A comparison of different technologies, showing the trade-off between flexibility 
and performance 
with regards to their performance and flexibility in his Master's thesis [49] and the results 
are shown in Figure 2.6. 
Although there are a number of different l\'Ps currently on the market, they all employ 
similar mechanisms. These are discussed belu'vv, in urder to give an understanding of how 
NPs ciiffer from traditional network processing architectures. 
2.4.1 Hardware Mechanisms 
NPs have CL number of different mechanisms that enable them to process packets at line 
speeds. The implementations of these mechanisms differs from processor to processor, but 
all rely on the same basic principles. This section will discuss some of these processing 
merhanisms and give details about how they are implemented in the IXP2400 [22]. 
Packet-switched networks uffer a high level of data parallelism due to the fact that packets 
can be considered on an individual basis. The[(~ is little, to no, data dependency betwef'n 
sequential packets as each packet contains a header with all the information needed to 
process itl. This allows the task of processing the packets to be performed in parallel 
and network processors have hardware to take advantage of this. This is usually done by 
lThis is not always true as packet processing done at the application layer may introduce inter-packet 
dependencies. This is also the case when packet fragmentation has taken place. 
20 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
(b) 
ME Cluster a ME Cluster 1 
(a) (c) 
Figure 2.7: The layout of PPEs in different NPs . (a) Lexra NetV rtex, (b) Cisco PXF and 
(c) Intel IXP2400 
means of multiple Programmable ProcE'ssing Engines (PPEs). These engines are essen-
tially RISC processors, with added instructions, that execute in parallel. PPEs are the 
main data-processing unit of the NP and although they are present in most NPs, differ-
ent NP architectures have these engines in different numbers and layouts. Architectures 
like thE' Lexra NetVortex f13} have up to 16 MIPS cores that operate in parallel, while 
the Cisco P XF f62} arranges its 32 processors in 4 pipelines. The Net VortE'x will assign 
a single packet to each PPE as it becomes available, the PPE then processes the packet 
to completion. In contrast to this, the PXF will assign packets to its four pipelines, and 
packets are passed hetween different PPEs until they reach the end of the pipe. 
The IXP2400 provides a balance between these two layouts; it provides 8 PPEs (known 
as micro engines) that can either be arranged in a pipeline or parallel manner. The mi-
croengines implement ""hat are known as next-neighbour registers; these allow data to 
bE' passed to the next microengine in the cha.in to allow pipelined processing. Figure 2.7 
contrasts these three layouts and shows how each is implemented. 
NPs also provide another level of parallelism by implementing multi-threading on the PPEs. 
Due to the vast difference between the clock cycle time of thE' prEs and the acc:ess time of 
on-board memory, the PPEs will often have to wait for multiple clock cycles brforr data 
is returned from (or written to) memory. ~P manufacturers have added functionality in 
21 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
a 
to other 
them to conneet 
algorithmic 
to 
2.8 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
2.4.2 
2.5 
some 
Media Switch 
Fabric 
2 A 
XScale. • Performance 
Core Monitor 
return 
routers 
new. It 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
to 
an 
few 
streams 
are 
more 
it 
[31, 
it IS 
2.1. 
IS 
routers 
some sources 
to 
for a 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wnto see <>!C,,,.C)UHA 
2 
most cases 
as 
current routers. 
IS common 
seen 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
J' 
onto 
2.5.1 
can to 
• 
to reason 
• 
, routers do not 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
• 
2.5.2 
tel' 
2.5.3 
to 
to 
structures 
nr',(",n,(\c<£u, to meet 
to 
structure in 
to 
on 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
use use it 
two most common ones 
once a 
'vorst-case 
to 
store a 
a 
be converted into first. 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
is 
LTYr011v"~ as well as traditional 
fields. 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
is to 
structure 
could map 
a 
can 
re-
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
Tables 
1 
Phase 0 Phase 1 Phase 2 Phase 3 
a are as 
tree. 
manner 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
a 
more 
measure 
1. 
2. 
remove any 
more 
is not vital 
same 
hut 
as it 
more cut 
can to 
root 
a way of storage 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
(256" x, 4) 
tree the 2 
aver-
accesses. 
vasan, A 
was 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
(b) 
was 
to 
vector now 
to a 
is not a it can a 
are 
not 
worst-case 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
a is 
as 
means more 
as a on 
next 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn3.1 
to 
3.2 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
more 
accesses to 
t 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
3.3 
2.5 the different 
ferences among these algorithms is that 
of 
structure was 
algorithms. One 
algorithm 
the 
different 
properties. To compare various algorithms UF>'''~'''U one another, a standard rule-
set must 
multi-field classification 
be 
present in rcal-\vorld databases. 
that 
[GG, 65]. research uses the '-JHA,OO.JJ 
. It is described 
The implementation 
('an use. 
algorithms 
network processor 
is a lack of 
creators ,\,ere 
can 
a way that contain the structure 
into creating a 
proposed hv 
A . 
the amount resources 
ill the is the Intel IXP2400. that 
The of NP are below, further details can 1)(' found 
3. 
The Nfl 8 at Call used to process AttacIJed to 
englll(,s arc a memory technologies. These are listed, along with 
m can support a of is split OWl' 
two independent channels. allows both l\IB banks to be 
1 evaluation board was 
processor 8 of 
runs :\ I on tavista 
This processor is initialising the 
39 
but 
This is a card 
of DRA~l. It contains 
[441. The board 
processor. 
no role in the classification 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
3.5 
IS 
was 
3.6 
it 
number, 
memory on 
necessary to test 
rernoyes any 
reason, 
rule-sets will be identified number rather than 
to 
it is 
the 
router, 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn2: 
Classifier 
3.1: 
a 
router 
Marker 
structure a 
to 
resources errICI(';nt can 
to 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wnamounts 
is that it must 
comes more 
detail in 
to 
is 
nH"'H£'r sev-
are 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
3.2: A a 
A a 
structures 
IS more 
structures 
to 
can 
can 
is a 
are 
two or more 
com-
router 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
ma 
3. 
process 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn is is 
to 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
1 
struc-
ture [31]. 
as 
to use 100% 
now 
accesses must 
com-
called MEs in the IXP2400 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
to 
are 
is 
yector 
yector 
\'ector 
an 
3. 2 
tree a 
it 
a 
consume more 
are 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn3.7: rule used the algorithm 
R1 
R2 
R3 
R4 
RS 
R6 
R7 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
3. 3 
vector now 
can 
is 
a llH:H~H IS 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
.1 
.2 
worst-case 
to 
all stream 
to test 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
14 bytes )( 
Ethernet Header 
4.1: 
worst-case 
router 
to 
20 bytes --.-)1(-- 20 bytes 
IP Header rep Header Ethernet Trailer 
common use. 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
uses 3 
:VIEs to 
~""'~""'"' queue 
C IS 
not the meta-data of the 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
I§ll:lotdt~_~ ___ l""'_~ 
[) ~1iI1I iii "W e:F ""9~~~'111":~,1!=l.!.'%~'f 
ll!L.J 
// Bulld lx :request for transait queue 
.lu(t.-p, - -, b . port_out , « 24] 
leLf i e ld [t .. P . 0111 . . rd&taOl 
alu [$txreq, leap. OR. 1. «Jl) 
/,r Juap t o ~r4Lch ring wri le fell eadJ p:Jrl 
alu (ta.p, - , b , port_cut. « 2) 
. 27 H • output port 
. 23 00 • ptr to Irop but ter de5C 
31 • !let "'~hd bit (31 28 • re: 
Juap ( leap. vrite_ringO'], t&rqels [vri la_ringO'. vrite_ringll, vrtteJing2l . vrite_riog3'] 
wri la_ringO' : 
br_iD~Lstate [SC1LRIHG' · .,rpiCKET_TX_SCR...RING_O / · - / _FULL. ful CringO' 1 
scratch [ put . $t.xreq . zero , (P.lCKET_TJLSCR_RING_O « 2) . NU"-VORDS_TX.JIESS'\GEL S19_dor 
alu (toounl-PO . ~nt-PO . +. one] 
bT [.tart.} 
vri " __ ringl' : 
br _i np_sla te l SCfCRI MC/- IO/P4CJCET_ TX...SCR.-RIHG_ V •• / _FULL . (1,11 LriD.91 I I 
!l~(~~~1 ~t:O~t~r ' + ~ P~!f_~SC1LRIKG_l ( ( 2) . lnnUIORI~LTX...KESS1GEJ . siLdor.J 
br [start..] 
vri t."_riIl921 · 
~r~~~ri!~~[~~~~~·7~rcKhx...fi~~H~I~~";;7r:~iru~u~~Il~'~ESSAGE1. siq dor.=] 
2..1 
~I§l __ .... I!sJ - .. I~ "-...... I~ ""-"""" -llill ,,-_h 
~ Chp l<- 3 _ " S ... 5, ... I .... I 
Pacbb teeaiwd 833 Rec:tio.oe,. 29 82 
- ill ...,2S11_"""_1wd"'11 
-:' ",<-
r. ill l.ticJolngN (l[ 
·W lotic:ror.gnelll 
. at Nooengi1e(t" 
.IW "iaoangr..&~ 
'NM_U 
N IofooengFe 1:1 
Whtio~1:1 
5rl t.4ict~I:~ 
J)p24OO 110 \£no: 5611.9:)t I 
Figure 4.3: A screen shot of the IXP2400 simulator. 
processor of the IXP2400. This stage imports the previously generated rule-set and creates 
the data structures needed by the classification algorithm_ These structures are then stored 
in SRAM, where they can be accessed by the classification MEs_ Since the framework 
uses queues to pass packets between the different MEs, it is possible to have multiple 
MES processing packets simultaneously. The implementation uses one queue for receiving 
packets and one for transmitting them. The classification stage using 1 to 5 MEs operating 
in parallel is shown graphically in Figure 4.2. 
4.4 Simulation Environment 
The structural model does not run on the actual NP hardware, but rather on a cycle-
accurate simulator. This simulator is part of the IXP SDK and simulates both the MEs 
and peripherals on a clock-cycle by clock-cycle basis. The development environment also 
contains a packet generator that can be used to generate packets on the sending and re-
ceiving interfaces of the NP. The simulator runs exactly the same code as the hardware 
53 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
was 
are next 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn5.1 
5.2 
• 
• 
to 
to 
is an 
1. 
one is most 
resource is 
It 
is 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
100000 
10000 
m 
~ 
Q) 1000 
N 
iii 
Q) 
CI 
tel 100 ... 
0 
.... 
en 
10 -
::rL, 
32 
r-
r-- r-- t-- r--
- f--- f--- ,-----
i[ -
64 128 256 512 1024 2048 4096 8192 16384 
Ruleset Size 
1_ lndirect-HiCuts _ Direct-HiCuts 0 BV 0 Tuple _ Linear I 
Figure 5.1: The memory space requirement for the data structures of different packet 
classification algorithms. 
The storage metrics can be gathered from the preprocessing stage of the algorithms. vVhen 
using a physical NP, the packet classification data-structures will be stored in SRAM after 
they have been calculated by the XScale processor. In the evaluation framework, the data 
structures are written to a script file which is executed during the startup of the simulator. 
This script copies the data into the virtual SRAM of the processor, allowing it to be 
accessed by the classification MEs during execution. Details regarding this process can be 
found in Appendix C. 
The IXP2400 NP has a shared memory architecture, therefore increasing the number of 
classification MEs does not have an effect on the storage requirements of the algorithm. 
The results showing the memory requirements of the algorithm as a product of the rule-set 
size is shown in Figure 5.1. 
Analysis 
The simplest algorithm, the linear search, will always have an absolute value for its storage 
requirements. The only data structure for the algorithm is a list of all rules, requiring 
memory directly proportional to N, the number of rules in the set. The linear search 
algorithm has the smallest storage requirements for any algorithm in this study. 
56 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
~ 
0 
E Q) 
:::E 
Q) 
CI 
.l!! 
s:: Q) 
u 
.... 
Q) 
11. 
100% 
90% 
80% 
70% 
60% 
50% 
40% 
30% 
20% 
10% 
0% 
32 64 128 256 512 1024 2048 4096 8192 16384 
Ruleset Size 
I_ Ranges - Bit-Vectors 1 
Figure 5.2: The storage requirement percentages for the two data-structures of the BV 
algorithm 
The Bit-Vector (BV) and Tuple-Space Search algorithms have similar storage requirements. 
This is to be expected as both algorithms function in almost the same manner. These 
algorithms require two separate data-structures for each dimension of the classifier: a list 
of ranges and a bit-vector. The list of ranges is based on the input rule-set and is therefore 
identical for both algorithms. The BV algorithm has higher storage requirements for all 
cases, as the number of tuples is always equal to or less than the number of rules. There 
is a worst-case storage complexity of N 2 for the BV algorithm and NT for the tuple-space 
algorithm, T being the number of tuples. 
The majority of the storage space for the two algorithms is used by the bit-vectors, this 
becomes more extreme as the size of the rule-set increases. Figures 5.2 and 5.3 show these 
ratios with regards to the different sized rule-sets for the two algorithms. The list of ranges 
is growing linearly while the bit-vectors grows quadratically for both cases. 
The HiCuts algorithm differs from the other two algorithms in that its storage requirements 
are not proportional to the number of rules. The two largest rule-sets, for the experimental 
results, are for 8192 and 2048 rules. The 4096-rule set, which lies between the other two, 
is substantially smaller. The reason for this inconsistency is that the size of the HiCuts 
data-structures is less dependant on the number of rules and more dependant on the layout 
of the rules within the address space. If the rule-set has a large number of rules clustered 
57 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
1.': 
0 
E Q) 
::E 
Q) 
01 
.f! 
c: Q) 
~ Q) 
c.. 
100% 
90% 
80% 
70% 
60% 
50% 
40% 
30% 
20% 
10% 
0% 
32 64 128 256 512 1024 2048 4096 8192 16384 
Ruleset Size 
I_ Ranges _Bit-Vectors I 
Figure 5.3: The storage requirement percentages for the two data-structures of the Thple-
Space algorithm 
in small areas, the algorithm will have a larger number of rules in each leaf node and a 
shallower tree!. This will decrease the storage requirements for the rule-set. If the rules 
are more evenly distributed throughout the address space, each node will contain more 
children and it is more likely that rules will be duplicated in adjacent nodes. This will 
increase the storage requirements for the rule-set. It is difficult to determine the size of 
a HiCuts tree without first creating it, however a worst-case bound of N d , d being the 
number of dimensions, is given in the literature [16]. 
The Indirect and Direct HiCuts algorithms differ quite dramatically in regards to their 
storage requirements, this becomes more apparent as the size of the rule-set increases. For 
the 16384-rule set the Direct-HiCuts version requires almost 4 times as much memory as 
the indirect version. 
5.3 Search Speed 
The second metric is the search speed of the algorithm. In order for an algorithm to be 
useful, it needs to be able to classify packets at realistic network line-speeds. The IXP2400 
ISee Section 2.5.3. 
58 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
a 
at 
.1 
to execute an 
accesses as an 
to execute a 
it 
use 
r("~",,,,"'rt at 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
Analysis 
600 
500 
Ul Q. 
.c 
~ 400 
OIl 
iii 
a: 
c::: 300 
0 
Ui 
III 
'E 200 III 
c::: 
1\1 
... 
~ 
100 
0 
32 64 128 256 512 1024 2048 4096 8192 16384 
Ruleset Size 
1 ........ 1 ME ........ 2 ME -+-3 ME --- 4 ME ........ 5 ME I 
Figure 5.4: The packet throughput for the linear search algorithm. 
The results of the linear search are given for all rule-sets, using a number of MEs, in Figure 
5.4. For this algorithm, the performance decreases as the number of rules increases. On 
average, !f memory accesses are required per packet. 
Notice also that while the performance does increase as the number of MEs is increased, 
it does not do so linearly as would be expected. This is the case for all the algorithms in 
this study and will therefore be discussed separately in Section 5.3.4. 
5.3.2 Bit-Vector (BV) and Tuple-Space Algorithms 
The search-speed of the BV and TUple algorithms is dependant on two parts. The first 
step is finding the matching range. This needs to be done for all 5 dimensions of the 
classifier, with a theoretical maximum of 2N + 1 ranges per dimension. The ranges are 
stored in sorted lists and it is therefore possible to use the binary search algorithm to find 
the enclosing range of a packet with 0 (log N) memory accesses. The bit-vectors for the BV 
and Tuple algorithms need to have as many bits as there are rules and tuples respectively. 
Table 5.1 shows the number of tuples in each of the rule-sets used in this research. This 
agrees with the results obtained by the tuple-space algorithm authors: the rate at which 
60 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
3000 
2500 
UI 
Co 
.c 
~ 2000 
S 
as 
a: 
c: 1500 
0 
'iii 
!II 
E 1000 !II 
c: 
as 
... 
I-
500 
0 
32 64 128 256 512 1024 2048 4096 8192 16384 
Ruleset Size 
I-+- 1 ME ...... 2 ME --- 3 ME --- 4 ME ........ 5 ME I 
Figure 5.5: The packet throughput for the BY algorithm for multiple MEs 
3000 
2500 
UI 
Co 
.c 
~ 2000 
GI 
-
as 
a: 
c: 1500 
0 
'2j 
'E 1000 !II 
c: 
as 
... 
I-
500 
0 
32 64 128 256 512 1024 2048 4096 8192 16384 
Ruleset Size 
I-+- 1 ME ...... 2 ME --- 3 ME --- 4 ME ........ 5 ME I 
Figure 5.6: The packet throughput for the Tuple-Space algorithm for multiple MEs 
62 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
5000 
4500 
'iii' 4000 
Q. 
~ 3500 
! 3000 
a: 
c: 2500 
0 
.~ 2000 
·e 
~ 1500 
as 
... 
I- 1000 
500 
0 
32 64 128 256 512 1024 2048 4096 8192 16384 
Ruleset Size 
1-.- 1 ME --- 2 ME -- 3 ME --- 4 ME -+- 5 ME 1 
Figure 5.7: The packet throughput for the Direct-HiCuts algorithm 
5.3.3 HiCuts Algorithm 
The performance metrics for the HiCuts algorithm are more difficult to analyse than the 
other algorithms in this study. Figures 5.7 and 5.8 show the search-speed metrics obtained 
for both versions of this algorithm. While the performance of the other algorithms de-
creased as the size of the rule-set increased, this is not the case for the HiCuts algorithm. 
The highest packet throughput for this algorithm is seen for the 64-rule set and the lowest 
for the 4096-rule set. To understand why this is the case, the underlying structure of the 
rule-set decision-trees needs to be examined. 
Analysis 
Like the other algorithms in this research, the classification performance for the HiCuts 
algorithm is dependant on the number of memory accesses per packet. The memory ac-
cesses for this algorithm are split between two processes: the tree traversal and the rule 
comparison. It requires 1 memory access to traverse each node, therefore the maximum 
number of memory accesses is equal to the maximum depth of the tree. A more realistic 
metric is given by examining the average depth of the tree. This is determined by taking 
2The width of the SRAM is 32-bits. 
3This assumes each tuple is assigned to a unique hash key and therefore no conflicts occur. 
63 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
5000 
4500 
Cii' 4000 
Q. 
~ 3500 
! 3000 
a: 
c 2500 
0 
.gj 2000 
·e 
:!! 1500 
III 
... 
I- 1000 
500 
0 
32 64 128 256 512 1024 2048 4096 8192 16384 
Ruleset Size 
I-+-1 ME --- 2 ME --3 ME ........ 4 ME ....... 5 ME I 
Figure 5.8: The packet throughput for the Indirect-HiCuts algorithm 
the total depth of all leaf nodes and dividing it by the total number of leaf nodes, the 
results of this are shown in Table 5.2. The average depth will give an indication of the av-
erage number of memory accesses required to reach a leaf node. The algorithm must then 
compare each rule in the leaf node with the current packet headers. The maximum number 
of rules in a leaf node is determined by the binth parameter, which in turn is determined 
by the structure of the rule-set [48J. The average number of rules in a node is considerably 
less than the maximum (Table 5.2). While there are a few nodes with approximately binth 
rules, the majority contain very few. Each rule comparison requires 2 memory accesses 
with an average of ~ rules per packet, NL being the number of rules in the particular leaf 
node. The overall performance for the HiCuts algorithm would therefore be proportional 
to dA + ~, where dA is the average depth of the tree and NA is the average number of 
rules per leaf node. This is show in Figure 5.9. 
The point is illustrated further in Figure 5.10. The rule-set in (a) consists of 5 rules and 
has a maximum depth of 3. In order to match rule R2, a total of 7 memory accesses is 
required, 3 for the node traversal and 2 for each rule. Tree (b) has double the rules of Tree 
(a) but is only 2 levels deep. This case requires 1 memory access to reach the first list of 
rules and another 4 to match rule R2, a total of 5 memory accesses. 
64 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
I No. of Rules I Max. Depth I Avg. Depth I binth I Max. Rules I Avg. Rules I 
32 9 3.8 4 4 2.3 
64 7 2.6 4 4 2.4 
128 10 2.8 4 4 2.6 
256 9 3.1 8 8 3.5 
512 11 3.8 8 8 2.4 
1024 9 2.9 16 16 3.3 
2048 9 3.5 32 32 3.7 
4096 3 1.6 256 250 8.8 
8192 8 2.6 64 64 6.1 
16384 4 1.1 512 512 9.6 
Table 5.2: The average depth and number of rules in the rule-sets. 
2500 7 
6 
"iii'" 2000 Ul Q. GI 
.Q 5 :g ~ GI U ~ 1500 u 4 < 
a: ~ 
c: 0 0 
3 ~ .~ 1000 
·E :::!: 
Ul 2 ~ c: 
III 0 
"- 500 z I-
o ~--~--~--~--~--~--~--~--~----~--+ o 
4096 16384 8192 2048 512 32 256 1024 128 64 
Ruleset Size 
1-+-1 ME ....... dA + NA / 21 
Figure 5.9: The performance of the HiCuts algorithm with respect to the average tree 
depth and number of rules per node 
65 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
I 
5 
• 
(a) (b) 
Figure 5.10: An RiCuts tree showing the number of memory accesses required for (a) an 
example 5-rule set and (b) an example lO-rule set. 
5.3.4 Performance Increase from using Multiple MEs 
This section will address the issue of using multiple MEs for classification. All the algo-
rithms discussed above show a performance increase as more MEs are added to the system, 
however as expected this increase is not linear. This is due to the two bottlenecks in the 
system: processing power and memory utilisation. 
During classification there will be a number of threads accessing the memory simultane-
ously. The SRAM controller has a number of queues in which memory requests are placed 
until they can be executed . The NP was designed in this way, to hide the memory la-
tency by allowing other threads to execute, while some are waiting on memory operations. 
This has the advantage of allowing multiple packets to be processed in parallel, therefore 
increasing the memory throughput. 
Memory accesses can however only be performed at a finite rate and therefore as the number 
of MEs increases, so does the memory utilisation. Once this reaches 100%, no more memory 
accesses can be queued and increasing the number of MEs will have no effect on the packet 
processing speed. Figure 5.11 shows this graphically for the RiCuts algorithm using the 
8l92-rule set. Related to this is the fact that increasing the number of MEs once the 
memory has been totally utilised can have a negative effect on the transmission rate, this 
can been seen in Figure 5.11 when increasing the number of MEs from 4 to 5. This effect 
is caused by the SRAM command queue back pressure mechanism [631. 
Also worth noting in this section are the results for the 64 and 128 rule sets. For 4 and 
5 MEs, the classification performance is no longer bounded by memory, but rather by 
66 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
700 ~ 100 90 600 Y Cii' 80 Q. ;;e 
.0 500 J 70 ~ ~ C Gl 60 .S! ca 400 ca a: f .!!! c 50 5 0 .~ 300 ~ 40 ~ 
·s 0 E ~ 200 30 Gl 
f! :::E 
t- 20 
100 
10 
0 0 
2 3 4 5 
Number of MEs 
I-+- Rate --- Memory Utilisation I 
Figure 5.11: The increased performance and memory utilisation as the number of MEs is 
increased. 
the other parts of the architecture, namely the receiving and transmitting MEs and the 
physical network ports of the processor. 
For all the algorithms, the maximum transmission rate was reached using 4 MEs. At this 
stage, memory utilisation is close to 100% and adding additional MEs did not increase 
the performance. For the 64 and 128 rule sets, the transmission rate is not limited by 
the memory utilisation, but rather by the maximum limitations of the hardware. The 
maximum rates for all the algorithms are summarised in Figure 5.12. 
67 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
5000 " ---
4500 
iii 4000 
Co 
~ 3500 
~ 3000 
a: 
c: 2500 
o 
.~ 2000 
·e 
~ 1500 
as 
... 
I- 1000 
500 
o 
I 
, 
J 
1 
I 
-
-
-
-
-
t 
32 
--
--
-
-
-
-
-
I--
I, 
64 
~-
- -- -- -
- - -- - -
- -
~ 
~ 
- - -- --
c-- - '- - 1tIiD In .~ 
-T "'~r -~ 1 0 I 
128 256 512 1024 2048 4096 8192 16384 
Ruleset Size 
1_ lndirect-HiCuts _ Direct-HiCuts 0 BV 0 Tuple _ Linear : 
I 
Figure 0.12: The maximum transmission rate for all the algorithms using 4 MEs 
5.4 Algorithm Latency 
Another important metric, which is related to the search-speed, is the latency introduced 
by the packet classification algorithm. Real-time applications, like VoIP and video confer-
encing, are particularly affected by this. In this research, latency is defined as the time 
taken, from beginning to end, to classify a packet. 
The packet throughput rate of an algorithm measures the total number of packets classified 
per second. This metric can be increased by processing more packets in parallel. Latency 
is measured on a per packet basis and therefore can only be improved by modifying the 
classification algorithm. The results shown in Figure 5.13 are a measure of the average 
latencies for the different algorithms. They were obtained by adding code to the packet 
classification MEs that recorded the elapsed number of cycles before and after the algorithm 
had executed. The difference between these two values can be used to calculate the latency 
of the algorithms. More information regarding this can be found in Appendix C. 
68 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
300 
250 
200 
UI 
2: 
~ 150 
c: 
/~ 
/ 
Q) 
3 
100 
50 
0 
32 64 128 256 512 1024 2048 4096 8192 16384 
Ruleset Size 
1-- Direct ---- Indirect --0- BV -- Tuple I 
Figure 5.13: The latency of the different packet classification algorithms as a function of 
the rule-set size. 
Analysis 
The lower the latency, the more packets that can be classified per second and therefore 
the higher the packet throughput of the algorithm. The packet throughput is therefore 
inversely proportional to the latency, this can be seen in Figure 5.14. 
The highest latency for all rule-sets and algorithms is 251 f.1s. This number is small when 
compared with the 150 ms delay that is acceptable for VoIP traffic. This architecture is 
therefore suitable as a classification system for real-time applications. The algorithm which 
performed best in terms of latency was the HiCuts algorithm. 
69 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
5000 
4500 
iii 4000 
c.. 
~ 3500 
~ 3000 
a: 
r:: 2500 
o 
.~ 2000 
·E 
~ 1500 
I! 
..... 1000 
500 
o 
..., 
/ ~ I \ 
/ ~ / \ 
~ ./"'\ / \ 
"-./ \ / \ 
\ / ~ 
V 
/ \ 
.............. / ~ ~ 
~ ~ ~ 
32 64 128 256 512 1024 2048 4096 8192 16384 
Ruleset Size 
1---- Latency -+- Throughput 1 
40 
35 
30 
25 iii 
2: 
20 ~ 
r:: 
,S! 
15 j 
10 
5 
o 
Figure 5.14: The latency and packet throughput of the HiCuts algorithm using 4 MEs. 
5.5 Summary 
In comparing the different packet classification algorithms, there are a number of observa-
tions that can be made: 
Firstly, the algorithm that performs best with regards to classification speed is the Direct-
HiCuts algorithm. This is true for all the rule-sets used in this research, save one: the 4096 
rule-set. The low performance of this particular rule-set is not a product of the number 
of rules, but rather due to the structure of the rules. This makes it difficult to gauge 
the performance of this algorithm for a particular rule-set, without first performing the 
preprocessing stage of the algorithm. 
In the case of the 4096-rule set, where there are a large number of clustered rules, the 
algorithm that performed best was the tuple-space algorithm. This algorithm had the 
second highest overall performance (after the two HiCuts variations) and outperformed 
the BY algorithm in all cases except the two smallest rule-sets. For these two rule-sets, 
the reduction in the size of the bit-vector from using tuples was not large enough to offset 
the added complexity of the algorithm. Due to the difference in growth rates between the 
number of rules and the number of tuples, as the rule-set size becomes larger, the benefits 
of the tuple-space algorithm should become more pronounced. 
70 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
terms 
creases, 
rate 
t\veen 
same 
most 
processors as 
IS 
even more. 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
·1 
current 
• I)elcmne more 
to a 
worst-case 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
• 
• 
• 
on com-
• 
can 
• 
• 
overcome 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
6.2 
are 
• It 
it is the ones 
• 
• 
• a 
it is to 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
was current 
• 
account 
a 
to 
account is 
2 
" 
ease 
more on 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
anet 
1 
[61 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
[l1j 
anee. 
[151 on 
A 
to 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
A 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
2002. 
van A , pages 
A 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
[56] 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
2 
A 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
are 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
at 
.1 
a 
one 
to 
router. 2 
trace 
test 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
.2 
to convert 
a 
to 
to create 
no 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
.1 
Hash 
Unit 
N 
1: 
is a 
SRAM 
• Controller 0 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn.2 
to 
can arrows 
to 
no 
are 
is 
store 
structures 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
a 
to once 
a manner 
a 2 
lS 
.5 
.6 
a 
) 
to 
accesses to 
are 
to 
can be 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
on a 
one to 
.1 
was 
~VJlHI.'J.l,"Al to be 
Un
ive
rsi
ty 
of 
Ca
pe
 To
wn
.2 
were 
were 
to 
source 
uvvu.vu. to 
can 
were 
