Network Processors and Next Generation Networks: Design, Applications, and Perspectives by VITUCCI, FABIO
“frontespizio˙fabio” — 2008/2/19 — 10:27 — page 1 — #1
UNIVERSITA` DI PISA
Scuola di Dottorato in Ingegneria “Leonardo da Vinci”
Corso di Dottorato di Ricerca in
Ingegneria dell’Informazione
Tesi di Dottorato di Ricerca
Network Processors and
Next Generation Networks:
Design, Applications, and Perspectives
Autore:
Fabio Vitucci
Relatori:
Prof. Franco Russo
Prof. Stefano Giordano
Anno 2008
Index
List of ﬁgures 8
List of tables 9
Introduction 11
Chapter 1: Network Processors 15
1.1 Comparison among Network Processor Platforms . . . . . . . . 15
1.1.1 Multi-chip Pipeline (Agere) . . . . . . . . . . . . . . . . 15
1.1.2 Augmented RISC Processor (Alchemy) . . . . . . . . . . 19
1.1.3 Embedded Processor Plus Coprocessors (AMCC) . . . . 21
1.1.4 Pipeline of Homogeneous Processors (Cisco) . . . . . . . 23
1.1.5 Conﬁgurable Instruction Set (Cognigine) . . . . . . . . . 25
1.1.6 Pipeline of heterogeneous processors (EZchip) . . . . . . 27
1.1.7 Extensive and Diverse Processors (IBM) . . . . . . . . . 27
1.1.8 Flexible RISC Plus Coprocessors (Motorola) . . . . . . 31
1.2 Intel IXP2XXX Network Processors . . . . . . . . . . . . . . . 34
1.2.1 General Structure . . . . . . . . . . . . . . . . . . . . . 34
1.2.2 The Intel XScale . . . . . . . . . . . . . . . . . . . . . . 34
1.2.3 Microengines . . . . . . . . . . . . . . . . . . . . . . . . 35
1.2.4 Memories . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.2.5 Media Switch Fabric . . . . . . . . . . . . . . . . . . . . 40
1.2.6 SHaC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.2.7 Intel IXA Portability Framework . . . . . . . . . . . . . 41
1.2.8 IXA SDK . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.2.9 Developer's Workbench . . . . . . . . . . . . . . . . . . 46
CONTENTS
1.2.10 ENP-2611 . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Chapter 2: REFINE: the REconﬁgurable packet FIltering on
NP 49
2.1 The main idea . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3 Multidimensional Multibit Trie . . . . . . . . . . . . . . . . . . 51
2.4 Application Design . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.5 Reconﬁgurability . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.7 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . 60
2.8 Final considerations . . . . . . . . . . . . . . . . . . . . . . . . 63
Chapter 3: Amber Sched: a resource scheduler for NPs 65
3.1 The main idea . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 Ipv4 Forwarder Evolution . . . . . . . . . . . . . . . . . . . . . 67
3.4 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4.1 Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.2 Scheduler Design . . . . . . . . . . . . . . . . . . . . . . 72
3.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.6 Implementation and Results . . . . . . . . . . . . . . . . . . . . 76
3.7 Final considerations . . . . . . . . . . . . . . . . . . . . . . . . 76
Chapter 4: A cooperative NP/PC architecture for measure-
ments 79
4.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 The Basic Idea and Issues . . . . . . . . . . . . . . . . . . . . . 81
4.3 The Implementation Design . . . . . . . . . . . . . . . . . . . . 83
4.4 Network Processor side . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.1 Microengines Application Scheme . . . . . . . . . . . . . 84
4.4.2 Xscale Application Scheme . . . . . . . . . . . . . . . . 85
4.5 PC side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.5.1 Kernel space  the compatibility abstraction layer . . . 86
4.5.2 User Space  the user interface . . . . . . . . . . . . . . 88
4.6 Timestamping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6.1 Time Budget . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6.2 The Accuracy of Timestamp . . . . . . . . . . . . . . . 90
4.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 92
4.8 Final considerations . . . . . . . . . . . . . . . . . . . . . . . . 94
4
CONTENTS
Chapter 5: BRUNO: a high performance traﬃc generator 97
5.1 The main idea . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.1 BRUTE . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.2 Traﬃc generators on the IXP2400 Network Processor . 102
5.3 BRUNO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.1 BRUTE and BRUNO . . . . . . . . . . . . . . . . . . . 102
5.3.2 Design of BRUNO . . . . . . . . . . . . . . . . . . . . . 103
5.4 Components of BRUNO . . . . . . . . . . . . . . . . . . . . . . 104
5.4.1 Load Balancer . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.2 Traﬃc Generators . . . . . . . . . . . . . . . . . . . . . 107
5.4.3 Transmitter . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5 Communication between BRUTE and NP . . . . . . . . . . . . 109
5.5.1 Synchronization . . . . . . . . . . . . . . . . . . . . . . 111
Chapter 6: Smart data structures for NPs 115
6.1 The main idea . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2 Background on Bloom Filters . . . . . . . . . . . . . . . . . . . 116
6.3 The new upper bound of Overﬂow Probability . . . . . . . . . . 119
6.4 MultiLayer Hashed CBF (ML-HCBF) . . . . . . . . . . . . . . 122
6.4.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . 122
6.4.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.3 Operational Complexity . . . . . . . . . . . . . . . . . . 124
6.4.4 Simulation results . . . . . . . . . . . . . . . . . . . . . 125
6.5 Huﬀman Spectral Bloom Filters . . . . . . . . . . . . . . . . . . 126
6.5.1 Theoretical basis . . . . . . . . . . . . . . . . . . . . . . 126
6.5.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . 127
6.5.3 Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.5.4 Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.5.5 Insertion/Deletion . . . . . . . . . . . . . . . . . . . . . 129
6.6 MultiLayer Compressed CBF . . . . . . . . . . . . . . . . . . . 129
6.6.1 Complexity and properties . . . . . . . . . . . . . . . . 130
6.6.2 Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.7 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . 134
6.8 Blooming Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.8.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . 138
6.8.2 Properties of Blooming Tree . . . . . . . . . . . . . . . . 140
6.8.3 Memory Optimization . . . . . . . . . . . . . . . . . . . 143
6.8.4 Measurements . . . . . . . . . . . . . . . . . . . . . . . 146
5
CONTENTS
Conclusions 149
Bibliography 151
Acknowledgments 159
6
List of Figures
1.1 Architecture of NP Agere. . . . . . . . . . . . . . . . . . . . . . 16
1.2 Internal structure of FPP unit. . . . . . . . . . . . . . . . . . . 17
1.3 Internal structure of RSP. . . . . . . . . . . . . . . . . . . . . . 18
1.4 The Alchemy chip. . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5 AMCC nP7510. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6 A possible conﬁguration of CISCO XPF. . . . . . . . . . . . . . 23
1.7 Standard path of a packet in a PRE. . . . . . . . . . . . . . . . 24
1.8 Internal structure of Cognigine network processor. . . . . . . . 26
1.9 The scheme of NP-1 chip. . . . . . . . . . . . . . . . . . . . . . 28
1.10 Internal architecture of IBM network processor. . . . . . . . . . 29
1.11 The EPC chip in the IBM NP. . . . . . . . . . . . . . . . . . . 30
1.12 Architecture of C-Port. . . . . . . . . . . . . . . . . . . . . . . . 32
1.13 Internal architecture of a Channel Processor. . . . . . . . . . . 33
1.14 Scheme of the IXP2400. . . . . . . . . . . . . . . . . . . . . . . 35
1.15 Compilation process. . . . . . . . . . . . . . . . . . . . . . . . . 44
2.1 The beginning of multidimensional multibit trie. . . . . . . . . 53
2.2 A simple example of trie backtracking. . . . . . . . . . . . . . . 54
2.3 Classiﬁer architecture at microengine level. . . . . . . . . . . . 58
2.4 The experimental test-bed. . . . . . . . . . . . . . . . . . . . . 60
2.5 Transfer delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.6 Time evolution of transfer delay values. . . . . . . . . . . . . . 62
3.1 IPv4 Forwarder . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2 IPv4 Forwarder with AMBER Resource Scheduler . . . . . . . 69
3.3 Theoretical completion time, emulated values and number of
packets per second (dotted line) as a function of active threads
number. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
LIST OF FIGURES
4.1 Conceptual scheme of the monitoring system. . . . . . . . . . . 82
4.2 Functional scheme of the entire NP-side application. . . . . . . 84
4.3 Batch frame and packet digest speciﬁcation. . . . . . . . . . . . 85
4.4 The virtual interfaces moni. . . . . . . . . . . . . . . . . . . . . 87
4.5 Hardware packet receiving chain. . . . . . . . . . . . . . . . . . 89
4.6 Histogram of measured Tproc. Inequality (4.1) is satisﬁed for 4
threads (Tproc < 816cc) and 8 threads (Tproc < 1632cc). . . . . 91
4.7 Packets rawly saved to trace ﬁle. . . . . . . . . . . . . . . . . . 93
4.8 Packets captured from the mouse ﬂow. . . . . . . . . . . . . . . 94
5.1 Architecture of BRUTE. . . . . . . . . . . . . . . . . . . . . . . 101
5.2 Mapping BRUTE in BRUNO. . . . . . . . . . . . . . . . . . . . 103
5.3 Architecture of BRUNO. . . . . . . . . . . . . . . . . . . . . . . 105
5.4 Structure of packet request. . . . . . . . . . . . . . . . . . . . . 105
5.5 Structure for a ﬂow. . . . . . . . . . . . . . . . . . . . . . . . . 108
5.6 Address Translation. . . . . . . . . . . . . . . . . . . . . . . . . 110
5.7 DRAM window circular buﬀer. . . . . . . . . . . . . . . . . . . 112
6.1 A Bloom Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2 A Counting Bloom Filter. . . . . . . . . . . . . . . . . . . . . . 118
6.3 Bounds comparison. P ′b is always tighter than Pb. . . . . . . . . 121
6.4 Process of ﬁnding a current counter for ML-HCBF. . . . . . . . 123
6.5 A Huﬀman tree for the CBF bin counters. . . . . . . . . . . . . 127
6.6 Example of fast lookup through popcount. . . . . . . . . . . . . 128
6.7 An example of HSBF. . . . . . . . . . . . . . . . . . . . . . . . 130
6.8 ML-CCBF example. The resulting Huﬀman code for ϕ is 1110. 131
6.9 Size comparison among ML-CCBF, CBF and m× Entropy. . . 133
6.10 An example of a Naive Blooming Tree with b = 1. . . . . . . . 139
6.11 An example of an Optimized Blooming Tree with b = 1. . . . . 144
6.12 Size comparison for NBT, OBT, dl-CBF and CBF with n = 2048.145
8
List of Tables
1.1 Units and functionalities of Agere system. . . . . . . . . . . . . 17
1.2 Processors and functionalities of RSP unit. . . . . . . . . . . . 19
1.3 Processors of NP-1. . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.4 Co-processors of IBM NP. . . . . . . . . . . . . . . . . . . . . . 30
1.5 Properties of IXP2400 memories. . . . . . . . . . . . . . . . . . 40
3.1 Simulated predictions (clock cycles) . . . . . . . . . . . . . . . 73
3.2 Simulated completion time (clock cycles) and microengines uti-
lization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.3 Measured delay for IPv4 Forwarder with and without AMBER
Sched . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.1 Data Structures Comparison. Size is expressed in KBytes . . . 126
6.2 Number of Clock Cycles for Operations in the IXP2800 . . . . 135
6.3 Performance Algorithms Comparison . . . . . . . . . . . . . . . 137
6.4 Number of Clock Cycles for Operations in the IXP2350 . . . . 146
6.5 Performance Algorithms Comparison . . . . . . . . . . . . . . . 147

Introduction
Network Processors (NPs) are hardware platforms born as appealing solu-
tions for packet processing devices in networking applications. Nowadays, a
plethora of solutions exists, with no agreement on a common architecture.
Each vendor has proposed its speciﬁc solution and no oﬃcial standard still
exists.
The common features of all proposals are a hierarchy of processors, with
a general purpose processor and several units specialized for packet process-
ing, a series of memory devices with diﬀerent sizes and latencies, a low-level
programmability. The target is a platform for networking applications with
low time to market and high time in market, thanks to a high ﬂexibility and
a programmability simpler than that of ASICs, for example.
After about ten years since the "birth" of network processors, this research
activity wants to make an analytical balance of their development and usage.
Many authoritative opinions suggest that NPs have been "outdated" by mul-
ticore or manycore systems, which provide general purpose environments and
some specialized cores. The main reasons of these negative opinions are the
hard programmability of NPs, which often requires the knowledge of private
microcode, or the excessive architectural limits, such as reduced memories
and minimal instruction store.
Our research shows that Network Processors can be appealing for diﬀer-
ent applications in networking area, and many interesting solutions can be
obtained, which present very high performance, outscoring current solutions.
However, the issues of hard programming and remarkable limits exist, and
they could be alleviated only by providing almost a comprehensive program-
ming environment and a proper design in terms of processing and memory
resources. More eﬃcient solutions can be surely provided, but the experience
of network processors has produced an important legacy in developing packet
processing engines.
In this work, we have realized many devices for networking purposes based
on NP platform, in order to understand:
• the complexity of programming;
• the ﬂexibility of design;
• the complexity of tasks that can be implemented;
• the maximum depth of packet processing;
• the performance of such devices;
• the real usefulness of NPs in network devices.
All these features have been accurately analyzed and will be illustrated in
this thesis. Many remarkable results have been obtained, which conﬁrm the
Network Processors as appealing solutions for network devices. Moreover, the
research on NPs have lead us to analyze and solve more general issues, related
for instance to multiprocessor systems or to processors with no big available
memory. In particular, the latter issue lead us to design many interesting
data structures for set representation and membership query, which are based
on randomized techniques and allow for big memory savings.
The ﬁrst chapter presents a comparison among the diﬀerent architectures
of network processors: the main features and the principal diﬀerences among
them are illustrated. Then a more detailed description of the Intel IXP2XXX
family is given, which includes the network processor we used in our research
activities.
The second chapter shows the overall process of design and realization
of a multidimensional packet classiﬁer on the IXP2400 NP. All the phases
are analyzed, from the choice of the algorithm to the implementation, up to
functional optimization and measurements.
In the third chapter a processing scheduling scheme for network processors
is proposed. It matches the problem that the evolution of processing power can
not cope with the link capacity growth. Therefore a link capacity scheduler is
no longer suﬃcient to assure eﬃcient service diﬀerentiation to end-users, but
a proper computing power allocation for packet processing must be adopted.
In the fourth chapter we present a new traﬃc monitoring device, based on
a cooperative PC/NP architecture. It outperforms the previous solutions in
terms of packet capturing power and takes care of timestamp accuracy.
The ﬁfth chapter describes the overall design of a high performance traﬃc
generator. Also this works presents a cooperative PC/NP architecture: the
PC generates packet lenghts and departure times according to user settings
12
and traﬃc models, and the NP microengines care about the actual packet
generation and transmission.
In the last chapter, all the algorithms and the structures we have designed
for data representation are presented. They are based on randomized tech-
niques and show remarkable results. As said above, the hardware targets are
network processors, but the ideas and the algorithms can be exported to many
other platforms.
13

Chapter 1
Network Processors
As said in the introduction, nowadays no oﬃcial standard for network pro-
cessors still exists, and each vendor has proposed its speciﬁc solution. The
common features of all proposals are a hierarchy of processors, a series of
memory devices with diﬀerent sizes and latencies, a low-level programma-
bility. The target is a platform for networking applications with low time
to market and high time in market, thanks to a high ﬂexibility and a easy
programmability.
In this chapter, we ﬁrst present a comparison among the available network
processor platforms and then a detailed description of the hardware reference
of our activity, the Intel IXP2XXX family.
1.1 Comparison among Network Processor Plat-
forms
1.1.1 Multi-chip Pipeline (Agere)
Agere System Incorporated (i.e. the microelectronics branch of Lucent Tech-
nologies) presents a NP family called Payload Plus [1]. It has three interesting
features: a multichip architecture, a programmable classiﬁer, a ﬂexible man-
agement of input data.
Architecture
The Agere system is composed by three diﬀerent units. Fig. 1.1 shows the
interconnections among the diﬀerent chips and the data ﬂow through the
consequent pipeline.
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
Figure 1.1: Architecture of NP Agere.
The Fast Pattern Processor (FPP) and the Routing Switch Processor
(RSP) establish the basic pipeline for fast data path processing. The ingress
packets are forwarded to the FPP, which sends them, along with an instruction
set, to the RSP. The packets are then forwarded toward the switching fabric.
A third chip, the Agere System Interface (ASI) is a co-processor that intro-
duces new functionalities to improve general performance. The ASI gathers
statistical information on packets, which are then used for traﬃc manage-
ment. Moreover the ASI provides a connection toward a distinct processor
(not shown in ﬁgure), which is used to manage the overall system and the
exception packets.
The system oﬀers other connections: for instance, the conﬁguration bus
connects also the ingress hardware interfaces in order to coordinate the data
ﬂow toward the FPP. Fig. 1.2 shows the internal architecture of the FPP.
Processors and functional units
Each chip in the Agere system contains several processors and provides dif-
ferent functionalities. Tab 1.1 shows the features of each component.
The Functional Bus Interface (FBI) implements an interesting form of
Remote Procedure Call (RPC), which allows for calling functions which are
external to the FPP unit. This way it is possible to extend the FPP function-
alities by adding ASIC hardware.
16
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
Figure 1.2: Internal structure of FPP unit.
Table 1.1: Units and functionalities of Agere system.
Unit Functionality
Pattern processing engine Pattern matching
Queue engine Manage packet queuing
Checksum/CRC engine Compute checksum or CRC
ALU Classical operations
Input interface and framer Divide ingress packets in 64-byte long blocks
Data buﬀer controller Check access to external data buﬀer
Conﬁguration bus interface Connect to external conﬁguration
Functional bus interface Connect to external functional bus
Output interface Connect to external RSP chip
17
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
Figure 1.3: Internal structure of RSP.
Moreover, the FPP contains the interfaces for each external connection.
For instance, the pattern processing engine can interface to an external control
memory by means of a program memory and a queue engine. To handle
packets, the FPP has an ingress interface, a framer that divides packets in 64
bit long blocks and an output interface for the conﬁguration bus.
The FPP contains also an external interface for the conﬁguration bus. The
central part is given by a functional bus, which all the processors can connect
to. There is also an external interface for the functional bus which is used by
the ASI for checking the processing.
The RSP unit, whom internal structure is shown in 1.3, has a set of pro-
cessors and functionalities listed in tab. 1.2. The stream editor, the traﬃc
manager and the traﬃc shaper have been built with Very Long Instruction
Word (VLIW) processors.
18
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
Table 1.2: Processors and functionalities of RSP unit.
Unit Functionality
Stream editor engine Packet modiﬁcation
Traﬃc manager engine Regulate traﬃc and hold statistics
Traﬃc shaper engine Check QoS parameters
Input interface Receive packets to FPP
Packet assembler Store arriving packets
Queue manager logic Interface to external traﬃc scheduler
Output interface External connection for output packets
Conﬁguration bus interface Connect to external conﬁguration bus
Memory
In the Agere architecture, both external and internal memory are provided.
The FPP divides packets in blocks and stores them in an external data-buﬀer
(by means of an interface on the chip). It uses the internal memory for packets
in the processing stage, while the external memory is used to store programs
and instructions. The RSP stores packets in an external SDRAM and uses a
Synchronous Static RAM (SSRAM) for high priority queues.
Programming support
Ease programming is an appealing feature of Agere chip. The FPP is a
pipelined multithreaded processor and provides 64 independent contexts. How-
ever the parallelism is hidden to the programmer, who is able this way to use
high-level languages. Agere oﬀers also a speciﬁc language for the classiﬁca-
tion, a Functional Programming Language (FLP), and a scripting language,
the ASL (Agere Scripting Language). Moreover, Agere oﬀer a substantive
support for traﬃc management. The logic of RSP allows for using multiple
queues, applying external scheduling rules and handling traﬃc shaping.
1.1.2 Augmented RISC Processor (Alchemy)
Alchemy semiconductors Inc. (acquired by Advanced Micro Devices) oﬀers
diﬀerent versions of Network Processors with diﬀerent speeds [2]. These so-
lutions are based on a RISC processor which is enriched by instructions spe-
cialized on packets processing.
19
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
Figure 1.4: The Alchemy chip.
Architecture
This architecture is characterized by an embedded RISC processor along with
a series of co-processors. The core, a CPU MIPS-32, uses a 5-stages pipeline,
the pipelined register ﬁle access and the zero penalty branching to improve
the performance. Many instructions have been added to the set, such as a
"multiply and accumulate" to aid in CRC or checksum computing. Other
added instructions are those for memory prefetch, for conditional move oper-
ations, count leading of 0s and 1s. Fig. 1.4 shows the internal organization of
Alchemy chip.
Processors and functional units
As shown in ﬁg. 1.4, the embedded RISC processor can access to a certain
number of I/O controllers and functional units. The chip contains also an
RTC (Real Time Clock) unit.
20
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
Memory
On the chip there are two caches of 26KB, one for instructions and one for
data, and connections for external SDRAM e SSRAM. The bus which connects
the SSRAM provides also access to a Flash Memory, a ROM and a PCMCIA
unit.
Programming support
Given that the Alchemy chip uses a MIPS processor, it can be programmed
in C language.
1.1.3 Embedded Processor Plus Coprocessors (AMCC)
Applied Micro Circuit Corporation (AMCC) oﬀers a series of NPs with dif-
ferent performance [3]. The AMCC architecture allows for eﬃciently using
parallelism in order to obtain high data-rates.
Architecture
The version nP7510 includes 6 embedded processors (called nP cores), which
work in parallel (e.g., a packet transform engine, a policy engine, a metering
engine) and other functional units which provide external interfaces. An ex-
ternal co-processor handles address lookups based on a Longest Preﬁx Match
algorithm. Fig. 1.5 shows the scheme of AMCC chip.
Processors and functional units
Each processor provides hardware threads at zero context switch. This way,
the nP7510 can simultaneously process more packets or cells. The program-
ming model of AMCC allows for hiding the parallelism to programmer, who
can write code as for a single processor. Moreover, each packet or cell is
processed by a single thread, this way avoiding to partition the code and
implement complex balancing algorithms.
The Packet Transform Engine, which is optimized for packets or cells,
allows for operations on frames in parallel with the nP cores; several operations
can be made in a single instruction: insert or delete data, compute and add
the CRC or change values in packet header.
The Special Purpose Engine enables the elimination of mutexes or other
software threats for synchronizing access of thread to shared resources.
The Policy Engine is dedicated to search and classiﬁcation operations.
Many lookups (up to 512 with compound keys) can be simultaneously made
21
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
Figure 1.5: AMCC nP7510.
with a ﬁxed latency. A key feature of Policy Engine is the "Network-Aware
CASE Statement": the use of multiple and concurrent classiﬁcations allows
for the elimination of nested "if-then-else" instructions, this way reducing
code and improving performance. The metering engine enables the collection
of information for the remote monitoring of SNMP, while the Statistic Engine
enables the automated collection of statistics based on RMON protocol. The
nP7510 has been designed to support a speed of 10 Gbps. It can be interfaced
with the traﬃc management chipset nPX5710/20. The conﬁguration can be
doubled in order to handle a full duplex traﬃc of 10 Gbps. The nPX5710/20
contains also a virtual SAR unit (Segmentation And Reassembly).
Memory
As for many NPs, the AMCC chip oﬀers external and internal memories.
Moreover, a controller manages the two types of memory and hides this dou-
ble nature to processor. An external TCAM is used for packet classiﬁcation
processes.
22
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
Figure 1.6: A possible conﬁguration of CISCO XPF.
Programming support
These processors can be programmed in C or C++; AMCC provides a com-
piler, an assembler and a debugger.
1.1.4 Pipeline of Homogeneous Processors (Cisco)
The Parallel eXpress Forwarding (PXF) network processor has been designed
by Cisco to be used in Cisco routers [4].
Architecture
The PXF adopts a parallel architecture that can be conﬁgured in order to
create a series of pipelines. A single chip contains 16 embedded processors
that can be put to work on 4 parallel pipelines. Figure 1.6 shows a possible
organization of processors.
Processors and functional units
The PXF architecture counts a separation between control plane and forward-
ing plane. A route processor cares of routing protocols, network conﬁguration,
errors handling, and packets which are destined to the router.
Instead, the forwarding plane is controlled by the PXF technology. In the
PXF, each processor is optimized for packet processing at high speed and it is
23
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
Figure 1.7: Standard path of a packet in a PRE.
completely independent of the other ones; these units are called Express Micro
Controllers (XMCs) and contain a complex double execution unit, provided
with several speciﬁc instructions for an eﬃcient packet processing. Moreover
the XMCs can access to diﬀerent resources on the chip, as register ﬁles and
timers. They have also a shared access to an external memory in order to
store state information, such as routing tables and packet queues. Finally,
some micro-controllers guarantee that processing results can be passed among
subsequent XMCs on the same pipeline.
Figure 1.7 illustrates the path of a packet through this architecture. In this
conﬁguration, 2 PXF network processors are used for each Performance Rout-
ing Engine (PRE), this way obtaining 4 pipelines of 8 processors. Whenever
a packet goes to a PRE from the ingress interface, it enters the ASIC back-
plane interface and is buﬀerized in the input packet memory. The header is
extracted and sent to PXF for packet classiﬁcation, header modiﬁcation and,
if needed, data modiﬁcation. The processing comprehends also the selection
of the port on which packet forwarding is performed.
24
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
By means of simple routine algorithms, the PXF instructs ASIC back-
plane interface to store packet in its packet-buﬀer memory, in one of the
possible queues which are associated to corresponding output queues. Then,
the scheduling function of PXF processes this queue in order to determine
what is the next packet to be forwarded. After this decision, the PXF in-
structs ASIC backplane interface to copy this packet in the hardware queue
associated to corresponding egress interface.
Memory
There is an independent memory for each processor and one for each column
of processors, in order to optimize accesses.
Programming support
This network processor is realized for internal use, and not as general-purpose
product, thus it uses private software. Microcode and Cisco IOS are combined
to provide processing functions. The association of these functions to the
processors pipeline is very ﬂexible and can be updated when new functions
are available to be added.
1.1.5 Conﬁgurable Instruction Set (Cognigine)
The network processor of Cognigine Corporation is an example of reconﬁg-
urable logic: the adopted processor has not a preﬁxed set of instructions.
Architecture
This architecture allows for using up to 16 processors, which can be intercon-
nected to form a pipeline. Each processor is called Reconﬁgurable Communi-
cation Unit (RCU) and has a connector that links it to RSF (Routing Switch
Fabric), this way allowing for communications arbitrage and planning. The
RCUs are connected in a hierarchical manner: a crossbar is used to connect
a group of 4 RCUs and another one to connect groups of RCUs. This solu-
tion allows for scaling the architecture for a big number of RCUs. The RSF
permits to divide a transaction in order to hide latencies; it is accessed by a
RCU through a memory mapping.
Each RCU contains 4 execution units which can be dynamically recon-
ﬁgured. Each unit uses an instruction set called Variable Instruction Set
Communications (VISC). As for a standard processor, a VISC instruction
performs an easy operation, but details of operation are not determined a
25
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
Figure 1.8: Internal structure of Cognigine network processor.
priori. In fact, the chip contains a dictionary which deﬁnes the interpretation
of each instruction: operands' size, how they can be employed, the basic op-
eration and the predicate. The dictionary is in turn conﬁgurable, elements
can be added or dynamically changed. This way, programmer can deﬁne a
personal instruction set, insert the interpretation of these instructions and
develop a program based on them. For instance, a programmer could deﬁne
an instruction set optimized for peculiar processings or speciﬁc protocols.
VISC instructions are decoded during the ﬁrst stage of the pipeline. Each
RCU provides a ﬁve-stage pipeline and hardware support for 4 threads.
The interconnections among processors are again conﬁgurable. For each
RCU there are 4 64-bit data buses and 4 buses at 32-bit addresses, which
allows for connecting RCUs in pipeline.
Memory
RCUs access to diﬀerent types of memory, such as the internal SSRAM or the
Double Data Rate SDRAM (DDR-SDRAM). Dictionary for VISC instructions
is allotted in a distinct memory. Memories compose a hierarchy where the
26
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
fastest ones are internal registers and scratchpad memory, then the cache
for instructions and memory dedicated to data, while the slowest ones is the
external memory, which is designed to store packets.
Programming support
In order to maximize the parallelism, the RCUs provide hardware support
for multithreading. Moreover, there are connections to external buses, as the
PCI bus. Finally, along with C compiler and assembler, Cognigine oﬀers a
support for a classiﬁcation language.
1.1.6 Pipeline of heterogeneous processors (EZchip)
EZchip Corporation produces the network processor NP-1 [5]. This architec-
ture shows as heterogeneous processors, each of them dedicated to speciﬁc
functions, can work together in a pipeline manner. The NP-1 has been de-
signed for a big target: processing of layers 2-7 at 10 Gbps.
This chip contains also a very fast SRAM, which is used for storing packets
and lookup tables. There is an interface to access an external DRAM (external
SRAMs and CAMs are not necessary). The chip includes also an interface for
an external processor for management and control functions (the interface is
not shown in ﬁgure).
Moreover, EZchip claims to use patented algorithms which allows embed-
ded memory for searching in external memories, in order to support a line-rate
of 10 Gbps. These algorithms and the associated data structures allow for
searches with strings of variable length. Further details are not publicly avail-
able.
Architecture
In this chip there are the Task Optimized Processors (TOPs). Each TOP
has a personal set of instructions and connections which is speciﬁc for the
functionalities that it must provide. Figure 1.9 illustrates chip architecture.
The NP-1 contains 4 types of processors, which are describer in tab. 1.3.
1.1.7 Extensive and Diverse Processors (IBM)
IBM produces a family of network processors called PowerNP [6]. This so-
lution is very complex and comprehends a wide gamma of processors, co-
processors and functional units.
27
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
Figure 1.9: The scheme of NP-1 chip.
Table 1.3: Processors of NP-1.
Processor type Optimized for
TOPparse Header ﬁeld extraction and classiﬁcation
TOPsearch Table lookup
TOPresolve Queue management and forwarding
TOPmodify Header and payload modiﬁcation
28
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
Figure 1.10: Internal architecture of IBM network processor.
Architecture
This network processor contains programmable processors and several co-
processors which handle searches, frame forwarding, ﬁltering and frame mod-
iﬁcation. The architecture is composed by a set of central embedded proces-
sors, along with many supporting units. Fig. 1.10 shows the overall architec-
ture, while ﬁg. 1.11 accurately illustrates the area called Embedded Processor
Complex (EPC).
In addition to the embedded processor PowerPC, the EPC contains 16
programmable processors, which are called picoengines. Each picoengine is
multithreaded, thus improving again performance. In order to speed up pro-
cessing, frames are processed before being passed to the protocol processor in
the EPC.
The ingress physical MAC multiplexor takes frames arriving from physical
interface, checks CRC and passes frames to ingress data store. The ﬁrst part
of frame, which contains headers up to layer 4, is passed to the protocol
processors, while the remaining part is stored in memory. Once frame has
29
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
Figure 1.11: The EPC chip in the IBM NP.
Table 1.4: Co-processors of IBM NP.
Co-processor Function
Data Store Frame buﬀer DMA
Checksum Compute and check header checksums
Enqueue Forward frames arriving from switch or target queues
Interface Provide access to internal registers and memory
String Copy Transfer big amounts of data at high speed
Counter Update counters used in protocol processing
Policy Handle traﬃc
Semaphore Coordinate and synchronize threads
30
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
been elaborated, the ingress switch interface forwards it toward the proper
output processor through the switching fabric.
The external hardware of the EPC takes care also of the output of frames.
The egress switch interface receives data from the switching fabric and stores
them in the egress data store. The egress physical MAC multiplexor handles
frame transmission, by extracting them from egress data store and sending
them to physical interface.
In addition to picoengines, the chip of IBM contains several co-processors
specialized for particular functions. Some examples are presented in 1.4.
Memory
The PowerNP provides access to an external DDR-SDRAM and presents
many internal memories, with an arbiter which coordinates accesses to them.
The internal SRAM provides fast access, which allows for temporarily storing
packets to be processed. Moreover, programmable processors have a dedi-
cated instruction memory; for instance, each picoengine has 128 KB of private
memory which is dedicated to this purpose.
Programming support
In addition to standard programming tool (such as compilers, assemblers,
etc.), the IBM chip provides a software package for simulation and debugging.
This package is available for several operative systems, such as Solaris, Linux
and Windows. The co-processor that cares about traﬃc management works
at wire speed, this way the IBM chip is able to analyze each packet in order
to verify that traﬃc is complying to predetermined parameters.
1.1.8 Flexible RISC Plus Coprocessors (Motorola)
The Motorola Corporation brands its network processors C-Port. Models C-5,
C-5e and C-3 represent a tradeoﬀ between performance and power consump-
tion.
Architecture
The Motorola chip is very appealing; it is an example of internal processors
which can be conﬁgured to work in a parallel or pipeline manner. The ca-
pability of selecting a conﬁguration model for each processor provides a high
ﬂexibility to C-Port. Fig. 1.12 shows as C-Ports can connect more physical
interfaces to a switching fabric.
31
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
Figure 1.12: Architecture of C-Port.
Each network processor includes 16 blocks of processors, which are called
Channel Processors and care for packet processing. Each CP can be conﬁgured
in diﬀerent ways. The most direct approach is the dedicated conﬁguration,
which establishes a one-to-one relation between the CP and the physical in-
terface. In this conﬁguration the Channel Processor must manage both the
input and the output, and is suitable for interfaces at medium or low speed
(100Base-T Ethernet or OC-3), for which the processor has enough power. In
order to handle higher speeds, the Channel Processors can be organized in
a cluster in a parallel way. This way, whenever a packet arrives, any CP in
idle state can handle such a packet. The number of CPs in each cluster can
be modiﬁed, thus the designer can select the proper sizes according to the
interface speeds and the amount of required processing. Figure 1.12 shows
chip C-Port C-5 architecture, where CPs are conﬁgured in cluster.
The diagram illustrates the 16 Channel Processors (CP − 0 . . . CP − 15)
conﬁgured in parallel clusters of 4 CPs per cluster. In addition to CPs, the
Motorola chip contains many other co-processors. The Executive Processor
provides a conﬁguration and management service of the overall Network Pro-
32
1.1. COMPARISON AMONG NETWORK PROCESSOR PLATFORMS
Figure 1.13: Internal architecture of a Channel Processor.
cessor; it communicates with a potential host PC via bus PCI or through serial
lines. The Fabric processor allows for a fast connection between the internal
buses and an external switching fabric. The lookup unit allows for speeding
up searches in lookup tables. The buﬀer management and queue management
units handle and check respectively buﬀers for packets and queues.
However, the name Channel Processor is misleading: the chip does not
contains an only processor, but is a complex structure with a RISC processor
and several functional units which aid in handling packets at high speed. Fig.
1.13 shows CP components and their interconnections.
As we see, the CP has a parallel structure for ingress and egress side.
The Serial Data Processor (SDP) is programmable and on the ingress side
cares for checking checksum or CRC, decoding, analyzing headers, while on
the egress side is used for modifying frames, computing checksum or CRC,
coding, and framing. The RISC processor deals with classiﬁcation processes,
traﬃc handle and traﬃc shaping.
Programming support
The network processor C-Port can be programmed in C o C++. Motorola
provides a compiler, a simulator, APIs and libraries to be used for managing
33
1.2. INTEL IXP2XXX NETWORK PROCESSORS
physical interfaces, lookup tables, buﬀers, and queues.
1.2 Intel IXP2XXX Network Processors
In this section, the architecture and the functionalities of Intel IXP2XXX
Network Processors will be shown. The characters XXX indicate the ciphers
which speciﬁes a particular model. We will refer to the overall family; the
diﬀerences among models are related to the number of processing units, or
the availability of speciﬁc functionalities (for instance, units which allow for
encryption algorithms). Therefore, we try to explain the main features of
IXP2XXX family, its advanced functions, programming languages, and de-
velop environment. Finally the card Radisys ENP-2611 we have used will be
described, which contains the Intel chip.
1.2.1 General Structure
Fig. 1.14 shows a scheme of the IXP2400, in which functional units and
connections are presented. Often we refer to IXP2400 for speciﬁc features
and data we give.
The network processor contains 9 programmable processors: an Intel XS-
cale and 8 units called microengine, which are divided in 2 cluster of 4 mi-
croengines (ME 0:0 . . .ME 1:3). The general purpose processor XScale is a
RISC (Reduced Instruction Set Computer) ARM V5STE compliant, while
the microengines are RISCs optimized for packet processing.
From the scheme in ﬁg. 1.14 is clear the use of memories with diﬀerent sizes
and features (e.g., SRAM, DRAM, Scratchpad), as well as the availability of
shared functional units with speciﬁc purposes (e.g., MSF or the unit for hash
computing). In the following, all these features will be analyzed.
1.2.2 The Intel XScale
The Intel XScale processor which is installed on network processor of Intel
family IXP2XXX is compliant with the ARMv5STE architecture, as deﬁned
by ARM Limited. The "T" indicates the support to thumb instructions, i.e.
speciﬁc instructions which allow for passing from the 32bit modality to the
16bit one, and vice versa. This capability is useful for memory utilization
purposes.
Instead, the "E" indicates the support to advanced instructions of Digital
Signal Processing. The processor uses an advanced internal pipeline, which
improves the capability of hiding memory latencies.
34
1.2. INTEL IXP2XXX NETWORK PROCESSORS
Figure 1.14: Scheme of the IXP2400.
The support to ﬂoating point operations is not available.
Regarding the programming, the XScale processor supports real time op-
erative systems for embedded systems as VxWorks or Linux. Therefore, it can
take advantage of C/C++ compilers available in this environments. In addi-
tion, it can use several development tools, as IDE (Integrated Development
Environment), and debuggers.
In the IXP2400 NP, the XScale runs at 600 Mhz, while in the IXP2350 it
runs at 1.2 Ghz.
1.2.3 Microengines
Microengines has a speciﬁc instruction set for processing packets. There are 50
diﬀerent instructions, including the operations concerning the ALU (Aritmetic
Logical Unit) which work on bits, bytes and longwords and can introduce shift
or rotations in a single operation. The support to divisions or ﬂoating point
operations is not available.
The microengines of IXP2400 work at 600 Mhz, instead those of IXP2350
35
1.2. INTEL IXP2XXX NETWORK PROCESSORS
work at 900 Mhz or 1.2 Ghz.
The memory which stores the code to be executed in a microengine is
the instruction store and can contains up to 4K of 40bit instructions. The
code is loaded on microengines by XScale processor in the startup phase.
Once microengines runs, the instructions are executed in a 6-stage pipeline,
requiring a clock cycle with full pipeline. Clearly, whenever jumps or context
swaps happen, the pipeline must be cleared out and then ﬁlled again with
instructions, thus way requiring more clock cycles.
Threads
Each microengine allows for the use of 8 thread with hardware support to
context switch. This way of context switch is called "zero-overhead", because
microengines hold a series of registers for each thread; thus, whenever the
context switch occurs, registers copy is not required, therefore the overhead is
related only to the pipeline emptying (i.e., very few clock cycles). Processors
can be conﬁgured to use 8 threads, or only 4 threads. In the latter case, only
the threads with even index are activated and they have a higher number of
registers.
All the threads execute the same instructions, which have been read from
the internal memory of microengines, by starting from the ﬁrst instruction.
However, it is possible to diﬀerentiate the operations for each thread by using
some conditional instructions:
if (ctx==1) {
. . .
}
else if (ctx==2) {
. . .
}
Each thread runs and then releases the controller to allows the other ones
to run. The scheduling is not preemptive: until a threads works and does
not release the controller, the other threads can not execute their code. The
context switch is invoked by means of proper instructions (ctx_arb) and is
typically used as mechanism for hiding access latency to resources. For in-
stance, whenever an external memory must be read, the thread release the
controller before it accessing to the memory.
The not preemptive approach allows for reducing issues in critical sections,
i.e. parts of code in which resources which are global for threads are used and
modiﬁed. If two threads access to the same register at the same time, the
36
1.2. INTEL IXP2XXX NETWORK PROCESSORS
data in the register can become insubstantial. Therefore, the not preemptive
model aid in this purpose.
However, the not preemptive scheduling does not solve the issue of critical
sections for threads accessing contemporaneously to the same resource and
belonging to diﬀerent threads. Techniques of synchronization are therefore
needed.
To handle the threads execution for each microengine there is a thread
arbiter, i.e. a scheduler which selects the thread to run by using a round-
robin policy among the active threads.
Registers
There are four types of registers for each microengine:
• general purpose;
• SRAM transfer;
• DRAM transfer;
• next-neighbor.
As said above, each context has a private set of registers, therefore each
bank of registers is divided in the same way among threads. In addition, there
are some control Status Registers (CSRs) which allows for diﬀerent operations
or for conﬁguring microengines' functioning.
General Purpose Registers (GPRs) - Each microengine have 256 32-bit
registers for general purpose, which are allotted in two banks of 128 registers
(called bank A and bank B). Each instruction which has as operands GPRs,
requires that they belong to diﬀerent register banks.
Registers can be accessed in local manner for the thread (i.e., each thread
accesses 32 GPRs), or in absolute manner, or in global manner (i.e., registers
are accessed by all the threads as global variables). In the code, name of
GPRs can follow some rules [7].
Transfer Registers - SRAM transfer registers (256 per microengine) are
32-bit registers which are used for writing and reading from SRAM memory
or from the other memories or functional units in the Network Processor, such
as Scratchpad memory, SHaC unit, Media Switch Fabric, and PCI (Peripheral
Component Interconnects) interfaces.
DRAM transfer registers are suitable for writing ad reading from DRAM
and can be used in replacement of SRAM registers only for reading.
37
1.2. INTEL IXP2XXX NETWORK PROCESSORS
Transfer registers are the main mechanism to make asynchronous opera-
tions on the memories; on a transfer register a thread writes data to be then
written in memory, or from a transfer register a thread reads data which has
just been read from memory.
Registers' bank is divided into two parts, one of them for writing and the
other one for reading. This does not allow a wrong use of transfer registers
(for instance, as GPRs).
More precisely, when a transfer register is used, typically a couple of reg-
isters is available, with the same name, but writing on this register means
writing on the "writing" part, while reading it means accessing the "reading"
part.
Also these registers can be accessed in local or global manner respecting
to thread.
Next-Neighbor Registers - Each microengine has 128 32-bit registers called
next-neighbors. They can be used in two ways: as other general purpose
registers, or as "microengine communication" registers. In the ﬁrst case, if
the standard general purpose registers are ﬁnished, for instance, the next-
neighbor registers can be used in replacement. In the second one, they make
available to the microengine with the next index the data which has just
been written by the current microengine. This way, the ﬁrst microengine can
communicate with the second one, the second one with the third one and so
on.
The communication can occur through a simple writing in the registers
or through the set up on the registers of a data structure called ring, which
is a FIFO queue and which is accessed by means two CSRs, NN_PUT ans
NN_GET.
Signaling
- Each microengine has on tap 15 numbered signals. They are useful for the
execution of asynchronous operations which concern memories and functional
unit. For instance, whenever a reading in SRAM is required by a thread, the
end of the operation can be communicated through a signal to the thread
which have required the reading. Once the signal from the SRAM has been
received, we can be sure to have the data.
Some functional units, for instance DRAM, require the use of a couple of
signal for the signaling.
Speciﬁc instructions allows for making context-switch and waiting for the
arrival of one or more signals. This way the mechanism of hiding memory
latency is obtained.
38
1.2. INTEL IXP2XXX NETWORK PROCESSORS
Finally, signals can be used as synchronization mechanism among threads,
in order to solve potential collisions on the same resources.
Local Memory
- The local memory of a microengine consists of 640 longwords (i.e., words
of 32 bits) which can be accessed very fast, with a maximum latency of 3
clock cycles. Moreover, this delay has to be taken in account only a speciﬁc
CSR is used to select the position where we must work; if we use consecutive
locations of local memory, we do not need to set again the CSR and then to
wait for 3 other cycles.
The access occurs through the special registers *l$index0 and *l$index1,
which refer two diﬀerent locations in local memory. Such registers, which
are replicated for each thread, can be incremented or decremented (e.g.,
*l$index++) or used with indexes (e.g., *l$index[4] indicates the fourth long-
word after that indicated by *l$index[0]).
Content-Addressable Memory and CRC
- The Content Addressable Memory (CAM) is a special memory which is
addressable according to the content. Each microengine has a CAM with 16
entry. Speciﬁc instructions (CAM_LOOKUP) allow for search of a particular
content on the memory. If the content is found, the CAM position is returned,
otherwise the least recently used (LRU) element.
The CAM is very useful to implement little cache or to handle arrays of
queues.
Finally, computing CRC (Cyclic Redundancy Check) is possible through
proper registers.
1.2.4 Memories
IXP2XXX network processors can access 4 diﬀerent types of memory: local
memory, scratchpad, SRAM, and DRAM. The local memory can be accessed
only by the single microengine that contains it, while the other memories are
shared. Tab. 1.5 shows the diﬀerent characteristics and tradeoﬀ in terms of
size, latency and minimum accessible unit (logical width).
Each type of memory allows for special operations. We have already said
about local memory. The scratchpad is a SRAM memory on the chip, which
is contained in the SHaC block. It allows for atomic operations on data, such
as increase, decrease, test&set. An atomic operation in an operation that
can not be divided. For instance, incrementing a variable requires reading,
39
1.2. INTEL IXP2XXX NETWORK PROCESSORS
Table 1.5: Properties of IXP2400 memories.
Memory Logical Width (bytes) Size (bytes) Latency (clock cycles)
Local Memory 4 2560 ∼ 3
Scratchpad 4 16k ∼ 60
SRAM 4 128M ∼ 90
DRAM 8 1G ∼ 120
incrementing and writing it. If the overall operation is atomic, the three
operations can be split. This way, the collision issues in the use of shared
resources are solved.
Moreover, the scratchpad enables creation and management of FIFO queues
(which are called Rings, because they use a part of memory as it was circular).
These "ScratchRings" are often utilized in order to permit the communica-
tion among microengines through simple and fast operations (the scratchpad
is the fastest memory shared among microengines).
The SRAM is an external memory which supports the same operations of
the scratchpad; in additions, it allows for creating and handling FIFO queue
by means of element pointers, therefore with no need to transfer them. no
need to transfer them.
The DRAM is the biggest and slowest memory. It allows for a direct
path from and toward Media Switch Fabric with no need of transfer through
microengines.
The logical width has to be taken in account in programming phase, be-
cause mechanisms to hide it to the programmer are not available. This way,
for example, to access two consecutive longwords in SRAM, we need to indi-
cate the second one with an oﬀset of 4 in respect to the ﬁrst one.
Finally, it is useful to know the management of asynchronous commands
on the shared memories which arrive from diﬀerent threads. Each interface of
the memories has a queue of command to be executed, from which draw on in
a sequential manner. A thread can have more command on diﬀerent queues
or on the same queue.
1.2.5 Media Switch Fabric
The Media Switch Fabric (MSF) unit is the interface designed to data transfer
from and toward network processors of IXP2XXX family. Packet reception
40
1.2. INTEL IXP2XXX NETWORK PROCESSORS
and transmission on network processors is a complex process of reassembly
and segmentation of little parts of packets called mpackets.
Through MSF, programmer has an interface for transmission and reception
which is universal and independent of packet format.
The mpacket size is deﬁned by the reception buﬀer (RBUF) and by the
transmission buﬀer (TBUF), which are conﬁgurable in 64, 128 and 256 bytes
through speciﬁc CSRs of MSF.
1.2.6 SHaC
SHaC (Scratchpad, Hash and CAP) is the multifunction unit which contains
the scratchpad memory, an unit for generation of hash codes and the CAP
(Control Status Register Access Proxy) unit. The hash unit is capable of
computing hash codes of 48, 64 and 128 bits from keys of the same size.
Moreover, with an only request, 3 keys to be worked on can be inserted. The
algorithm to be used can be conﬁgured through CAP.
The CAP unit provides the interface for using many CSRs for the over-
all chip. In addition, it allows for the inter-threads and inter-microengines
signaling and the management of interrupts to be sent to XScale processor.
Another functionality of CAP is the handling of register reﬂector, which
is a mechanism used by a thread in a microengine in order to write on the
SRAM transfer registers of any other thread in any microengine.
Finally, the SHaC contains also the logic for interfacing the peripherals of
XScale processor as memories and external timers.
1.2.7 Intel IXA Portability Framework
Intel Internet Exchange Architecture (IXA) takes care of providing hardware
which is programmable via software and open APIs. Practically, it is the
hardware and software architecture of Intel network processor family.
The IXA Portability Framework is the modular architecture which is based
on building blocks and allows for the reuse of the code written for a IXP2XXX
NP on any NP of the same family. Therefore, the software structure is based
on the modular modality of code for XScale and microengines which is sup-
ported by an Hardware Abstraction Layer with standard APIs.
The ﬂexibility is guaranteed through the full programmability of the two
architecture layers and the diﬀerent low-level functions which are provided
in hardware. Moreover, it is possible to select the model of multithreaded
programming (parallel way, pipeline or hybrid) according to the needs.
41
1.2. INTEL IXP2XXX NETWORK PROCESSORS
In addition, hardware which is expressly designed for the IXA architecture
permits to solve the issues of memory latencies, which raise when the rate
grows.
Microblocks and Core Components
The modular structure of the software, which enables the code portability, is
based on two types of building blocks. They are called core components at
XScale level and microblocks at microengines level.
Each building block represents a functionality of packet processing, e.g.
NAT, forwarding, Ethernet bridging, etc. Programmer can use these elements
or build new ones or combine them to create an application.
Some blocks are called driver-blocks and care about the operations more
dependent by the underlying hardware architecture, such as reception, trans-
mission or queue handling. They are blocks optimized for their purposes,
therefore it is not opportune to modify them.
XScale/microengines interactions
The network processors of IXP2XXX family present two hierarchical layers:
• an upper layer, with the XScale processor (programmable in C lan-
guage), which hosts an embedded operative system and deals with con-
trol plane and management of the overall NP;
• a lower layer, which takes care of fast data path and is composed of
microengine (programmable in microcode assembly), which provide a
short set of instructions optimize for packet processing.
The core components operate as intermediate between these two layers.
They are modules which allows for the interface between the processor and
all the other units of NP, for deﬁning symbols and for handling exceptions of
fast data path.
The use of symbols is useful for the deﬁnition of resources. Indeed, some
modules care about resource management in an integrated manner, i.e. each
use of any memory part requires a direct request to a module called resource
manager. The resource manager, in the process of microcode loading on the
microengines, will allocate the required resources and will set the proper pa-
rameters for the application functioning (this phase is called symbol patching).
Each block of code written for a microengine can be handled also by a
core component at XScale level. Therefore there is a core component for the
reception code, another one for the transmission code, and so on.
42
1.2. INTEL IXP2XXX NETWORK PROCESSORS
1.2.8 IXA SDK
In addition to the IXA Portability Framework, the Intel IXA SDK (Software
Development Kit) provides several tools to develop applications for IXP2XXX
NPs. These tools include a compiler for a microengine-oriented C-like lan-
guage, an assembler for the assembly language for microengines [8] and a
Integrated Development Environment (IDE) called Developer's Workbench.
Assembly for microengines
The instruction for the assembly language for microengine [9, 10] assume this
general form:
opcode [param1, param2, ...], opt1, opt2, ...
With opcode we indicate the name of instruction, the parameters to be
passed are param1, param2, etc., and there are also the optional parameters
opt1, opt2, etc. These options attend to change the behavior of the instruction
or to add optimizations. For instance, a common option is ctx_swap[signal]:
it allows, in instructions which access memory, for executing a context switch
by waiting for a signal from memory controller, which points that data have
been read or written.
Other common options allows for code optimization, by reducing penalty
in case of jump. These options are defer[n], which point to the assembler to
execute the ﬁrst n next instructions in the pipeline before a jump or a context
switch.
The assembly language for microengine gives the possibility of conditioned
or not-conditioned jumps, as well as any other programming language. The
points on the code to which jump are indicated through labels followed by
the character #. For instance
label1#
. . .
. . .
br[label1#]
In the instruction set, the opcodes point typically the hardware unit to be
used. For example, if two registers have to be summed, the instructions is:
alu[z,x,+,y]
because for the arithmetical logical operations the ALU (Arithmetic Logic
Unit) is used.
Instead, if a reading in SRAM is required, the following instruction is used:
43
1.2. INTEL IXP2XXX NETWORK PROCESSORS
Figure 1.15: Compilation process.
sram[read,x,position,0,1],ctx_swap[sig_sram]
Constructs
The assembler provides some user-friendly constructs, which replicate the
basic constructs of the most widespread programming languages. Thus, if
endif, while, repeat until, can be used. This way, code is more readable and
less prone to wrongs.
Moreover, it is possible to create subroutines to be called, but commonly
they are not utilized because the stack lacks. Instead, macros are preferred,
i.e. code which is exploded for each occurrence.
Macros, along with conditional compilation and other functions, are made
possible by a preprocessor, very similar to the preprocessor of C language,
which is a very useful tool for programming in assembly.
The overall compilation process is shown in ﬁg. 1.15: we start from the
.uc ﬁle to arrive to the .list ﬁle, which contains the actual code to be executed
by a microengine.
44
1.2. INTEL IXP2XXX NETWORK PROCESSORS
Virtual registers
The assembler provides the capability to handle the available registers through
some names, although a name of a register does not point always the same
location in the registers banks. These are the virtual registers, which allows
for deﬁning diﬀerent scopes for registers.
For instance, let us suppose that a macro for the debug utilizes a couple
of registers, which are then never used in the remaining part of the code. It
should be a wastage to statically allocate two locations in the banks for these
two registers. Therefore, the key-words .begin and .end are used, this way
deﬁning a scope for the registers: out of this scope, the registers do not exist
and the corresponding memory locations can be reused.
The mechanism of dynamical mapping of registers on physical locations is
not only related to the functions in order to deﬁne the scope. In fact, if the
number of declared registers raises so much that they can not be all statically
allocated, some physical locations used by a certain register (with a scope
still active) are reused and then assigned again to the original register when
it needs them. This mechanism can be dangerous if used on transfer registers
which are currently used for accessing memories. For this case, there is the
key word volatile which guarantees the statical allocation of registers.
Microengine-C
The microengine-C language allows for programming microengines with the
ease and the typical features of C language, i.e. types check, memory pointers
and functions. Since a memory stack can not be used, functions deﬁned in
microengine-C can not be recursively called and functions' pointers can be
used.
The syntax of microengine-C is compliant to ANSI-C, except these limi-
tations concerning functions. The supported types are signed and unsigned
and go from char (8 bits) to longlong (64 bits). Moreover, structs and enum
types are supported.
According to the optimizations of compiler, functions can be compiled as
online (i.e., they are exploded as they are macros) or as subroutines.
Given the diﬀerent type of memory and registers, declarations of variables
must be accompanied by indications regarding their allocation.
Moreover, some speciﬁc functions allowed by the NP, such as atomic op-
erations, have not a corresponding one in ANSI-C. Therefore, "intrinsics" are
used, which are constructs expressly introduced, which look as functions but
in actuality correspond to well know sequences in assembly.
45
1.2. INTEL IXP2XXX NETWORK PROCESSORS
These diﬀerences from the common C make the use of microengine-C less
easy and perceptual. In addition, the compilation process does not allow for
obtaining optimized code, so it pass through an assembly version. For these
reasons, often the assembly is preferred to microengine-C.
1.2.9 Developer's Workbench
The Integrated Development Environment provided with IXA SDK is the
Developer's Workbench. This development tool allows for writing code and
debugging of assembly or microengine-C in a visual envornment in Windows
Microsoft.
Moreover, it is possible to debug the code in hardware, by connecting the
Network Processor to the PC with the IXA SDK.
Finally, an accurate simulator of Network Processor (based on clock cycles
and not event-driven) is provided as part of IDE. It precisely recreates system
behavior and is an optimum tool for testing applications' prototypes with no
need to port the code on the hardware and for the accurate measurements,
for example, of latencies of single processing stages in the network processor.
Scripting
The simulator of Developer's Workbench (called Transactor) supports a C-
like scripting language. It provides several commands which permit to accu-
rately observe applications behavior. Indeed instructions to add a watch on
memories and registers are available, with the capability to execute speciﬁc
sequences of instructions when certain values change. For example, each time
a register changes its value, the content can be written on a ﬁle. The values
of registers or memory locations can be initialized or modiﬁed, the RBUF
and TBUF buﬀers can be obseved, as well as CSRs of any block of network
processor.
The deﬁnition and the use of functions is supported, as well as the the use
of classical constructs of programming languages, such as if(), while(), etc.
Finally, there are further commands for the simulation control, i.e. model
reset, simulation stop or restart, and so on.
1.2.10 ENP-2611
Laboratories which have placed this research have on tap Radisys ENP-2611
cards, on which is integrated the Network Processor Intel IXP2400. These
medium-low proﬁle cards allow for obtaining nominal line rates of 2.5 Gbps
46
1.2. INTEL IXP2XXX NETWORK PROCESSORS
and have 3 optical multimodal gigabit ethernet ports. A further gigabit port
at 10/100 Mbps is available in order to handle traﬃc of control plane or for
debugging services.
These cards are mounted on PC through a PCI bus compliant with the
speciﬁcations 2.2 at 32 or 64 bits. The use of PCI bus permits to connect
more ENP-2611 cards in order to build a single network node.
These cards provide 8 Mbytes of SRAM and 256 Mbytes of DRAM.
The development on ENP-2611 cards is based on IXA SDK, but Radisys
has introduced an own add-on to SDK Intel, which is called ENP-SDK.
47

Chapter 2
REFINE: the REconfigurable
packet FIltering on NP
This chapter illustrates the compound process that leads the implementation
of a reconﬁgurable multidimensional packet ﬁltering on the Intel IXP2400
NP. The multidimensional multibit trie is chosen as the best algorithm to
be implemented and it is modiﬁed to exploit the speciﬁc features of Network
Processor. The diﬀerent tasks are mapped on the NP computational resources
and an optimized implementation is performed, with subsequent experimental
validation.
2.1 The main idea
NPs appear as the most promising solutions to realize high-performance net-
work devices that provide hard tasks as packet classiﬁcation, resource schedul-
ing, traﬃc policing. All of such mechanisms are necessary to provide a certain
level of service, especially in environments that have the main target of QoS
provision, e.g. DS networks. Indeed, interest in modern Internet applications
is constantly growing and a signiﬁcant number of applications impose strict
demands on network performance. Hence, Internet routers, that, to date,
provide best-eﬀort service only, are now required to provide service diﬀeren-
tiation. The ﬁrst step in providing this diﬀerentiation is the capability of
distinguishing packets belonging to diﬀerent ﬂows.
The process of categorizing packets into ﬂows is called packet classiﬁcation.
Formally, given a set of rules deﬁning the packet attributes, classiﬁcation is the
process of identifying the rule to which a packet conforms. As both the number
and the complexity of the applications requiring classiﬁcation are increasing,
2.2. RELATED WORKS
such a task is becoming more and more critical. Indeed, packet ﬁltering
requires to perform, at real-time, a considerable number of operations (such
as analyzing packet header ﬁelds according to diﬀerent rule speciﬁcations)
while maintaining high speeds with almost no impact in traﬃc dynamics.
Keep unaltered traﬃc dynamics is a key issue in order to model devices'
behavior and use techniques of teletraﬃc engineering for network resources
provisioning.
In this scenario, in order to achieve ﬂexibility and high performance, the
most promising solution is represented by the adoption of Network Processors.
The main target of this chapter is to propose a rigorous methodology to the
design and the implementation of a packet ﬁltering on NP platform, which
takes into account both the functional speciﬁcations of the component itself
and the speciﬁc features of the candidate hardware for its implementation.
The beneﬁts of all processing capabilities and operational units available on
NP have been accurately analyzed, through an analytical process useful for
designing and realizing any network device on NPs. High performance and
dynamic reconﬁgurability are the main goals to be addressed.
The ﬁrst purpose of this activity is the selection of a suitable classiﬁca-
tion algorithm to be implemented into the embedded system, by means of a
comparison among many algorithms presented by diﬀerent researchers. Then,
the selected algorithm (namely the multidimensional multibit trie) is modi-
ﬁed and reﬁned to capitalize the peculiar functional properties of our NP.
Afterwards a proper application design is planned, to map the diﬀerent tasks
to the available computational units and to extensively exploit the hardware
resources (i.e. processors and memories). The ﬁnal step is to develop an opti-
mized implementation of the classiﬁcation functionalities on top of IXP2400,
with subsequent experimental validation. Multitasking and multithreaded
programming are accurately examined.
2.2 Related Works
Several papers have addressed the packet classiﬁcation problems. An ex-
haustive survey of proposed algorithms is given by Gupta et al. [11]. The
best existing classiﬁcation schemes described in the literature (Recursive Flow
Classiﬁcation [12], Hierarchical Intelligent Cuttings [13], Aggregated Bit Vec-
tor [14]) require large amounts of memory for even medium size classiﬁers,
precluding their use in core routers.
Therefore ternary CAMs are considered by many designers the only so-
lution for realizing packet ﬁlters in core routers. For example, Nourani and
50
2.3. MULTIDIMENSIONAL MULTIBIT TRIE
Faezipour [15] propose an eﬃcient TCAM-based architecture for multimatch
search, to be used in network intrusion detection systems and load balancers.
Recently, Baboescu et al. [16] provide an alternative to CAMs via an Ex-
tended Grid-of-Tries with Path Compression algorithm, whose worst-case
speed scales well with database size while using a minimal amount of memory.
Concerning the fulﬁllment of a classiﬁcation engine on top of Network Pro-
cessors, Kounavis et al. [17] analyze several databases of classiﬁcation rules
and derive their statistical properties, in order to suggest a classiﬁcation ar-
chitecture to be implemented eﬃciently on NPs. Rashti et al. [18] present a
multidimensional packet classiﬁer engine for NP-based ﬁrewalls, established
on a hierarchical trie algorithm and on a technique of memory usage opti-
mization. Srinivasan and Feng [19] study the performance of two diﬀerent
design mappings of the Bit Vector algorithm on the Intel IXP1200 Network
Processor. Hsu et al. [20] propose a bit compression algorithm, also for Intel
IXP1200, which requires few memory accesses and allows high performance.
Further, Chiuen and Pradan [21] study the application of caching to Network
Processors, in order to improve performance by means of exploiting temporal
locality in packet streams and avoiding unnecessary repetition of processing
operations.
Moreover, it is recognised [22] that runtime reconﬁguration is an appealing
characteristic of software for NPs. Several applications, e.g. dynamically ex-
tensible services, network resources management and adaptive load balancers,
need an extensible ﬂow classiﬁer to be updated at runtime. Lee and Coulson
[23] adopt a runtime component-based approach in which ﬁne-grained com-
ponents on the NP can be dynamically (un)loaded and (dis)connected in a
principled manner. This introduces a certain amount of delay (up to 60ms)
that becomes unacceptable at high rates. In this work, a diﬀerent approach
is presented, in which the update of ﬁltering rules is performed through a
communication between XScale and microengines, with no need to connect
and disconnect components. The control processor warns the microengines
through predeﬁned messages and then updates the classiﬁcation data struc-
ture; in this way not consistent data are not used and update time is much
reduced.
2.3 Multidimensional Multibit Trie
Given a set of rules deﬁning the packets attributes, packet classiﬁcation is
deﬁned as the process of ﬁnding the best match among rules and packet ﬁelds.
Rule matching schemes may be an exact match or a preﬁx/range match on
51
2.3. MULTIDIMENSIONAL MULTIBIT TRIE
multiple ﬁelds. Applications use policies based on layer2-layer4 header ﬁelds.
Since multidimensional classiﬁcation is a complex problem, many researchers
have explored and proposed a wide variety of algorithms [11].
To choose a classiﬁcation algorithm that exploits the capabilities of our
Radisys® ENP-2611 board, these several algorithms have been compared.
Classic performance metrics have been analyzed, such as search speed, mem-
ory requirements, scalability in classiﬁer table size, ﬂexibility in rule speciﬁ-
cation. In addiction, all the possible bottlenecks of NPs have to be taken into
account. In particular, the usual limits of NP architecture, i.e. the amount of
memory consumption and the size of the instruction store, have to be consid-
ered [24]. In other words, the key challenge is to design a packet classiﬁcation
algorithm that requires both low memory space and low access overhead: this
would provide a proper scaling with respect to high bandwidth networks and
large databases of classiﬁcation rules.
The analysis of comparison results led to identify an algorithm based on
a multidimensional multibit trie structure as the most appropriate for the
IXP2400 [24]. Its appealing features are big ﬂexibility, high search speed
and low number of memory accesses; especially the latter is a fundamental
issue for the performance of any algorithm implemented on NPs, due to high
latencies of external memories. The classiﬁer processes the ﬁve main header
ﬁelds, as typically adopted in the literature [17] and in real classiﬁers: the IP
Destination and Source Address, IP Destination and Source Port, and Layer
4 Protocol Type. Hence, we have a 5-dimensional classiﬁer, which uses a 5-
stage hierarchical search trie (see ﬁg. 2.1). In each stage, the classiﬁcation
algorithm processes a single packet header ﬁeld; the analysis of each ﬁeld is
performed into several steps (named strides), by checking a speciﬁc number
of bits. The choice of the strides lengths is based on a theoretical method
proposed in [25] and on the analysis of rules distributions in real classiﬁers
[17]. Such an algorithm has, in the worst-case, a research speed of the order
of O(W/K), a ﬁxed number of memory accesses (L) and a storage complexity
of the order of O(2(K − 1)NW/K). We denote the number of entries in the
classiﬁer with N , the maximum preﬁx length (in bits) of a ﬁeld with W , the
total number of data structure levels with L and the average length of strides
with K.
The main issue of the original multidimensional multibit trie is memory
consumption. Indeed, to handle rules with non-speciﬁed parts (e.g. 121.132.*.*)
and to have a single memory access in each stride, 2s nodes for level are neces-
sary (where s = stride length) though only a few of them correspond to actual
existing rules and the remaining nodes are virtual. In order to decrease mem-
ory requirements, in our version when there is a rule with non-speciﬁed parts,
52
2.4. APPLICATION DESIGN
Figure 2.1: The beginning of multidimensional multibit trie.
the asterisk-state has been deﬁned. It includes the all states hidden by the
wild-card, without need for nodes explosion [19]. However, the introduction
of the virtual asterisk-state increases algorithm complexity, due to possible
wrong paths and trie backtrackings, and makes the number of memory ac-
cesses variable (see a simple example in ﬁg. 2.2). This phenomenon is largely
compensated by the advantage of locating the classiﬁer table into the SRAM,
a memory device with lower access delay than that of the DRAM, with the
overall eﬀect of a signiﬁcant reduction of the packet processing time. To sus-
tain the choices, a simulative validation has been performed by means of an
apposite simulator built in C language. A large saving of memory has been
obtained, which justiﬁed the modiﬁcations [24].
2.4 Application Design
During application design, the ﬁrst great concern has been over the optimal
distribution of functions between the two diﬀerent levels of processors. Oﬀ-
line operations such as classiﬁer table creation or classiﬁer table update can be
assigned to an elementary piece of software at XScale level. According to IXA
53
2.4. APPLICATION DESIGN
Figure 2.2: A simple example of trie backtracking.
SDK terminology, such element is called core component and is implemented
as a kernel module for Montavista Linux.
On the other hand, packet ﬁltering is a task to be included in the series of
per-packet operations of a node. In the IXP2XXX NP context, this means it
must be implemented at microengine level. Moreover, it can be distributed to
one or more microengines. In any case, the ordered thread model is compul-
sory. In this design model [20], each thread is in charge of overall processing
of a packet and, through the use of signalling, threads and packets order is
preserved.
The packet processing scheme (executed by all threads) can be described
with a self-explaining pseudo-code as shown in algorithm 1:
Algorithm 1 The packet processing scheme
1: read_packet_header(ip_hdr)
2: extract_5− tuple(ip_hdr)
3: result = classifier(5− tuple)
4: set_TOS(result)
54
2.4. APPLICATION DESIGN
An ordinary way to reduce processing load of real classiﬁers is caching
ﬂows. In real traﬃc, in fact, most of the packets commonly belong to a
certain set of ﬂows characterized by the same 5-tuples. Then it is useful to
introduce a ﬂow-cache, where a new entry is added when the ﬁrst packet of
a new ﬂow enters the system. In this case, the classiﬁer performs a lookup in
the classiﬁer table and stores the result in the ﬂow cache. Otherwise, for each
packet belonging to a known ﬂow, the classiﬁcation result is already in the
cached data and the amount of memory accesses is reduced. Since the number
of ﬂows can be very high, a hash-table is an eﬃcient way to implement such a
cache. The 5-tuple processed by the classiﬁer is composed of 104 bits of data
which become the input to our hash function.
As said in section 1.2, the IXP2400 includes a hash unit (inside the SHaC
component) that is accessible by all processors in a shared fashion. It pro-
vides 48-, 64- and 128-bit hash functions from keys of the same size. Each
microengine includes also a 16- or 32-bit CRC-unit, which introduces a min-
imum delay (less than 10 clock cycles per computation) since it is local to
microengines. Obviously CRC is not a perfect hash function for most appli-
cations, yet it is good enough for the requirements of our module. Therefore,
for hashing operations, the CRC-unit of NP is used, which allows to minimize
the latency and to limit the hash-table size. As for memory utilization, 104
bits are needed to store each ﬂow 5-tuple while classiﬁcation result consists of
8 bits. The minimum accessible memory unit for IXP2400 is a 32-bit longword
(LW). Therefore we decide that the 5-tuple and classiﬁcation result must be
aligned to 128-bit (4 LW), using the remaining 16 bits for a next-element-
pointer. This pointer is needed because, in order to resolve the collisions of
the hash table, a linked list is used for each entry.
By using 16 bits for the pointer, we limit the maximum number (Nmax)
of ﬂows that can be cached at the same time to 216, a trade-oﬀ value between
a good exploitation of caching and a low probability of collisions in the hash
table. Moreover, an h-bit hash requires 2h entries, and each entry is 32-
bit long (i.e. 1 LW). Therefore the maximum overall memory utilization for
ﬂow cache is 2h + 4Nmax longwords. Hence, for example, for h = 15 and
Nmax = 216 our cache requires at most 1.2 Mbytes, that means it can be
placed in SRAM, which is the external memory with lowest latency delay.
The introduction of a hash-table implies also some coarse memory allo-
cation routines that must wipe oﬀ old ﬂows and accept new ones in cache.
In order to achieve a low complexity in these functions, we keep ﬂows data
locations in a simple FIFO queue. Thus a cached ﬂow is deleted when Nmax
new ﬂows enters the system [21].
The pseudo-code for packet processing become then as shown in algorithm
55
2.5. RECONFIGURABILITY
2:
Algorithm 2 The new packet processing scheme.
1: read_packet_header(ip_hdr)
2: extract_5− tuple(ip_hdr)
3: h = hash(5− tuple)
4: if (there_is_a_match_in_hash_table) then
5: result = rule[h, 5− tuple]
6: else
7: result = classifier(5− tuple)
8: add_flow_to_cache(h, 5− tuple, result)
9: end if
10: set_TOS(result)
Finally, in such a scheme where diﬀerent threads work with the same data
structure, contention becomes an important issue to be addressed. Software-
controlled cooperative context-switches reduce thread synchronization issues,
but it is not enough in case of multiple simultaneous operations on the cache
table, as processing, additions, or removals of cache elements. A simple solu-
tion is to use mutexes (short for mutual exclusion) which are Boolean variables
(1 or 0) accessed by atomic (i.e.: non-interruptible) operations.
2.5 Reconﬁgurability
Reconﬁgurability is one of the most appealing features of network processors.
The ability of changing ﬁltering rules "on the ﬂy" is a fundamental require-
ment for most applications based on packet ﬁltering, like dynamic proxies and
adaptive load balancers.
After a ﬂow is cached, microengines use information stored in the cache
table to classify packet belonging to this ﬂow. Meanwhile, if a non-cached
packet must be classiﬁed, µ-engines access to the classiﬁer table. This table is
created by an oine component at XScale level. Therefore, in order to support
reconﬁguration at runtime, some communication between microengines and
XScale must be deployed. Before the classiﬁer table can be updated, a message
must be sent from XScale to microengines and then the packet processing code
must block any request to access to the classiﬁer table during the update time.
The cache is very useful in this task since packets belonging to cached ﬂows
can still be processed while only new ﬂows are slightly aﬀected (the packets
of new ﬂows are stored in line for being processed according to the updated
56
2.6. IMPLEMENTATION
table). At the end of table update, cached ﬂows can be all wiped oﬀ by some
sort of "garbage-collection" routines.
The update command can be issued by using the Class_table module,
which produces this messages exchange:
• Class_table communicates with the classiﬁer core component;
• the core component sends a message to Update microblock through ring
6;
• the Update microblock sets a mutex on classiﬁer table, blocking all
accesses to it;
• once there is no thread using classiﬁer table, the Update microblock
sends a message to the core component through ring 7;
• the XScale can now re-build the table;
• once the rebuild is done, it sends a message through ring 6;
• then the Update microblock can release the mutex and threads can
access the table again.
Message exchanges between core components and microblocks are dealt by
Resource Manager Framework [22] and are interrupt-driven. In order to solve
contention between Classiﬁer and Update microblocks and avoid deadlocks,
memory routines release owned mutexes whenever a Classiﬁer thread asks for
the same mutex.
2.6 Implementation
The packet ﬁltering is designed to be integrated into a general application
scheme which is also used for a Level2 Static Forwarder provided by Rady-
sis with ENP2611 boards. The classiﬁcation module is inserted into a single
microengine (as depicted in ﬁg. 2.3) but it can be spread over diﬀerent mi-
croengines with few modiﬁcations. A single microengine is used because of
better experimental performance as compared with the solution with more
microengines. These results depend on the strict ordering of microengines.
In fact, in case of many independent and concurrent accesses to SRAM, as
happens for this application, the strict ordering entails long waits even for
fast threads. Hence, the use of less microengines, and consequently of less
threads, is the choice that assures the best performance.
57
2.6. IMPLEMENTATION
Figure 2.3: Classiﬁer architecture at microengine level.
In the ﬁgure, the circles represent rings, which are on-chip circular FIFO
queues. The external rectangles represent processors. The microengines are
labelled with the name "uE" followed by a hexadecimal index. The internal
rectangles represent the pieces of code that implement speciﬁc functions. In
a microengine these pieces are named microblocks, at XScale level they are
driven by core components.
The white microengines contain the driver-blocks, directly provided by
Intel, strictly hardware-dependent, dealing with low level functionalities. In
particular, Packet_RX retrieves packets from interface and assemble meta-
data to be put in Ring 1, while Packet_TX reads metadata from Ring 2 to 5
and sends corresponding packets to Media Switch Fabric.
The gray microengines contain the user-written microblocks, in charge of
packet processing and representing the actual target of developers.
Our classiﬁer is composed by microengine 0x01 and by a core component
communicating with an user-space application for table creation and update.
Microengine 0x01 includes two microblocks: Classiﬁer (in charge of packet
classiﬁcation) and Update (assigned to deal with table update). These mi-
58
2.6. IMPLEMENTATION
croblocks corresponds to diﬀerent threads, in particular 7 threads run the
Classiﬁer microblock, while Update is assigned to thread 8. If the classiﬁer is
spread over diﬀerent microengines, only one of them must include the Update
microblock.
The functionalities of our module will be accurately described in the follow-
ing. The XScale processor and the microengines perform distinct operations:
the ﬁrst one builds the decision-making data structure from a rules list; the
second ones process packet headers and classify them.
A PHP interface is created to allow rules insertion. Thus, a proper ﬁle
is generated and passed to module Class_table, which calculates the number
of nodes for each level and the SRAM addresses of levels. Then, it builds
the data-structure, by using the pre-calculated addresses. The microengines
retrieve the proper ﬁelds (IP addresses, L4 ports and Protocol ﬁeld) from
packet headers and compute the corresponding CRC. The result is used to
address the hash table and search the cache for a match. If an entry exists,
the result is already stored and can be used. Otherwise the classiﬁer looks
for the exact rule using the trie data structure in SRAM and a new entry is
inserted in the cache table. At the end the TOS ﬁeld is modiﬁed according
to the result.
Referring to well-known reader/writer paradigm, each thread starts as a
reader but becomes a writer if it does not ﬁnd the ﬂow it looks for. Then,
whenever a thread needs to check for a ﬂow in the cache, it ﬁrst checks the
mutex for the corresponding entry in the hash table. If it is set, the thread
has to wait, otherwise it sets the mutex and enters the linked list. Once the
search is ﬁnished, the thread releases the mutex by clearing it. The mutex
for an entry is placed in the most signiﬁcant bit (msb) of it. Mutex variables
that are not related to hash table entries are stored in Scratchpad Memory (if
the application uses more than 1 microengine) or Local Memory (otherwise),
for reading latency reduction.
At the earliest stages, one of the main issues of our classiﬁer was the high
number of SRAM accesses. Therefore, a very compressed data structure is
created, to consolidate adjacent memory accesses. This target is obtained
through the pre-calculation of SRAM addresses performed by the XScale, in
order to avoid empty memory spaces.
Moreover, complex programming features, such as ﬁlling, are analyzed.
The IXP2400 provides a limited queue for SRAM access requests: if a thread
needs a memory access but the queue is full, it keeps on forwarding the request,
hence stalling the whole microengine. To avoid stalling, a ﬁrst solution is the
instructions ﬁlling. It consists of ﬁrst executing all the instructions which do
not depend on the memory access, including those that follow the memory
59
2.7. PERFORMANCE EVALUATION
Figure 2.4: The experimental test-bed.
access instruction in the code ﬂow. This way, memory readings are delayed,
which prevents from overﬁlling the queue of SRAM access requests. To date,
all these optimizations are manually performed, as the code improvement
options of compilers are still not suﬃciently mature.
2.7 Performance Evaluation
This section illustrates a series of measurements to test working and per-
formance of our classiﬁer. The test-bed used is shown in 2.4. The RadiSys
ENP-2611 board is hosted by a PC running Linux; they are connected through
a serial cable for board conﬁguration and by Ethernet for all following com-
munications.
To generate and analyze IP traﬃc, the Spirent AdTech AX/4000 is used,
connected to the board through multimodal optical ﬁber cables and to a
Graphical User Interface (GUI) through Ethernet. The ADTech AX/4000 is
able to ﬁll up a full-duplex gigabit link.
The rule tables for experimental validation have been generated using
ClassBench [24], a benchmarking tool for packet classiﬁcation algorithms. The
ClassBench Generator produces synthetic ﬁlter sets that accurately model
the characteristics of real ﬁlter sets (e.g. Access Control Lists, Firewalls, IP
Chains). Moreover, the tool suite includes a Trace Generator, which creates
60
2.7. PERFORMANCE EVALUATION
Figure 2.5: Transfer delay.
sequences of packet headers to test the algorithm with respect to a given ﬁlter
set. Each packet header is characterized by a diﬀerent 5-tuple and it identiﬁes
a speciﬁc traﬃc ﬂow. These sequences are translated to be used as inputs to
AX/4000.
The system proposed in this paper is able to classify 3 Mpps (the maximum
number of packets on a full-duplex gigabit link) with no packet loss, indepen-
dently of statistical properties of oﬀered load. This outcome approaches the
performance limit of the IXP2400 and is very remarkable in comparison with
the other available solutions. For instance, a classiﬁcation function imple-
mentation on a 33 Mhz FPGA and ﬁve 1 Mb SRAMs allows up to 1 Mpps
[25]; an heuristic hardware-based algorithm for packet classiﬁcation on net-
work processors [26] shows a simulated classiﬁcation throughput of 2.56 Gbps,
by setting in the PALAC simulator a SRAM access cost of 10 ns instead of
about 200 ns we experiment. Moreover, our overall system can be ported
with slight modiﬁcations on other network processors of the Intel family and
it well scales with power processing; by adopting for example the IXP2800,
which has a grater global processing power, the system can easily reach higher
performance (up to 10 Gbps). Fig. 2.5 shows the average transfer delay expe-
rienced by packets in the overall application, according to a diﬀerent number
of ﬂows. The number of ﬂows that can be cached at the same time is ﬁxed to
61
2.7. PERFORMANCE EVALUATION
Figure 2.6: Time evolution of transfer delay values.
500. We used for traﬃc generation a random space burst distribution, with a
ﬁxed burst size of 1250 packets per burst and a variable number of bursts for
second.
As shown by the ﬁgure, if the traﬃc is composed of 400 or 600 ﬂows (close
to the cache table capacity), the average packet processing time is ever less
than 20 µs. Whenever the number of ﬂows signiﬁcantly increases, the beneﬁts
of cache table are reduced: the cache is constantly updated for the arrival of
new (non-cached) ﬂows and the classiﬁer table is often referred. Hence the
processing time increases. However, also with 1000 ﬂows (twice the number
of ﬂows that can be cached) and with a traﬃc load of 3 Mpps, these value are
less than 150 µs. 2.6 illustrates the time evolution of average transfer delay
experienced by packets in diﬀerent work conditions of the classiﬁer. For these
measurements, a constant bit rate ﬂow distribution and an overall traﬃc load
of 3 Mpps are set. The cache table can contain 500 ﬂows.
At the beginning, 400 packet ﬂows pass through the classiﬁcation device.
All the information of such ﬂows is cached and thus the average processing
time is very low (about 8 µs). The classiﬁer table is used only in startup
phase and it doesn't aﬀect the transfer delay values.
62
2.8. FINAL CONSIDERATIONS
At 12.2 second, 200 new traﬃc ﬂows are added to the previous ones (the
global traﬃc load holds steady). Now there are 600 traﬃc ﬂows with a periodic
distribution: this is the worst case for our application, because the processing
of periodic and contemporary ﬂows nulliﬁes the possibility of exploiting any
time locality through the cache (remember that at most it can contain at
the same time the information for 500 ﬂows). In fact, packets of diﬀerent
ﬂows continuously arrive at the classiﬁcation device, with no time locality,
hence in the cache table new entries are persistently added and other entries
deleted, and the classiﬁer table is often referred. Despite this hard situation,
an average time processing increase of only 1 µs is observed. At 25 second,
a rule table update command is invoked. The entire update process starts,
including the message exchange between XScale and microengines and the
update of classiﬁer table. In this period the processing times increase up to
20 µs, but very few packets are involved in. In fact the overall process lasts
less than 0.6 seconds, and in this period the packets of cached ﬂows continue
to be processed. Only the packets of not cached ﬂows are delayed in wait for
the update of classiﬁer table. After the update, the average transfer delay
clearly reverts to 9 µs.
The results obtained for reconﬁgurability of our device are very remark-
able: a rule table update is obtained practically for free in terms of packet loss
and delay. For memory consumption, the measurements sustain the goodness
of a-priori analysis and the eﬃciency of performed modiﬁcations: the rule
table can be stored in SRAM provided it does not exceed 50000 rules. As
we have already said, the bound is not given by intrinsic restrictions of our
module, but only by the sizes of the memories in the RadiSys card. Classiﬁer
performance, in terms of supported throughput and processing time, is fully
independent of number of rules. Therefore, an absolute scalability related to
this feature is detected.
2.8 Final considerations
In this chapter the rigorous design and the accurate implementation of a mul-
tidimensional packet ﬁltering on Intel IXP2400 Network Processor have been
illustrated. The ﬁrst phase of this research has been focused on the selection
of a classiﬁcation algorithm that would achieve an optimal trade-oﬀ between
its performance and its integration in the speciﬁc hardware platform. The
algorithm, namely the multidimensional multibit trie, has been modiﬁed to
exploit the characteristics of the Radisys board. Then, the classiﬁcation func-
tionalities have been implemented: the XScale processor builds the decision-
63
2.8. FINAL CONSIDERATIONS
making data structure according to a rules list, the microengines process the
packet headers and ﬁnd the exact rules. Also a cache table has been intro-
duced, to improve the classiﬁer performance thanks to a decrease of number
of memory accesses. The system reconﬁgurability has been carefully focused,
and a communication set has been created to allow XScale to update clas-
siﬁer data structure and microengines to use consistent data. The problems
of contention and synchronization, that are typical issues in multiprocessors
environments, have been addressed by means of programming expedients as
mutexes, in addition to software-controlled cooperative context-switches pro-
vided by IXP2400. In the programming of microengines, multithreading,
stalling and ﬁlling have been accurately investigated, in order to obtain the
best performance. Finally, we have tested working and performance of our
classiﬁer. Good values of transfer delay are obtained and a throughput of 3
Mpps is supported. The process of rules table update is obtained practically
for free: it lasts less than 0.6 seconds and produces only a small increase of
processing time, with no packet loss.
64
Chapter 3
Amber Sched: a resource
scheduler for NPs
The growth of the Internet in the last years has been pushed by increasing
requirements in terms of capacity, security and reliability. Moreover, im-
provements in multimedia applications need mechanisms and architectures
to accomplish Quality of Service(QoS) and diﬀerentiated services. Technol-
ogy development has shown that the evolution of processing power cannot
cope with the link capacity growth. Therefore a link capacity scheduler is no
longer suﬃcient to assure eﬃcient service diﬀerentiation to end-users, but a
proper computing power allocation for packet processing must be adopted.
In this chapter a processing scheduling scheme for Intel IXP2XXX Network
Processors is proposed. A model of the architecture is deﬁned and an ad-hoc
simulator is developed to help the comprehension of the system and the re-
design of the application. Finally experimental results show the performance
of the proposed algorithm.
3.1 The main idea
The high level of QoS and the service diﬀerentation required by the emerging
Internet applications, along with the large number of costumers (from resi-
dential to enterprise) due to new broadband access technologies, such as DSL
and x-PON, lead towards the building of next generation routers with great
ﬂexibility and processing capabilities, high-speed interfaces and large switch
capacity. The technology development, in the last years, has shown as the
evolution of the processing power cannot cope with the link capacity growth
[26]. Therefore the sole link capacity scheduler is no longer suﬃcient to as-
3.2. RELATED WORKS
sure an eﬃcient and reliable service diﬀerentiation to end-users, but a proper
computing power allocation must be adopted.
In this scenario one of the most promising paradigm is represented by the
adoption of Network Processors (NPs) [27]. The inherent multiprocessor and
multithreaded architecture of a NP allows the deﬁnition of diﬀerent paths
for packet ﬂows inside the chip: this feature can be exploited to allocate
processing resource according to ﬂow needs. In this chapter we present an
analytical approach to the problem of resource scheduling on NPs for given
packet processing applications.
3.2 Related Works
The interest of the scientiﬁc community in the area of resource scheduling for
packet processing applications is very wide. Firstly, in [28] the authors show
how conventional link-scheduling schemes do not apply to the case of resource
scheduling, because they work with only one resource (the link capacity) and
rely on unlimited capacity of buﬀers and processing. These assumptions have
no meaning on architectures with many output ports and a limited number
of processors per card. The authors identify in the thread the smallest com-
puting unit to be scheduled and propose a thread scheduler for IXP1200 NP.
However this choice increases the overhead of context-switches and the overall
complexity as well. While our approach and [28] share the focus on threads,
we look for a simpler and faster scheme.
A fair composite scheduler of link bandwidth and CPU (CBCS) can limit
the delay of an active node as it combines the two operations in one[29]. How-
ever CBCS takes a decision on each packet before the processing stage, before
knowing which output port it will be forwarded to. This way, such a compos-
ite scheduler does not exploit the available links. Anyway, [29] highlights the
importance of a feedback scheme (similar to the one we propose) that helps
making predictions and decisions.
In order to deﬁne the resource to be scheduled, T.Wolf et al. show in
[30] how the processing stage delay in a programmable router is generally
related to the particular kind of running application and to packet size. The
authors propose an algorithm that predicts the processing delay of a packet as
a function of payload length and of the speciﬁc application. However in packet
forwarding applications, the processing delay does not depend on packet size
and the proposed estimation is a constant. [30] inspired our work in the
deﬁnition of a more precise and complex prediction algorithm for processing
delay.
66
3.3. IPV4 FORWARDER EVOLUTION
Estimations for NPs execution time are given in [31] and [32]. The former
paper focuses on worst-case analysis to give safe bounds for real-time applica-
tion performance stability, the latter provides an oine and static prediction
based on the accurate study of the code that is typically unpractical.
Our analysis of the processing resource is based on classical studies on mul-
tithreaded systems, such as [33], and focuses on network applications to pre-
dict processing delays. The problem of resource allocation on multi-processor
systems can also be generalized to a multi-server scheme. In [34] a formal deﬁ-
nition of Multi-Server Fair Queueing (MSFQ) as an approximation of General
Processor Sharing (GPS) for multiple servers is provided. This seminal work
inspired diﬀerent other schemes[35] on the problems of packet scheduling on
multi-processor and multiple links. However, as the GPS for the single server
case, MSFQ can be seen as an ideal system and as a target for real schedulers.
3.3 Ipv4 Forwarder Evolution
This work is based on the IPv4-Forwarder Application provided by Intel as an
example code. It is distributed among microengines (µEs) as shown in ﬁg.3.1,
and the XScale Core (which is not shown) handles exception packets and
management/control functionalities. We will refer to processor as synonym
of microengines, since our focus is on processing stage of the Fast-Data-Path
and the XScale is not implied.
Microengines are equally divided into two clusters, thus the notation µE
p:q refers to the q-th microengine of the cluster p. The ﬁrst microengine
from the left in ﬁg.3.1 (µE 0:0) receives packets from Media Switch Fabric
(MSF) and puts the pointers to received packets in a ring structure. This
kind of structure is used by all processing units to communicate, as it is
allocated in a fast and small memory called Scratchpad (the only shared
memory that is embedded in the NP[9]). Three microengines process the
packets according to a look-up table (created by XScale Core) and hand the
processed packets to the Packet Queue-Manager microengine. Here all packets
are queued according to their output port waiting for a proper order from the
Link Scheduler microengine. Finally the Queue-Manager dequeues packets
and delivers them to µE 1:1 which is appointed to packet transmission.
The study of this application has been focused on the processing stage
(µE 1:3, 0:2, 0:1 in ﬁg. 3.1), where each thread deals with only one packet.
The use of a single ring between receiver microengine and processing stage
is not proper for packet delay diﬀerentiation, since there is no way to decide
the path each packet will follow into the processing stage. Moreover the
67
3.4. MODEL
Figure 3.1: IPv4 Forwarder
temporal scheduling on the rings for processing stage forces all the threads to
be completed at the same time and avoids packets misordering, but it avoids
also any opportunity to diﬀerentiate traﬃc delays.
The application has then been modiﬁed as shown in ﬁg. 3.2. A new mi-
croengine has been introduced for the resource scheduling functionalities. As
shown in the next section, this microengine chooses the proper processing mi-
croengine for each received packet according to the ﬂow it belongs. The choice
is supported by feedback information written into fast Scratchpad memory by
FB microblock.
As a side eﬀect the new application does not keep the described read/write
order, introducing the new problem of packet misordering. Anyway, the
amount of out-of-order packets is hardly signiﬁcant (as it will be shown in
last section) and represents the price for diﬀerentiation of class delays.
3.4 Model
As remarked above, the thread has been chosen as the resource to be sched-
uled. Thus it has been studied, in particular for its processing delay that, in
packet forwarding applications, is not related to packet length but only on
processors load.
We will base our analysis of the processing stage of our modiﬁed applica-
tion (ﬁg. 3.2) on the following remarks:
68
3.4. MODEL
Figure 3.2: IPv4 Forwarder with AMBER Resource Scheduler
1. Each thread handles only one packet a time.
2. The code executed for each packet is the same.
3. Threads do not follow any order when reading/writing to input/output
rings.
4. There is no exceptional packets to be processed.
5. Each thread can be in execution, idle, or waiting for required resources
(the processor is handed to other threads in a non-preemptive fashion).
Property 4 corresponds to a Common Case Execution Time (CCET)[32]
approach. All the mentioned properties (except for the 3rd) are satisﬁed by
most applications for diﬀerent NPs[10]. Therefore the model we describe can
be applied to a large number of applications with little modiﬁcations.
We will refer to the processing time of a packet (and then of its related
thread) inside the processing stage of the forwarding application as comple-
tion time. We will call active a thread that has a packet to be processed
and we will refer to the condition of voluntary processor release and wait for
a resource as block. Thus we will hereafter call block instant the moment
when a block begins and a context-switch occurs.
During a block, a thread must wait both for the required resource to
be ready (TbRes) and for the other threads to hand the processor back to
it (TbInt), resulting in a blocking time Tb = max(TbRes, T bInt). Thus the
69
3.4. MODEL
completion time Tck of thread k belonging to a microengine with n active
threads out of a total number of NT threads is given by:
Tck(n) =
∑Nb
i=1
[
∆Tk,i + ∆Tctx + max
(
TbIntk,i , T b
Res
k,i
)]
(3.1)
TbIntk,i =
∑n
j=1,j 6=k ∆Tj,i +
∑NT
j=n ∆Tr
where i = 1..Nb represents the index of block instants of a thread, ∆Tk,i is
the atomic execution time of thread k before block instant i, ∆Tctx is the
context-switch penalty (which is a few processor cycles) and ∆Tr represents
the atomic execution time of an inactive thread, that is a thread which reads
the input ring and ﬁnds no packets there. Finally TbResk,i is the i-th resource
blocking time of thread k.
Simulations and [10] suggest that blocking time for SRAM or DRAM ac-
cess has an average behaviour expressed by:
E[TbRes] = max (K ·N · L, Tbmin)
where K is a constant depending on the memory type, N is the number of
threads accessing that speciﬁc memory, L indicates the words read or written
and Tbmin is a constant threshold representing the minimum blocking time
(i.e. the time the controller needs before taking control of the resource).
The max() operator in (3.1) can be replaced with one of its terms (either
TbIntk,i or Tb
Res
k,i ) according to the load of the microengines (that is the number
of active threads). Thus two diﬀerent regimes can be identiﬁed and in both
cases a linear expression (with diﬀerent parameters) for Tc(n) = E[Tck(n)]
holds:
Tc(n) = a · n+ b (3.2)
To conﬁrm the previous results, the forwarding microengine code has been
emulated with the Developer Workbench [36] and diﬀerent completion time
traces have been extracted. The result is shown in ﬁg. 3.3 together with the
theoretical curve obtained through (3.2): there is only a little diﬀerence for
the value of Tc(1), anyway when n = 1 the system is very underloaded and
this is a situation when a resource scheduler is of little help.
3.4.1 Predictors
To account the ﬂuctuation in the number of active threads due to the depar-
ture and arrival of new packets, we can average eq. 3.1 and 3.2 with a running
average (we call it predictor A) or we can exploit the measured processing time
70
3.4. MODEL
Figure 3.3: Theoretical completion time, emulated values and number of pack-
ets per second (dotted line) as a function of active threads number.
on last packet (predictor B). This way we propose two prediction schemes for
processing time, that require only a small amount of additional complexity
and a feedback mechanism that reads the number of active threads or the last
processing time. Such schemes are compared with a predictor that uses only
previous processing time (predictor C) and the plain constant predictor (D)
proposed in [30]. Hereafter we report the four prediction schemes:
TˆA(j) = αTˆA(j − 1) +(1− α)Tc(n)
TˆB(j) = (α− β)TˆB(j − 1) +βX(j − 1) + (1− α)Tc(n)
TˆC(j) = αTˆC(j − 1) +(1− α)X(j − 1)
TˆD(j) = const
where TˆK(j) is the prediction for j-th packet in scheme K, X(j − 1) is the
measured completion time of the previous packet, Tc(n) is the result of (3.1)
and α and β are constants that take in account the memory of the predictors.
Predictors A and B require the estimate of parameters a and b in the linear
approximations (3.2) of (3.1). The estimation can be achieved by the use of
an online estimator such as the one proposed in [30], that can be activated at
startup for a training phase and again at regular intervals. By keeping a small
71
3.4. MODEL
amount of additional variables related to the last K packets, the estimator
can use the measured completion times Xi of packets and compute regression
coeﬃcients a and b as follows:
a =
∑
K Xini−
∑
K Xi
∑
K ni/K∑
K n
2
i−
∑
K ni
∑
K ni/K
(3.3)
b =
∑
K Xi − a
∑
K ni (3.4)
The comparison among predictors has been made by means of diﬀerent
real completion times traces of more than 10000 packets, with various values
of α and β = 0.3. The average prediction errors (e), in clock cycles, are shown
in tab.3.1. As for the constant of predictor D, we used a simple average of ﬁrst
100 processing delays. Predictor B provides the best estimation of completion
time on all tests. Predictor A shows a good approximation as well, always
better than C. Finally, results for predictor D show the penalty induced by a
lack of information on the status of the processing units.
As stated above, out-of-order packets are a big issue when using prediction
for scheduling purposes. An accurate estimation of completion times just
addresses reduction of packet misordering. We assume as misordering penalty
the additional delay experimented by a packet because of mistakes in ordering
packets by the scheduler. To quantify such a delay, we assume that random
variables XBj,Ck denotes the event that the scheduler misorders packet j of
ﬂow B and packet k of ﬂow C. Thus P (XBj,Ck = 1) is the probability that
the scheduler makes a mistake in ordering the packets of these two ﬂows. An
analysis of the average misordering delay as seen by ﬂow B (δb) for estimation-
based scheduling techniques is provided in [30]. Substantially, δb accounts for
the time spent by the server in wrongly servicing traﬃc from ﬂow C before
processing packets from ﬂow B. In our analysis, a simple bound for δb is used:
δb ≤ δmax = 2P (XBj,Ck = 1)Vmax (3.5)
where Vmax is the maximum delay dispersion among ﬂows.
The values of δmax for the diﬀerent predictors are reported in tab.3.1.
Again, Predictor B, as it exploits all the information we have, gives best
results (lowest value of δmax) together with Predictor C, while predictor A
gives higher misordering delay.
3.4.2 Scheduler Design
The AMBER Sched provides a service diﬀerentiation according to the delay
sensitivity of traﬃc classes. Particularly, it assigns the fastest path inside
processing stage to packets belonging to delay sensitive (DS) ﬂows and ensure
72
3.4. MODEL
Table 3.1: Simulated predictions (clock cycles)
Predictor A Predictor B Predictor C Predictor D
α e δmax e δmax e δmax e δmax
0.75 4.3 450 0.35 200 6 182 185 401
0.8 4 446 0.16 196 6 184 185 401
0.85 0.9 440 4 195 6 184 185 401
0.9 2.2 444 1.03 200 10 194 185 401
0.92 0.7 442 1.2 208 12 202 185 401
packets belonging to non delay sensitive ﬂows to be forwarded. It tries also
to minimize the number of out-of-order packets.
To a ﬁrst approximation, there is no diﬀerence among completion time
for all threads in the same microengine. Therefore, to obtain diﬀerent com-
pletion time according to classes, it is mandatory to use microengines with
diﬀerent workload, i.e. a diﬀerent number of active threads, as suggested by
eq. 3.1. Since the number of microengines assigned to the processing stage
is limited and traﬃc is hardly predicted, it is not possible to statically assign
microengines to speciﬁc classes (otherwise a non-work-conserving scheduling
scheme is obtained).
The rate of processed packets in a microengine (n/Tc(n), dotted line in
ﬁg. 3.3) is a concave increasing function of n and it increases slowly when n
is high. Then if a microengine has a high number of active threads, it is not
possible to assign a further remarkable amount of traﬃc to it. Therefore we
can make two diﬀerent conclusions, depending on the amount of traﬃc to be
processed. If that amount is large, the number of active threads on all micro-
engines of the processing stage will be high as well and some processors will
become fast saturated. In case of saturation, all the remaining traﬃc shall
be moved to other microengines that will have no longer any advantage in
terms of completion time. Hence, in this condition, the best solution is to dis-
tribute all the traﬃc equally on microengines and give up any diﬀerentiation
on completion time among ﬂows.
On the other hand, if traﬃc is not too large, it is possible to maintain
some microengine with more active threads than others and then to achieve
remarkable diﬀerences in completion times. To this aim we need to deﬁne a
mechanism to protect DS ﬂows and ensure them the least completion time
(processing delay). We assign a weight term to packets according to their class
and we use this weight as an additional parameter in solving the scheduling
73
3.4. MODEL
problem. Intuitively, the higher the weight for a class, the faster the class is
served. In fact, assuming that the weights sum for a microengine is constant,
if all packets belong to the same ﬂow (i.e. they have same weight wi) the
number of active threads shall be Wtot/wi and thus Tc ∝ (1/wi + b).
Moreover, to provide service diﬀerentiation also with a large amount of
traﬃc, we must maintain the microengines with a diﬀerent number of active
threads irrespective of traﬃc load. To address this target, we limit the number
of packets contemporaneously served by a microengine by setting a threshold
Wmax for the maximum weights sum for any processor: a packet is scheduled
to a microengine only if the weights sum does not exceed Wmax.
The pseudo-code for the scheduler functions is shown in algorithm 3. We
deﬁne Φ(t) as the set of backlogged ﬂows at time t and Ω as the set of the
microengines. The scheduler keeps a list of ﬁnish time (Fti) for the last
scheduled packet of all ﬂows and schedules always the backlogged ﬂow with
least Fti. Moreover it predicts the completion time (tˆck) of a packet on all
k ∈ Ω. Of course predictions change only when the number of active threads
changes (i.e.: a packet leaves or enters a microengine). Then we compute
tˆck only after a packet has been scheduled and after a packet has left the
processing stage. To guarantee delay diﬀerentiation and a low misordering
ratio, we deﬁne the set E of eligible microengines as the subset of Ω that can
receive packets (i.e. the sum of weights is less than Wmax) and provide a
predicted completion time that avoids misordering (i.e. tˆck > Ftj).
The dequeue function (see algorithm 4) is called whenever the set E con-
tains at least one element (this condition is satisﬁed for t > tdeq). It computes
E and behaves diﬀerently according to the class of the packet it schedules: it
hands packets belonging to DS ﬂows to the microengine with least estimated
completion time, while other packets are sent to the path with least total
weight, ensuring packets belonging to non-DS ﬂows to be forwarded.
Algorithm 3 Pseudo-code for the scheduler
1: At packet departure and after scheduling:
2: j = argmini∈Φ(t) Fti
3: for all k ∈ Ω do
4: if Wk < Wmax then
5: tˆck = TˆX where X ∈ {A,B,C,D}
6: tdeq = min(Ftj − tˆck)
7: end if
8: end for
74
3.5. SIMULATION
Algorithm 4 Pseudo-code for the dequeue function
1: E = {k ∈ Ω : t+ tˆck > Ftj ,Wk < Wmax}
2: if j ∈ DS then
3: m = argmink∈E tˆck
4: else
5: m = argmink∈EWk
6: end if
7: send ﬁrst packet from ﬂow j to µE k
8: Ftj = t+ tˆck
Table 3.2: Simulated completion time (clock cycles) and microengines utilization.
µE0 µE1 µE2 Tc wi
DS 41% 62% 21% 2485 20
non-DS 59% 38% 79% 3513 10
3.5 Simulation
An ad-hoc simulator (in Matlab environments) has provided preliminary re-
sults from the mathematical model of delays. Such results have suggested
optimal weights for classes and complementary solutions to enhance system
performance. In particular we have found that adding a further weight to the
busiest microengines can help reduce the amount of out-of-order packets.
In a further step of simulation, Developer Workbench [36] cycle-accurate
IXP2XXX simulator has been used to achieve more accurate results on com-
pletion time (tab. 3.2). We tested AMBER Sched with a 2 Gbps traﬃc
divided equally into DS and non-DS ﬂows. The weights have been set as
shown in tab. 3.2 and we adopted predictor B. Results on tab. 3.2 show a
net diﬀerence in completion time among the classes. DS traﬃc has the least
delay in processing stage, with a gain of more than 1000 clock cycles over the
other class. Such a diﬀerentiation is justiﬁed also by the signiﬁcant diﬀerence
achieved between the number of active threads seen by DS and non-DS traﬃc
(respectively n = 5 and n = 7.3 as shown by tab. 3.2 and ﬁg. 3.3).
Tab. 3.2 shows also microengine utilization by the two classes of traﬃc.
It is remarkable that the paths mainly used by DS traﬃc present a minor
amount of other traﬃc. That conﬁrms the reliability of using weights for ﬂow
separation.
75
3.6. IMPLEMENTATION AND RESULTS
Table 3.3: Measured delay for IPv4 Forwarder with and without AMBER Sched
IPv4 Fwd AMBER (DS) AMBER (non-DS)
mean delay(µs) 14.21 12.1 13.2
min delay(µs) 9.7 9.6 10.2
max delay(µs) 104.3 15.4 17.5
pmisord 0 5 · 10−5 4 · 10−5
wi 30 10
3.6 Implementation and Results
The ﬁnal phase of this work deals with the software implementation of AM-
BER Sched and its experimental evaluation on a Radisys ENP2611 board,
equipped with NP IXP2400. The software has been written in microcode as-
sembly [9, 10] according to the common directives for microblock deployment,
as to make it compatible with existing applications.
In our experimental tests, we generate and measure traﬃc with a perfor-
mance tester Spirent AX4000. Since forwarding application delay does not
depend on packet size [30], the application has been tested at diﬀerent packet
rates and with various parameter conﬁgurations.
Tab. 3.3 shows the values of end-to-end delay for the IPv4 Forwarder ap-
plication and the application with AMBER Sched after a proper matching of
its parameters (i.e. optimal weights wi suggested by simulation and predictor
B with α = 0.9 and β = 0.3). As we expected, DS-marked packets experi-
ence a smaller delay than non-DS. The maximum delay for both classes is
signiﬁcantly lower than with the original application. The price for the diﬀer-
entiation is a certain amount of out-of-order packets. Anyway the measured
misorder ratio (pmisord) can be acceptable for most applications [37].
However, in spite of the overall processing added for scheduling feedback
and decisions, the mean processing time for a packet is lower than the one of
original application. This capability allows the IPv4 forwarder with AMBER
Sched to process heavy traﬃc loads (e.g. 3 Mpkt/s) with no packet loss.
3.7 Final considerations
This chapter has described an analytical approach to processing scheduling
in multithreaded and multiprocessor architectures suitable for programmable
76
3.7. FINAL CONSIDERATIONS
routers. The focus is on the Intel IXP2400 Network Processor, but the work
can be generalized to diﬀerent architectures with minor changes. The pro-
posed scheme achieves delay diﬀerentiation inside processing stage of a refer-
ence forwarding application. The Developer Workbench simulator and exper-
imental results showed that our AMBER Sched can accelerate delay-sensitive
traﬃc.
77

Chapter 4
A cooperative NP/PC
architecture for
measurements
The extensive availability of cost eﬀective commodity PC hardware pushed
the development of ﬂexible and versatile traﬃc monitoring software such as
protocol analyzers, protocol dissectors, traﬃc sniﬀers, traﬃc characterizers
and IDSs (Intrusion Detection Systems). The largest part of these pieces
of software is based on the well known libpcap API, which in the last few
years has become a de facto standard for PC based packet capturing. Many
improvements have been applied to this library but it still suﬀers from several
performance ﬂaws that are due not to the software itself but rather to the
underlying hardware bottlenecks.
In this chapter we present a new traﬃc monitoring device, implemented
by an Intel IXP2400 Network Processor PCI-X card connected to a gigabit
ethernet LAN hosting a cluster of common personal computers running any
libpcap based application. This architecture outperforms the previous solu-
tions in terms of packet capturing power and timestamp accuracy.
4.1 Motivations
In the last few years, the availability of ﬂexible, easy to use and easy to
customize network monitoring software, has proposed the PC as a suitable
platform for network monitoring and testing. Application such as tcpdump
[38], wireshark [39], ntop [40] etc., prove to be very eﬀective and ﬂexible for
large a variety of monitoring tasks. Most of these pieces of software are based
4.1. MOTIVATIONS
on the well known libpcap API [38], which in the last few years became a
de facto standard for PC based packet capturing. Many improvements have
been applied to this library [41] [42] but it still suﬀers from performance ﬂaws;
these ﬂaws are not generated by the software itself but by underlying hardware
bottlenecks [43] [44].
All the applications we mentioned are often used together with high-end
PCs to capture, analyze and characterize traﬃc from high-speed links; in all
these cases, their main weakness is evident: low performance. Traﬃc traces
produced by such a combination of hardware and software suﬀer from two
types of uncertainty:
1. Packet timestamps: to sustain a high packet rate, the PC must drive
interface cards by polling and this results into poor timestamp accuracy;
2. Packet loss: packet loss can be experienced if either the packet rate is
too high and the host CPU cannot allocate/release memory for packets
or if the system bus cannot keep the pace of the incoming data.
Moreover, only oﬀ-line computing often can be performed on packets since
no extra CPU power is left for on-line analysis (all the CPU time is used for
capturing) [43] [44].
This poor performance is mainly due to the lack of packet computing
capabilities on the network interface cards which commonly equip commodity
PCs: these interface cards are uncapable of either timestamping the arrival
of a packet (avoiding interrupt latency) or to ﬁlter out unwanted packets
(avoiding memory allocation/release for unwanted packets) or to feed the
host PC with only a fragment of the packet instead of the entire one (avoiding
system bus saturation).
The research described in this chapter proposes an architecture to combine
the ﬂexibility of general purpose PCs (equipped with libpcap based applica-
tions) with the power of Network Processors of the Intel IXP2XXX family.
The target is a powerful system capable of capturing packets on GigaEth-
ernet links with good timestamp accuracy. Other research eﬀorts recognize
the eﬀectiveness of exploiting Network Processor based devices in aid to com-
mon hardware for monitoring purposes. Xinidis et al. [45] propose an active
splitter based on Intel IXP1200 for ﬁltering traﬃc directed to the sensors of
a Network Intrusion Detection or Prevention System (NIDS/NIPS). In their
scheme, the NP applies Early Filtering techniques and then it forwards traﬃc
to diﬀerent sensors, according to Locality Buﬀers or hash load balancing. In
[46], Wolf et al. propose to use a distributed architecture, called Distributed
Online Measurement Environment (DOME), of passive measurement nodes
80
4.2. THE BASIC IDEA AND ISSUES
equipped with Intel IXP2400 NP. Their work includes header anonymization
schemes and is compared with Endace DAG 4.3 cards. Both the previous
systems achieve, for small packets (64 bytes), a transmission rate of around
500 Mbit/s, while our solution is able to manage up to 1 Gbit/s. Moreover,
our system carefully addresses the issue of an accurate packet timestamping.
Compared to hardware solutions (e.g. DAG cards), our traﬃc monitor
provides a major number of functionalities and a larger ﬂexibility. For instance
the packet classiﬁer we implemented supports 50000 rules, while a DAG card
with an integrated 7-rule ﬁlter costs more than twice.
4.2 The Basic Idea and Issues
Figure 4.1 depicts a scenario where the two ﬂow directions of a Gigabit Eth-
ernet optical ﬁber are both split into two optical signals: the ﬁrst signal is
scattered to an output ﬁber while the second passes through the splitter.
Hence, there are two output ﬁbers, one for each direction. This is the best
available way to copy network traﬃc though some others are possible (e.g.
conﬁguring port mirroring on layer 2 network devices).
The output ﬁbers of the splitter are connected to two of the three optical
interfaces (see section 4.3) of a Radisys ENP2611 Network Processor card,
while a cluster of PC-based Linux boxes is connected to the third interface
via a gigabit ethernet switch. PCs and NP are also connected via a stan-
dard 100BaseT Ethernet LAN (the control interface on the NP) supporting a
standard TCP/IP connection used to issue conﬁguration commands from user
interfaces; therefore every PC on the LAN can issue conﬁguration commands
to the NP via a client/server application (the server resides on the NP, while
each PC runs an instance of the client).
Referring to this scenario, the basic idea behind the proposed architecture
is to make the NP board performing at the wire speed (1 Gbps: only up or
down-stream) the following operations:
1. Packet timestamping: recording the arrival time of each packet in
the standard UTC format;
2. Packet classiﬁcation and ﬁltering: selecting only those packets use-
ful for the user and assigning each packet a unique ﬂow identiﬁer based
on a rule set;
3. Header striping: getting only the necessary information (e.g. the ﬁrst
n bytes);
81
4.2. THE BASIC IDEA AND ISSUES
Figure 4.1: Conceptual scheme of the monitoring system.
82
4.3. THE IMPLEMENTATION DESIGN
4. Batch frame crafting: collecting data in batch frames, each contain-
ing information of several packets;
5. Sending batch frames to commodity PCs belonging to the clus-
ter: using the third port ﬁber of the NP board.
On the PC side, the batch frame is received, dissected, and delivered
towards any monitoring application.
The main advantages of this architecture are:
1. Timestamping accuracy, in that it is performed by the NP card
without the interrupt latency typical of a PC;
2. Heavy CPU ooad, as unwanted packets are dropped at the NP
level and are not delivered to any PC and since a pre-classiﬁcation is
performed on packets, bringing even more CPU ooad (for example in
ﬂow identiﬁcation).
At this stage, the main issue of this architecture would be the incompat-
ibility between the proposed batch frame and all the available libpcap-based
applications. Next sections describe the implementation design of the entire
architecture made up of an NP-side timestamping and classiﬁcation appli-
cation and a PC-side kernel space abstraction layer which guarantees the
compatibility with any libpcap-based application.
4.3 The Implementation Design
As said in chapter 1, the IXP2400 NP is hosted by a RadiSys ENP-2611
board, equipped with 8 MB of (very fast) SRAM and 256 MB of (less fast)
DRAM. This board provides three Gigabit Ethernet optical interfaces and one
Fast Ethernet interface for remote control. Moreover it supports MontaVista
Linux operating system running on the XScale CPU. The board is plugged
into a PCI-X slot of a host PC; this PCI connector is currently used only for
power supply (no data communication takes place through it).
The main goal of the entire application is to accurately timestamp packets
and to classify and extract a conﬁgurable portion of them within our IXP2400
NP. The application is made up of an NP-side module and a PC-side module.
A complete description of the whole application and its components will be
elaborated upon in the next sections.
83
4.4. NETWORK PROCESSOR SIDE
Figure 4.2: Functional scheme of the entire NP-side application.
4.4 Network Processor side
The NP side of our traﬃc monitor application reﬂects the IXP processor hi-
erarchy: microengines are in charge of packet timestamping and classiﬁcation
and batch frame crafting, while the XScale deals with timestamp calibration,
classiﬁcation table setup and update and parameter reconﬁguration. The
entire NP-side application is depicted in ﬁg. 4.2.
4.4.1 Microengines Application Scheme
The whole application can be summarized as follows. The RX microengine
(0x00) retrieves packets from interface and puts them in Ring 1. For each
packet, the arrival time (actually the arrival time of the ﬁrst mpacket, cfr.
section 4.6), the entire length and the ﬁrst n bytes of the packet are recorded.
The second microenginee (0x01) classiﬁes the packets it receives from Ring
1 (by assigning them a ﬂow identiﬁer (ﬂowID) between 0 and 216 − 1 or
by simply dropping them) and sends them to the next microengine (0x10)
which copy all data buﬀers (each containing ﬂowID, length, timestamp and
the ﬁrst n bytes of a packet) together to create a batch frame (the batch frame
format is depicted in ﬁg. 4.3). Finally, the batch frame is passed to the TX
microengine (0x02) to be sent to one of the PCs belonging to the cluster.
The batch frame header has the source address set as the MAC address
84
4.4. NETWORK PROCESSOR SIDE
Figure 4.3: Batch frame and packet digest speciﬁcation.
of the outgoing interface, the destination address set as the MAC address
of the correspondent cluster's PC and the type ﬁeld set to an unused value
(0x9000). As depicted in ﬁg. 4.3, the payload is built with a variable number
of packet digests each made up of all the packet information. The length of
the fragment can be diﬀerent among diﬀerent ﬂowIDs.
The code running on the 0x10 microengine (batch frame crafting) contains
a table with the correspondence between ﬂowID and the MAC address of
the PC in charge of processing that ﬂow. At a given time, the application
maintains up to one batch frames for each PC of the cluster and each packet
digest is copied onto the batch frame correspondent to its ﬂowID.
We will not deal with the details of the classiﬁcation application here, since
it has already been published; readers interested in some classiﬁer internals
can look at [47].
4.4.2 Xscale Application Scheme
The XScale processor handles the functionalities concerning timestamp cali-
bration, classiﬁcation table setup and update and parameters reconﬁguration.
Moreover, it cares of interfacing to the user-side by means of an instance of a
client/server application for each PC of the cluster.
85
4.5. PC SIDE
Timestamp calibration is the operation of resolving the correspondence
between the microengine timestamp, which is a relative time, with the UTC
time, which is an absolute time. This correspondence (initial value plus
counter frequency) is passed to the PC-side application in order to allow a cor-
rect UTC timestamping. Indeed the timestamps provided in the batch frame
are 64-bit values of an integrated counter of the NP and must be interpreted
correctly by the PC.
As far as the classiﬁer conﬁguration concerns, a proper ﬁle is generated
and passed to module Class_table ([47]), which builds the decision-making
data structure. The XScale also manages the dynamic reconﬁguration of the
rule set: this is done through a message exchange between its core components
and microengines [47].
The XScale application is also responsible of setting the desired number
n of bytes to be captured for each packet belonging to a given ﬂow and
managing the ﬂowID spaces which have to be assigned to the various PCs
without collisions.
4.5 PC side
The PC-side application is composed by two components. The ﬁrst is a Linux
kernel module which implements a compatibility abstraction layer, while the
second is a user space application built by a front-end (user interface) and a
back-end which passes user's conﬁguration commands to the kernel module
(via ioctl system calls) and to the NP (via the above mentioned TCP/IP
connection established on the Fast Ethernet control interface).
4.5.1 Kernel space  the compatibility abstraction layer
This module acts as a compatibility layer between the NP-PC communication
protocol (which is simply the batch frame format), and the standard packet
processing chain of the Linux kernel on which the libpcap API is based.
The module registers itself as a virtual network layer capable of process-
ing ethernet frames with the type ﬁeld equal to 0x9000. The module also
creates up to 216 virtual interface cards mon0 to mon65535 (one for each
ﬂowID), then implementing an abstraction layer toward the system. Every
time a batch frame is received by the kernel, it is steered to this layer which, in
turn, extracts from its payload all the packets together with their timestamp
and ﬂowID. For every extracted packet digest, a new correctly timestamped
packet is generated and transmitted on the virtual interface indexed by ﬂowID.
86
4.5. PC SIDE
Figure 4.4: The virtual interfaces moni.
Hence, a libpcap based application conﬁgured to monitor the interface e.g.
mon5 (e.g. with the command tcpdump -i mon5) will see all (and only) those
packets with ﬂowID 5, as if it was connected directly to the ﬁber (to which
actually the NP is connected).
Therefore, this layer makes it possible to instruct the NP to mark an
arbitrary micro-ﬂow with a speciﬁc ﬂowID, and to analyze this ﬂow by simply
connecting an application, such as wireshark, to the corresponding virtual
interface (see ﬁg. 4.4).
As experimentally showed in section 4.7, the computational overhead in-
troduced by this piece of code is very low since it is implemented in a zero-copy
fashion.
Probably, the most important advantage of this abstraction layer is the full
compatibility with existent software: packets arrive to the kernel as they were
captured on the wire, making unnecessary any modiﬁcation to applications
and libraries.
87
4.6. TIMESTAMPING
4.5.2 User Space  the user interface
The user interface is made up of a back-end capable of
• conﬁguring the NP classiﬁer via a TCP connection whose peer is the
Xscale application;
• instructing the NP to capture the desired number n of bytes from every
packet (via the TCP connection);
• reading the association timestamp-UTC;
• conﬁguring the abstraction layer via ioctl system calls.
The front-end module simply implements a user-interface from which the
user can conﬁgure the entire system.
4.6 Timestamping
The timestamping operation consists of recording the arrival time of each
packet. The arrival time is intended as the time ta at which the ﬁrst bit of
the packet reaches the network interface. Unfortunately, packet reception (the
action of retrieving packets from the wire to the CPU, which is the ﬁrst place
where timestamping can be performed) is a compound operation: to derive
the timestamping accuracy of the system, we need to accurately examine what
actually happens whenever a packet arrives at the board.
The Gigabit Ethernet interfaces of ENP2611 are controlled by a Sierra
PM3386 and a PM3387 Gigabit MAC device (see ﬁgure 4.5). Those devices
forward received frames to a FPGA Bridge connected to the Media Switch
Fabric (MSF) interface of the IXP2400. The MSF operates in POS-PHY
Level 3 (aka SPI-3, aka PL3) mode and splits packets in ﬁxed-sized chunks
called mpackets (whose size is conﬁgurable as 64, 128 or 256 bytes). To avoid
contention on the PM3386, in our application one of the two interfaces con-
nected to this chip is used for transmission, while the remaining one, together
with the one connected to the PM3387 chip, is used for packet capturing.
At start-up time, all the RX microengine threads place themselves on a
freelist (rx_freelist), thus stating they are ready to handle a new mpacket.
Each time the MSF receives an mpacket, it awakes the ﬁrst thread in the list
and delivers the data to it. Then RX threads gather the set of incoming
mpackets from MSF and merge them, thus reassembling original packets.
88
4.6. TIMESTAMPING
Figure 4.5: Hardware packet receiving chain.
4.6.1 Time Budget
When dealing with timestamp operation, the main concern is the jitter of
the delay each stage introduces (a ﬁxed and known amount of delay between
the real and measured time can be simply subtracted to the measure). In
the following, we will show that the delay between the arrival time and the
timestamp operation is almost constant on the ENP2611.
Both the PM3386 controller and the SPI-3 bridge forward incoming frames
as soon as a certain amount of bytes (hereafter we will call it the forwarding
threshold) is received. This threshold can be conﬁgured to 64, 128 or 256
bytes. Since the minimum packet size on Ethernet is 64 bytes, in order to
avoid timestamp jitter due to diﬀerent packet lengths, we set the threshold
and the mpacket size to 64 bytes.
This way, the ﬁrst thread in the rx_freelist is awaken at time tx with
a ﬁxed delay from the arrival time ta. The delay tx − ta consists of the sum
of three latencies corresponding to the three interfaces that data has to cross:
1. the time to reach the "forwarding threshold" within the PM3386, given
by d0 = 512bits/1Gbps = 0.512µs;
2. the time required by the PM3386 to transfer data across the second in-
terface toward SPI-3 bridge, given by d1 = (512bits/32bits)/104Mhz ∼=
89
4.6. TIMESTAMPING
0.154µs;
3. the time required by the SPI-3 to transfer data across the third interface,
which operates at the same speed as the second one, thus adding an
equal delay d2 = d1.
Therefore, provided there is an available thread ready to timestamp the packet
as soon as it arrives, the time lag tx − ta = d0 + d1 + d2 ∼= 0.820µs is ﬁxed
and known.
4.6.2 The Accuracy of Timestamp
We now have to prove that there is always such a thread. The operations
performed by a thread of the RX microengine when a packet arrived are
timestamping (i.e., reads a timestamp counter and stores its value into an
internal register), copying packet data and timestamp into the DRAM and
context switching.
The ﬁrst two steps together are very fast and take about 80 clock cycles
(cc) to be executed. In the third step the microengine puts itself in an idle
state until the memory executes the requested operations and switches to
the ﬁrst ready thread in the rx_freelist; when all the memory operations
are completed, the thread is signaled by the memory hardware itself and can
restart its operation. Clearly, while the memory controller is executing the
requested operations, the microengine can be used to perform other tasks by
means of other threads.
The worst case occurs whenever all packets are 65 bytes long: in this case,
we have a 64 bytes long mpacket plus one extra 1 byte long mpacket. The
total amount of time it takes to process this packet is twice the time needed
for one mpacket (i.e. 2× Tproc), while the interarrival time is slightly higher
than the single-mpacket case: 408cc instead of 364cc.
As reported in the IXP2400 data sheet, the signaling delay to awake a RX
thread is constant and very small. Thus, by using Nth threads per port, we
make sure that the RX microengine threads can receive and timestamp the
ﬁrst mpacket of a packet with a ﬁxed delay from the real arrival time if the
following inequality hold:
Tproc ≤ 204cc×Nth (4.1)
Since Tproc depends on a very large number of factors (accesses in memory,
number of threads, instantaneous conditions, etc.), it has been experimentally
measured. As shown in ﬁg. 4.6, either for 4 and 8 threads it largely satisﬁes
(4.1).
90
4.6. TIMESTAMPING
Figure 4.6: Histogram of measured Tproc. Inequality (4.1) is satisﬁed for 4
threads (Tproc < 816cc) and 8 threads (Tproc < 1632cc).
As for packet batch creation, both the amount of data taken from each
packet and the packet batch total size are conﬁgurable. Once the amount
of data in the packet batch reaches the conﬁgured size, it is sent to the TX
microengine. Moreover, a timeout is provided to make sure that non-full
packet batches are transmitted if no more packets arrive.
Timestamp is provided by the use of 64-bit timestamp registers within the
RX microengine. Such registers are increased by one every 16cc (we shall call
it "NP-tick" or simply tick). Then each packet is timestamped with a value
given by tx − d0 − d1 − d2 = tx − 492cc = tx − 31 ticks.
Related to the accuracy of timestamp, note that the timestamp counter
increases every 16 cc, thus each tick represents a time of (16/600)µs ∼=
0.0267µs. To quantify the goodness of such granularity, it is worth reminding
that the most error sensitive application is traﬃc characterization; in this
application the measure that has to be very accurate is the inter-arrival time
of packets. Since the minimum inter-arrival time on a Gigabit Ethernet link
is 0.68µs, we obtain a very good maximum error of 4%.
If packets are concurrently captured from two interfaces, we have a times-
91
4.7. EXPERIMENTAL RESULTS
tamp error when two mpackets are presented to the SPI-3 chip by PM3386
and PM3387. An upper bound of the timestamping error is obtained in
the worst case which takes place when two mpacket arrive exactly at the
same time to the SPI-3. In this case one of the two mpackets has to wait
Emax = d1 ∼= 0.154µs before being timestamped. Comparing this error with
the minimum inter-arrival time, we obtain a maximum error of 22.6% which
is much larger than the 4% due to the clock granularity.
4.7 Experimental Results
In the experimental testbed, the NP-based capturing device is connected to
a high-end personal computer equipped by two Intel Xeon 2.8GHz CPUs
(with hyper threading activated), 1 GByte of rambus RAM and a 3COM
Gigabit Ethernet optical ﬁber network interface using the tg3 driver. The
installed Operating System is Ubuntu Linux 7.04 OS with a 2.6.18 vanilla
kernel. Unfortunately, the tg3 driver, along with the majority of the drivers
for gigabit interfaces available for Linux, does not support the polling working
mode (NAPI). Nonetheless, the interrupt mitigation mechanism supported in
hardware by this 3COM interface proved to be suﬃcient to avoid the PC
livelock.
In order to perform packet capturing, a standard tcpdump and libpcap dis-
tribution is used. Data streams are generated by Spirent ADTECH AX/4000
hardware packet generator and analyzer.
In the ﬁrst experiment, a bulk traﬃc stream is generated and fed to the
personal computer either directly or through the NP. The main purpose of
this experiment is, on one hand, the evaluation of the processing overhead
introduced by the abstraction layer and, on the other hand, the evaluation of
the beneﬁts introduced by the packet batching operation performed in the NP
(due to the lower packet rate which means a lower rate of calls to the driver
function). The NP has been set up to mark all the traﬃc with the ﬂowID 3,
thus making it available through the mon3 virtual network interface on the
receiving PC.
The stream is captured in both cases using the tcpdump raw capturing
features. Hence, for the ﬁrst experiment the command line is:
user@hostname# tcpdump -i eth4 -w file1
while for the NP-driven one the command line is:
user@hostname# tcpdump -i mon3 -w file2
The second experiment aims to show the capabilities of the system in ex-
tracting and processing amouse ﬂow in presence of an elephant one. Therefore
92
4.7. EXPERIMENTAL RESULTS
Figure 4.7: Packets rawly saved to trace ﬁle.
two ﬂows are involved: one (the mouse) from IP host 100.3.3.3 to 10.3.3.3
with TCP source port 100 and destination port 3357, the second (the ele-
phant) with a diﬀerent source port (3). The ﬁrst ﬂow is generated at a rate
of 50kpps, while the second one is generated at increasing packet rates. The
compound ﬂow is once again captured by the PC alone and through the NP.
The NP is conﬁgured to mark the mouse with ﬂowID 4 (available through
mon4 at the receiving PC). In both cases tcpdump is simply asked to decode
each packet and write it in a trace ﬁle (in a real context, this is the minimum
real-time packet processing). The issued command lines is the following:
user@hostname# tcpdump -nttv -i eth4 -w file1 src host 10.3.3.3
and dst host 100.3.3.4 and src port 100 and dst port 3357
while, in the NP-driven experiment:
user@hostname# tcpdump -nttv -i mon4 -w file1.
In both experiments, packets are minimum sized.
Fig. 4.7 shows that for the bulk stream capturing the NP-based system
outperforms the PC alone, meaning that the beneﬁt of a lower number of
calls to the driver is greater than the processing overhead introduced by the
abstraction layer.
Fig. 4.8 shows the full advantage obtained using the NP in the ﬂow ex-
traction. In this context the PC shows all its architectural ﬂaws loosing a
huge amount of packets, while the NP-based system performs this operation
with no loss.
93
4.8. FINAL CONSIDERATIONS
Figure 4.8: Packets captured from the mouse ﬂow.
4.8 Final considerations
This chapter proposes a packet capturing system architecture which combines
the power of a Network Processor card with the ﬂexibility of software based
solutions; this brings wire speed capturing capacity to applications such as
traﬃc monitors, intrusion detection systems, protocol analyzers.
The system has been tested and proved to overcome or alleviate a number
of limitations of standard PC traﬃc monitoring schemes. Moreover, it looks
to be scalable towards multi gigabit environments by simply adopting a more
powerful NP.
Some reﬁnements are planned for this architecture to overcome some de-
sign limitations. The ﬁrst limitation is the use of a batch packet regardless of
the fragment length conﬁgured by the user. Hence, if a user asks the system
to capture the entire length of the packets, the resulting batch packet is longer
then the maximum ethernet frame every time a full size packet is captured.
To overcome this problem, we plan to use oversized ethernet frames handled
by modern Gigabit Ethernet NICs.
A second limitation is the use of a gigabit ethernet interface of the NP-
board to send batch frames to the cluster; as a matter of fact, the number of
interfaces on an NP board is one of the most important sources of cost and
hence must be saved. Therefore, a less expensive choice would be to use the
PCI-X bus (which currently only gives power supply to the board) to transmit
batch frames to the host PC and make it forward them to the cluster via a
standard gigabit ethernet PCI-X NIC.
Finally, another limitation of the system is that it cannot currently classify
and duplicate packets to more than one ﬂow in order to send a given packet
to more than one processing PC or application. A possible solution involves
94
4.8. FINAL CONSIDERATIONS
the upgrade of the classiﬁcation application (which must be able of handling
rules with multiple target ﬂowIDs), of the batch frame format (and hence the
PC side abstraction layer) to bring multiple ﬂowIDs per packet digest and of
the batch frame crafting application to make it capable of copying a fragment
across many packet digests (possibly with diﬀerent fragment lengths).
95

Chapter 5
BRUNO: a high performance
traffic generator
Evaluating the performance of high speed networks is a critical task due to the
lack of reliable tools to generate traﬃc workloads at high rates. The current
open-source software tools (BRUTE, KUTE, RUDE) are not suitable to deal
with high-speed networks as they present poor performance in terms of gener-
ated frames per second and scarce timing/rate accuracy in traﬃc generation.
These issues are due to the intrinsic limitations of the PC architecture, for
which these tools are designed. This chapter proposes a diﬀerent approach
based on the Intel Network Processor IXP2400. The design aims to keep the
high ﬂexibility of PC solutions while outperforming them in terms of band-
width/packet rates. Moreover BRUNO achieves excellent results in terms of
time/rate accuracy compared to other solutions based on Network Processors
and, of course, the PC-based tools.
5.1 The main idea
The last few years have been marked by a steady growth of the Internet,
that has triggered an increasing demand of reliable networks oﬀering high
transmission capacity and quality of service. Such dramatic rise has paved the
way to the need of networks testing, in order to measure performance and ﬁnd
any "weaknesses" of systems. The evaluation of modern networks, however,
is a very diﬃcult task. In fact, given the high speed of current networks,
simulating their behavior (for example by means of tools such as the largely
diﬀused ns2) is not possible with the accuracy required for functional and
performance tests: the unavoidable simpliﬁcations arisen by simulations show
5.1. THE MAIN IDEA
their limits and become not acceptable.
Moreover, in most cases network operators equipments are not easily acces-
sible to collect the required information. Therefore, the only viable direction
to test modern networks is emulation, by injecting synthetic traﬃc and mea-
suring the responses of such networks. Clearly, in order to have meaningful
results, it is necessary to generate and inject traﬃc very similar to the actual
internet traﬃc, in terms of high speeds and statistical properties. This is a
very complex issue, for both the lack of software tools which generate reliable
IP traﬃc at high speeds and the high costs of hardware generators.
The current open-source software tools (KUTE [48], RUDE [49]) are not
suitable to deal with high-speed networks as they present poor performance
in terms of generated frames per second and scarce timing/rate accuracy in
traﬃc models reproduction. To overcome these limits, an accurate traﬃc
generator, called BRUTE, has been implemented by the TLCNetGroup of
the University of Pisa [50]. It runs on top of Linux operating system and
takes advantage of the Linux kernel potential in order to accurately generate
traﬃc ﬂows up to very high bit-rates. BRUTE has a ﬂexible architecture and
an extensible design and provides a number of library modules implementing
common traﬃc proﬁles (like CBR, Poisson process and Poissonian Arrival of
Burst process).
Although BRUTE outperforms all the widespread software traﬃc genera-
tors in terms of both the achieved throughput and the time precision, it is still
limited by PC capabilities in terms of sustainable bit rates. In fact, it is able
to generate a maximum traﬃc load of 400 Mbps, thus resulting not suitable
for example to evaluate gigabit networks at full rate. As already said, these
issues are due to the intrinsic limitations of the PC architecture, for which all
these tools are designed. Therefore, diﬀerent solutions have to be made up.
The use of ﬂexible hardware platforms to improve performance and accu-
racy looks unavoidable, and network processors (NPs) appear as appealing
solutions for such purpose. Traﬃc generators based on NPs are presented
in [51] and [52]. Such papers inspired our work, but we look for an overall
improvement of performance, accuracy and traﬃc models reproduction.
Hence, this work illustrates the design of BRUNO (BRUte on Network
prOcessor), which is a traﬃc generator built on the IXP2400 Intel Network
Processor and based on BRUTE. In this design, BRUTE runs on the PC
that hosts the NP-card and is in charge of computing departure times accord-
ing to the traﬃc models. Then the host PC writes these information in the
memory shared with the NP microengines, which in turn use such data to
generate packets and send them with the proper timeliness. The code simula-
tion through the Developer Workbench has shown a sustainable rate of 2Gbps
98
5.2. RELATED WORKS
with high preciseness.
Next section presents the previous works in the traﬃc generation area. In
particular, BRUTE and NP-based solutions are analyzed. Section 5.3 illus-
trates the overall design of BRUNO, by starting from a functional mapping
with BRUTE and then reporting the basic ideas which has led the design. In
section 5.4 all the components of our applications are accurately described,
along with the mechanisms of time correction and feedback for packet trans-
mission. Finally, section 5.5 explains the communication between BRUTE
and network processor by means of PCI bus and the issue of synchronization.
5.2 Related Works
A number of open-source tools for traﬃc generation have been proposed over
the years, most of them designed for Linux boxes. In this section we brieﬂy
introduce some of the most used and powerful traﬃc generators based on PC,
FPGA and NP architectures. As for accuracy and precision, in [53], Paredes-
Farrera et al. study and compare some traﬃc generators running on top of a
standard Linux and on a Real-Time patched Linux. As intuition suggests, the
paper shows that best results are obtained on a real time system. Moreover
the authors provide a simple deﬁnition for precision and accuracy: the ﬁrst
is related to the quality and stability of a system that makes it possible to
create the same or very similar values or measurements, the latter measures
the similarity of the created values and the true values. Therefore, since we are
interested in traﬃc generators and their packet timelines, we refer to precision
as the standard deviation of generated timelines, while accuracy describe the
average error.
KUTE [48] (formerly known as UDPgen) is an UDP traﬃc generator de-
signed to achieve high performance over Gigabit-Ethernet. It is a Linux ker-
nel module that operates directly on the network device driver bypassing
the Linux kernel networking subsystem. This means that its architecture is
strictly related to the kernel, thus limiting its extensibility.
RUDE [49] and MGEN [54] are user-space tools. The former is able to
instantiate a number of simultaneous patterns of traﬃc, but provides a non-
extensible script language. Moreover, the software architecture of RUDE does
not provide any explicit support for extensible interfaces and is not suitable
to work at high rates, especially with small frames, as shown in [50]. MGEN
provides both a command line and a GUI for traﬃc generation in user-space.
It supports diﬀerent Unix-based Operating Systems such as FreeBSD, Linux,
NetBSD and Solaris, but its accuracy is limited because of the system timers it
99
5.2. RELATED WORKS
employs (e.g., in the Linux kernel on PC-platforms the timer resolution is only
10ms [49]). The Internet Traﬃc Generator (ITG) [55, 56] aims to reproduce
TCP and UDP traﬃc and replicate appropriate stochastic processes for Inter
Departure Time and Packet Size. It is able to achieve performance comparable
to that of RUDE and MGEN but provides more traﬃc patterns.
The Browny and RobUst Traﬃc Engine (BRUTE) [50] takes advantage of
the Linux kernel potential in order to accurately generate traﬃc ﬂows up to
very high bit rates. Because of its excellent ﬂexibility due to a simple script
language and an extensible architecture, it has been chosen as the basis for the
development of BRUNO. We present BRUTE in more details in the following
subsection 5.2.1.
However, all the PC-based generators, even if well designed, are limited by
the capacity of the PC architecture. For instance, in a gigabit ethernet sce-
nario, the highest throughput achievable with BRUTE (1.09 Mfps) is reached
only in intermittent bursts.
As for other architectures, an Altera Stratix GX FPGA has been employed
by Abdo et al. [57] to develop an OC-48 traﬃc generator. This tool provides
high performance but it suﬀers from a lack of ﬂexibility and available traﬃc
models, also because of the diﬃculties in the deﬁnition of new models.
This work has been inspired by the need for a generator with the high
ﬂexibility of PC-based tools such as BRUTE and the high performance of
dedicated hardware instruments.
To the best of our knowledge, only two traﬃc generation tools have been
proposed on Network Processors, both for Intel R©IXP2XXX NPs. Such tools
are reviewed in subsection 5.2.2.
5.2.1 BRUTE
In this subsection, the main features of BRUTE are illustrated. BRUTE
[50] exploits the capabilities of Linux kernels (2.4 - 2.6) to generate traﬃc
at high bit-rate. This software is easily extensible thanks to the availability
of optimized functions that enable the implementation of additional traﬃc
sources (named T-modules) by users. The development of these modules
is made even easier by the presence of an interface (API) that allows their
deﬁnition by the C language.
BRUTE uses POSIX.1B FIFO process type and it has been designed as
user space application, in order to obtain a high ﬂexibility at the expense of
a slight loss of performance in terms of latency.
As shown in ﬁg. 5.1, the parser reads script ﬁles that contain the traﬃc
generation requests of users. Such information are then stored into an internal
100
5.2. RELATED WORKS
Figure 5.1: Architecture of BRUTE.
database (called mod-line). The traﬃc engine examines the mod-line entries
and instantiates the proper traﬃc handlers (called micro-engines) deﬁned into
the T-modules. All the micro-engines are sequentially executed to generate
the requested traﬃc.
As illustrated in the ﬁgure, the modular design involves a distributed
parser algorithm, in particular:
• the core parser handles the grammar and part of lexical tasks;
• micro-parsers distributed in the T-modules complete the lexical parsing.
Currently BRUTE is maintained by CVS (Concurrent Versions System
[58]) and available in [59] along with diﬀerent traﬃc patterns: Constant Bit
Rate, Poisson, Poisson Arrival of Burst, constant inter-departure time, tri-
modal ethernet distribution and more.
The programming script language is organized in a list of statements, each
occupying a single line that consists of an optional label, a command identiﬁer
and a sequence of parameters of the traﬃc class. A little example of the script
language is reported in the following:
lab: cbr msec=1000; rate=1000;
daddr=10.0.1.10; len=512;
This statement instructs the traﬃc engine to generate a 1 Kfps CBR traﬃc
ﬂow with 512B long frames for a duration of 1s. When not all parameters
are speciﬁed, BRUTE uses preconceived values. For instance, in the previous
example the destination IP address is given by the statement, while for the
source address the default value is used .
101
5.3. BRUNO
5.2.2 Traﬃc generators on the IXP2400 Network Pro-
cessor
IXPktgen [51] is a generator created at the University of Kentucky, which is
based on the Intel IXP2400. Given the lack of speciﬁc informations about this
generator, performing a comprehensive analysis is not possible. Nevertheless
an accurate study of its source code shows that it employs 4 µEs (working in
8-threads mode) which are used for traﬃc generation. This implies that, as
each thread is statically assigned a single ﬂow, only 32 ﬂows can be generated
at the same time. IXPktgen is developed in microcode-assembly and can
generate any kind of ethernet frames.
The Pktgen [52] is a traﬃc generator implemented in the TNT laboratory
of the University of Genova. It is based on the Radysis ENP-2611 board
equipped with the Intel IXP2400 NP. It can generate CBR (Constant Bit
Rate) and burst traﬃc with high throughput. Anyway its performance in
terms of jitters is not certain. In its design, 5 µEs (working in 4-threads mode
with a single ﬂow per thread) are in charge of traﬃc generations. Therefore it
is possible to generate only 20 ﬂow at the same time. The code is developed
in microC, whose compiler is not as optimized as the microcode-assembler
(according to Intel's guidelines [60]).
5.3 BRUNO
The aim of BRUNO is to combine the ﬂexibility of software-based genera-
tors as BRUTE with the throughput and the accuracy achievable only by
hardware-assisted applications. Therefore in our architecture we exploit both
a general purpose PC and an ENP2611 Radisys pci-board equipped with the
Intel IXP2400 NP. The link between them, in terms of shared data structures,
is given by the DRAM and SRAM memories on the Radysis board, which are
accessible through PCI bus. On the PC, a modiﬁed version of BRUTE (that
we simply call BRUTE in the following) generates departure times and packet
lengths according to the speciﬁcations of the traﬃc script ﬁle, and writes them
in the DRAM of the card. On the NP side, the µEs read these values from
DRAM and handle the actual creation of the packets to be sent.
5.3.1 BRUTE and BRUNO
Figure 5.2 depicts the architecture of our generator in a functional mapping
with BRUTE. As we will accurately describe in the following, the user inter-
face, the overall parsing process and the creation of internal ﬂow structures
102
5.3. BRUNO
are grouped and assigned to the host PC, while the remaining functions are
committed to the µEs of the IXP2400. In particular, the role of Traﬃc En-
gine is covered by a µE named Load Balancer (LB), which is in charge of
reading data created by BRUTE in DRAM and making correction of packet
departure times. The task of actually creating packets is assigned to 4 µEs
named Traﬃc Generators (TGs).
Figure 5.2: Mapping BRUTE in BRUNO.
5.3.2 Design of BRUNO
As stated previously, one of the main limitations shown by NP-based traﬃc
generators is related to the maximum number of ﬂows that can be simulta-
neously generated. This is mainly due to the ﬁxed association between ﬂows
and µE threads. In order to overcome this issue, in BRUNO a given ﬂow
is not strictly associated to a particular thread, thus obtaining a not ﬁxed
maximum number of contemporary ﬂows. This is achieved by ensuring that
threads process packets regardless of ﬂows which they belong to.
A second issue is the optimal use of available processing resources. Indeed,
if each thread is in charge of a single ﬂow, it is likely to happen that some
thread work more than others, or even that all threads on a certain µE work
while other µEs just sleep. This is not desirable since a high number of active
threads on the same µE could aﬀect the timeliness of packets. In BRUNO the
LB µE guarantees an equal balance of load among TG µEs. It distributes the
103
5.4. COMPONENTS OF BRUNO
packet generation requests among the various TGs in a round robin fashion.
This way, whenever a single ﬂow has to be generated, all the threads in the
TGs can work for it.
In order to provide a great accuracy in the traﬃc generation, a correction
time mechanism is also introduced. It works by comparing the ideal departure
times (computed by the Load Balancer) with the real ones (measured in the
transmission stage) and by properly modiﬁed the time information. Of course
a feedback mechanism is included to take the observed real transmission times.
All these considerations have led the design of BRUNO, which is depicted
in ﬁg.5.3. In the diagram, µE are represented by tagged boxes.The ﬁrst µE
on the left (Load Balancer) reads data from DRAM and after some operations
forwards the information, through a ring structure, to the others µEs called
Traﬃc Generators. The rings are circular fast and small FIFO queues, used by
all processing units to communicate. They are allocated into the scratchpad
memory of the IXP2400 [61] (the only shared embedded memory). Traﬃc
Generators ﬁnally send request for packet transmission to the transmitter
(TX) µE. The transmitter, in turn, is connected to the Load Balancer through
a feedback ring (Timing Correction in the ﬁgure).
As for the storage of all the information, packet requests are kept in
DRAM, as previously stated, in a memory window that is continuously re-
freshed by the PC with new data. Traﬃc parameters (i.e.: source and desti-
nation L2 and L3 addresses and L4 ports) are kept in SRAM, because they
need to be accessed very frequently for the creation of each packet. More
details of PC-to-NP communications are provided in sec. 5.5.
5.4 Components of BRUNO
5.4.1 Load Balancer
As shown above, the Load Balancer µE ﬁrst takes the data provided by
BRUTE in DRAM, then adjusts the packet departure time and ﬁnally for-
wards the sending request to a Traﬃc Generator. In ﬁg. 5.4 the structure
for packet generation request is depicted (hereafter we call it simply "packet
request"). It is written in DRAM by the host PC which runs BRUTE and is
composed of two 32-bits longwords. The ﬁrst one contains the interdeparture
time, while the second carries the packet size (16 bits), the pointer to the
SRAM location of the ﬂow structure (Flow Index, 15 bits), and the type of
address (IPv4 or IPv6, 1 bit).
The threads in the Load Balancer µE are divided into two equal groups
that run completely diﬀerent codes.
104
5.4. COMPONENTS OF BRUNO
Figure 5.3: Architecture of BRUNO.
Interdeparture Time
(31-0)
Packet Size Flow Index Type
(31-16) (15-1) 0
Figure 5.4: Structure of packet request.
105
5.4. COMPONENTS OF BRUNO
Even Threads
They are in charge of moving packet requests from DRAM to the local memory
of µEs. More precisely, packet departure times are converted from "relative"
into "absolute" and written, along with the second longword of the packet
request, in a circular FIFO queue in the local memory (hereafter this structure
is called "Modiﬁed Packet Request", MPR). The DRAM is read in blocks of
16 longwords (32-bits each) therefore 8 MPRs are processed at a time.
Odd Threads
They read MPRs from local memory, adjust their departure times, and then
forward them to Traﬃc Generators by a deﬁned scheduling policy.
The correction of packet departure times (notice that they are absolute
times, since they have been read from local memory) takes place in two suc-
cessive steps:
• control the feedback scratchring;
• control the temporal gap between two successive MPRs.
In the ﬁrst step, a thread checks if the feedback scratch contains any new
information about sent packets. Such scratchring is indeed ﬁlled by the trans-
mission µE with the measured packet interdeparture times, marked with the
sequence number of the second packet in the couple, that acts as unambiguous
identiﬁer. Such interdeparture time is used in the simple correction algorithm
based on a exponential moving average:
ϕn = A · ϕn−1 +B · [∆ˆn−k −∆n−k] (5.1)
where ϕn represents the correction applied to the n-th packet departure time,
∆ˆn−k is the measured interdeparture time of the (n−k)-th and (n−k−1)-th
packets taken from the feedback scratchring and ∆n−k is the ideal interde-
parture time (kept in local memory) of the same couple of packets. The term
k takes into account the feedback and system delay. In fact, when the Load
Balancer is working on the n-th MPR, there will be a certain number of MPRs
in the TGs and on the rings, moreover the feedback mechanism is obviously
not instantaneous. Therefore, the corrected departure time Xn is computed
by subtracting the correction term ϕn to the original departure time τn:
Xn = τn − ϕn
106
5.4. COMPONENTS OF BRUNO
For simplicity and because of the lack of ﬂoating point support in the µEs
microcode, we have set A = 34 and B =
1
4 (these values are easily obtained by
means of bit-shift operations). This way, some possible spurious delay peaks
given by any transitory change of state (e.g.: because of a fast change in the
traﬃc to generate) of the system are softened.
After the correction, the second step consists of checking the temporal
gap between the current MPR and the previous one. This step is required
because the corrected departure time can be phisically not achieveable. This
can happen if Xn < Xn−1 + ln/c where ln is the packet length, c is the line
data rate (1 Gb in this case) and then ln/c is the transmission delay of the
packet. In this case, we set Xn = Xn−1 + ln/c.
Load Balancer's communications
It may happen that the Load Balancer has to deal with packets that must
be sent by the Traﬃc Generators too far in the future. Since the timestamp
counter in the TGs is limited to 16 bits, the LB should not send to TGs
MPRs scheduled more than 216 clock ticks ahead in the future . Therefore
odd threads stop when the diﬀerence between the time written in the MPR
and present time is greater than a parameter (AWAITING_THRESHOLD),
which has to be set at the start of BRUNO application. This parameter can
be at most 216 ticks and the timestamp counter is increased every 16 clock
cycles. Since the µE clock frequency is set to 600 Mhz :
1tick =
16
600 · 106 seconds = 26ns
The choice of the AWAITING_THRESHOLD must be carefully considered.
In fact, low values lead to an under-utilization of the Traﬃc Generators that
may not respond properly to abrupt changes in the traﬃc. On the other hand,
high values of this parameter can easily saturate the scratch rings between
the Load Balancer and Traﬃc Generators.
5.4.2 Traﬃc Generators
In the current design of BRUNO (ﬁg. 5.3), 4 µEs are used for packet creation.
Each thread belonging to a Traﬃc Generator processes a packet at a time, by
taking the corresponding information (i.e., MPR) from the scratch ring be-
tween Load Balancer and Traﬃc Generator. In particular, the thread deter-
mines the ﬂow which the packet belongs to through the ﬁeld INDEX_FLOW.
It is an identiﬁer that also provides an oﬀset to ﬁnd the SRAM location of
107
5.4. COMPONENTS OF BRUNO
Output_Port Protocol TOS Ind_Type Res
(31-23) (22-15) (14-7) (6-5) (4-0)
Source_Port Destination_Port
(31-16) (15-0)
Index Total SRC_ADDs
(31-24) (23-16) (15-0)
Index Total DEST_ADDs
(31-24) (23-16) (15-0)
Figure 5.5: Structure for a ﬂow.
the structure which describes the ﬂow. During initialization, we allocate a
structure (shown in ﬁg. 5.5) per ﬂow to be generated in SRAM.
Output_Port indicates the physical output port for the ﬂow. Protocol,
TOS, Source_Port, and Destination_Port provide the corresponding ﬁelds of
packet headers of layers 3-4. SRC_ADDs and DEST_ADDs point two SRAM
locations which contain a list of source and destination addresses respectively.
Total provides the number of these addresses, while Index indicates the next
address to be taken if the choice is made in a linear way. In fact, choosing
addresses in the list can be performed in two diﬀerent ways: randomly and
linearly. In the former way, a random number generator (embedded inside
each µE) indicates the address to be taken from the list, while in the latter
one, the address is given by the Index, which is then incremented by one. The
two bits of Ind_Type indicates the selection modality for both source and
destination addresses (which can be IP or MAC addresses).
The structures related to ﬂows are loaded into SRAM by BRUTE in the
initialization phase, according to the user settings. From these data, a thread
is able to create packet metadata and layer 2-3-4 headers. Then the thread
is placed in a state of sleep, until the time to send the packet arrives. At this
point, the thread wakes up and sends the packet transmission request to the
transmission block, which will provide to transmit the packet.
5.4.3 Transmitter
The last stage of the generator is the transmitter. The code used is the
one provided by Intel R©, as optimized for the transmission. Obviously, the
feedback system for time correction previously described has been added. It
has been put before the last state of the transmission process, in order to
measure "real departure times" as truthfully as possible.
108
5.5. COMMUNICATION BETWEEN BRUTE AND NP
5.5 Communication between BRUTE and NP
The communication between the BRUTE application, which is in charge of
generating the "packet request", and Network Processor, where the traﬃc
generation actually takes place, makes use of the PCI bus; such a bus pro-
vides support for fast communication, thus allowing a real time interaction
between the Network Processor and the application running on the host PC.
In particular, both the DRAM and SRAM memory banks on the board are
accessible through the local PCI bus, which, in turn, is connected to the
PCI bus of the host PC through the Intel 21555 non-transparent PCI-to-PCI
bridge [62]. Since the address plan referring to the two buses is diﬀerent (this
is the main reason why non-transparent bridging is used), address translation
is implemented on such a device: up to three non overlapping intervals in the
host PC PCI address space (called downstream windows) can be conﬁgured to
be translated to corresponding address intervals in the Radisys PCI address
space; every time the 21555 bridge receives a transaction referring to an ad-
dress falling into one of the downstream windows, it maps such address to the
corresponding address of the Radisys PCI bus and forwards the transaction
over it. In a symmetric way, three upstream windows in the Radisys PCI
address space can be deﬁned in order to forward transaction from the board
bus to the host PC bus.
Diﬀerent address translation methods are provided by the bridge, but the
most simple and eﬃcient to be used is the direct base translation: a down-
stream memory (or upstream) window is deﬁned by a base address and address
translation is performed by simply replacing such a base with a corresponding
translated base which deﬁnes an address region over the target bus. Since the
base length is variable, the size of an address window can be deﬁned by the
user: in general the window size can assume values from 4 kB up to 2 GB,
thus allowing to completely map each memory bank of the Radisys board.
The memory translation map can be conﬁgured by accessing and setting
some control registers associated to the 21555 bus; in our implementation this
is done by a Linux kernel module which is inserted in the host PC operating
system. After the initialization of such a module, both the SRAM and the
DRAM memory banks are accessible as PCI resource regions by the host PC
operating system and can be read and written using system calls referring
to memory mapped I/O. In order to oﬀer a simple interface to user applica-
tions, our module registers two virtual character devices in the Linux kernel,
which are associated to the Radisys DRAM and SRAM banks respectively.
Such devices provide support to the mmap access method [63], which allows
for registering a direct binding between a statically deﬁned physical address
109
5.5. COMMUNICATION BETWEEN BRUTE AND NP
region and a user space virtual address region; when a user process accesses a
virtual address falling into the mapped area, the virtual memory management
part of the kernel directly converts it to the corresponding physical address.
This allows an user process to directly access the resources associated with a
device, without using any data buﬀering in the kernel memory; such a mech-
anism, which is generally used for accessing high performance devices such as
graphical cards, provides the maximum speed for accessing peripheral devices,
since no data copying is required.
Figure 5.6: Address Translation.
By accessing the two character devices, BRUTE can both conﬁgure the
parameters deﬁning each generated traﬃc ﬂow (which are stored in SRAM)
and set the interdeparture times and lengths for each packet, by writing the
corresponding data structure in DRAM. During the actual traﬃc genera-
tion process, the corresponding DRAM region must be accessed both by the
BRUTE process, which has to add metadata referring to new packets, and
by a Traﬃc Generator running on the Network Processor, which reads such
metadata and generates the outgoing traﬃc. In order to achieve the coordi-
nation of the two processes, the DRAM region is accessed as a ring buﬀer,
110
5.5. COMMUNICATION BETWEEN BRUTE AND NP
composed by ﬁxed size blocks of metadata structures; both the host CPU and
the NP access each of this blocks in a round robin order. Synchronization by
interrupts is not used, because of the consistent and variable latencies which
are involved in such a mechanism.
5.5.1 Synchronization
In operational conditions, the DRAM area containing the packet requests
must be accessed both by the µEs and by the host CPU through the PCI
communication mechanism described in the previous section. Therefore we
need to deﬁne a mechanism to cope with the contemporaneous presence of
readers and writers on the same memory area. In particular, each packet
request must be written by the host CPU and read by the NP only once.
For this reason a synchronization mechanism has to be implemented be-
tween the NP and the host CPU. We choose a method that does not rely on
the classic interrupt based solutions which are typically used for PCI devices,
because of the variable and possibly long latencies which are involved in such
schemes. On the contrary, we adopt a polling based mechanism. In particular,
in our scheme, the CPU takes care of the whole synchronization task.
The rationale of such a choice is that the data in the buﬀer can be written
at a rate that is signiﬁcantly higher than the one at which they are read by
the NP. Indeed the mean rate at which the data structure in the buﬀer can be
read must be of the same order of the packet sending rate (otherwise, in the
long run, the internal buﬀers of the of the LB µE, where the packet requests
are temporarily stored, would overﬂow). Since, even at full rate, less than a
few milion packet per seconds can be sent by the traﬃc generator, an average
PC can easily ﬁll the buﬀer faster than it is emptied. Besides, in order to avoid
the performance degradation involved in performance switching, the BRUTE
process can be assigned an higher priority in the Linux processor scheduling,
thus guaranteeing that other tasks interfere only marginally with the traﬃc
generation. As a consequence, there is no need for the NP to wait for the
CPU, while it is likely that the CPU has to stop waiting for the NP to read
the data in the buﬀer. In addition the code running on the NP is optimized
for maximum performance and implementing waiting mechanisms could lead
to a major performance degradation.
Since the DRAM window containing the packet requests is cyclically read
by the Network Processor, it can be considered as a circular buﬀer. Transfer-
ring a large block of data over the PCI bus (and from/to the DRAM) is, in
terms of overall delay, more proﬁtable for the host CPU than moving small l
amounts of packet requests at a time. Therefore the mentioned circular buﬀer
111
5.5. COMMUNICATION BETWEEN BRUTE AND NP
Figure 5.7: DRAM window circular buﬀer.
will be partitioned in blocks containing a given number of data structures (for
instance 8); each time either the host CPU or the NP accesses the buﬀer a
whole block of data structures is read or written. In ﬁg.5.7 the DRAM window
divided into diﬀerent blocks (arranged in a FIFO circular queue) is depicted.
The NP keeps in its SRAM a pointer to the last block it has read and, in
turn, the CPU maintains in its own memory a pointer to the last block it has
written. Before performing a write operation, the CPU reads both pointers
to check whether the buﬀer is full. In such a case, the CPU enters a busy
waiting status for a certain waiting time and, after that, it checks the pointers
again.
We avoid using a polling mechanism that continuously reads the pointer
to the last read block because that would imply a huge number of access to
the NP SRAM that, in turn, could aﬀect the delay of NP accesses to such a
memory and lead to a performance degradation.
The waiting time must be accurately calculated so as to avoid that the NP
reads the whole buﬀer while the CPU is waiting. Since, as already pointed out,
the average buﬀer reading rate must be equal to the average packet sending
rate of the NP (otherwise the internal buﬀers of the µEs would overﬂow or
become empty) a good estimate of such a waiting time can be computed by
the host CPU as
Tdelay = ρ× B
R
(5.2)
where B is the overall amount of packet requests that can be contained in the
112
5.5. COMMUNICATION BETWEEN BRUTE AND NP
buﬀer, R is the average packet rate produced by the generator (known by the
host PC) and k is an arbitrary safety parameter smaller than one (if k = 1
then Tdelay is the time it takes to empty the whole buﬀer). For very low values
of k the CPU ﬂoods the SRAM with read requests, thus aﬀecting precision.
On the other hand, if k ' 1 then the CPU may not be able to ﬁll the buﬀer
properly and the system would not be able to respond to abrupt changes of
the generated traﬃc in time. Preliminary experimental results (limited to the
PCI communication) seems to conﬁrm that setting 0.1 ≤ ρ ≤ 0.4 is a good
choice.
The implementation of the busy waiting mechanism over the host PC
relies on the real time capabilities which are included in the original BRUTE.
It provides busy waiting functions which, by actually counting the CPU clock
cycles and by taking advantage of the Linux process scheduling policies, allow
to set a waiting period with a fairly good accuracy.
113

Chapter 6
Smart data structures for
NPs
This chapter illustrates a series of smart data structures for set representation
and membership query, which are useful in devices provided by a limited
amount of memory, such as Network Processors. Memory saving, which is a
paramount issue in NP applications, lead us towards the design of diﬀerent
structures and algorithms with many appealing properties.
All these solutions are based on principles of Bloom Filters, which are
eﬃcient randomized data structures for membership queries on a set with a
certain known false positive probability.
6.1 The main idea
Nowadays streamed data processing is a basic problem in many areas related
to computer applications. In particular, detecting whether an item belongs
to a set is one of the most challenging tasks, especially when the amount of
data to be processed per unit of time is very large and rapidly changes.
A Bloom Filter (BF) is a simple data structure for information repre-
sentation and query processing. It is a randomized method based on hash
functions; thus it allows false positives, but the space savings often outweigh
this drawback. BFs were introduced by Burton Bloom [64] in the 1970s for
database applications, but recently they have received a great attention also
in the networking area [65], for collaborating in overlay and peer-to-peer net-
works, packet routing, and measurements. BFs are also proposed for many
distributed networking protocols: for example, in order to share web cache, a
proxy periodically broadcasts BFs that represent the contents of their cache.
6.2. BACKGROUND ON BLOOM FILTERS
In this situations, BFs are not only data structures but also messages being
transmitted in a network. Thus, several performance parameters have to be
taken into account in designing BFs: the probability of false positives, memory
size, number of items to be managed and transmission size.
BFs do not address the issues of inserting and deleting items in the set.
For example, a set may change over time, with elements being inserted and
deleted. Deletion cannot be done by simply changing ones into zeros, as
a single bit may correspond to multiple elements. In order to allow these
operations, Counting Bloom Filters (CBFs) have been designed [66]. They
are based on the same idea of BFs, but they use ﬁxed size counters (also
called bins) instead of single bits of presence. When an item is inserted, the
corresponding counters are incremented; deletions can now be safely done by
decrementing the counters. CBFs present the problem of counters overﬂow,
which has to be considered in the design.
This chapter presents ﬁrst a new upper bound for counter overﬂow prob-
ability in CBFs. This bound is much tighter than that commonly utilized in
literature and it is useful for our design of eﬃcient solutions. Thus, several
new data structures are proposed, which exploit the bound and improve CBFs
in terms of fast access and limited memory consumption. The target could
be the implementation of the compressed data structures in the small but
fast local memory or on-chip SRAM of devices such as Network Processors
(NPs).
6.2 Background on Bloom Filters
A Bloom Filter represents a set S of n elements from a universe U by using an
array of m bits, denoted by B[1], ..., B[m], initially all set to 0. The ﬁlter uses
k independent hash functions h1, ..., hk with log2(m) bits long output, that
independently map each element in the universe to a random number uni-
formly distributed over the range. For each element x in S, the bits B[hi(x)]
are set to 1, for 1 ≤ i ≤ k (a bit can be set to 1 multiple times).
To answer a query of the form Is y in S?, we check whether all B[hi(y)]
are set to 1. If not, y is not a member of S, by the construction. If all
B[hi(y)] are set to 1, it is assumed that y is in S, hence a BF may yield a false
positive. The probability of a false positive f can be tuned by choosing the
proper values for m and k. It is a well-known result [66] that the minimum
f is obtained for k = (m/n) ln 2. In this conﬁguration, all bits B[1], ..., B[m]
are set or cleared with probability p = 1/2 (thus, roughly, the same number
of ones and zeros are present in the BF).
116
6.2. BACKGROUND ON BLOOM FILTERS
Figure 6.1: A Bloom Filter.
Many works about BFs have been presented, and the major improvements
are compressed BFs [67], distance-sensitive BFs [68], dynamic BFs [69], space-
code BFs [70].
As previously stated, BFs do not allow insertion and deletion of an item
in the set. Therefore, CBFs have been introduced, which use m ﬁxed size bins
instead of m single bits of presence (see ﬁg. 6.2). When an item is inserted
(or deleted), the corresponding counters are incremented (or decremented).
However, CBFs present the problem of counters overﬂow, which has to
be considered in the design. Although for most network applications four
bits long counter are suﬃcient [65], the distribution of counters load across
bins changes dramatically (according to Poisson arrivals [66]), suggesting that
four bits any bin is a safe choice and that a certain amount of compression is
achievable. Moreover, by using a ﬁxed number of bits, the problem of counters
overﬂow in CBFs is not completely solved. It results in a lack of adaptiveness
and inaccuracy of stored information.
In order to resolve these limitations and have better performance, many
improvements to CBFs have been done. Mitzenmacher [67] shows that un-
balancing the number of ones and zeros in a standard BF can help achieving
a good compression ratio before transmission (e.g. for web-caching applica-
tion). This way, by keeping the same amount of bits of the uncompressed
case, it is possible to either reduce the false positive probability or use a lower
number of hash functions.
Spectral Bloom Filters (SBFs) [71] are an extension of standard BFs to
multi-sets, allowing estimates of the multiplicities of individual items with a
small error probabilities. The word Spectral means that SBFs allow only ﬁl-
tering of elements whose multiplicities are within a requested spectrum (there-
117
6.2. BACKGROUND ON BLOOM FILTERS
Figure 6.2: A Counting Bloom Filter.
fore they do not preserve bins from overﬂow in a conclusive way). The main
goal of SBFs is the optimal counter space allocation, so they dynamically vary
the size of their counters in order to minimize the number of necessary bits. To
achieve this ﬂexibility, SBFs include additional slack bits among the counters
and complex index structures, that increase both memory needs and access
time as compared to standard CBFs. Finally, SBFs introduce techniques for
ﬁlter compression based on Elias code, that reduce the transmission size of
data structures but increase again the processing load.
Dynamic Count Filters (DCFs) [72] are data structures designed for speed
and adaptiveness in a very simple way. They do not require the use of in-
dexes, thus obtaining a fast access time, and avoid permanently counters
overﬂow. DCFs consist of two diﬀerent vectors: the ﬁrst one is a basic CBF
with counters of ﬁxed size, the second one is the Overﬂow Counter Vector,
which has a counter for each element of ﬁrst vector that keeps track of the
number of overﬂow events. The size of counters in Overﬂow Counter Vector
changes dynamically to avoid saturation; this implies that, for each update,
a structure rebuilding is required. Moreover, the decision of having the same
size for all these counters (for direct access) entails that many bits are not
used. Therefore, this solution can be improved, especially in terms of memory
consumption.
The d-left CBFs (dlCBFs) [73] are a simple alternative based on d-left
hashing and ﬁngerprints of bins. They do not rely on the principles of Bloom
Filters, but they oﬀer the same functionalities. The dlCBFs use less space,
generally saving a factor of two or more for the same fraction of false positives,
and the construction is very simple and practical, much like the original Bloom
118
6.3. THE NEW UPPER BOUND OF OVERFLOW PROBABILITY
Filter construction. Indeed the simplicity in constructing and maintaining
data structures is maybe the greatest contribution of [73] as compared with
the previous works. Moreover, even dlCBFs have the limitation of potential
counters overﬂow and the need for an additional ﬁngerprint for each bin in
the data structure.
The memory utilization is the parameter that is better taken into account
in this work. As previously mentioned, there are several cases where network
bandwidth is still expensive and transmission size becomes a fundamental
parameter (e.g., Web cache sharing or P2P applications). Moreover, although
memory appears plentiful today, there are many hardware architectures used
in network devices (e.g. Network Processors) that may take advantage of
using very space-eﬃcient data structures, in terms of both performance and
costs. Indeed, a memory saving can greatly speedup a device by requiring rare
access to slower oﬀ-chip memory; further, while ordinarily DRAM memory
is cheap, fast SRAM memory and especially on chip SRAM continue to be
comparatively scarce. All these issues have led our research, which had the
target of an eﬃcient and practical data structure for CBF.
6.3 The new upper bound of Overﬂow Proba-
bility
In this section, a new bound for counter overﬂow probability in CBFs is pre-
sented.
The following classical result [65] on CBF gives a bound on the overﬂow
probability P (ϕ ≥ j) that is widely adopted to design the bin size:
P (ϕ ≥ j) ≤
(
enk
jm
)j
(6.1)
However, (6.1) is pretty loose; the next theorem presents a tighter bound for
P (ϕ ≥ j).
Lemma 1. Let ϕ be a CBF counter value and α = nkm−1 . If α < 1, the
function χ(j) = P (ϕ = j) is a monotonically decreasing function.
Proof: The probability of the event {ϕ = j}, for j ≥ 1, is given [65] by:
χ(j) =
(
nk
j
)(
1
m
)j (
1− 1
m
)nk−j
The ratio between two consecutive values is:
119
6.3. THE NEW UPPER BOUND OF OVERFLOW PROBABILITY
χ(j + 1)
χ(j)
=
nk − j
j + 1
1
m− 1 <
α
j + 1
< 1 (6.2)
which gives the proof. For k = (m/n) ln 2, α = (m × ln 2)/(m − 1). α is
less than 1 for m > (1− ln 2)−1 ≈ 3.26. In the CBFs, the previous condition
is always satisﬁed, since m 1.
Theorem 1. Let ϕ be a CBF counter value and α = nkm−1 . If the number of
hash functions is chosen so as to minimize the probability f of false positive
(i.e., k = (m/n) ln 2), then:
P (ϕ ≥ j) < α(j + 1)
j(j + 1− α)P (ϕ = j − 1)
Proof: By repeadetly applying eq. 6.2:
P (ϕ ≥ j) =
+∞∑
i=j
P (ϕ = i) < P (ϕ = j)
+∞∑
i=0
j!αi
(j + i)!
(6.3)
The right hand sum of (6.3) can be bounded as:
+∞∑
i=0
j!αi
(j + i)!
<
+∞∑
i=0
(
α
j + 1
)i
=
j + 1
j + 1− α
to ﬁnally obtain:
P (ϕ ≥ j) < α(j + 1)
j(j + 1− α)P (ϕ = j − 1) (6.4)
Corollary 1. Under the previous results, if α < 1:
P (ϕ > j) < P (ϕ = j − 1) (6.5)
Proof: From (6.3), by changing the lower limit of the series from 0 to 1,
we obtain:
P (ϕ > j) <
α2
j(j + 1− α)P (ϕ = j − 1) (6.6)
Then, considering that j ≥ 1, α2/(j(j + 1− α)) < 1.
Lemma 1 allows to approximate P (ϕ = 0) and E[ϕ]:
1 =
+∞∑
j=0
P (ϕ = j) ' P (ϕ = 0)
∞∑
j=0
αj
j!
= P (ϕ = 0)eα (6.7)
120
6.3. THE NEW UPPER BOUND OF OVERFLOW PROBABILITY
Figure 6.3: Bounds comparison. P ′b is always tighter than Pb.
Then P (ϕ = 0) = e−α. As for the expectation of ϕ we get:
E[ϕ] =
+∞∑
j=0
jP (ϕ = j) ' P (ϕ = 0)
∞∑
j=0
j
αj
j!
= α (6.8)
If the CBF minimizes f , E[ϕ] ' ln 2 = 0.693, which is a a very tight
approximation in several cases.
The ﬁgure 6.3 shows a bounds comparison for n = 1000, k = 10 and
m = nk/ ln 2. P is the actual P (ϕ ≥ j), Pb is the well known (6.1) while
P ′b is that provided by the theorem 1. In the smaller graph, a zoom on the
contour of j = 2. It is interesting to see that the previous bound can be
much tighter than that widely used (eq. 6.1). For instance, eq. 6.1 yields
P (j > 15) ≤ 1.37×10−15 while our bound produces P (j > 15) < 1.51×10−16,
with a gain of an order of magnitude.
121
6.4. MULTILAYER HASHED CBF (ML-HCBF)
6.4 MultiLayer Hashed CBF (ML-HCBF)
6.4.1 The algorithm
The basic idea of the ﬁrst data structure we proposed is the addition, to the
usual CBF vector (CBFV), of a stack of hash-based vectors (HBVs). To the
best of our knowledge, although a limited degree of hierarchy is sometimes
obtained by adding a CAM [74] or another counter [72], this is the ﬁrst attempt
to introduce the idea of a hierarchy of arrays in CBFs, which results in a
MultiLayer structure where counters may span over diﬀerent levels.
The target of our data structure is an improvement of CBFs by avoiding
counters overﬂow and reducing memory needs. The drawback, as we will see
below, is a very slight increase of complexity for the insertion/deletion of an
element. Instead the probability f of false positives does not vary, because it
depends on the ﬁrst CBF vector only.
Let us indicate with x0 and m0 the bit-size and number of bins in the
CBFV respectively, while x1 . . . xN and m1, . . . ,mN (all of them powers of 2)
represent the bit-size and number of bins for the N vectorsHBV1, . . . ,HBVN .
These values are set to obtain a low probability of counters overﬂow. Basically,
we deﬁne a counter value as the sum of its CBFV element and its potential
corresponding elements in the HBV stack.
A set of k + N hash functions hi are needed. The ﬁrst k functions (that
address the CBFV) have an output length of log2(m0) bits while the other
N ones (one for each HBV) provide log2(mi) bits (where i = 1, . . . , N) and
must be minimal and perfect. This requirement introduces an increase of
the operational complexity whenever an element has to be inserted or deleted
from the sets involved in the perfect hashing. However, the construction cost
of the minimal perfect hash functions (MPHFs) we need is limited since the
cardinality (m1, . . . ,mN ) of the sets involved is small.
When an operation of either inserting or deleting an element s is required,
the current counter Cci (with i = 1, . . . , k) for each of the ﬁrst k hash functions
must be found. Therefore, at ﬁrst CBFV [hi(s)] has to be considered: if it is
already saturated (i.e. it has reached 2x0 − 1), the counter position is hashed.
The obtained value is used to address a counter in HBV1. If its value is also
saturated, the same procedure is repeated: the position in HBVj is hashed
to obtain the index for HBVj+1 until a non-saturated counter (Cci) is found.
Hence, the actual value of the counter is the sum of CBFV [hi(s)] and all the
explored HBVj elements. The overall process to ﬁnd the current counters
is shown in the pseudocode 5 and in the simple example of ﬁg. 6.4. In the
example, x0 = 3 bits and x1 = x2 = 2 bits. The hash function hi gives 4
122
6.4. MULTILAYER HASHED CBF (ML-HCBF)
HBV2 0 1 0 1
⇑
hk+2(6) = 1
HBV1 1 3 2 1 0 2 3 0
⇑
hk+1(4) = 6
CBFV 4 2 1 6 7 0 4 1 5 1 3 · · ·
⇑
hi(s) = 4
Figure 6.4: Process of ﬁnding a current counter for ML-HCBF.
as outcome: CBFV [4] is saturated, therefore, counter position (4) is hashed
and the output is 6. Also HBV1[6] is saturated, then its position is hashed:
the outcome is 1. HBV2[1] is the current counter Cci for s. Its actual value
is 1 + 3 + 7 = 11.
Algorithm 5 Pseudocode for ﬁnding current counters
1: for i← 1, k do
2: Cci ← CBFV [hi(s)]
3: if CBFV [hi(s)] = 2x0 − 1 then
4: j ← 1
5: pos← Hj [hi(s)]
6: repeat
7: Cci ← HBVj [pos]
8: pos← Hj [pos]
9: j ← j + 1
10: until HBVj [pos] 6= 2xj − 1
11: end if
12: end for
Each current counter Cci (where i = 1, . . . , k) is incremented by one for
inserting an element. Instead, if an element must be deleted, Cci is decre-
mented; if it is already equal to 0, the relative counter of previous layer is
decremented.
To perform a membership query for an element s, we check whether all
CBFV [hi(s)] are non-zero values. It is exactly the same simple procedure as
standard CBF.
123
6.4. MULTILAYER HASHED CBF (ML-HCBF)
6.4.2 Properties
If the number of HBVs is N , the number of bins in CBFV is m0 (set for
minimizing f) and in HBVj is mj = αjm0 (where αj is a negative power of
2, as hashing reduces the range of values for each layer):
• the maximum supported value for a counter is
ϕmax =
N∑
j=0
2xj −N − 1
• the average size of the overall data structure is
E[S] = m0
x0 + N∑
j=1
xjαj

where αj = 2dlog2 P (ϕ>ϕj−1)e.
6.4.3 Operational Complexity
As above stated, a membership query for an element s is performed the
same way as standard CBFs: we check whether all CBFV [hi(s)] (where
i = 1, . . . , k) are non-zeros. Therefore k hash operations are required and
lookup complexity is O(1).
The complexity for insertion and deletion of an element is almost the
same. We use k hash functions for the CBFV and one for each HBV we
reach, thus resulting with the following average number ω of hash operations
in the ordinary case:
ω = k
1 + N∑
j=1
P (ϕ > ϕj)

Finally, an operation of incrementing/decrementing a counter has to be made.
On the other hand, the cost of insertions/deletions increases because of the
complexity of the minimal perfect hashing scheme that must cope with a
change in its set. Such a complexity depends on the number of elements to
be hashed, which is the number of saturated bins (mj+1 at layer j). However,
the overall cost must be weighted with the probability of a transition toward
a saturated bin. Even if the cost of the perfect hashing [75] at layer j is
124
6.4. MULTILAYER HASHED CBF (ML-HCBF)
O(mj+1), in the average case, the increment is negligible. Only the MPHF of
the top layer of the counter must change during insertions/deletions and the
probability to reach a saturated counter at layer j is simply mj/m0. Thus
the increment of complexity due to the MPHFs is:
∆ω = O
 1
m0
N∑
j=1
m2j

The average total amount of operations is very close to k, provided that
the data structure is designed to have low overﬂow probabilities (see tab. 6.1).
Moreover, if the number of layers is bounded to N , the number of operations
in the worst case is constant and equal to k(1 + N). Therefore, even the
operational complexity for inserting/deleting an element is O(1).
If the data structure is well designed (i.e. low overﬂow probability), it is
worth observing that the more the number of levels, the larger ϕmax, thus
the smaller the overall overﬂow probability ( P (ϕ > ϕmax) ), with almost the
same number of operations.
6.4.4 Simulation results
In this section, in order to better understand how the parameters aﬀect data
structure behavior, simulation results are shown about the memory usage of
ML-HCBF in comparison with standard CBF. In the simulation runs, a set
of 2000 elements has been generated to test the diﬀerent data structures. For
CBF and for the ﬁrst layer of ML-HBCF, 10 hash functions are used and
the number of bins m0 is selected so as to minimize the probability of false
positives (according to [66]).
For each run (i.e. an entry in tab. 6.1), we have set a conﬁguration for
ML-HCBF (i.e. the number of layers and the size of bins per layer) and
calculated the overﬂow probability. Then, by theorem 1, we have found the
number of bits for bin required in a standard CBF to obtain the same overﬂow
probability. Therefore, the results shown in tab. 6.1 refer to structures with
the same probabilities of counters overﬂow and false positives: memory saving
of ML-HCBF in comparison with standard CBF is clear.
Finally, notice that the size of ML-HBCF depends almost exclusively on
the number of bits of ﬁrst level. The best memory saving are obtained by
using 2 bits for CBFV bins. Further, as shown by the ﬁrst three entries in the
table, by increasing the number of levels in ML-HCBF, memory consumption
is nearly the same but the overﬂow probability decreases and the gain with
respect to CBF increases.
125
6.5. HUFFMAN SPECTRAL BLOOM FILTERS
Table 6.1: Data Structures Comparison. Size is expressed in KBytes
(x0, . . . , xN ) P (ϕ > ϕmax) ML−HCBF CBF ratio
(2, 2, 2, 2) 1.7× 10−7 7.55 11.7 0.64
(2, 2, 2, 2, 2, 2) 3.8× 10−11 7.55 13.03 0.58
(2, 2, 2, 2, 2, 2, 2, 2) 9.8× 10−15 7.55 14.08 0.53
(2, 2, 3, 3) 1× 10−15 7.55 14.08 0.53
(2, 2, 4) 2× 10−20 7.55 15.7 0.48
(2, 3, 3) 1.5× 10−15 7.6 14.08 0.54
(3, 4, 4) 2× 10−23 10.56 14.96 0.7
6.5 Huﬀman Spectral Bloom Filters
6.5.1 Theoretical basis
We start presenting the theoretical basis of Huﬀman Spectral Bloom Filters.
Theorem 2. Let H(σ) be the Huﬀman coding of σ, len(.) the bit-length
operator, ϕ a CBF counter value; then:
len(H(ϕ)) = ϕ+ 1
Proof: Huﬀman codes can be obtained by using a binary tree. The tree
is constructed from a list of N nodes (symbols) whose weights correspond to
the symbol probabilities.
The whole procedure is the following:
• let x and y be the two nodes with the lowest weight;
• x and y are aggregated into a parent node whose weight is set to the
sum of the two nodes;
• the parent node replaces x and y in the list.
These steps are repeated until the list contains one node only.
To perform Huﬀman coding of CBF bin counters, we ﬁrst construct a
tree whose nodes X0, ..., XN correspond to the possible values of the counters
j = 0, ..., N ; the weight of the j-th node is set to P (ϕ = j). Let Lτ be the list
of nodes at step τ and let Xτ be the parent node to be created at this step.
Suppose we have Lτ = {X0, X1, . . . , XN−τ−1, Xτ−1}; the weight of the parent
126
6.5. HUFFMAN SPECTRAL BLOOM FILTERS
4
0
1
3
0
1
2
0
1
1
0
1
00
Figure 6.5: A Huﬀman tree for the CBF bin counters.
node Xτ−1 created at the previous step is P (Xτ−1) = P (j > N − τ − 1). By
corollary 1:
P (Xτ−1) < P (j = N − τ − 2)
Moreover, the previous inequality also implies that P (Xτ−1) is smaller than
any of the values in the set {P (X0), . . . , P (XN−τ−2)}. Then, at step τ , the
nodes with the smallest weights are Xτ−1 and XN−τ−1 and they shall be
aggregated into the parent node Xτ . Thus:
Lτ = {X0, ..., XN−τ−2, XN−τ−1, Xτ−1} ⇒ Lτ+1 = {X0, ..., XN−τ−2, Xτ}
The resulting tree turns out to be completely unbalanced (i.e., the depth
of all N nodes is given by the sequence of the ﬁrst N naturals) such as the
one in ﬁg. 6.5. Therefore, the depth of node ϕ is ϕ + 1, i.e. the encoding of
the value ϕ is ϕ+ 1 bit-long.
6.5.2 The algorithm
In order to introduce our second method, Huﬀman Spectral Bloom Filter
(HSBF), we start by recalling Spectral Bloom Filters [71]. They use a memory-
eﬃcient structure that encodes any bin with Elias coding. This way the bins
do not have a ﬁxed position and, for all k hash functions, we have to ﬁnd
the right bin it points to by looking up a certain amount of words. Lookup
implies to decode a number of bins until the right one is found. Moreover
each insertion and deletion imply a potential shift of the whole structure.
To simplify these operations, SBFs divide the entire structure in subseg-
ments and use a set of tables in aid to the lookup. In addition, a certain
127
6.5. HUFFMAN SPECTRAL BLOOM FILTERS
wi
0111011010011011︸ ︷︷ ︸
popcount=10
wi+1
1101001011001· · ·
⇓
16− 10 = 6 symbols in wi
Figure 6.6: Example of fast lookup through popcount.
number ε of empty bits (called slack bits) are inserted to reduce shifts opera-
tions for insertions and deletions. Elias compression scheme is a perfect choice
when dealing with large number, such as those of multiset membership query
applications. However, for smaller values (recall that in a regular CBF, 16 is
widely considered as a high loose bound), other codings can perform better.
Our proposal is to encode a number σ with σ consecutive ones and a
trailing zero (ﬁg. 6.5). This way, the encoding produces σ + 1 bit for symbol
σ: it is a Huﬀman coding, as shown in theorem 2. This is a major advantage
since Huﬀman is the minimum redundancy coding for independent symbols
such as the bins of a CBF.
Moreover, our coding scheme allows an easy lookup since most processors
provide an instruction that counts the number of bits set to one in a word
(popcount). By taking advantage of such an instruction, we do not have to
decode each value we ﬁnd during lookup, but simply count the number of
cleared bits in a word. The number of cleared bits is the number of symbols
encoded in that word (see example in ﬁg. 6.6). Clearly, we still have to
perform a shift for each insertion or deletion and we need a table to speed up
lookup but the total size of the structure is very close to the minimum (given
by the entropy of all symbols).
On these basis, we have deﬁned our second method, and hereafter we
report its features.
6.5.3 Size
We divide the bins in B blocks of D bins and we address the blocks with the
table. The average size of the HSBF is:
E[S] = m(1 + E[ϕ]) +B (ε+ log2 [(m−D) (ϕmax + 1)])
128
6.6. MULTILAYER COMPRESSED CBF
where ε is the number of slack bits kept at the end of each block. The last part
of the above formula takes into account the table size. The table is addressed
by the ﬁrst log2B bits of the hash, the remaining bits represent the bin index.
Each entry of the table represents the starting address of the corresponding
block thus requiring less than (m−D)(ϕmax + 1) bit.
6.5.4 Lookup
As for operation complexity, a lookup requires k hash functions and for each
of them a check in the table and a search in the corresponding block for the
bin we need (see ﬁg. 6.7). Thus on average D/2 bins have to be looked and
W/(E[ϕ] + 1) bins will be found in a word of W bits. The overall average
number of operations for a lookup is then:
ω = k
(
D(E[ϕ] + 1)
2W
)
As shown in section 6.3, E[ϕ] ' ln 2. Therefore, the average number of
operations for a lookup is constant and its complexity is O(1).
6.5.5 Insertion/Deletion
In order to insert a new element, we need to perform a lookup and to add a
1 digit for each bin in the code. This corresponds to shifting all the bits at
the bin's right by one position and a table update. Thus, for all insertions,
the number of operations is:
ω = k
(
D(E[ϕ] + 1)
W
)
It is straightforward to see that, since even deletion requires a lookup and
a shift, the overall cost is the same as insertion. The complexity of these
operations is O(1), as for lookup.
6.6 MultiLayer Compressed CBF
The drawbacks of algorithm and data structure described in sec. 6.5, as
well as SBF, are related to the memory wastage due to slack bits and to the
complexity of a lookup done through searching (even if aided by index tables),
given by the need of a shift for each insertion or deletion.
129
6.6. MULTILAYER COMPRESSED CBF
Figure 6.7: An example of HSBF.
In the following, the MultiLayer Compressed Counting Bloom Filter (ML-
CCBF) is presented, which is a CBF that reduces the memory requirements
and the complexity of lookup. The idea is, again, to explode the CBF along
another dimension, hence creating a multilayer structure. This construction,
in conjunction with the Huﬀman coding deﬁned in sec.6.5, provides a stack
of bitmaps (L0, ..., LN ), where the ﬁrst layer L0 is a standard BF. The other
layers are built and modiﬁed dynamically when needed.
Let popcount(u) be the number of 1s in the bitmap (0, ..., u − 1); the
construction is as follows:
• Li keeps all the i-th binary digits of our Huﬀman encoded counters;
• on Li, the j-th bit belongs to the counter whose popcount on Li−1 is j.
Figure 6.8 shows an example of ML-CCBF, in which we are counting a
bin ϕ for symbol σ. The bin at layer 0 is pointed by the hash function h(σ).
The number of ones before h(σ) is computed (i.e. popcount(h(σ)) = 5) and
used as index for layer 1. The procedure is repeated until we ﬁnd a 0 digit
(that is the end of the code). Therefore the resulting Huﬀman code for the
counter is 1110, which corresponds to value 3.
6.6.1 Complexity and properties
One of the most signiﬁcant advantage of our algorithm is that it is an extension
of a standard BF. Thus, the lookup is as simple and fast as in a standard BF
130
6.6. MULTILAYER COMPRESSED CBF
L3 0 0 · · ·
⇑
1
L2 0 1 0 1 · · ·︸ ︷︷ ︸
1 one
⇑
3
L1 1 1 0 1 0 1 0 0 · · ·︸ ︷︷ ︸
3 ones
⇑
5
L0 1 0 1 1 1 0 0 1 1 0 1 · · ·︸ ︷︷ ︸
5 ones
⇑
h(σ)
Figure 6.8: ML-CCBF example. The resulting Huﬀman code for ϕ is 1110.
as we need to check only bits at layer 0. Therefore the lookup complexity is
O(1).
Instead, for insertion and deletion we need to explore diﬀerent layers in
the structure. We refer to mi as the number of bits in layer i. The size of
layer i can be obtained as:
mi = m0P (ϕ ≥ i)
Since jumping one layer up requires a popcount on a potentially large
number of bits, we divide all layers in blocks of the same bit-size D and add
a table for each level. When computing popcount(uj) at layer j, the ﬁrst
log2(mj/D) bits of uj are used as index to table j. Each entry of the table
represents the number of ones preceding the start of the block. Thus, if W
is the number of bits in a word, the actual popcount operation works only on
less than D/W words. Therefore, the average cost of a popcount is 1 + D2W .
Algorithms 6 and 7 show the pseudocode for insertion and deletion pro-
cedures in ML-CCBF. Both operations require, for all k bins, the complete
lookup of multiplicity (by exploring a certain amount of layers), a shift by
one position and the update of the last explored table. Such a update con-
sists simply of an increment or a decrement on a limited number of entries.
Therefore the average amount of operations for insertion and deletion is given
by:
ω = k
[
E[ϕ]
(
1 +
D
2W
)
+ 2
]
Once again, E[ϕ] ' ln 2, thus the average amount of operations is ﬁxed and
the complexity for insertion/deletion is O(1).
131
6.6. MULTILAYER COMPRESSED CBF
Algorithm 6 The insertion of an element in a ML-CCBF
1: for i← 1, k do
2: j ← 0
3: u0 ← hi(s)
4: while (Lj(uj) = 1) do
5: uj+1 ← popcount(uj)
6: j ← j + 1
7: end while
8: Lj(uj)← 1
9: uj+1 ← popcount(uj)
10: j ← j + 1
11: Lj(uj + 1, . . . ,mj + 1)← Lj(uj , . . . ,mj)
12: mj ← mj + 1
13: Lj(uj)← 0
14: UpdateTable(Lj)
15: end for
Algorithm 7 The deletion of an element in a ML-CCBF
1: for i← 1, k do
2: j ← 0
3: u0 ← hi(s)
4: while do(Lj(uj) = 1)
5: uj+1 ← popcount(uj)
6: j ← j + 1
7: end while
8: Lj(uj , . . . ,mj)← Lj(uj + 1, . . . ,mj + 1)
9: mj ← mj − 1
10: Lj−1(uj−1)← 0
11: UpdateTable(Lj)
12: end for
132
6.6. MULTILAYER COMPRESSED CBF
Figure 6.9: Size comparison among ML-CCBF, CBF and m× Entropy.
6.6.2 Size
ML-CCBF is a multilayer transposition of the algorithm shown in sec. 6.5,
with no need for slack bits. Hence, it results in a lower memory requirement:
S = m0 +
m0∑
i=1
ϕi +
ntab∑
i=1
TSi
TSi is the size of the table required for layer i, which needs ni = dmi/De
entries of size log2(mi), thus resulting in:
TSi = ni log2(mi) =
⌈m0
D
⌉
P (ϕ ≥ i) log2 [m0P (ϕ ≥ i)]
Figure 6.9 shows the comparison among the sizes of ML-CCBF, standard
CBF and the minimum amount of bits for independent symbols (BF entropy =
m× entropy), for k = 10 and m = 32768. The memory saving of our method
is clear as it approaches the minimum value. Note that the optimal number
of elements n = 2270 that minimizes f , minimizes the distance from the BF
entropy as well.
133
6.7. COMPARATIVE ANALYSIS
The average amount of required memory is then:
E[S] = m0(1 + E[ϕ]) + TS
A closed form expression for TS =
∑ntab
i=1 TSi is not simple to obtain in a
general case. However, we use the results of theorem 1 (see the Appendix) to
compute a bound for TS.
If α = ln 2 to minimize the false positive probability, then:
TS ≤
⌈m0
D
⌉
(2 log2(m0)− 1.85)
6.7 Comparative Analysis
For the evaluation of the algorithms proposed in previous sections and the
comparison with other known in literature, the Network Processor Intel IXP2800
has been taken as referential hardware architecture.
As shown in tab. 6.2, we have weighted the operations of the algorithms
in terms of clock cycles for microengines. Microengines are just the processors
designed to handle fast data path; the operations weights are set according
to the IXP2800 Hardware Reference Manual [61]. For the construction of a
MPHF, we have simulated the algorithm in [75] and measured its cost.
Each algorithm has been simulated and its performance has been mea-
sured in terms of memory consumption and processing load for lookup and
insertion/deletion. In simulation runs, the total number of data elements is
n = 2000, k = 10, and the number of bins for the main vector is 2.8×104, thus
minimizing the probability of false positives. For the algorithms which divide
data structure in subsegments, the number of blocks is B = 64. All other pa-
rameters are set to obtain about the same probability of false positives among
the diﬀerent algorithms and to be able to manage the same number n of el-
ements. Moreover, for the algorithms which present a hierarchical structure,
we have located each substructure in the fastest memory as possible (see tab.
6.5).
In ML-HCBF, we have set a conﬁguration of four layers (2, 2, 3, 3). The
ﬁrst vector CBFV has been stored in scratchpad memory, thus a lookup
requires accesses to scratchpad. Vectors HBV1, HBV2 and HBV3 can be lo-
cated in local memory, thus an insertion/deletion needs also a certain number
of accesses to this memory. However, in this layer conﬁguration the probabil-
ity of overﬂowing CBFV is very low (0.033), thus accesses to local memory
are not frequent.
134
6.7. COMPARATIVE ANALYSIS
Table 6.2: Number of Clock Cycles for Operations in the IXP2800
Operations Number of cycles
hash 10
popcount 1
shift 1
read/write in local memory 2
read/write in scratchpad memory 60
construction of a MPHF 1000
Concerning ML-CCBF, the main BF vector L0 and index tables are stored
in local memory, while the remaining vectors in scratchpad. A lookup only
requires checking the ﬁrst vector, therefore only local memory is accessed. For
insertion and deletion we still need to explore diﬀerent layers in the structure,
thus both memories are accessed.
For a standard CBF, built with four bits for bin, the overall structure has
been located in scratchpad. Therefore lookup, insertion and deletion require
accesses to this memory.
With the data of our simulation, DCF does not experiment any overﬂow of
counters in CBF vector. Therefore, Overﬂow Counter Vector are not necessary
and DCF exhibits exactly the same behavior of CBF, in terms of both size
and complexity.
Regarding HSBF, we have stored in scratchpad the main structure and in
local memory the index tables. As said above, a lookup requires, for each hash
function, to check the table in local memory, to search for the corresponding
block in scratchpad for the bin we need and to compute a popcount. The same
number of operations are required for inserting/deleting an element, with the
addition of shifting by one position the bits in the bin, to increment/decrement
a counter. Remember that HSBF is a simple alternative version of SBF, which
is a structure optimized for multi-set. SBFs use, for values greater than 2, Elias
code instead of Huﬀman code and several more index tables, thus resulting in
higher memory consumption and operational complexity.
Finally, the overall unique structure of dlCBF has been located in scratch-
pad. A lookup requires 2k hashing and k accesses to scratchpad, while an
insertion or a deletion needs 2k hashing, k accesses to scratchpad and, ﬁnally,
k incrementing or decrementing operations.
From results in tab. 6.5, it is clear that the solutions proposed in this
paper show a signiﬁcant memory saving in comparison with standard CBF and
135
6.8. BLOOMING TREE
DCF (saving of 46% for ML-HCBF, 56% for ML-CCBF, and 54% for HSBF),
and also compared to SBF. Instead, there is a memory consumption increase
in comparison with dlCBF (from 0.93 KB up to 2.35 KB). Hovewer, our
methods, inspired by dynamic approaches (e.g. DCF), avoid in a conclusive
way the problem of counters overﬂow, thus preserving the accuracy of stored
information.
Moreover, the introduction of a hierarchical structure allows in ML-CCBF
a remarkable decrease of clock cycles for the lookup operation. Indeed, the
main structure is stored in local memory, thus enabling lookup by accessing
local memory only. The membership query is the most frequent operation for
these data structures, therefore the reduction of about 83% of clock cycles for
lookup is a great outcome. It outweighs the drawback of an increase of 50%
of processing for inserting/deleting an element.
Finally, note that our HSBF outperforms SBF, in terms of memory con-
sumption and operational complexity. This is an expected result, due to the
simplicity of our method and to the use of Huﬀman code (SBFs are opti-
mized for multi-sets). If compared to the complexity of standard algorithms,
HSBF shows a reduction of 13% for lookup and an increase of 45% for inser-
tion/deletion. The diﬀerent frequency of operations allows to claim that the
tradeoﬀ is advantageous.
6.8 Blooming Tree
This section proposes another data structure with the same functionalities of
a CBF. Also this structure is allotted in diﬀerent layers, thus exploiting the
built-in memory hierarchy of many packet processing systems. Because of
the similarities with binary trees and the tunable multilayer design, we call
it Blooming Tree (BT). A naive construction of BT reduces the memory
consumption at least of a factor of 2/ ln 2 ' 2.88 times compared to that of
standard CBF, while an optimized version achieves a saving of up to 4/ ln 2 '
5.75 times. The main idea behind Blooming Trees is the construction of a
binary tree upon a plain BF, thus creating a multilayered structure where
each layer represents a diﬀerent depth-level of tree nodes. The aim is to
achieve both low false positive probability f and low memory requirements.
The drawback is the increased cost in lookup operation, that can be mitigated
by the low memory consumption that enables the deployment of the proposed
structure in faster on-chip memories.
136
6.8. BLOOMING TREE
T
a
b
le
6
.3
:
P
er
fo
rm
a
n
ce
A
lg
o
ri
th
m
s
C
o
m
p
a
ri
so
n
M
L
-H
C
B
F
M
L
-C
C
B
F
C
B
F
D
C
F
H
S
B
F
S
B
F
d
lC
B
F
S
iz
e
(K
B
)
7
.5
5
6
.1
3
1
4
.1
1
4
.1
6
.4
2
1
2
.1
2
5
.2
M
a
in
st
ru
ct
u
re
(K
B
)
7
.0
4
(s
cr
.)
3
.5
2
(l
o
c.
)
1
4
.1
(s
cr
.)
1
4
.1
(s
cr
.)
5
.9
2
(s
cr
.)
8
.1
2
(s
cr
.)
5
.2
(s
cr
.)
S
ec
o
n
d
a
ry
st
ru
ct
u
re
s
(K
B
)
0
.4
1
(l
o
c.
)
2
.4
(s
cr
.)
-
-
-
-
-
In
d
ex
ta
b
le
s
(K
B
)
-
0
.2
1
(l
o
c.
)
-
-
0
.5
(l
o
c.
)
4
(l
o
c.
)
-
P
ro
b
.
o
f
fa
ls
e
p
o
si
ti
v
es
1
0
−
3
1
0
−
3
1
0
−
3
1
0
−
3
1
0
−
3
1
0
−
3
1
.5
×
1
0
−
3
L
o
o
k
u
p
(c
lo
ck
cy
cl
es
)
7
0
0
1
2
0
7
0
0
7
0
0
6
0
6
8
0
1
8
0
0
In
se
rt
./
D
el
.
(c
lo
ck
cy
cl
es
)
1
0
4
3
1
0
6
4
7
1
0
7
1
0
1
0
5
8
1
2
1
7
8
1
0
137
6.8. BLOOMING TREE
6.8.1 The algorithm
To begin the description, we ﬁrst show the simplest construction of BT with
no optimization (we shall call it NBT, i.e. Naive BT); an optimized version
of BTs will be elaborated upon in sec. 6.8.3.
To build a NBT for n elements, L+ 2 layers are deﬁned:
• a plain BF (B0) with k0 hash functions hj (j = 1...k0) and m bins such
that m = nk0/ ln 2;
• L layers (B1...BL), each composed by mi (i = 1 . . . L) blocks of 2b bits;
• a ﬁnal layer (BL+1) composed by c-bits counters.
The j-th hash function hj provides a log2m+ L× b bit long output: the
ﬁrst group (s0,j) of log2m bits is used to address the BF at layer 0, the other
L × b bits are divided into L substrings (s1,j . . . sL,j) of b bits, one for each
layer.
Let popcount(B[u]) be the number of ones in the bitmap B[0] . . . B[u− 1]
and let us consider the simplest case b = 1 (this way, blocks become couples
and substrings si,j collapse into single bits). The lookup for an element σ
consists of a check on k0 elements in the BF and an exploration of the cor-
risponding k0 branches of the Blooming Tree. As shown in pseudocode 8,
we jump from layer i to layer i+ 1 by:
• computing a popcount on layer i, that gives us the index of the couple
to be observed in the layer i+ 1;
• checking the bit expressed by si,j : if si,j is equal to 0, we check the ﬁrst
bit of the couple, otherwise the second;
• processing the bit of the couple: if it is 0, then σ is not in the set and
the lookup result is NOT FOUND, otherwise the overall process must
be iterated for the next layers.
Therefore, a lookup of an element requires, for each hash function, a pop-
count and a bit check on the hash and on the block; these operations are
needed for each layer, until the result is found, thus resulting in a maximum
number of operations equal to k0[hash+L(popcount+ 2× check)]. However,
the computational cost of a lookup is negligible as popcount and hash op-
erations are supported by hardware in most modern NPs (such as the Intel
IXP2350 [61]) and often replicated in multi-core architectures.
An example of the lookup process is shown in ﬁg. 6.10, where the tree
structure of NBT is clear and a single hash is used (k0 = 1). For instance,
138
6.8. BLOOMING TREE
Algorithm 8 Pseudo-code for the lookup of element σ in BT
1: for l← 1, k0 do
2: x0 ← s0,l
3: for i← 0, L+ 1 do
4: if Bi(xi) = 0 then
5: return NOT FOUND
6: end if
7: xi+1 ← 2b × popcount(Bi[xi]) + si,l
8: end for
9: end for
B0
B1
B2
B3
3 items 1 item 2 items
0 1 1 0 1 0 0 1
1 0 0 1 1 0 1 0
1 1 1 0 0 1 1 1
1 2 1 1 1 1
0 1 0 0
0 1 0 1 0 1
Figure 6.10: An example of a Naive Blooming Tree with b = 1.
let us observe the last bit of the BF, where two items collide. The popcount
(equal to 3) leads to the proper block of layer B1. The bit s1 of the hash is
equal to 0 for both the items, so the ﬁrst bit of couple is set. Then, the items
present a diﬀerent s2 bit of the hash and they split: in the fourth couple of
layer B2 (as indicated by the popcount on B1) both the bits are set. Therefore,
in the layer B3, two diﬀerent bins count the two items.
When inserting an element σ, we start from the standard BF (B0). For
each hash function hj(σ) (j = 1 . . . k0), we extract x0 = s0,j and set B0[x0].
Then we compute x1 = popcount(B0[x0]) and jump onto layer 1 at the x1-
th couple. Now we have two diﬀerent possibilities according to the bit we
just set in the BF. If it was already set (i.e.: we have a collision in the BF),
we set in the couple the position given by the next bit (s1,j) of hj(σ). If
it was clear before setting, we allocate a couple in the x1-th position (it is
a right-shift by 2 positions) and set the position given by s1, j in the new
139
6.8. BLOOMING TREE
couple. Then, for all other layers, we repeat all the previous steps, as shown
in pseudocode 9. Therefore the insertion of an element requires a maximum
number of operations equal to k0[hash+ L(popcount+ shift+ bitset)].
Algorithm 9 Pseudo-code for the insertion of element σ
1: for l← 1, k0 do
2: x0 ← s0,l
3: for i← 0, L do
4: previ ← Bi(xi)
5: Bi(xi)← 1
6: xi+1 ← 2b × popcount(Bi[xi])
7: if previ = 0 then
8: Bi+1(xi+1 + 2b . . .mi+1 + 2b)← Bi+1(xi+1 . . .mi+1)
9: mi+1 ← mi+1 + 2b
10: end if
11: xi+1 ← xi+1 + si,l
12: end for
13: BL+1(xi+1)← BL+1(xi+1) + 1
14: end for
For the deletion of an element σ (see pseudocode 10), the corresponding
counter in the layer BL+1 has to be found (by following the lookup process)
and decremented. If the new counter value is equal to 0, its associated block
of the previous layer has to be checked: if only the bit concerning σ is set,
the overall block has to be removed (by a left-shift by 2b bits) and the same
way the lower layer has to be processed. Otherwise, if other bits are set in
the block, the deletion process ends.
6.8.2 Properties of Blooming Tree
By construction, in an NBT, layer i + 1 has as many blocks as the number
of ones in layer i. Thus, the worst case for the overall structure size occurs
whenever all the branches start at layer 0 (i.e.: there are no collisions in B0).
In this case 2bnk0 bits are necessary for each layer. Thus, considering also
layers B0 and BL+1, the size SNBT can be bounded by:
SNBT ≤ nk0(1/ ln 2 + 2bL+ c)
To compute the overﬂow probability (Pov) at layer L+ 1, we observe the
collisions for each bit of all layers. It is equivalent to substituting each bit
with a counter and to observing its value. Let us consider a given counter at
140
6.8. BLOOMING TREE
layer i + 1. It descends from a parent counter at layer i. When jumping
from layer i to layer i+ 1, some of the elements colliding at layer i may have
a diﬀerent hash substring, thus reducing the value of the counter at the next
layer. Therefore the probability P (ϕ, i+ 1) of having a counter with value ϕ
in layer i+ 1 is the sum of the probabilities that the parent counter on layer
i has the value ϕ+ j times the probability that the (i+ 1)-th hash substring
is the same for ϕ elements and diﬀerent for the other j:
P (i+ 1, ϕ) =
∞∑
j=0
P (i, ϕ+ j)
(
ϕ+ j
ϕ
)(
1
2b
)ϕ(
1− 1
2b
)j
(6.9)
Algorithm 10 Pseudo-code for the deletion of element σ
1: for l← 1, k0 do
2: x0 ← s0,l
3: for i← 0, L do
4: xi+1 ← 2b × popcount(Bi[xi]) + si,l
5: end for
6: BL+1(xi+1)← BL+1(xi+1)− 1
7: if BL+1(xi+1) = 0 then
8: i← L
9: while Bi(xi mod 2b . . . xi mod 2b + 2b) = 1 >> si do
10: Bi(xi . . .mi)← Bi(xi + 2b . . .mi + 2b)
11: mi ← mi − 2b
12: i← i− 1
13: if i = 0 then
14: B0(x0)← 0
15: break
16: end if
17: end while
18: end if
19: end for
The probability of collision occurrencies in hash tables is generally approx-
imated by a Poisson model [76]. Moreover, as stated in [73], such a model
(with parameter α0 = nk0/ ln 2) can be applied also to CBFs:
P (i, ϕ) ' e
−αiαϕi
ϕ!
= Poisson(αi, ϕ) (6.10)
141
6.8. BLOOMING TREE
By using (6.10) in (6.9), we can now compute P (i+ 1, ϕ):
P (i+ 1, ϕ) '
∞∑
j=0
e−αiαϕ+ji
(ϕ+ j)!
(
ϕ+ j
ϕ
)(
1
2b
)ϕ(
1− 1
2b
)j
= eαi(1−2
−b)e−αi
(αi/2b)ϕ
ϕ!
'
' eαi(1−2−b)2−bϕP (i, ϕ) = Poisson
(αi
2b
, ϕ
)
(6.11)
The Poisson pmf is invariant with respect to a binomial transform such
as (6.9) except for the parameter that is divided by 2b; thus, P (L+ 1, ϕ) is a
Poisson pmf as well.
Finally, by iterating (6.11) and by approximating
L∏
i=1
eα0/2
ib(1− 1
2b
) with
eα0 , we obtain:
P (L+ 1, ϕ) ' eα0P (0, ϕ)2−Lbϕ ' (ln 2/2
Lb)ϕ
ϕ!
(6.12)
Eq. (6.12) states that, for reasonable values of L (e.g. L ≥ 10), the
probability P (L+1, ϕ+1) becomes much smaller than P (L+1, ϕ). Therefore,
the probability of overﬂow P (L + 1, ϕ ≥ 2c − 1) can be safely approximated
by:
P (L+ 1, ϕ ≥ 2c − 1) ' P (L+ 1, 2c − 1) '
(
ln 2/2Lb
)2c−1
(2c − 1)! (6.13)
Since in each layer a hash substring of b bits is used to choose the block
at the next layer, the probability of false positives f for our BT decreases by
a rate of 2b per layer. Then, considering also the false positives probability of
the layer 0 (the BF), we have:
f = 2−(k0+Lb) (6.14)
To give an idea, a standard 4-bit CBF with the same f requires k0 + L
hash functions and SCBF bits:
SCBF = 4n(k0 + L)/ ln 2 ≥ 4 1 + L/k01 + (2bL+ c) ln 2SNBT (6.15)
This means that a NBT can be more than 2/ ln 2 ' 2.88 times smaller than
its equivalent CBF.
142
6.8. BLOOMING TREE
6.8.3 Memory Optimization
An optimized version of BT is described in this section. It follows from three
main observations about NBTs:
• as suggested by the Poisson approximation (6.10) and as shown in
ﬁg.6.10, P (i, 1) gives the most relevant contribution for all layers; in-
deed, once there are no collisions in a certain block at layer u, there
are no collisions also in the corrisponding blocks in all upper layers
u+ 1 . . . L, but we use 2b(L− u) + c bits for those blocks;
• all blocks always have at least a bit set: a block with 2b zeros (let us
call it zero-block) has no meaning;
• looking up w layers yields f = 2−(k0+wb).
Therefore, whenever there are no collisions in a block, a zero-block can be
used to indicate this situation and stop the branch from growing. But we
cannot stop the lookup there, since it would increase the probability of a false
positive. The solution of the Optimized BT (OBT) is to add a bitmap and
an array of hash substrings for each layer. The array of substrings for layer i
is composed by all the [(L− i)b]-long hash substrings that complete the hash
of the branches that stop at layer i. In the bitmap (of mi bits), the generic
j-th bit is set if the j-th block has no collision (i.e. zero-block); this way it
can be used to address the substring array (see ﬁg. 6.11). The optimization
can be also done at runtime.
Obviously, operational routines change. As for lookup of an element σ,
whenever the xi-th block is a zero-block, we compute yi = popcount(bitmapi [xi ])
and compare the last (L− i)b bits of the hash of σ with the yi-th element in
the i-th substring array. This way, the lookup becomes faster as zero-blocks
are very likely to occur in any layer, thus avoiding all the steps required to
jump up to layer L+ 1.
The insertion routine, however, can be slightly slower since it must also
be aware of zero-blocks. If we have no collisions at layer 0, we add a zero-
block, we set the corresponding bit in the bitmap as well as the corrisponding
substring in the substring array. Instead, if there is a collision, we have to
check the colliding elements and create the corrisponding branches up to the
layer (let us say j) where the hash substrings diﬀer. At layer j + 1 we repeat
the ordinary steps: add two zero-blocks, set the corresponding bits in the j-th
bitmap and add the two hash substrings in the j-th substring array. The
computational cost of deletion, in turn, is about the same of that of insertion
143
6.8. BLOOMING TREE
B0
B1
B2
B3
3 items 1 item 2 items
0 1 1 0 1 0 0 1
1 0 0 0 0 0 1 0 0 1 1 0
bitmapi
10 01
hash sub-
strings
1 1 1 1
1 2 1 1
0 1 0 0
0 1 0 1
Figure 6.11: An example of an Optimized Blooming Tree with b = 1.
since, again, zero-blocks require additional processing but reduce the amount
of accesses to upper layer.
The average size of the overall structure is:
SOBT ' nk0
(
L+ 2b + (1 + 2b−1)/ ln 2
)
(6.16)
In fact, the ﬁrst layer B0 is constructed with m = nk0/ ln 2 bits and has, on
average, m/2 ones. For each 1 present in B0 we build a block (2b bits) in
B1. Let us suppose that all the nk0 branches start at layer B2. This means
that we have nk0 zero-blocks in B2, a bitmap of nk0 bits and nk0 entries (of
L− 1 bits) in the hash substring array of layer 2. All these components sums
up to nk0
(
L+ 2b + (1 + 2b−1)/ ln 2
)
bits.
If we introduce a collision in B2, the number of blocks and the bitmap
length become nk0−1 while the substring array is reduced by 2×(L−1) bits.
At layer B3 we add 2 blocks, 2 bits in the bitmap and 2 elements of L−2 bits
in the substring array. Thus a collision adds 2b − 1 bits. However, the higher
the layer, the fewer the collisions and so the bits they introduce. Indeed, by
means of simulations, eq. (6.16) proves to be very precise and it shows that,
as L increases, an OBT can become up to 4/ ln 2 ' 5.75 times smaller than
an equivalent CBF.
Fig. 6.12 shows that the overall gain in size is evident for OBT as compared
to NBT and to the most compact CBF-like structure in literature (dl-CBF).
The ﬁgure reports the size for the above mentioned structures as a function
of − log2(f) (i.e. the number of layers in OBT and NBT) for n = 2048.
144
6.8. BLOOMING TREE
Figure 6.12: Size comparison for NBT, OBT, dl-CBF and CBF with n = 2048.
6.8.4 Measurements
To evaluate the proposed algorithms and compare their performance to those
of other known algorithms in literature, the NP Intel IXP2350 has been taken
as referential hardware architecture. It has a XScale core and four 32-bit
microengines MEv2, 4KB of local memory, 16KB of scratchpad memory, and
128KB of message SRAM.
As shown in tab. 6.4, the operations of the algorithms have been weighted
in terms of clock cycles for microengines, according to the IXP2350 Reference
Manual [61] and ignoring operations with negligible costs such as shift and
popcount.
Each algorithm has been simulated and evaluated in terms of memory
consumption and processing load. In simulation runs, the parameters are set
to obtain about the same probability of false positives and overﬂow among
the diﬀerent algorithms. Moreover, each structure (and each substructure for
hierarchical algorithms) was located in the fastest possible memory.
In the ﬁrst simulation run, we set n = 8192. For NBT and OBT, L = 11,
k0 = 1, c = 3, and b = 1, thus obtaining Pov = 10−28 and f = 0.24 × 10−3.
Regarding NBT, we have stored in local memory the ﬁrst layers B0 and B1,
which ﬁll 2.9 KB. Then layersB2...B8 have been located in scratchpad memory
and layers B9...B11 in SRAM. For OBT, we put the ﬁrst two layers (3.6 KB)
145
6.8. BLOOMING TREE
Table 6.4: Number of Clock Cycles for Operations in the IXP2350
Operations Number of cycles
hash 10
read/write in local memory 2
read/write in scratchpad memory 60
read/write in SRAM memory 100
Table 6.5: Performance Algorithms Comparison
n = 8192
NBT OBT CBF dl-CBF
Size (KB) 25.25 15.87 69.25 21.3
Size Ratio 2.74 4.36 1 3.25
Lookup (clock cycles) 724 160 1320 1200
Insertion/Deletion (clock cycles) 1036 520 1320 1200
n = 2048
Size (KB) 6.3 3.97 12 5.3
Size Ratio 1.9 3.02 1 2.26
Lookup (clock cycles) 256 9 720 800
Insertion/Deletion (clock cycles) 368 29 720 800
in local memory and the other ones in scratchpad, including all bitmaps and
substring arrays.
To obtain about the same probabilities, in a standard CBF (with 4 bits
per bin) we set k = 12 (this way, Pov = 1.5 × 10−16 and f = 0.24 × 10−3).
The overall structure has been located in SRAM, therefore lookup, insertion
and deletion require k accesses to this memory and k hashing.
Finally, for dl-CBF we set k = 10, thus obtaining Pov = 2.96 × 10−23
and f = 1.46 × 10−3. The data structure has been located in SRAM, thus
processing an element requires 2k hashing and k accesses to SRAM.
From the results shown in tab. 6.5, it is clear that OBT is the best solution
in terms of memory consumption. OBT outperforms also dl-CBF, which
presents the best results in literature. Moreover, the multilayer structure
of our solutions allows for a remarkable decrease of operational costs. For
n = 8192, OBT shows, in comparison with dl-CBF, a reduction of 86% of
146
6.8. BLOOMING TREE
clock cycles for lookup and 56% for insertion/deletion.
For n = 2048, the trends are the same: our solutions outperform the
previous ones in terms of both memory consumption and operational load.
The structure of OBT can be even completely located in local memory, thus
drastically reducing operational costs (e.g., 98% less than dl-CBF for lookup).
147

Conclusions
This work has presented a detailed analysis on network processors and their
role as packet processing conponents in network devices. An overall survey
on the current available solutions has been illustrated, by comparing the fun-
damental features of such multiprocessor systems. Then a more accurate
description of Intel IXP2XXX family has been made, which is the hardware
reference architecture for our activities, due to their wide diﬀusion in academic
contexts and their ﬂexibility and reconﬁgurability.
In the central chapters of this thesis, we have shown a series of network
applications we have designed and realized. The targets of such components
are very diﬀerent, thus allowing us to "test" all the features of the Intel net-
work processors and understand their limits and advantages. We have tried to
solve many issues related to architectural constraints, by adopting innovative
solutions or designing by ourselves new algorithms and mechanisms.
The last chapter is just a section which introduces such algorithms we have
designed at the beginning to overcome the limits of network processors, and
we have sometimes developed and customized for other applications.
As already said, nowadays NPs have been claimed to be "outdated" by
multicore or manycore systems, which combine general purpose and special-
ized cores and especially provide an easy programming environment. Instead,
NPs often require the knowledge of private microcode and present too many
architectural limits, such as small memories and instruction store.
However, our research shows that network processors can be appealing
for diﬀerent applications of packet processing in networking area. We have
realized many devices which have shown high performance, outscoring cur-
rent solutions. The activity related to traﬃc generation, packet classiﬁcation,
traﬃc measurements, have remarked a ﬂexibility and a pliability of network
processors which encourage a further research and development on these plat-
forms. On the other hand, the issues of hard programming and remarkable
architectural limits narrow such development of networking applications based
CONCLUSIONS
on NP, thus paving the way for more eﬃcient solutions.
Finally, the activity research with such hardware platforms has enforced
us to solve more general and interesting problems, which are related to a wide
gamma of systems. We have already said about the issue of memory savings
and the proposed algorithms, moreover the features concerning multicore sys-
tems have been accurately analyzed in our research on the resource scheduling
ﬁeld, while the cooperation with general purpose PC has been dealt for the
traﬃc meter.
150
Bibliography
[1] Agere, The challenge for next generation network processors. [Online].
Available: www.agere.com/docs/challenge_new.pdf
[2] Alchemy, Alchemy semiconductor unveils au1000 in-
ternet edge processor. [Online]. Available: http:
//www.thefreelibrary.com/Alchemy+Semiconductor+Unveils+
Au1000+Internet+Edge+Processor.-a062704288
[3] AMCC, Product family for packet processors. [Online]. Avail-
able: https://www.amcc.com/MyAMCC/jsp/public/browse/controller.
jsp?networkLevel=EMBE&superFamily=NETP
[4] CISCO, Parallel express forwarding in the cisco 10000 edge
service router. [Online]. Available: http://whitepapers.zdnet.co.uk/0,
1000000651,260007268p-39000421q,00.htm
[5] EZchip, Network processor designs for next-generation networking
equipment. [Online]. Available: http://whitepapers.silicon.com/0,
39024759,60001341p-39000410q,00.htm
[6] J. R. A. Jr., B. M. Bass, C. Basso, and R. H. B. et al, Ibm powernp
network processor: Hardware, software, and applications. [Online].
Available: http://www.research.ibm.com/journal/rd/472/allen.pdf
[7] Intel R©, Ixp2400/2800 programmer's reference manual.
[8] E. J. Johnson and A. R. Kunze, Ixp2400-2800 Programming: The Com-
plete Microengine Coding Guide. Intel Press, 2003.
[9] Intel R©, Ixp2400/2800 hardware reference manual.
[10] D. E.Comer, Network systems design using network processors: Intel
2xxx version, 2005.
BIBLIOGRAPHY
[11] P. Gupta and N. McKeown, Algorithms for packet classiﬁcation, 2001.
[Online]. Available: citeseer.ist.psu.edu/gupta01algorithms.html
[12] P. Gupta and N. McKeown, Packet classiﬁcation on multiple
ﬁelds, in SIGCOMM, 1999, pp. 147160. [Online]. Available:
citeseer.ist.psu.edu/gupta99packet.html
[13] P. Gupta and N. McKeown, Packet classiﬁcation using hierarchical in-
telligent cuttings, in Proceedings of Hot Interconnects VII, 1999.
[14] F. Baboescu and G. Varghese, Scalable packet classiﬁcation, in SIG-
COMM '01: Proceedings of the 2001 conference on Applications, tech-
nologies, architectures, and protocols for computer communications. New
York, NY, USA: ACM, 2001, pp. 199210.
[15] M. Nourani and M. Faezipour, A single-cycle multi-match packet classi-
ﬁcation engine using tcams, in HOTI '06: Proceedings of the 14th IEEE
Symposium on High-Performance Interconnects. Washington, DC, USA:
IEEE Computer Society, 2006, pp. 7380.
[16] F. Baboescu, S. Singh, and G. Varghese, Packet classiﬁcation for core
routers: Is there an alternative to cams, 2003. [Online]. Available:
citeseer.ist.psu.edu/baboescu03packet.html
[17] M. E. Kounavis, A. Kumar, H. Vin, R. Yavatkar, and A. T. Campbell,
Directions in packet classiﬁcation for network processors, in Proceedings
of Second Workshop on Network Processors (NP2), 2003.
[18] M. J. Rashti, H. R. Rabiee, A. Foroutan, and M. Lavasani, A
multi-dimensional packet classiﬁer for np-based ﬁrewalls. in SAINT.
IEEE Computer Society, 2004, pp. 250254. [Online]. Available:
http://dblp.uni-trier.de/db/conf/saint/saint2004.html#RashtiRFL04
[19] D. Srinivasan and W. chang Feng, Performance analysis of multi-
dimensional packet classiﬁcation on programmable network processors,
in LCN '04: Proceedings of the 29th Annual IEEE International Con-
ference on Local Computer Networks. Washington, DC, USA: IEEE
Computer Society, 2004, pp. 360367.
[20] C. R. Hsu, C. Chen, and C.-Y. Lin, Fast packet classiﬁcation using bit
compression, in Global Telecommunications Conference, 2005. IEEE
Computer Society, 2005.
152
BIBLIOGRAPHY
[21] T. Chiueh and P. Pradhan, Cache memory design for network
processors, in HPCA, 2000, pp. 409. [Online]. Available: citeseer.ist.
psu.edu/312404.html
[22] I. A. Troxel, A. D. George, and S. Oral, Design and analysis of a dy-
namically reconﬁgurable network processor, in LCN '02: Proceedings of
the 27th Annual IEEE Conference on Local Computer Networks. Wash-
ington, DC, USA: IEEE Computer Society, 2002, p. 0483.
[23] K. Lee and G. Coulson, Supporting runtime reconﬁguration on network
processors, in AINA '06: Proceedings of the 20th International Confer-
ence on Advanced Information Networking and Applications - Volume 1
(AINA'06). Washington, DC, USA: IEEE Computer Society, 2006, pp.
721726.
[24] S. Giordano, G. Procissi, F. Rossi, and F. Vitucci, Design of a multi-
dimensional packet classiﬁer for network processors, in Proceedings of
International Conference of Communications 2006, 2006.
[25] K. S. Kim, Eﬃcient construction of pipelined multibit-trie router-
tables, IEEE Trans. Comput., vol. 56, no. 1, pp. 3243, 2007, fellow-
Sartaj Sahni.
[26] A. M. Odlyzko, Internet traﬃc growth: sources and implications, in
Optical Transmission Systems and Equipment for WDM Networking II,
vol. 5247, Aug. 2003, pp. 115.
[27] J.Williams, Architectures for network processing, in Proc. IEEE Inter-
national Symposium on VLSI., 2001, pp. 6164.
[28] A. Srinivasan, P. Holman, J. Anderson, S. Baruah, and J. Kaur, Network
Processor Design: Issues and Practices. Morgan Kaufmann, 2004.
[29] F. Sabrina, S. Kanhere, and S. Jha, Implementation and performance
analysis of a packet scheduler on a programmable network processor, in
Proc. IEEE LCN 2005), Sydney, Australia, Nov. 2005, pp. 242249.
[30] T. Wolf, P. Pappu, and M. A. Franklin, Predictive scheduling of network
processors, Computer Networks, vol. 41, no. 5, pp. 601621, Apr. 2003.
[31] J. L. B. Patrick Crowley, Worst-case performance estimation for
hardware-assisted multithreaded processors. in Proceedings of the
HPCA-9 Workshop on Network Processors, 2003.
153
BIBLIOGRAPHY
[32] H. Xiao, L. Zhang, and D. Wu, Component based performance predic-
tion for network processor based system, in PDCAT '05: Proceedings
of the Sixth International Conference on Parallel and Distributed Com-
puting Applications and Technologies. Washington, DC, USA: IEEE
Computer Society, 2005, pp. 356358.
[33] A. Agarwal, Performance tradeoﬀs in multithreaded processors, Paral-
lel and Distributed Systems, IEEE Transactions on, vol. 3, Sept. 1992.
[34] J. Blanquer and B. Ozden, Fair queuing for aggregated multiple links,
in Proc. ACM SIGCOMM, San Diego, CA, Aug. 2001, pp. 189198.
[35] L. N. B. Satya R. Mohanty, On fair scheduling in heterogeneous link
aggregated services, in Proc. of ICCCN 2005., Oct. 2005, pp. 199205.
[36] Intel R©, Ixp2400/2800 developer's tool user guide.
[37] M. Laor and L. Gendel., The eﬀect of packet reordering in a backbone
link on application throughput, IEEE Network, vol. 16, Sept. 2002.
[38] tcpdump/libpcap. [Online]. Available: http://www.tcpdump.org/
[39] Wireshark protocol analyzer (was ethereal). [Online]. Available:
http://www.wireshark.org
[40] Ntop network traﬃc probe. [Online]. Available: http://www.ntop.org
[41] P. Wood, libpcap-mmap. [Online]. Available: http://public.lanl.gov/
cpw
[42] J. C. Mogul and K. K. Ramakrishnan, Eliminating receive livelock
in an interrupt-driven kernel, ACM Transactions on Computer
Systems, vol. 15, no. 3, pp. 217252, 1997. [Online]. Available:
citeseer.ist.psu.edu/article/mogul95eliminating.html
[43] L. Deri, Improving passive packet capture:beyond device polling.
[Online]. Available: citeseer.ist.psu.edu/695645.html
[44] L. Deri, Passively monitoring networks at gigabit speeds using commod-
ity hardware and open source software, in Proceedings of PAM, 2003.
[45] K. Xinidis, I. Charitakis, S. Antonatos, K. G. Anagnostakis, and E. P.
Markatos, An active splitter architecture for intrusion detection and
prevention, IEEE Trans. Dependable Secur. Comput., vol. 3, no. 1, p. 31,
2006.
154
BIBLIOGRAPHY
[46] T. Wolf, R. Ramaswamy, S. Bunga, and N. Yang, An architecture for
distributed real-time passive network measurement, in MASCOTS '06:
Proceedings of the 14th IEEE International Symposium on Modeling,
Analysis, and Simulation. Washington, DC, USA: IEEE Computer So-
ciety, 2006, pp. 335344.
[47] D. Ficara, S. Giordano, and F. Vitucci, Design and implementation
of a multi-dimensional packet classiﬁer for network processor -
technical report. [Online]. Available: http://wwwtlc.iet.unipi.it/
research/classiﬁer.pdf
[48] http://caia.swin.edu.au/genius/tools/kute/.
[49] http://rude.sourceforge.net/.
[50] N. Bonelli, S. Giordano, G. Procissi, and R. Secchi, Brute: A high perfor-
mance and extensibile traﬃc generator, in Proc. of Int'l Symposium on
Performance of Telecommunication Systems (SPECTS'05), July 2005.
[51] http://protocols.netlab.uky.edu/ esp/pktgen/.
[52] R. Bolla, R. Bruschi, M. Canini, and M. Repetto, A High Performance IP
Traﬃc Generation Tool Based On The Intel IXP2400 Network Processor,
ser. Distributed Cooperative Laboratories: Networking, Instrumentation,
and Measurements. Springer Berlin Heidelberg, 2006, pp. 127142.
[53] M. Paredes-Farrera, M. Fleury, and M. Ghanbari, Precision and ac-
curacy of network traﬃc generators for packet-by-packet traﬃc anal-
ysis, in Proc. of 2nd International Conference on Testbeds and Re-
search Infrastructures for the Development of Networks and Communi-
ties(TRIDENTCOM 2006), 2006.
[54] http://mgen.pf.itd.nrl.navy.mil/.
[55] A. Botta, A. Dainotti, and A. Pescape, Multi-protocol and multi-
platform traﬃc generation and measurement, in Proc. of INFOCOM
2007 DEMO Session, May 2007.
[56] S. Avallone, A. Pescapè, and G. Ventre, Analysis and experimentation
of internet traﬃc generator, in Proc. of New2an 2004, International
Conference on Next Generation Teletraﬃc and Wired/Wireless Advanced
Networking, February 2004.
155
BIBLIOGRAPHY
[57] A. Abdo, H. Awad, S. Paredes, and T. J. Hall, Oc-48 conﬁgurable ip
traﬃc generator with dwdm capability, in Proc. of the Canadian Con-
ference on Electrical and Computer Engineering, May 2006, pp. 1842 
1845.
[58] http://ximbiot.com/cvs/cvshome/.
[59] http://netgroup-serv.iet.unipi.it/brute/.
[60] Intel R© IXP2400/2800 Developer's Tool reference manual.
[61] Intel R© IXP2350 Hardware reference manual.
[62] Intel Corporation, 21555 Non-Transparent PCI-to-PCI Bridge User's
manual.
[63] Linux Device Drivers, Third Edition, http://lwn.net/Kernel/LDD3/.
[64] B. Bloom, Space/time trade-oﬀs in hash coding with allowable errors,
Communications of the ACM, vol. 13, no. 7, pp. 422426, July 1970.
[65] A. Broder and M. Mitzenmacher, Network applications of bloom ﬁlters:
A survey, Internet Mathematics, vol. 1, no. 4, 2005. [Online]. Available:
http://www.internetmathematics.org/volumes/1/4/Broder.pdf
[66] L. Fan, P. Cao, J. Almeida, and A. Z. Broder, Summary cache: a scalable
wide-area web cache sharing protocol, SIGCOMM Comput. Commun.
Rev., vol. 28, no. 4, pp. 254265, 1998.
[67] M. Mitzenmacher, Compressed bloom ﬁlters, in PODC '01: Proc. of
the twentieth annual ACM symposium on Principles of distributed com-
puting. New York, NY, USA: ACM Press, 2001, pp. 144150.
[68] A. Kirsch and M. Mitzenmacher, Distance-sensitive bloom ﬁlters, in
ALENEX '06: Proc. of Algorithm Engineering and Experiments, 2006.
[69] D. Guo, J. Wu, H. Chen, and X. Luo, Theory and network applica-
tions of dynamic bloom ﬁlters, in Proc. of INFOCOM 2006. 25th IEEE
International Conference on Computer Communications., vol. 1, 2006.
[70] A. Kumar, J. J. Xu, L. Li, and J. Wang, Space-code bloom ﬁlter for
eﬃcient traﬃc ﬂow measurement, in IMC '03: Proc. of the 3rd ACM
SIGCOMM conference on Internet measurement. New York, NY, USA:
ACM Press, 2003, pp. 167172.
156
BIBLIOGRAPHY
[71] S. Cohen and Y. Matias, Spectral bloom ﬁlters, in SIGMOD '03: Proc.
of the 2003 ACM SIGMOD international conference on Management of
data. New York, NY, USA: ACM Press, 2003, pp. 241252.
[72] J. Aguilar-Saborit, P. Trancoso, V. Muntes-Mulero, and J. L. Larriba-
Pey, Dynamic count ﬁlters, SIGMOD Rec., vol. 35, no. 1, pp. 2632,
2006.
[73] F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese,
An improved construction for counting bloom ﬁlters, in LNCS 4168,
14th Annual European Symposium on Algorithms, 2006, pp. 684695.
[74] H. Song, S. Dharmapurikar, J. Turner, and J. Lockwood, Fast hash
table lookup using extended bloom ﬁlter: an aid to network processing,
in SIGCOMM '05: Proceedings of the 2005 conference on Applications,
technologies, architectures, and protocols for computer communications.
New York, NY, USA: ACM, 2005, pp. 181192.
[75] F. C. Botelho, Y. Kohayakawa, and N. Ziviani, An approach for
minimal perfect hash functions for very large databases, 2006. [Online].
Available: http://homepages.dcc.ufmg.br/~fbotelho/cv/pub/tr06.pdf
[76] W. Buchholz, File organization and addressing, IBM Systems Journal,
no. 2, pp. 86111, June 1963.
157

Acknowledgments
Il ringraziamento iniziale è per il gruppo di ricerca di Reti di Telecomuni-
cazioni dell'Università di Pisa, che mi ha dato la possibilità di svolgere questo
lavoro di tesi, in pieno spirito di collaborazione e di amore per la ricerca.
In particolare, grazie ai prof. Franco Russo e Stefano Giordano, miei tutori
e costanti punti di riferimento. Un rigraziamento ancora più "speciﬁco" va
all'NP team (Mimmo, Fede, Luca, Piero, Gianni), che ha lavorato a stretto
contatto con me nell'area di ricerca dei Network Processor. Grazie anche alla
Ericsson Italia, che ha ﬁnanziato questo mio triennio di ricerca.
E poi non posso non rigraziare tutti i miei amici, da Mezzogiorno di Fuoco
ai Briganti, dai vecchi compagni di università a tutti quelli che in questi
anni hanno incrociato la mia strada e condiviso parte di essa, rendendola
sicuramente migliore.
Inﬁne grazie a Gabriella, costante faro nella mia vita, alla mia famiglia,
sempre vicina nonostante la distanza, e al Signore, che mi ha regalato tutto
questo.

