Peak-performance DFA-based string matching on the Cell processor by Daniele Paolo Scarpazza & Oreste Villa
Peak-Performance DFA-based String Matching on the Cell Processor
Daniele Paolo Scarpazza1, Oreste Villa1,2 and Fabrizio Petrini1
1Paciﬁc Northwest National Laboratory 2Politecnico di Milano
Computational & Information Sciences Division Dipartimento di Elettronica e Informazione
Richland, WA 99352 USA Milano I-20133, Italy
{fabrizio.petrini, daniele.scarpazza}@pnl.gov ovilla@elet.polimi.it
Abstract
The security of your data and of your network is in the
hands of intrusion detection systems, virus scanners and
spam ﬁlters, which are all critically based on string match-
ing. But network links are getting faster and faster, and
string matching is getting more and more diﬃcult to per-
form in real time. Traditional processors are not keeping up
with the performance demands, whereas specialized hard-
ware will never be able to compete with commodity hard-
ware in terms of cost eﬀectiveness, reusability and ease of
programming.
Advanced multi-core architectures like the IBM Cell
Broadband Engine promise unprecedented performance at
a low cost, thanks to their popularity and production vol-
ume. Nevertheless, the suitability of the Cell processor to
string matching has not been investigated so far.
In this paper we investigate the performance attainable
by the Cell processor when employed for string matching
algorithms based on Deterministic Finite-state Automata
(DFA). Our ﬁndings show that the Cell is an ideal candidate
to tackle modern security needs: two processing elements
alone, out of the eight available on one Cell processor pro-
vide suﬃcient computational power to ﬁlter a network link
with bit rates in excess of 10 Gbps.
1 Introduction
A Network Intrusion Detection System (NIDS) is an ef-
fective way to provide security to systems connected to the
network. At the heart of a NIDS there is string matching
The research described in this paper was conducted under the Laboratory
Directed Research and Development Program for the Data Intensive Com-
putingInitiativeatPaciﬁcNorthwestNationalLaboratory, amulti-program
national laboratory operated by Battelle for the U.S. Department of Energy
under Contract DEAC0576RL01830.
1-4244-0910-1/07/$20.00 Copyright c  2007 IEEE.
algorithm, that allows the system to make decisions based
not only on the packet headers, but on the actual content of
the data ﬂow.
Most payload scanning applications have a common re-
quirement for string matching. For example, the presence
of a speciﬁc string of bytes can identify the presence of an
Internet worm or a malicious executable program. Because
the location of such strings in the packet payload and their
length is unknown, string matching algorithms must be able
to detect strings of diﬀerent lengths starting at arbitrary lo-
cations in the packet payload.
Packet inspection applications must be able to operate at
wire speed: with network performance quickly increasing
thanks to newer, more powerful version of Gigabit Ether-
net, it is becoming increasingly diﬃcult for software-based
solutions to keep up with the line rates. In addition to the
technological advances in network communication, we are
also experiencing an explosion in the size of the data dic-
tionaries that contain the strings that we need to match with
the incoming stream of data.
Several hardware-based techniques have been employed
for implementing packet inspection applications. In most
cases, special algorithms have been developed on Field Pro-
grammable Gate Arrays (FPGAs)[8, 15], exploiting the po-
tentially high level of parallelism available on these devices.
A typical example is the implementation of Bloom ﬁlters
on FPGAs [7, 13, 14]. A Bloom ﬁlter [2] is a data struc-
ture that stores a set of signatures compactly by computing
multiple hash functions on each member of the input set.
This technique queries a database of strings to check for
the membership of a particular string. Another approach to
implement string matching algorithms is through the use of
Deterministic Finite State Automata (DFA). DFAs can ef-
ﬁciently implement algorithms (such as Aho-Corasick [1])
which allow to search for strings in a given dictionary. If the
dictionary is expressed as a set of regular expressions rather
than a set of exact strings, eﬃcient techniques exist to gen-
erate a DFA which recognizes occurrences of all the regularexpressions at the same time [4]. The literature presents
numerous FPGA-based implementations of Aho-Corasick
search algorithms [5, 16, 17, 10], with diﬀerent degrees of
performance and diﬀerent dictionary sizes. Many other eﬃ-
cient string search algorithms exist, such as Knuth-Morris-
Pratt [12], Boyer-Moore [3], Commentz-Walter [6], Wu-
Manber [18] and their numerous derivatives, but they are
not frequently adopted in security applications. In fact, they
are based on heuristics and their workload depends on the
input data, which makes them vulnerable to attacks based
on malicious input streams speciﬁcally designed to over-
load them.
Meanwhile, the computing community is assisting to a
shift of paradigm in processor design: gate densities, clock
frequencies and heat dissipation are all reaching their phys-
ical limits, and the future promises of more computational
power reside in the availability of more and more paral-
lelism, through multiple cores on the same silicon die. The
IBM Cell Broadband Engine (Cell BE, for short) is the most
prominent member of the advanced multi-core processor
family. This class of architectures promises unprecedented
performance at a low cost, thanks to their popularity and
production volume. Because of their relatively low cost and
their signiﬁcantly higher ﬂexibility and ease of program-
ming, multi-core processors could represent tough competi-
tors to FPGA-based solutions in many application domains
including network security, provided that they will be able
to provide a level of performance which is comparable with
the one provided by FPGAs.
Despite their performance potential, to the best of our
knowledge, no works in literature have speciﬁcally tackled
the problem of optimally mapping DFA-based string match-
ing algorithms on advanced multi-core processors. In this
paper, weanalyzetheperformanceattainablebytheCellBE
processor when performing string matching through a DFA-
based algorithm. Although we report experimental setup
and data referring to the Cell BE architecture, the overall
parallelization strategies we propose are general, and ap-
plicable to any other advanced multi-core architecture, e.g.
like the one announced by Intel in its TeraScale initiative,
which hosts 80 homogeneous processor cores on the same
silicon die.
Our experiments show that the Cell BE architecture pro-
vides a considerable amount of performance: two of the
eight processing elements which compose a Cell processor
proved capable of performing a real-time ﬁltering of a net-
worklinkwithabitrateof10Gbps. TheCellalsoprovidesa
remarkable degree of freedom in terms of the wide range of
conﬁgurations in which multiple processing elements can
be combined together to reach higher performance or to
support larger dictionaries. In summary, the Cell architec-
ture proves to be an ideal candidate for the implementation
of DFA-based string search algorithms.
The remainder of this paper is organized as follows. In
Section 2 we review the peculiar architectural characteris-
tics of the Cell processor. Section 3 introduces the concept
of DFA tile, which maps a DFA string acceptor to a process-
ing element in the Cell BE, and Section 4 presents its imple-
mentation and discusses its performance. In Section 5 we
show how tiles can be combined to increase performance
and dictionary size. In Section 6 we present a technique
which allows to employ arbitrarily large dictionaries at the
price of a smooth degradation in performance. Section 7
concludes the paper.
2 Overview of the Cell BE Processor
This section discusses peculiar characteristics of the
IBM Cell BE which are relevant in this context. These char-
acteristics determine the way in which programmers must
write software if they want to exploit the computational
power available on the Cell.
The Cell BE [9] is a heterogeneous, multi-core chip ca-
pable of massive ﬂoating point processing, optimized for
compute-intensive workloads and broadband, rich media
applications. It is composed of one 64-bit Power Proces-
sor Element (PPE), 8 specialized co-processors called Syn-
ergistic Processing Elements (SPE), a high-speed memory
controller and a high-bandwidth bus interface.
The PPE is responsible for running the operating sys-
tem and coordinating the SPEs. It is a traditional 64-bit
PowerPC processor core with a VMX unit, 32 KB Level 1
instruction cache, 32 KB Level 1 data cache, and 512 KB
Level 2 cache. The PPE is a dual issue, in-order execution
design, 2-way SMT, running at 3.2 GHz.
Each of the 8 SPEs contains a Synergistic Processing
Unit (SPU), a memory ﬂow controller, a memory manage-
ment unit, a bus interface and an atomic unit for synchro-
nization mechanisms. The SPU is a RISC-style processor
with SIMD instructions and a large number (128) of wide
PPE SPE1 SPE3 SPE5 SPE7 IOIF1
MIC SPE0 SPE2 SPE4 SPE6 BIF
IOIF0
Data Arbiter
Figure 1. The Element Interconnect Bus.
2registers (128 bit). The large number of registers facilitates
eﬃcient instruction scheduling and loop unrolling. To fully
exploit the computational potential of the SPUs, the pro-
grammer must write explicit data-level parallel code by us-
inganextendedCsyntaxwithintrinsics. Scalarcodeshould
be avoided: in fact a scalar operation is compiled as SIMD
operation of which only a slice of the output is used. And
even if a scalar operation produces a portion of the output
of a SIMD one, it has the same cost.
SPU instructions cannot access the main memory di-
rectly. Rather, they access a 256 kbyte local store (LS)
memory, which holds both instructions and data. The lo-
cal store is in fact a software-controlled scratchpad, and it
is the programmer’s responsibility to manage its contents
by transferring data from main memory to the LS and back
via explicit DMA commands. Program execution continues
unaﬀected in the SPU while a DMA transfer is in progress.
This allows the programmer to overlap computation and
data transfer steps in such a way that the transfer latency
is completely hidden (e.g. via double-buﬀering techniques,
as we detail in Section 6).
The PPE and SPEs communicate through an internal
high-speed Element Interconnect Bus (EIB) (see Figure 1),
which is the heart of the Cell processor’s communication
architecture. It has separate communication paths for com-
mands and data, and the data network consists of four 16
byte-wide data rings, two of which run clockwise and the
other two counter-clockwise. The EIB operates at half the
processor clock speed, and exhibits a peak bandwidth equal
to 204.8 Gbyte/s [11].
This speed is attainable only by intra-chip data transfers.
Fortransfers involvingmain memory, the peakbandwidth is
 0
 5
 10
 15
 20
 25
 1  2  3  4  5  6  7  8
A
g
g
r
e
g
a
t
e
 
M
e
m
o
r
y
 
B
a
n
d
w
i
d
t
h
 
(
G
b
y
t
e
s
/
s
)
Synergistic Processing Elements
64 bytes
128 bytes
256 bytes
512 bytes and larger
Figure 2. Available aggregate bandwidth from
the main memory to the SPEs, at varying
block size.
25.6 Gbyte/s. The actual achieved data bandwidth depends
on a number of factors: mutual alignment of the source and
destination addresses, interference with transfers already in
progress, the number of Cell BE chips in the system, direc-
tion of the transfer, the eﬃciency of the data arbiter and,
above all, the size of the transferred block. The program-
mer must transfer data in blocks large enough to amortize
the bus negotiation overhead. Figure 2 shows that, in con-
ditions of heavy traﬃc, bandwidth values close to the peak
can be reached only when transferred block are at least 256
bytes or larger. In practice, programs should access the
main memory at a medium-large granularities only.
3 Using Deterministic Finite Automata as
String Acceptors
String matching is the core of network security systems
like intrusion detectors, deep-inspection ﬁlters, spam ﬁlters
and on-line virus scanners. A number of exact string match-
ing algorithms are based on a Deterministic Finite Automa-
ton (DFA) used as a language acceptor. DFA-based algo-
rithmsperformastatetransitionperindividualinputsymbol
consumed, and enter one of the ﬁnal states when the input
matches one of the words in the dictionary. DFA-based al-
gorithms are very common in security applications, because
their workload is content-independent, which makes them
immune from overload attacks based on malicious contents.
A ﬁnite state acceptor is a quintuple (Σ,S, s0,δ,F),
where Σ is the input alphabet (a ﬁnite non empty set of
symbols), S is a ﬁnite non empty set of states, s0 ∈ S is
the initial state, δ is the state transition function and F is
the set of ﬁnal states, a (possibly empty) subset of S. Given
a current state and an input symbol, the transition function
yields a new state: δ : S × Σ → S.
We consider each byte in the input stream as one input
symbol. The DFA reads in the input one symbol at a time
and performs a state transition according to the current state
and the value of the input. If the destination state s is ﬁnal
(s ∈ F), then the current string is recognized. The acceptor
enters a ﬁnal state whenever a portion of the stream matches
a word in the dictionary. In network security, this usually
marks the detection of malicious contents. Therefore, DFAs
should transit across non-ﬁnal states the vast majority of
time, whereas each ﬁnal state should be associated with a
type ofmalign content. Consequently, non-ﬁnal to non-ﬁnal
state transitions should be considered the steady state of the
system, and the case to optimize.
We express and compare the performance of diﬀerent
implementations of string matching algorithms in terms of
throughput, i.e. the number of input symbols which are pro-
cessed in the unit of time, which also measures the max-
imum number of state transitions possible in the unit of
time. This quantity is consistent across implementations
3Implementation Version 1 2 3 4 5
SIMD Vectorization No Yes Yes Yes Yes
Loop Unroll Factor – – 2 3 4
Total cycles per action 311316 123976 90200 82182 91833
State transitions (= input block size) 16384 16384 16384 16416 16384
Clock cycles per DFA transition 19.00 7.57 5.51 5.01 5.61
Throughput (M transitions/s) 168.41 422.89 581.25 639.21 570.91
Throughput (Gbps) 1.35 3.38 4.65 5.11 4.57
Average CPI 2.60 0.67 0.63 0.64 0.62
Dual issue % 0.0 43.8 48.3 48.7 48.6
Stall % 63.2 7.4 0.0 0.0 0.6
Registers used 4 40 81 124 spill
Speedup 1.00 2.51 3.45 3.79 3.39
Table 1. The highest performance is obtained with SIMDization and accurate loop unrolling.
DFA
state
transition 
table
(1520 states,
32 input 
symbols)
Input buffer 0
Input buffer 1 16 k
16 k
190 k
Code 
and Stack
34 k
2
5
6
 
k
 
(
t
o
t
a
l
 
s
i
z
e
 
o
f
 
t
h
e
 
l
o
c
a
l
 
s
t
o
r
e
) DFA
state
transition 
table
(1648 states,
32 input 
symbols)
Input buffer 0
Input buffer 1 8 k
8 k
 206 k
34 k
Code 
and Stack
DFA
state
transition
table
(1712 states,
32 input 
symbols)
4 k
214 k
34 k
Code 
and Stack
4 k
Case 1 Case 2 Case 3
Figure 3. SPE Local store usage in our pro-
posed implementation.
with diﬀerent internal degrees of parallelism, which pro-
cess a diﬀerent number of streams concurrently. We express
throughput values in terms of billion bits processed per sec-
ond (Gbps) when we compare ﬁltering throughput against
the bitrate of a network link.
For the moment, we focus on a single SPE at a time. We
call a DFA tile the implementation of a DFA acceptor real-
ized on a single SPE, with a state transition table which ﬁts
the local store available in a SPE. A DFA tile alone can real-
ize a string search against a dictionary compatible with the
limit just introduced, at a given speed. If a larger dictionary
is desired, or a higher performance is needed, multiple DFA
tiles can be combined, as explained later.
4 Mapping DFA Acceptors onto the Cell BE:
Implementation and Results
We now derive the maximum performance attainable by
a single DFA tile. All the experiments and the measure-
ments refer to implementations of the algorithm in C lan-
guage, as described in this section, using the Cell BE intrin-
sics and language extensions, and compiled it with GNU
GCC version 4.0.2. We have run the experiments on the a
pre-production IBM DD3 Cell Blade. Proﬁling data come
from the full-system simulator provided with the IBM Cell
BE SDK version 1.1.
First, we describe the data structures we have chosen to
representtheDFA.Allourdesignchoicesareaimedatmak-
ing the algorithm ﬁt as much as possible the hardware char-
acteristics of the SPE, thus maximizing its performance.
The optimized representation of the DFA of our choice is
as follows. We represent the State Transition Table (STT)
as a data structure in the form of a complete table of words,
having a row for each state and a column for each of the
possible inputs. We represent the current state in the form
of a pointer to the table row corresponding to that state. Ex-
perimental results show that a realistic upper bound for the
number of states of a tile is between 1520 and 1712, see
Figure 3.
In accordance with the requirements of most security ap-
plications whose ﬁlters do not need to be case-sensitive, we
consider a restrained input range, containing 32 rather than
256 input symbol choices. For the sake of reduced mem-
ory footprint, we assume that an appropriate data-reduction
strategy has been employed to fold the 0-255 character
range into a smaller interval, e.g. the 32 values from 0x40
to 0x5F, which comprise the uppercase Latin alphabet plus
other 6 characters. Such a data-reduction can be trivially
implemented in an inexpensive way.
4Appropriate data alignment allows us to reduce the com-
putation associated with a state transition, and to compact
the STT representation. In detail, we allocate the STT at
an aligned location and choose an input set width which is
a power of two. So, we can represent states by pointers to
STT lines, and the last bits in these pointers are zero. There-
fore, these last bits can be used to encode whether the next
state is ﬁnal, plus other frugal output values if needed.
We consider a bare-bone, purely sequential implemen-
tation of a DFA acceptor which complies with the above
design choices, and counts the number of occurrences of
dictionary entries in the given block of input data. This is
suﬃcient in a large majority of application contexts, where
a single match is suﬃcient to trigger further checks or to
discard a network packet. This corresponds to “Implemen-
tation version 1” in Table 1. This code does not exploit
eﬃciently the hardware features of the Cell BE: it does not
use SIMD (single instruction multiple data) instructions and
it uses only 4 out of the 127 available registers. An SPE
contains two distinct pipelines which are able to issue two
instructions at the same time, provided that they do not con-
ﬂict. This implementation is unable to exploit this feature.
As a result, the number of clock cycles per instruction (CPI)
is high, 2.6, and the throughput is low, 1.35 Gbps.
We obtained a more eﬃcient implementation by exploit-
ing SIMD instructions to process in parallel the 16 bytes
which compose a 128-bit word. A SIMD-ized implementa-
tion which processes 16 streams in parallel is 2.51 times
faster than a sequential implementation which processes
one stream (see Implementation Version 2 in Table 1). This
implementation maintains 16 distinct DFAs, each working
on a distinct input stream but sharing the same STT. The
input streams are interleaved such that each quadword of
the input (128 bit, 16 byte) contains at position i-th a byte
from the i-th stream, which will be processed by the i-th
DFA in the tile. Stream interleaving is a reasonably inex-
pensive operation, and can actually be mapped on the PPE,
thus leaving all the 8 SPEs in a Cell BE available for other
tasks.
The data-ﬂow structure of this implementation is given
in Figure 4. This implementation exploits better the regis-
ters (40 out of 127) and the two pipelines (43.8% of the is-
sues are dual), reaching a lower CPI, 0.67. The compiler is
still unable to reschedule instructions in such a way that no
stalls occur (7.4% of the cycles are dependency stalls). To
helpthecompilerinremovingthesestalls, wemanualunroll
instances of this data ﬂow. We have unrolled it a diﬀerent
number of times to determine the optimal conﬁguration (see
Implementation versions 3–5 in Table 1). The optimal un-
roll factor is 3 (Implementation Version 4), which ensures
almost complete register utilization (124 out of 127) with-
out spills, a high dual issue rate and no stalls. To the best of
our knowledge, this leads to the highest possible throughput
3
a
d
d
r
e
s
s
<<
SIMD
shift left
>> 1
split
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
SIMD
shift right
a
d
d
r
e
s
s
a
d
d
r
e
s
s
a
d
d
r
e
s
s
a
d
d
r
e
s
s
a
d
d
r
e
s
s
a
d
d
r
e
s
s
a
d
d
r
e
s
s
a
d
d
r
e
s
s
a
d
d
r
e
s
s
a
d
d
r
e
s
s
a
d
d
r
e
s
s
a
d
d
r
e
s
s
a
d
d
r
e
s
s
a
d
d
r
e
s
s
a
d
d
r
e
s
s
l
o
a
d
l
o
a
d
l
o
a
d
l
o
a
d
l
o
a
d
l
o
a
d
l
o
a
d
l
o
a
d
l
o
a
d
l
o
a
d
l
o
a
d
l
o
a
d
l
o
a
d
l
o
a
d
l
o
a
d
l
o
a
d
&
&
&
&
&
&
&
&
&
&
&
&
&
&
&
& &
&
&
&
&
&
&
&
&
&
&
&
&
&
&
&
0xFFFFFFFE 0x00000001
16 Interleaved
input 
streams
16 input characters
16 offsets to the
transition table cells
16 input symbols
in the range 0­32
Current state
pointers for 
the 16 DFAs
Addresses to the cells
containing the 
next state pointers
State 
Transition 
Table
Next state pointers 
for the 16 DFA
Final state flags
for the 16 DFA
16
SISD
add
16 
SISD 
ands
16 loads
(and 
shuffles)
16 
SISD 
ands
Figure 4. An optimal implementation of a DFA
acceptor. The diagram shows which opera-
tions are SIMDized and which ones are not.
5Time
Computation Data transfer
Process
buffer 0
(25.64 s)
Process buffer 1
(25.64 s)
Process buffer 0
(25.64 s)
Load buffer 0 (5.94 s)
Load buffer 1 (5.94 s)
Load buffer 0 (5.94 s)
Load buffer 1 (5.94 s)
Process buffer 0
(25.64 s)
Figure 5. We hide the cost of data transfers
by overlapping them with computation.
attainable by a single DFA tile, which is 5.11 Gbps.
For simplicity, so far we have not considered the over-
headduetotransferringnewblocksoftheinputstreamsinto
the tile, i.e. into the SPE’s local store. We now show that
the latency associated with this transfer can be completely
hidden by overlapping computation and communication. To
do so, we adopt a double buﬀering technique for the input
stream blocks, and transfer buﬀers with DMA commands.
Computation can continue unaﬀected while DMA trans-
fers are in progress. With the block sizes considered be-
fore (4–16 kbyte) computation always takes more time than
data transfer. In the worst traﬃc conditions on the mem-
ory bus (all the SPEs accessing the memory at the same
time), the available aggregate memory bandwidth is 22.05
Gbyte/s (see Figure 2), i.e. 2.76 Gbyte/s per SPE. There-
fore, the time required to transfer a block of 16 kbyte is 5.94
µs, while the time required to process it is 25.64 µs. This
leads to the schedule depicted in Figure 5: computation and
data transfer can be overlapped in such a way that the cost
of all data transfers (except the ﬁrst one) is completely hid-
den. Values in the ﬁgure refer to an input data block size of
16 kbyte. The same considerations hold even when smaller
block sizes are chosen, down to 512 bytes.
1 SPE
1 DFA 1.5k states,
5.11Gbps
1 SPE
1 DFA 1.5k states,
5.11Gbps
STT 1 STT 2
1 DFA x ~3k states,
5.11Gbps
STT 1 STT 2
Input streams 1/2 Input streams 2/2
Input streams
2 DFA x 1.5k states,
10.22 Gbps
Input streams
STT 1 
1 SPE
1 DFA 1.5k states,
5.11Gbps
1 SPE
1 DFA 1.5k states,
5.11Gbps
STT 1 STT 1

(a) DFA tiles “in parallel”
(b) DFAs “in series”
Input streams Input streams
Figure 6. Composition “in series” and “in par-
allel” of DFA tiles.
5 Composing DFA Tiles to Increase Perfor-
mance and Dictionary Size
In the previous section we have determined the maxi-
mum throughput attainable by a single DFA tile. But, de-
pending on the application, the functionalities oﬀered by a
single tile may not be suﬃcient, either in terms of through-
put, or in terms of dictionary size, or both. To address these
needs, a system composed by multiple tiles can be imple-
mented.
When more throughput is needed, multiple identical tiles
can be used concurrently on distinct (with a minor overlap-
ping) portions of the input streams. We say that two DFAs
combined in this way are “in parallel”, by analogy with
an electric circuit (see Figure 6(a)). The two tiles have an
identical STT and therefore recognize the same dictionary.
Since they operate on two portions of the input streams
which are separate (except for a small overlapping region,
to allow matching of strings which cross the boundary), the
combined throughput is eﬀectively doubled.
The use of multiple tiles in a parallel conﬁguration al-
lows designers to eﬀectively multiply the throughput. This
is possible because string matching is an “embarrassingly
parallel” problem, which does not need communication
among processing elements, thus allowing for the use of an
arbitrary large number of processors without incurring con-
gestion issues. Mapping a DFA tile to each of the 8 SPEs
in a Cell BE leads to a performance limit of 5.11 × 8 =
40.88 Gbps attainable by a single Cell BE processor, under
the assumption that stream interleaving is performed by the
PPE, and the remaining computational power of the PPE is
6Input streams
1 SPE
1 DFA 1.5k states,
5.11Gbps
STT 1
1 SPE
1 DFA 1.5k states,
5.11Gbps
STT 2
1 SPE
1 DFA 1.5k states,
5.11Gbps
STT 3
1 SPE
1 DFA 1.5k states,
5.11Gbps
STT 4
Input streams 1/2
1 SPE
1 DFA 1.5k states,
5.11Gbps
STT 1
1 SPE
1 DFA 1.5k states,
5.11Gbps
STT 2
1 SPE
1 DFA 1.5k states,
5.11Gbps
STT 3
1 SPE
1 DFA 1.5k states,
5.11Gbps
STT 4
Input streams 2/2
Figure 7. Example of a mixed series/parallel
tile conﬁguration.
suﬃcient to carry out the accessory tasks that the speciﬁc
application demands. When an even higher performance is
desired, multiple Cell processors can be used in parallel;
e.g. a Cell Blade hosting two processors can reach 81.76
Gbps.
On the other hand, a state transition table of approxi-
mately 1500 states may not be suﬃcient to recognize the
desired dictionary. To overcome this limitation, multiple
DFA tiles can be combined in a series conﬁguration, as in
Figure 6(b). In this conﬁguration, the tiles operate on the
same input data at a given time, but each tile has a distinct
STT. Each STT corresponds to just a portion of the dictio-
nary.
To improve throughput and dictionary size at the same
time, an application designer can adopt a mixed series/par-
allel conﬁguration, in which groups of tiles operate on dis-
tinct portions of the input, while tiles share the same STT
within a tile group, as depicted in Figure 7. The conﬁgu-
ration in ﬁgure corresponds to an overall throughput equal
to 10.22 Gbps, and a dictionary size which is roughly four
times larger than the one which ﬁts in a single tile.
6 Achieving Arbitrary Dictionary Size
through Dynamic STT Replacement
If the space available for the state transition tables (STT)
in the local storage of all the SPEs is not suﬃcient for the
desired application, a diﬀerent approach can be employed,
which allows us to support virtually unlimited dictionary
sizes, at the price of a smooth degradation in performance.
We call this approach dynamic STT replacement. The
dictionary is partitioned into smaller subsets, each corre-
sponding to an STT half as large as the one considered be-
fore, i.e. approximately 100 kbytes, which roughly corre-
spond to 800 states. Each SPE contains now two STTs,
which are managed in a double-buﬀering fashion. While
one STT is used to ﬁlter the input, the other is used to load
from main memory the next STT portion. The chain of
DMA transfers required to load the STTs from main mem-
ory can be orchestrated to happen in the data-transfer idle
time of the schedule of Figure 5. Each tile is assigned a
number of STTs, and it will ﬁlter the input streams against
each of those STTs, loading them cyclically one after the
other.
In a conﬁguration where an STT occupies 95 kbytes, a
complete STT load can happen every two scheduling pe-
riods, as Figure 8 shows. We are assuming the steady-
state condition in which the ﬁrst STT is already loaded
at the beginning of the periods shown. In the general
case, if the dictionary required by the application requires
n STTs, each SPE can now provide an eﬀective bandwidth
of 5.11/(2(n − 1)) Gbps. The n − 1 term is due to the fact
that, except for the very ﬁrst period, an entire cycle through
all the STTs requires n−1 transfers, rather than n. Figure 9
shows how the throughput provided by this approach varies
when the number of STTs grows.
Time
Process buffer 0
(match against STT 0)
(25.64 s)
Process buffer 1
(match against STT 0)
(25.64 s)
Process buffer 0
(match against STT 1)
(25.64 s)
Computation
Process buffer 1
(match against STT 1)
(25.64 s)
Process buffer 0
(match against STT 0)
(25.64 s)
Load input to buffer 0 (5.94 s)
Load input to buffer 1 (5.94 s)
Load input to buffer 0 (5.94 s)
Load input to buffer 1 (5.94 s)
Data transfer
Load next STT into STT 1
chunk 1/2 (48 kbyte)
(17.83 s)
Load next STT into STT 1
chunk 2/2 (47 kbyte)
(17.46 s)
Load next STT into STT 0
chunk 1/2 (48 kbyte)
(17.83 s)
Load input to buffer 0 (5.94 s)
Load next STT into STT 0
chunk 2/2 (47 kbyte)
(17.46 s)
Load input to buffer 1 (5.94 s)
Load next STT into STT 1
chunk 1/2 (48 kbyte)
(17.83 s)
Figure 8. Schedule of a dynamic state transi-
tion table (STT) replacement.
7 0
 5
 10
 15
 20
 25
 30
 35
 40
 45
 0  100  200  300  400  500  600
T
h
r
o
u
g
h
p
u
t
 
(
G
b
p
s
)
Aggregate state transition table size (kbytes)
Number of SPEs used
1 SPE  
2 SPEs
4 SPEs
8 SPEs
Figure 9. Throughput provided by dynamic
STT replacement, when a variable number (1,
2, 4, 8) of tiles are employed.
7 Conclusions and Future Work
We have analyzed the suitability of the Cell BE architec-
ture when employed to perform string matching of a stream
of character against a given dictionary. We have identiﬁed
a modular component, the DFA tile, which represents the
optimal mapping of a ﬁnite state acceptor on a Cell SPE.
A DFA tile can process input streams at a throughput up
to 5.11 Gbps, with a state space comprising approximately
1500 states. We have shown how multiple tiles can be ﬂexi-
bly combined to achieve higher performance or to accept ar-
bitrarily sized dictionaries. In summary, the Cell BE proves
to be a fast and ﬂexible architecture when applied in this
domain.
Further work in this direction is being developed, also
exploring the potentials of the Cell BE when implementing
probabilistic string matching algorithms like Bloom ﬁlters.
References
[1] A. V. Aho and M. J. Corasick. Eﬃcient string matching: an
aid to bibliographic search. Communications of the ACM,
18(6):333–340, 1975.
[2] B. H. Bloom. Space/time trade-oﬀs in hash coding with al-
lowable errors. Commun. ACM, 13(7):422–426, 1970.
[3] R. S. Boyer and J. S. Moore. A fast string searching algo-
rithm. Communications of the ACM, 20(10):62–72, 1977.
[4] C. Chang and R. Paige. From regular expressions to DFAs
using compressed NFAs. In In Proc. CPM ’92, A. Apos-
tolico, M. Crochemore, Z. Galil, and U. Manber, editors,
Lecture Notes in Computer Science, No. 644, pages 88–108.
Springer-Verlag, 1992.
[5] Y. H. Cho and W. H. Mangione-Smith. Deep packet ﬁl-
ter with dedicated logic and read only memories. In Field-
Programmable Custom Computing Machines, 2004. FCCM
2004. 12th Annual IEEE Symposium on, pages 125–134,
April 2004.
[6] B. Commentz-Walter. A string matching algorithm fast on
the average. In Proceedings of the 6th Colloquium, on Au-
tomata, Languages and Programming, pages 118–132, Lon-
don, UK, 1979. Springer-Verlag.
[7] S. Dharmapurikar, P. Krishnamurthy, T. S. Sproull, and J. W.
Lockwood. Deep packet inspection using parallel Bloom
ﬁlters. IEEE Micro, 24(1):52–61, 2004.
[8] B. L. Hutchings, R. Franklin, and D. Carver. Assisting net-
work intrusion detection with reconﬁgurable hardware. In
FCCM ’02: Proceedings of the 10th Annual IEEE Sympo-
sium on Field-Programmable Custom Computing Machines,
page111, Washington, DC,USA,2002.IEEEComputerSo-
ciety.
[9] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R.
Maeurer, and D. Shippy. Introduction to the Cell Multipro-
cessor. IBM Journal of Research and Development, pages
589–604, July/September 2005.
[10] T. Katashita, A. Maeda, K. Toda, and Y. Yamaguchi. Highly
eﬃcient string matching circuit for IDS with FPGA. Pro-
ceedings of the 14th Annual IEEE Symposium on Field-
Programmable Custom Computing Machines (FCCM’06),
0:285–286, 2006.
[11] M. Kistler, M. Perrone, and F. Petrini. Cell Processor Inter-
connection Network: Built for Speed. IEEE Micro, 25(3),
May/June 2006.
[12] D.E.Knuth, J.Morris, andV.R.Pratt. Fastpatternmatching
instrings. SIAMJournalofComputing, 6(2):323–350, 1977.
[13] J. W. Lockwood, N. Naufel, J. S. Turner, and D. E. Tay-
lor. Reprogrammable network packet processing on the ﬁeld
programmable port extender (FPX). In Proceedings of the
ACM Intl. Symposioum on Field Programmable Gate Arrays
(FPGA 2001), pages 87–93, 2001.
[14] J. Moscola, J. Lockwood, R. Loui, and M. Pachos. Im-
plementation of a Content-Scanning Module for an Inter-
net Firewall. In Proceedings of IEEE Symposium on Field-
Programmable Custom Computing Machines (FCCM),
pages 31–38, Napa, CA, USA, Apr. 2003.
[15] R. Sidhu and V. Prasanna. Fast regular expression match-
ing using FPGAs. In Proceedings of the IEEE Sympo-
sium on Field-Programmable Custom Computing Machines
(FCCM01), April 2001.
[16] I. Sourdis and D. Pnevmatikatos. Fast, large-scale string
match for a 10 Gbps FPGA-based network intrusion. In
Proceedings 13th Conference on Field Programmable Logic
and Applications., September 2003.
[17] Y. Sugawara, M. Inaba, and K. Hiraki. Over 10 Gbps
string matching mechanism for multi-stream packet scan-
ning systems. Lecture Notes in Computer Science, Field
Programmable Logic and Application, 3203/2004:484–493,
2004.
[18] S. Wu and U. Manber. A fast algorithm for multi-pattern
searching. TechnicalReportTR-94-17, DepartmentofCom-
puter Science, University of Arizona, May 1994.
8