Abstract-With its 9 cores per chip, the IBM Cell/Broadband Engine (Cell) can deliver an impressive amount of compute power and benefit the string-matching kernels of network security, business analytics and natural language processing applications. However, the available amount of main memory on the system limits the maximum size of the dictionary supported by the string matching solution.
I. INTRODUCTION
The evolution of "Web 2.0" applications and business analytics applications is showing a more and more prevalent production and use of unstructured data. For example, Natural Language Processing (NLP) applications can determine the language in which a document is written. E-mail web applications extract semantically tagged information (dates, places, delivery tracking numbers, etc.) from messages. Business analytics applications can automatically detect business events like the merger of two companies.
In these applications and many others, it is crucial to process huge amounts of sequential text to extract matches against a predetermined set of strings (the dictionary). Arguably, the most popular way to perform this exact, multi-pattern string matching task is the Aho-Corasick [1] (AC) algorithm. However, AC, especially in its optimized form based on a Deterministic Finite Automaton (DFA), is not space-efficient. In fact, the state-transition table that its DFAs use can be highly redundant. Uncompressed DFAs have a low transition cost (and therefore a high throughput) but also large footprint and, consequently, a low dictionary capacity per unit of memory. For example, a dictionary of 200,000 patterns with average length 15 bytes occupies 1 Gbyte of memory when encoded for an uncompressed AC DFA. Low space efficiency limits the algorithm's applicability to domains that require very large dictionaries like automatic language identification, which employ dictionaries with millions of entries, coming from hundreds of distinct natural languages.
In this paper, we address precisely this space inefficiency by exploring a variant of AC that employs compressed paths. Our algorithm is inspired by those proposed by Tuck et al. [19] and Zha and Sahni [28] . These algorithms are based on the Non-deterministic Finite Automaton (NFA) version of AC, and achieve significant memory reduction.
We choose an established multi-core architecture, the IBM Cell/Broadband Engine (Cell) for our work because it is a prominent architecture in the high-performance computing community, it has shown potential in string matching applications, and it presents software designers with non-trivial challenges that are representative of the next generations of multi-core architectures.
With our proposed algorithm, we achieve an average compression ratio of 1:34 for English words and 1:58 for random binary patterns. Our implementation provides a sustained throughput between 0.90 and 2.35 Gbps per Cell blade in different application scenarios, while supporting dictionary densities up to 9.26 million average patterns per Gbyte of main memory.
The remainder of this paper is organized as follows. Section II introduces the Cell architecture. Sections III and IV introduce the AC algorithm and the compression method of Tuck et al. Section V demonstrates a parallel, Cell-based implementation of our technique. Section VI discusses the experimental results. Section VII reviews the related work. Section VIII concludes the paper.
II. THE CELL/BROADBAND ENGINE ARCHITECTURE
The Cell processor [26] contains 9 heterogeneous cores on a silicon die. One of them is a traditional 64-bit processor with cache memories and 2-way simultaneous multithreading, called Power Processor Element (PPE), and capable of running a full-featured operating system and traditional PowerPC applications. The other 8 cores are called Synergistic Processor Elements (SPEs). They have no caches, but rather a small amount of scratch-pad memory (256 kbyte) that the programmer must manage explicitly, by issuing DMA transfer from and to the main memory. The cores are connected with each other via the Element Interconnect Bus (EIB), a fast double ring on-chip network.
The Cell delivers its best performance when the SPEs are kept highly utilized by streaming tasks that load data from main memory, process data locally and commit the results back to main memory. These tasks exhibit a regular, predictable memory access pattern that the programmer can exploit to implement double buffering, and overlap computation and data-transfer over time.
Achieving high performance on the Cell with non-streaming applications is all but trivial, and algorithms based on DFAs like ours are arguably the most difficult to port. In fact, these algorithms exhibit unpredictable memory access patterns and a complex latency interaction between compute code and data-transfer code. These circumstances make it difficult to determine what represents the critical path in the code, and how to optimize it.
III. THE AHO-CORASICK ALGORITHM
Aho-Corasick (AC) [1] is a multi-pattern matching algorithm, commonly employed in Network Intrusion Detection System (NIDS) applications. There are two versions of it: a deterministic and a non-deterministic one. Both versions use finite state machines. The version we adopt is the one based on a Non-deterministic Finite Automaton (NFA).
In this version, states are connected by success and failure transitions. Each state has one outgoing failure transition and one or more success transitions. A success transition is labeled with a symbol from the accepted alphabet. Each state has a set of matches (from the dictionary) that are matched when the NFA transitions into that state as a result of a success transition.
The NFA is initialized in its start state, with its read head on the first symbol of the input text string S. At each step, the NFA performs a state transition examining the current input symbol in S. If the current state has a success transition labeled as the current input symbol, the NFA follows that transition, and the read head moves one symbol ahead over S. When no such success transition exists, the NFA follows the failure transition without advancing its read head.
Whenever the NFA lands into a state that has a non-empty match set, the automaton reports that all the strings in the match set have just been matched. In NIDS applications, this is usually associated with the detection of malicious signatures, and it triggers appropriate alerts.
IV. TUCK ET AL.'S COMPRESSED AUTOMATON
In this work, we focus on an adaptation of Tuck et al.'s [19] compressed NFA method. We choose to do so despite the fact that Zha and Sahni's [28] A state transition for an input i works as follow. We first determine whether Success[i] is null by examining bit i of the bitmap. If this bit is zero, the next state is pointed by the failure pointer. Otherwise, we determine the popcount of all the bits in the bitmap having position < i. Then we transition to the state pointed by firstChild, offset by a state record size (45 bytes) as many times as the popcount.
To reduce the cost of popcount, Tuck et al. propose the use of precomputed summaries, that give the popcount for the first 32 · j, 1 ≤ j < 8 bits of the bitmap. Each summary is 8 bits long, and 7 summaries are needed. The size of a bit-compressed node with summaries is therefore 52 bytes.
Path Compression. Path compression is similar to end-node optimization [5] , [9] . An end-node sequence is a sequence of states at the bottom of the automaton (the start state is at the top of the automaton) that comprises states having a single non-null success transition (except for the last state in the sequence, which has no non-null success transition). States in the same end-node sequence are packed together into one or more path-compressed nodes.
For each state s i packed into a path-compressed node, we store one success 1-byte input character, the failure pointer and the match list.
Since several automaton states are packed into a single compressed node, a 32-bit failure pointer is not sufficient to address packed states within a compressed node. With an additional 3-bit offset, we handle nodes with capacity c ≤ 8. Now, 3c/8 bytes are needed for the offsets. A pathcompressed node with capacity c needs 9c + 3c/8 bytes for the state information. 4 more bytes are needed to pointer to the next node (if any) in the sequence of path-compressed nodes. One more byte identifies the node type (bitmap and compressed) and its size (number of packed states). So, the size of a compressed node is 9c + 3c/8 + 5 bytes. Figure 3 shows a path-compressed node.
V. CELL-ORIENTED ALGORITHM DESIGN
This section describes the implementation choices we made to adapt our AC NFA algorithm to the Cell processor. To compute popcounts efficiently, we employ the CNTB and SUMB instructions (available at the C level via the spu cntb() and spu sumb() intrinsics). These reduce the number of operations to compute the popcount from 31 additions (summary+bit0+bit1+...+bit30) to two spu instructions plus one summary addition. Sample code to compute the popcount for childnode i (0 ≤ i ≤ 255) of a compressed Aho-Corasick node is given below.
popcount=get_summary(i); bitblock=get_bitmapblock(i); charvector=spu_promote(bitblock,0); countbyteones=spu_cntb((charvector); countblockones=spu_sumb(countbyteones,countbyteones); popcount=popcount+spu_extract(countblockones,0); Also, we employ vector comparison instructions to get the longest match between the input and compressed paths.
For alignment reasons, we only consider path-compressed nodes with packing factors (c) of 4, 8 and 12. Figure 4 shows the corresponding compression ratio. Note that 4 is the best choice for the English dictionary and 8 is best for random binary patterns. For simplicity, we consider a packing factor of 4 in the experiments that follow. The difference in compression gain obtained with a packing factor of 8 is not significant enough to justify the increase in algorithm complexity. By using this compressed automata, we can compress dictionaries with an average compression ratio of 1:34 for English dictionaries and 1:58 for random binary patterns.
We now describe the optimizations we employed to map our compressed AC algorithm to Cell architecture and their impact. Results were obtained with the IBM Cell SDK 3.0 on IBM QS22 blades. Figure 5 shows the impact of the optimization steps on the performance and quality of code. We started from a naïve compressed AC implementation which was a straightforward implementation of our compressed AC algorithm running on eight SPEs in parallel and we applied branch hinting, branch replacement with conditional expressions, vertical unrolling, data structure realignment, branch removal, arithmetic strength reduction and horizontal unrolling. The aggregate effect of these optimizations is to increase the throughput (by reducing the number of cycles absorbed per character), reducing the cycles per instruction (CPI), reducing stalls and increasing the dual issue rate (i.e. clock cycles in which both pipeline in an SPE issue a new instruction).
These techniques help to decrease the CPI, the branch stall cycles rate, the dependency stall cycles. They also decrease the single instruction issue rate and increase the dual instruction issue rate. Overall, the optimization effort results in a 16 to 25 times throughput speedup against the unoptimized PPE baseline implementation.
A. Step (2): Branch replacement and hinting
Whenever possible, we restructure the control flow so to replace if statements with conditional expressions. We inspect the assembly output to make sure that the compiler renders conditional expression with select bits instructions rather than branches.
A major if statement in the compressed AC NFA kernel does not benefit from this strategy, i.e., the one that branches depending on whether the node type is bitmap or pathcompressed. The two branches are too different to reduce to conditional expressions. We reduce the misprediction penalty associated with this branch by hinting to mark the bitmap case as the more likely, as suggested by our profiling on realistic data.
B. Step (3): Loop Unrolling, Data alignment
We apply unrolling to a few relevant bounded innermost loops, and we apply data structure alignment. Our algorithm consists of two major parts: a compute part and a memory access part. Since the compressed AC is too large to fit entirely in the SPEs' local stores, we store it in main memory.
We safely ignore the impact of memory accesses required to load input text from main memory to local store and write back matches in the opposite direction. In fact, we implement both transfers in a double-buffered way, overlapping computation and data transfer in time. The below pseudo code shows the major part of the vertical unrolling method in the algorithm. 
... }
When a single instance of an AC NFA runs, it computes its next-iteration node pointer and then fetches this node via a DMA transfer from main memory. DMA transfers have roundtrip time of hundreds of clock cycles. To utilize these cycles, we run multiple concurrent automata, each checking matches in different segments of the input, unrolling their code together vertically. Multiple automata can pipeline memory accesses, overlapping the DMA transfer delays. Figure 6 shows how two automata overlap their computation part with their DMA transfer wait time. Figure 7 illustrates how different vertical unrolling factors affect the performance. We choose vertical unrolling factor 8 in our implementation as it gives the minimal DMA transfer delay. We also performed an experiment to find out the best DMA transfer size to make full use of the bandwidth and minimize the DMA transfer delay. The psudocode below shows how to measure the DMA transfer time with different DMA transfer size. } record time2 single_DMA_transfer_time=(time2-time1)/n Figure 10 shows the optimal transfer size is 64 Bytes over the eight SPUs.
C. Step (4): Branch removal, select-bits intrinsics
After replacing if statements with conditional expressions, the branch miss stalls still account for about one fifth of the total compute cycles.
We use IBM asmvis [27] to inspect the static timing analysis of our code at the assembly level. It helps us to get a clear view of what the compiler is doing, instruction by instruction. The inspection reveals that conditional expressions are often translated by the compiler as expensive branch instructions. In this case, our code still suffers from expensive branch miss penalties, which can cost as much as 26 clock cycles each. To eliminate branches, we manually replace conditional expressions with the spu sel intrinsic [23] . The basic idea is to compute the two possible results for both branches and select one of the results using a select bit instruction. For example, the transformation reduces branch miss stalls from 19.5% to 2.7% of the cycle count for the full-text search scenario.
D. Step (5): Strength reduction
We manually apply operator strength reduction (i.e., replacing multiplication and divisions with shifts and additions) where the compiler did not. In addition, we use cheap pointer arithmetic to load four adjacent integer elements into a 128 bit vector. This reduces the load overhead. e.g. Manual strength reduction reduces the overall clock cycles 3% for the full text search scenario. 
E. Step (6): Horizontal unrolling
After Steps 1-4, dependency stalls occupy about 25% of the computation time. Within the NFA compute code, one branch handles bitmap nodes, while the other one handles path-compressed nodes. In the code of both cases, there are frequent read-after-write data dependencies.
To reduce the dependency stalls, we interleave the codes of multiple, distinct automata; we call this operation horizontal unrolling. These multiple automata process independent input streams against the same dictionary. They have distinct states and input/output buffers, and they require multiple, distinct DMA operations to perform the associated streamed double buffering. The buffer size is 4096 bytes in our experiments.
The horizontal unroll factor must be chosen accurately to reflect the trade-off between the decreased dependency stalls and the potentially increased branch stalls. Our experiments show that unrolling 2 NFAs achieves the highest performance improvement, 10%. For example, for the full text search scenario, dependency stalls decreased from 26.0% to 17.4%, while branch stalls increase from 1.8% to 3.2%.
VI. EXPERIMENTAL RESULTS
In this section, we benchmark our software design in a set of representative scenarios.
We use two dictionaries to generate compressed AC automata: Dictionary 1 contains the 20,000 most common words in the English language, while Dictionary 2 contains 8000 random binary patterns. We benchmark the algorithm on three input files: the King James Bible, a tcpdump stream of captured network traffic and a randomly generated binary file. The last two scenarios in the figure are representative of systems (with Dictionary 1 and 2, respectively) under a malicious, content-based attack. In fact, a system whose performance degrades dramatically when the input exhibits frequent matches with the dictionary is subject to content-based attacks. An attacker that gains partial or full knowledge of the dictionary could provide the system with traffic specifically designed to overflow it. In scenarios five and six we provide our system with inputs entirely composed of words from the dictionary. Our experiments show a desirable property of our algorithm: its performance actually increases in case of frequent hitting.
The reason is that our NFA spends a similar amount of time to process a bitmap or a path-compressed node. For this reason, a mismatch takes a comparable amount of time to the match of an entire path.
For this reason, the cycles spent per input character decrease when more input characters match the dictionary. Pathcompressed nodes pack as many as 4 or 8 original AC nodes, and allow multi-character match at one time. Figure 13 shows how the percentage of matched patterns affects the aggregate throughput on the IBM cell blade with 16 SPUs for the virus scanning scenario. As the percentage of the matched patterns increases, the aggregate throughput increases as well.
We explore the trade-offs between the AC compression ratio and the throughput in a Pareto space. We choose the English dictionary as the compression object and choose packing factors of 4, 8, 12 for path compressed nodes. As shown in Figure 14 , the compression ratio decreases with increase in the packing factor. However, the throughput is better with a packing factor of 8 than with one of 4.
The reason for that is the input data is a English input which has 100% match against the dictionary. So instead of matching 4 nodes in the path compressed node at one time, matching 8 nodes at one time gives better performance. However, a packing factor of 12 has some throughput degradation compared to a packing factor of 8. One conclusion we draw from this Pareto chart is the compression ratio affects the throughput, in order to get a better compression ratio, we have to sacrifice throughput.
VII. RELATED WORK
Snort [14] and Bro [13] , [4] , [7] , [15] , [3] are two of the more popular public domain Network Intrusion Detection Systems (NIDSs). The current implementation of Snort uses the optimized version of the AC automaton [1] . Snort also uses SFK search and the Wu-Manber [20] multi-string search algorithm.
The memory required to store the optimized Aho-Corasick and Wu-Manber data structures is excessive [19] .
To reduce the memory requirement of the AC automaton, Tuck et al. [19] have proposed starting with the nondeterministic AC automaton and using bitmaps and path compression.
We note that the use of bitmaps to obtain compact representations was proposed first by Jacobson [12] .
In the network security domain, bitmaps have been used also in the tree bitmap scheme [5] and in shape shifting and hybrid shape-shifting tries 2 [16] , [9] . Path compression has been used in several IP address lookup structures including tree bitmap [5] and hybrid shape-shifting tries [9] . These compression methods reduce the memory required to about 1/30-1/50 of that required by an AC DFA or a WuManber structure, and to slightly less than what required by SFK search [19] . However, lookups on path-compressed data require more computation at search time, e.g., more additions at each node to compute popcounts, thus requiring hardware support to achieve competitive performance.
Zha and Sahni [28] have suggested a compressed AC trie inspired by the work of Tuck et al. [19] : they use bitmaps with multiple levels of summaries, as well as an aggressive path compaction. Zha and Sahni's technique requires 90% fewer additions to compute popcounts than Tuck et al [19] 's, and occupies 24%-31% less memory. Scarpazza et al. [24] propose a memory-based implementation of the deterministic AC algorithm that is capable of supporting dictionaries as large as the available main memory, and achieves a search performance of 1.5-2.2 Gbps per Cell chip. Scarpazza et al. [25] also propose regular expression matching against small rule sets (which suits the needs of the search engine tokenizers) delivering 8-14 Gbps per Cell chip.
Song and Lockwood [18] , Fang et al. [6] , and Yu and Katz [22] , for example, propose the use of TCAMs (in the case of [18] , the TCAM is supplemented with bit-vector hardware) for NIDS applications. Yazadani et al. [21] propose a twolevel state machine architecture that employs a TCAM for packet content examination. Dharmapurikar and Lockwood [2] have proposed a hardware implementation of the AC [1] string matching algorithm for NIDS applications. They assert that their hardware design is more scalable than FPGA and TCAM based designs because of its reliance on "embedded on-chip memory blocks in VLSI hardware." Song et al. [17] propose the use of an FPGA pre-filter to reduce the network traffic actually examined by a NIDS and Lockwood et al. [8] propose an extensible system-on-programmable-chip design for content-aware filtering. Their design employs TCAMs and FPGAs. Tuck et al. [19] propose a way to represent unoptimized AC automata in a compact format. They predict a processing rate of about 8Gbps for an ASIC design. Van Lunteren [11] has proposed a B-FSM (Bart Finite State Machine) for NIDS applications. The proposed B-FSM employs a finite state machine similar to that used in the AC string matching algorithm and the packet classification scheme Bart developed earlier by Lunteran [10] . It is estimated that an FPGA version of the B-FSM will process at 10Gbps and an ASIC version at 20Gbps.
VIII. CONCLUSIONS
We present an optimized software design that exploits compressed AC automata to perform high-throughput multipattern string matching on the IBM Cell Broadband Engine.
We have presented a detailed overview of the algorithmiclevel and implementation-level optimizations that we applied in order to improve the algorithm's performance.
Our solution delivers impressive compression ratios in experiment scenarios representative of natural language processing and network security applications: respectively, 1:34 on dictionaries containing English words, and 1:58 on dictionaries containing random binary patterns. Also, our solution provide a remarkable throughput between 0.90 and 2.35 Gbps per Cell blade, depending on the statistical properties of dictionary and input.
