Design Alternatives for a High-Performance Self-Securing Ethernet Network Interface by Schuff, Derek L. & Pai, Vijay S.
Purdue University
Purdue e-Pubs
ECE Technical Reports Electrical and Computer Engineering
1-14-2007
Design Alternatives for a High-Performance Self-





Follow this and additional works at: http://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
Schuff, Derek L. and Pai, Vijay S., "Design Alternatives for a High-Performance Self-Securing Ethernet Network Interface" (2007).
ECE Technical Reports. Paper 342.
http://docs.lib.purdue.edu/ecetr/342
Design Alternatives for a High-Performance
Self-Securing Ethernet Network Interface ∗
Derek L. Schuff and Vijay S. Pai
Purdue University
West Lafayette, IN 47907
{dschuff, vpai}@purdue.edu
Abstract
This paper presents and evaluates a strategy for inte-
grating the Snort network intrusion detection system into
a high-performance programmable Ethernet network inter-
face card (NIC), considering the impact of several possi-
ble hardware and software design choices. While currently
proposed ASIC, FPGA, and TCAM systems can match in-
coming string content in real-time, the system proposed also
supports the stream reassembly and HTTP content transfor-
mation capabilities of Snort. This system, called LineSnort,
parallelizes Snort using concurrency across TCP sessions
and executes those parallel tasks on multiple low-frequency
pipelined RISC processors embedded in the NIC. LineSnort
additionally exploits opportunities for intra-session concur-
rency. The system also includes dedicated hardware for
high-bandwidth data transfers and for high-performance
string matching.
Detailed results obtained by simulating various software
and hardware configurations show that the proposed system
can achieve intrusion detection throughputs in excess of 1
Gigabit per second for fairly large rule sets. Such perfor-
mance requires the system to use hardware-assisted string
matching and a small shared data cache. The system can
extract performance through increases in processor clock
frequency or parallelism, allowing additional flexibility for
designers to achieve performance within specified area or
power budgets. By efficiently offloading the computation-
ally difficult task of intrusion detection to the network inter-
face, LineSnort enables intrusion detection to run directly
on PC-based network servers rather than just at power-
ful edge-based appliances. As a result, LineSnort has the
potential to protect servers against the growing menace of
LAN-based attacks, whereas traditional edge-based intru-
sion detection deployments can only protect against exter-
nal attacks.
∗This work is supported in part by the National Science Foundation
under Grant Nos. CCF-0532448 and CNS-0532452.
1 Introduction
Edge-based firewalls form a traditional approach to net-
work security, based on the assumption that malicious at-
tacks are sent to a target system across the global Internet.
However, firewalls do not protect applications running on
externally exposed ports, nor do they protect against attacks
originating inside the local-area network. Such attacks may
arise once a machine inside the network has been compro-
mised by an otherwise undetected attack, virus, or user error
(such as opening an email attachment). LAN-based attacks
are potentially even more dangerous than external ones as
they may propagate at LAN speeds and attack internal ser-
vices such as NFS.
Network intrusion detection systems (NIDSes), such as
Snort, may run on individual host machines on a LAN [25].
However, the overheads of running such systems are quite
high, as they must reassemble network packets into streams,
preprocess the data, and scan the streams for matches
against specified string content and regular expressions.
These overheads limit Snort to a traffic rate of less than
500 Mbps on a modern host machine (2.0 GHz AMD
Opteron processor system), even with full access to the
CPU. Profiling shows that no single portion of Snort is
the sole bottleneck: although string content matching is
the most important subsystem, the other portions of Snort
consume nearly half the required cycles. Further, the CPU
power on a running server machine would also need to be
shared with the server applications and the operating sys-
tem. Consequently, it is not feasible to deploy Snort directly
on high-end network servers that must serve data at Gigabit
rates or higher. These observations tend to limit the deploy-
ment of NIDSes to high-end edge appliances.
To allow for the reliable detection and logging of attacks
within the LAN, Ganger et al. propose self-securing net-
work interfaces, by which the network interface card exe-
cutes the NIDS on data as it streams into and out of the
host [11]. Their prototype used a PC placed between the
switch and the target PC as the self-securing network inter-
face. The newly inserted PC ran a limited operating system
and several custom-written applications that scanned traffic
looking for suspicious behaviors. Although valuable as a
proof-of-concept, a more usable self-securing network in-
terface would need to be implemented as an Ethernet net-
work interface card and execute a more standard NIDS.
This paper has two key contributions. First, this pa-
per presents a strategy for integrating the Snort network
intrusion detection system into a high-performance pro-
grammable Ethernet network interface. The architecture
used in this paper draws from previous work in pro-
grammable Ethernet controllers, combining multiple low-
frequency pipelined RISC processors, nonprogrammable
hardware assists for high-bandwidth memory movement,
and an explicitly-managed partitioned memory system [4,
34]. The firmware of the resulting Ethernet controller par-
allelizes Snort at the granularity of TCP sessions, and also
exploits opportunities for intra-session concurrency. Previ-
ous academic and industrial work on special-purpose archi-
tectures for high-speed NIDS have generally only dealt with
content matching [5, 9, 24, 29, 30, 36]. General Snort-style
intrusion detection allows stronger security than content-
matching alone by reassembling TCP streams to detect at-
tacks that span multiple packets, by transforming HTTP
URLs to canonical formats, and by supporting other types
of Snort tests.
This paper’s second contribution is to explore and ana-
lyze the performance impact of various hardware and soft-
ware design alternatives for the resulting self-securing net-
work interface, which is called LineSnort. The paper con-
siders the impact of the Snort rulesets used and scalability
with regard to processors and frequency, policies for assign-
ing flows to processors, the benefits of hardware-assisted (as
opposed to pure software) string content matching, and the
importance of data caches. The results show that through-
puts in excess of 1 Gigabit per second can be achieved for
fairly large rule sets using hardware-assisted string match-
ing and a small shared data cache. LineSnort can ex-
tract performance through increases in processor clock fre-
quency or parallelism, allowing an additional choice for




Snort is the most popular intrusion-detection system
available. The system and its intrusion-detection rule set
are freely available, and both are regularly updated to ac-
count for the latest threats [25]. Snort rules detect attacks
based on traffic characteristics such as the protocol type
(TCP, UDP, ICMP, or general IP), the port number, the size
of the packets, the packet contents, and the position of the
suspicious content. Packet contents can be examined for
exact string matches and regular-expression matches. Snort
can perform thousands of exact string matches in parallel
using one of several different multi-string pattern matching
algorithms [2, 8, 14, 35]. Snort maintains separate string-
matching state machines for each possible target port, al-
lowing each state machine to scan for only the traffic signa-
tures relevant to a specific service.
Beyond content signature matching alone, Snort includes
preprocessors that perform certain operations on the data
stream. Some important preprocessors include stream4 and
HTTP Inspect. The stream4 preprocessor takes multiple
packets from a given direction of a TCP flow and builds a
single conceptual packet by concatenating their payloads,
allowing rules to match patterns that span packet bound-
aries. It accomplishes this by keeping a descriptor for each
active TCP session and tracking the state of the session.
It also keeps copies of the packet data and periodically
“flushes” the stream by reassembling the contents and pass-
ing a pseudo-packet containing the reassembled data to the
detection engine. The HTTP preprocessor converts URLs
to normalized canonical form so that rules can specifically
match URLs rather than merely strings or regular expres-
sions. For example, it decodes URLs containing percentage
escape symbols (e.g., “%7e” instead of the tilde symbol) to
their normal forms [6]. It also generates absolute directories
instead of the “../” forms sometimes used to hide direc-
tory traversals. After this decoding, the same pattern match-
ing algorithm that Snort uses for packet data is used on the
normalized URL as specified by “URI-content” rules.
The following rules illustrate how Snort uses network
data characteristics to detect attacks:
• Buffer overflow in SMTP Content-Type: TCP traffic
to SMTP server set, established connection to port 25,
string “Content-Type:”, regular expression “Content-
Type:[ˆ\r\n]300,” (i.e., 300 or more characters after
the colon besides carriage return or newline)
• Cross-site scripting in PHP Wiki: TCP traffic to HTTP
server set, established connection to HTTP port set,
URI contains string “/modules.php?”, URI contains
string “name=Wiki”, URI contains string “<script”
• Distributed Denial-of-Service (DDOS) by Trin00 At-
tacker communicating to Master: TCP traffic to home
network, established connection to port 27665, string
“betaalmostdone”
The above rules test multiple conditions and use the logical
AND of those conditions to confirm an attack. The Snort
ruleset language includes 15 tests based on packet payload
and 20 based on headers. Over 95% of the rules in recent
Snort rulesets specify string content to match, but no rule
specifies only string content matching. Some rules specify
multiple string matches, and all are augmented with other
tests, such as specific ports, IP addresses, or URLs. About
30% of the rules use regular expressions, but nearly all of
these check for an exact string match as a first-order filter
before the more time-consuming regular expression match.
Figure 1 depicts the Snort packet processing loop. Snort
first reads a packet from the operating system using the
pcap library (also used by tcpdump). The decode stage
translates the network packet’s tightly-encoded protocol
headers and associated information into Snort’s loosely-
encoded packet data structure. Snort then invokes pre-
processors that use and manipulate packet data in various
ways. The rule-tree lookup and pattern matching stage de-
termines which rules are relevant for the packet at hand
















Figure 1. The Snort packet processing loop with percentage of time spent in each phase
fined in the string rules using the multi-string matching al-
gorithms. Packets may match one or more strings in the
multi-string match stage. Each match is associated with a
different rule, and for each of those rules, all the remain-
ing conditions are checked, including strings, non-content
conditions and regular expressions. Because each match
from the multi-pattern algorithm may or may not result in a
match for its rule as a whole, this stage may be thought of
as a “verification” stage. The last stage notifies the system
owner through alerts related to the specific rule matches.
Below each stage in Figure 1 is the percentage of ex-
ecution time spent in that phase for a representative net-
work test pattern drawn from the 1998-1999 DARPA intru-
sion detection evaluation [13]. The profile shown here was
gathered using the oprofile full-system profiling utility
running on a Sun Fire v20z with two 2.0 GHz dual-core
Opterons (4 processors total). Snort only runs on one pro-
cessor. The system has 4 GB of DRAM and uses Linux ver-
sion 2.6.8. The Snort configuration used a full recent rule-
set and included the most important preprocessors: stream4
and HTTP Inspect as described above. The Snort code pro-
filed here is modified to read all of its packets sequentially
from an in-memory buffer to allow the playback of a large
network trace, so packet reading consumes no time. The
system achieves 463 Mbps inspection throughput for this
trace. The stages shown in Figure 1 account for 92% of
execution time, with an additional 8% going into miscel-
laneous library calls, operating system activity, and other
applications running on the system. This additional compo-
nent is not relevant in the context studied in this paper and
is thus not considered further.
As Figure 1 shows, string content matching is a ma-
jor component of intrusion detection, constituting 46% of
execution time. Similar observations by others have led
some researchers to propose custom hardware support for
this stage, based on ASICs [3, 30], TCAMs [36], or FP-
GAs [5, 24, 29]. Some have also proposed support for regu-
lar expressions [9, 24]. Such hardware engines increase per-
formance both by eliminating instruction processing over-
heads and by exploiting concurrency in the computation.
Although some of these hardware systems have been tested
using strings extracted from Snort rules, all have used uni-
fied state machines for all string matching rather than per-
port state machines as in the Snort software.
Although Figure 1 shows the importance of string con-
tent matching, it also shows that no single component of in-
trusion detection makes up the majority of execution time.
This is not unexpected since the rules invariably include
multiple types of tests, not just string content matching.
Thus, any performance optimization strategy should target
the full intrusion detection system.
2.2 Programmable Ethernet Controllers
The first widely-used Gigabit Ethernet network interface
cards were based on programmable Ethernet controllers
such as the Alteon Tigon [4]. Although more recent Gigabit
Ethernet controllers have generally abandoned programma-
bility, programmable controllers are again arising for 10 Gi-
gabit Ethernet and for extended services such as TCP/IP of-
floading, iSCSI, message passing, or network interface data
caching [1, 15, 18, 22, 27]. Although integrating processing
and memory on a network interface card may add additional
cost to the NIC, this cost will still likely be a small portion
of the overall cost of a high-end PC-based network server.
Consequently, such costs are reasonable if programmabil-
ity can be used to effectively offload, streamline, or secure
important portions of the data transfer.
Network interface cards must generally complete a se-
ries of steps to send and receive Ethernet frames, which
may involve multiple communications with the host using
programmed I/O, DMA, and interrupts. Programmable net-
work interfaces implement these steps with a combination
of programmable processors and nonprogrammable hard-
ware assists. The assists are used for high-speed data trans-
fers between the NIC and the host or Ethernet, as efficient
hardware mechanisms can be used to sequentially transfer
data between the NIC local memory and the system I/O
bus or network. Willmann et al. studied the hardware and
software requirements of a programmable 10 Gigabit Eth-
ernet controller [34]. They found that Ethernet processing
firmware does not have sufficient instruction-level paral-
lelism or data reuse for efficient use of multiple-issue, out-
of-order processors with large caches; however, a combi-
nation of frame-level and task-level concurrency allows the
use of parallel low-frequency RISC cores. Additionally, the
specific data access characteristics of the network interface
firmware allow the use of a software-managed, explicitly-
partitioned memory system that includes on-chip SRAM
scratchpads for low-latency access to firmware metadata
and off-chip graphics DDR (GDDR) DRAM for high-










ICache 0 ICache  P-1ICache 1
Instruction Memory Interface























Figure 2. Block diagram of proposed Ethernet controller architecture
3 A Self-Securing Network Interface Archi-
tecture
As discussed in Section 2, programmable network inter-
faces have been proposed as resources for offloading var-
ious forms of network-based computation from the host
CPU of high-end PC-based network servers. Offloading in-
trusion detection from the host could be quite valuable be-
cause this service potentially requires processing every byte
sent across the network. This paper proposes a self-securing
Ethernet controller that uses multiple RISC processors
along with event-driven protocol processing firmware and
special-purpose hardware for string matching. Figure 2
shows the architecture of the proposed Ethernet controller,
which is based upon the 10 Gigabit Ethernet design by Will-
mann et al. [34]. The individual components target the fol-
lowing uses:
• Programmable processors: control-intensive computa-
tion, needs low latency
• Memory transfer assists (DMA and MAC): data trans-
fers to external interfaces, need high bandwidth
• String matcher: data processing, needs high bandwidth
• Graphics DDR (GDDR) SDRAM: for high capacity
and high bandwidth access to network data and dy-
namically allocated data structures
• Banked SRAM scratchpads: for low latency on ac-
cesses to fixed control data
• Instruction caches: for low latency instruction access
• Data cache: for low latency on repeated data accesses
Of these components, the data cache did not appear in
the previous works discussed. The remainder of this section
discusses the architectural components in greater detail.
Programmable Processing Elements. Ethernet proto-
col processing and the portions of Snort other than string
matching are executed on RISC processors that use a single-
issue five-stage pipeline and an ISA based on a subset of
the MIPS architecture. Parallelism is a natural approach to
achieving performance in this environment because packets
from independent flows have no semantic dependences be-
tween them. Additionally, alternatives to parallelism such
as higher frequency or superscalar issue are not viable op-
tions because network interfaces have a limited power bud-
get and limited cooling space.
Memory Transfer Assists. As in previous pro-
grammable network interfaces, this system includes hard-
ware engines to transfer data between the NIC memory and
the host memory or network efficiently [4, 19, 34]. These
assists, called the DMA and MAC assists, operate on com-
mand queues that specify the addresses and lengths of the
memory regions to be transferred to the host by DMA or
to the network following the Ethernet medium-access con-
trol policy. The DMA assist is also used to offload TCP/IP
checksumming from the host operating system [20].
String Matching Hardware. High-performance net-
work intrusion detection requires the ability to match string
content patterns efficiently. Thus, the controller integrates
string matching assist hardware. This assist could be based
on almost any of the previously-proposed ASIC, TCAM, or
FPGA-based designs mentioned in Section 2. The string-
matcher operates by reading from a dedicated task queue
describing regions of memory to process and then read-
ing those regions of memory (just like the DMA/MAC as-
sists). Since previously-reported string and regular expres-
sion matchers have been shown to match content at line-
rate (e.g., [3, 5, 24, 29, 30, 36]), the only potential slow-
down would be reading the input data from memory. This
penalty can be managed by allocating data appropriately in
the memory hierarchy.
One major difference from conventional Snort string
matching, however, is the use of a single table for all strings
as described in Section 2; in contrast, standard Snort uses
separate tables for each set of network ports. Such a so-
lution may not be feasible for embedded hardware string
matchers that have limited storage capacity. Consequently,
fast multipattern string matching alone will likely generate
false positives for rules that only apply to other ports.
Memory System. The memory system consists of small
instruction caches for low-latency access to the instruc-
tion stream, banked SRAM scratchpads for access to pro-
tocol processing and Snort control data, and an external
GDDR SDRAM memory system for high-bandwidth access
to high-capacity frame data. Two Micron GDDR SDRAM
chips can provide a 64 MB memory with a 64-bit interface
operating at speeds up to 600 MHz, double-clocked [23].
Additionally, the memory system includes a shared data
cache to allow the programmable processors to use DRAM-
based frame data when needed without incurring row ac-
tivation overhead on every access. The cache is not used
for scratchpad accesses or by the assists. For simplicity,
the cache is write-through; consequently, it is important to
avoid allocating heavily-updated performance-critical data
in the DRAM. The cache does not maintain hardware co-
herence with respect to uncached accesses from the as-
sists (which are analogous to uncached DMAs from I/O
devices in a conventional system). Instead, the code ex-
plicitly flushes the relevant data from cache any time it re-
quires communication with the assists. The system adopts
a sharing model based on lazy release consistency (LRC),
with explicit flushes before loading data that may have been
written by the hardware assists [17]; unlike true LRC, this
system need not worry about merging updates from mul-
tiple writers because the cache is write-through. An alter-
nate design is also evaluated which uses private caches for
each processor. Unlike the shared cache, the private caches
can provide single-cycle access. However, flushes are also
required to maintain coherence between processors. Each
time a processor acquires a lock that protects cacheable
data, it checks if it was the last processor to own the lock. If
not, it must flush its cache before reading because another
processor may have written to the data.
Table 1 summarizes the default architectural configura-
tion parameters for this controller.
Parameter Value
Processors 2–8 @ 500 MHz
Scratchpad banks 4
Private I-cache size 8 KB
Shared D-cache size 64 KB
Cache block size 32 bytes
Cache associativity 2-way
GDDR SDRAM chips 2 × 32 MB
DRAM frequency 500 MHz
Row size 2 KB
Row activation latency 52 ns
Table 1. Default architectural parameters
4 Integrating IDS into Network Interface
Processing
Section 2 described the steps in the Snort network intru-
sion detection system and in programmable Ethernet pro-
tocol processing firmware. A self-securing programmable
Ethernet NIC integrates these two tasks. Achieving high
performance requires analyzing and extracting the concur-
rency available in these tasks and mapping the computation
and data to the specific resources provided by the architec-
ture. Previous work has shown strategies to extract frame-
level and task-level concurrency in programmable Ethernet
controller firmware [19, 34]. Consequently, the following
will focus on executing Snort efficiently for the architecture
of Section 3. The resulting self-securing Ethernet network
interface is called LineSnort.
The architecture of LineSnort requires parallelism to ex-
tract performance. Since both Snort and Ethernet protocol
firmware process one packet at a time, packet-level concur-
rency seems a natural granularity for parallelization. Re-
call the Snort processing loop depicted in Figure 1. Unlike
conventional processors, LineSnort does not need a sep-
arate stage to read packets since this is already done for
protocol processing itself. The remaining stages, however,
present some obstacles that must be overcome. In partic-
ular, the data structures used by stream reassembly must
be shared by different packets, making packet-level paral-
lelization impractical.
Stream reassembly is implemented using a tree of TCP
session structures that is searched and updated on each
packet so that the packet’s payload is added to the correct
stream of data. A TCP session refers to both directions of
data flow in a TCP connection, considered together. The in-
formation related to the TCP session must be updated based
on each TCP packet’s header and can track information such
as the current TCP state of the associated connection. Be-
cause the session structure tracks the TCP state and data
content of its associated connection, packets within each
TCP stream must be processed in-order. However, there are
no data dependences between different sessions, so a paral-
lelization that ensured that each session would be handled
by only one processor could access the different sessions
without any need for synchronization. Such a paralleliza-
tion scheme implies that the Snort processing loop should
only use session-level concurrency (rather than packet-level
concurrency), at least until stream reassembly completes.
It is worth noting that the IDS itself can be the target of
denial-of-service attacks, and the stream reassembly stage is
a vulnerable target component, which is doubly true when
detection is hardware-assisted. Fortunately, Snort’s stream
preprocessor already contains code which attempts to detect
and mitigate the effects of such attacks, and this code can
also be used by LineSnort.
Figure 3 shows the stages of Snort operating using this
parallelization strategy. The parallel software uses dis-
tributed task queues, with one per processor. When a packet
arrives in NIC memory and is passed to the intrusion detec-
tion code, it must be placed into the queue corresponding to
its flow. The source and destination IP addresses are looked
up in a global hash table. If the stream has an entry, the
queue listed in the table is used. Otherwise the stream is as-
signed to whichever queue is currently shortest and the en-
try is added to the table, ensuring subsequent packets from
the stream will go into the same queue. Stream reassembly
is then performed for each session using the steps described
in Section 2. The processor assigned to the queue places a
pointer to the incoming packet data in its stream reassembly
tree. Eventually, enough of the stream is gathered that Snort
decides to flush it. This processor then finds a free stream
reassembly buffer and copies the packet data from DRAM
into the free buffer. These stream reassembly buffers are al-
located in the scratchpad for fast access; the buffers are then
passed to the hardware pattern matching assist by enqueue-
ing a command descriptor. The reassembler also enqueues
a descriptor for HTTP inspection if the stream is to or from
an HTTP port.
Although the above assignment strategy parallelizes
stream reassembly and other portions of Snort using
session-level concurrency, it provides no benefit for situa-
tions where there are fewer flows than processors (such as a
high-bandwidth single-session attack). Such a situation can
be supported by exploiting a property of stream reassembly:
packets from the same flow that are separated by a stream
reassembly flush point actually have no reassembly-related
dependences between them since they will be reassembled
into separate stream buffers. Consequently, if the proces-
sor enqueueing a new packet finds that its destination queue
is too full (and that another queue is sufficiently empty),
it can change the assignment of the flow to the emptier
queue. This works because the stream reassembler is robust
enough to handle flows encountered mid-stream (in the new
queue), and, after a timeout occurs, will inspect and purge
any streams that have not seen any recent packets (in the old
queue). It is also possible that a flow may be assigned away
from a queue and then later assigned back to it. To prevent
improper reassembly of these disjoint stream segments, a
flush point is inserted before the new section is added to the
reassembly tree. Reassignment achieves better load balance
and supports intra-flow concurrency for higher throughput;
the cost is additional overhead in stream reassembly. In ad-
dition, if reassignment were to be too frequent, (i.e., on al-
most every packet), the benefits of stream reassembly would
be negated; the requirement that the shortest queue be suf-
ficiently empty prevents this.
The string content matching assist dequeues the descrip-
tor passed in from stream reassembly, scans the reassembled
stream in hardware, and notes any observed rule matches.
For each rule matched by the stream, it writes a descriptor
into the global match queue, containing a pointer to the rule
data. The next processor to dequeue from the match queue
must verify each match reported by the content matcher.
Each Snort rule specifies several conditions (which may in-
clude multiple strings), but the matcher only checks for the
longest exact content string related to each rule. The ver-
ification stage checks each condition for the rule and han-
dles alerting if necessary. At the same time, another proces-
sor may also dequeue the HTTP inspect descriptor if one
was generated. The HTTP inspect portion of the firmware
performs URL normalization and matches “URI-Content”
rules to check for HTTP-specific attacks. If any “URI-
Content” matches are found, verification is performed in
the same manner as for normal content rules. The HTTP
Inspect and verification stages are essentially unmodified
from the original Snort. The rule data structures are allo-
cated in DRAM since they are not updated while process-
ing individual streams; the HTTP inspection data structures
are allocated in the scratchpad, with one per processor. The
notification stage shares common alerting mechanisms; no
attempt was made to privatize this stage since it represents
only 1.5% of the total computation even with complete se-
rialization.
The Snort stages described above integrate into Ethernet
protocol processing after new data arrives at the NIC (trans-
mit or receive). Currently, LineSnort only acts to detect
intrusions, just like the standard Snort; thus, it passes the
frame onto the remainder of Ethernet processing regardless
of whether or not there was a match. LineSnort could be
used to prevent intrusions by dropping a suspicious frame or
sending a reset on a suspicious connection, but that choice
is orthogonal to the basic design of the system.
The firmware uses the scratchpads for all statically-
allocated data (e.g., Ethernet processing control data, inter-
stage queues, and target buffers for stream reassembly and
HTTP Inspect). DRAM is used for frame contents and
all dynamically-allocated data (e.g., the stream reassembly
trees, the session assignment table, and the per-port rule-
tree data). This data allocation does not suffer as a result of
the simple write-through cache in the system because very
little data in DRAM is actually updated by the processors
during operation.
5 Evaluation Methodology
The architecture and firmware described in the previous
sections are evaluated using the Spinach simulation infras-
tructure [33]. Spinach is a toolkit for building network in-
terface simulators and is based on the Liberty Simulation
Environment [32]. Spinach includes modules common to
general-purpose processors (such as registers and memory)
as well as modules specific to network interfaces (such as
DMA and MAC assist hardware). Modules are connected




































Packet granularity Stream granularity
Figure 3. Firmware parallelization strategy used for LineSnort
hierarchical composition of an actual piece of hardware.
Spinach has previously been validated against the Tigon-
II programmable Gigabit Ethernet controller, has been used
for studying architectural alternatives in achieving 10 Giga-
bit Ethernet performance, and has been extended to model
graphics processing units [16, 33, 34].
Evaluating IDS performance is a difficult problem for
several reasons. IDS performance is sensitive to the ruleset
used and to packet contents, but there are very few network
traces available with packet contents because of size and
privacy concerns. In addition, programmable NIC perfor-
mance is sensitive to packet size. Because of its method of
distributing flows to processors, LineSnort is also sensitive
to flow concurrency, the number of active flows running at
one time. Hence, the traces used for evaluation should be
realistic in terms of the packet and flow contents and the
average packet sizes. The rulesets used should also be rea-
sonable for the target machine.
The rulesets used are taken from those distributed by
Sourcefire, the authors of Snort. Sourcefire’s ruleset con-
tains over 4000 rules describing vulnerabilities in all kinds
of programs and services. Although an edge-based NIDS
would need to protect against all possible vulnerabilities,
a per-machine NIDS such as LineSnort only needs to pro-
tect against the vulnerabilities relevant to that particular
server. For example, a Windows machine has no need to
protect against vulnerabilities that exist only in Unix, and
vice versa. Two rulesets are used in these tests: one with
rules for email servers and one with rules for web servers.
Both rulesets also include rules for common services such
as SSH. The mail ruleset contains approximately 330 rules,
and the web ruleset approximately 1050. Tailoring the rule-
set to the target machine allows LineSnort to consume less
memory and see fewer false positives.
LineSnort is evaluated using a test harness that models
the behavior of a host system interacting with its network
interface. The test harness plays packets from a trace at a
specified rate, and Section 6 reports the average rate sus-
tained by LineSnort over the trace. The rules used for test-
ing do not distinguish between sent and received packets, so
all traces are tested using only the send side. Although all
steps related to performing DMAs and network transmis-
sion are included in the LineSnort firmware, the actual PCI
and Ethernet link bandwidths are not modeled, as these are
constantly evolving and can be set according to the achieved
level of performance.
The packet traces used to test LineSnort come from the
1998-1999 DARPA intrusion detection evaluation at MIT
Lincoln Lab, which simulates a large military network [13].
Because they were generated specifically for IDS testing,
the traces have a good collection of traffic and contain at-
tacks that were known at the time. However, because they
were designed for testing IDS efficacy rather than perfor-
mance, they are not realistic with respect to packet sizes
or flow concurrency. To address this problem, a variety
of flows were taken from several traces and reassembled to
more closely match average packet sizes (≈ 778 bytes) seen
in publicly available header traces from NLANR’s Passive
Measurement and Analysis website (pma.nlanr.net). Two
traces, called LL1 and LL2, were created in this manner,
using different ordering and interleaving among the flows.
6 Experimental Results
Figure 4 shows the base results for the Lincoln Lab traces
using the web and mail rulesets for varying numbers of
CPUs. There are many factors, both in hardware and soft-
ware, that affect performance.
Rulesets. Rulesets have an important effect on any IDS,
and the choice of rules will always involve performance
tradeoffs. In LineSnort, the ruleset affects how much work
the firmware must do in verifying content matches. Since
the content matcher has only one string table, many matches
will be for rules that do not ultimately lead to alerts; these
“false positives” will be filtered out by the verification stage.
A false positive can arise because the rule does not apply to
the TCP/UDP ports in use or because it specifies additional
conditions that are not met by the packet. If the ports do not
match, the verification will complete quickly. However, if
additional string or regular expression matching is required,
Figure 4. LineSnort throughput results
achieved with 2–8 CPUs for mail and web
rulesets with LL1 and LL2 input traces
Figure 5. Relative speedup for increasing
numbers of processors
verification will be much slower as the processor will need
to read the packet and process its data using the matching
algorithms. The mail ruleset actually generates more false
positives at the string matching hardware because this rule-
set contains many short strings that appear in many pack-
ets; however, the traces used contain far more web flows
than email flows, so verification for the mail ruleset typi-
cally completes quickly after a simple port mismatch. In
contrast, many web rules continue on to the more expensive
checks, causing the web ruleset to spend nearly twice as
much of its time on verification as the mail ruleset. Within
the verification stage, the largest fraction of CPU time is
consumed by regular expression matching.
Processor and Frequency Scaling. Figure 5 shows
the speedup obtained by increasing the number of proces-
sors, normalized to the 2-processor throughput, for each
trace and ruleset. Although the mail ruleset has higher raw
throughput in all cases, the web ruleset scales much better
with increasing numbers of processors. For the mail rule-
set, two main factors contributed to the limitation in scal-
ability. Because the mail ruleset requires a comparatively
small amount of verification work per packet, its through-
put is more limited by the rate at which the processors can
perform the TCP stream reassembly and copy the reassem-
bled data from the DRAM into the scratchpad for inspec-
tion. Thus, there is more contention for the shared cache
with the mail ruleset even with 4 processors than there is
with the web ruleset with 8. The mail ruleset also triggers a
larger increase in lock contention as the number of proces-
sors increases. Conversely, the web ruleset requires much
more verification work per packet, and thus has lower over-
all throughput. Verification reads the reassembled packet
data from scratchpad rather than DRAM, so the overall
workload is more balanced between the scratchpad and the
DRAM for the web ruleset, making increased DRAM con-
tention less important. Moreover, because the scratchpad
is banked, the increase in processors causes less contention
there than for the cache.
Figure 6 shows the relationships between processor fre-
quency and throughput for the LL1 trace with the web and
mail rulesets. Scaling the processor frequency gives less
than linear speedups because the DRAM latency and band-
width is unchanged, so processors at higher frequencies
spend more cycles waiting for DRAM accesses. The web
ruleset scales better with frequency for the same reason it
scales better with additional processors; it is slower overall
and DRAM access makes up a smaller fraction of its time.
Flow Assignment. One of the most important influences
on overall performance is the assignment of packet flows to
processor queues. The assignment must be as balanced as
possible to keep the processor load balanced. Otherwise, if
the load imbalance is enough to fill up one of the stream re-
assembly queues, head-of-line blocking can occur, and all
incoming packets must wait for the full queue. We eval-
uate two primary methods for queue assignment. In the
static assignment method, the source and destination IP ad-
dresses are hashed in a symmetric manner, with the queue
determined directly from the hash, so that packets from the
same TCP session are always in the same queue. In dy-
namic assignment, the source and destination IP addresses
are looked up in a hash table. If the stream has an entry,
the queue stored in the table is used. Otherwise the stream
is assigned to whichever queue is currently shortest, and the
entry is stored, ensuring subsequent packets from the stream
will go into the same queue. Dynamic assignment is then
further enhanced with reassignment as discussed in Section
4.
The static assignment method is simpler and faster
(which is good because flow assignment is serialized); how-
ever, dynamic assignment does a better job balancing the
flows, and the reduction in head-of-line blocking compen-
sates for the cost of using the hash table, particularly as
more processors are added. The addition of dynamic re-
assignment provides further improvements for all combi-
nations with 4 or more processors. Reassignment enforces
balance and prevents head-of-line blocking, which compen-
sates for the extra stream reassembly overhead; improve-
ments varied from 2-16%, with an overall average of 7%.
Hardware String Matching. One of the main features
of the design is the use of a hardware assist for multi-string
(a) Web ruleset (b) Mail ruleset
Figure 6. Impact of CPU frequency for LL1 trace
Web Ruleset Mail Ruleset
LL1 Trace 1.8 2.8
LL2 Trace 1.4 2.4
Table 2. Speedup for hardware-assisted
string matching
matching. According to Amdahl’s Law, the benefit of ac-
celerating string matching is determined primarily by the
fraction of time spent in software string matching. Table 2
shows the actual speedups for each ruleset and trace combi-
nation when comparing hardware-assisted string matching
with Snort’s baseline software string matcher. As discussed
in Section 3, the hardware string matcher uses only a single
rule table, while the software matcher uses separate rule sets
per target port; consequently, the hardware string matcher
sees more false positives for content checks on ports other
than the actual target. Nevertheless, hardware string mach-
ing provides benefits of 1.4–2.8x for these tests. The mail
ruleset benefits much more from the hardware matcher be-
cause, as previously discussed, the mail ruleset requires less
time in the verification stage for our traces than the web
ruleset. Thus, a larger fraction of time is spent in the multi-
string matching phase, allowing a larger benefit by acceler-
ating it. A common practice to improve IDS performance is
to eliminate unneeded rules, or those that often trigger ex-
pensive verification such as regular expressions. Doing so
would serve to increase the benefit of the hardware string
matcher even more.
Caches. Processor cache sizes were evaluated using the
static queue assignment method. Willmann et al. showed
that the simple NIC firmware does not benefit much from
data caches [35]; however, the same is not true of intru-
sion detection. Figure 7 shows the impact of cache size on
throughput for LL1 trace. The working set of the firmware
is small; even a 512 byte cache is about twice as fast as a
system with no cache, and the largest cache is only 20-30%
Figure 7. Effect of shared cache size on 4-
processor throughput
faster than the smallest. Important data in the DRAM in-
cludes the the stream reassembly tree (which is shared for
static assignment) and flow assignment table, which are ac-
cessed frequently, as well as the packet data itself; the pay-
load is only accessed once but the IP and TCP headers are
reused during flow assignment and stream reassembly. The
32-byte cache blocks also help to exploit spatial locality, re-
ducing DRAM row activations by almost 40% with 1 kB of
cache. Since the mail ruleset spends less time in verification
than the web ruleset, it spends a relatively greater fraction in
stream reassembly and thus depends more on DRAM access
for its performance. Consequently, it benefits more from the
presence of and increased size of the cache.
Private per-processor caches were also tested. For most
tests however, the shared cache performed 10-30% faster
for the same total amount of cache, despite the fact that the
private cache had single-cycle access. The important shared
data structures are frequently accessed and updated, and the
private cache must often be flushed before accessing them
to maintain coherence. By contrast, sharing the data cache
gives the benefits of inter-processor prefetching for these
structures, and processors using the shared cache need only
flush specific lines when reading frame data written by the
assists.
The addition of flow reassignment requires that the
stream reassembly tree be made private per processor. This
means that although the overall number of packets and
packet descriptor structures in the system is the same, the
number of stream descriptors will be larger when reassign-
ment occurs because one stream can be in multiple reassem-
bly trees at once. This has two potential effects. First, it may
increase the demand for cache because of the duplication of
stream descriptors. Second, it may reduce the fraction of
shared data accessed by the processors, potentially bene-
fitting configurations with private caches, reducing or even
reversing the performance gap between shared and private
cache configurations.
Summary and Discussion. Performance of any intru-
sion detection system is heavily dependent on the ruleset
and workload. For our traces the mail ruleset requires less
time in the verification phase, and is thus more dependent
on bandwidth and contention for the cache and DRAM.
The web ruleset requires more verification time and so runs
slower overall, but benefits more from increased number
and frequency of CPUs. Load balancing and assignment of
flows to processors for stream reassembly is also of primary
importance for our parallelized approach. Dynamic assign-
ment, though less predictable and more variable than static
assignment, performs better, particularly as the number of
processors increases. Hardware-assisted string matching
was found to be essential to approach Gigabit speeds, as
was at least a small amount of cache. However the DRAM
footprint and amount of reuse are small enough that increas-
ing cache size provides only limited benefit. In addition,
a shared cache configuration performed better than private
caches since coherence-related flushes were not needed for
inter-processor sharing. Outside of these considerations,
several paths can be taken to acheive near-Gigabit speeds;
for simpler rulesets, such as the mail ruleset, 4 processors
can be used at 400MHz, 6 at 300MHz, or 8 at 200MHz, de-
pending on costs and power budgets. Likewise, a workload
like the more demanding web ruleset would require 6 pro-
cessors at 400MHz or 8 at 300MHz. Scaling the architec-
ture beyond 8 processors would probably require improve-
ments to the cache; a banked architecture for the shared
cache, or private caches with hardware coherence could po-
tentially address any contention problems. Alternatively,
adding hardware assistance for stream reassembly (such as
“gather” support at the string matcher) could substantially
reduce the overall workload by eliminating copy overheads.
7 Related Work
Section 2 gives the background information related to
the architectures and software used in this work. This sec-
tion discusses other related issues in high-performance in-
trusion detection. Other researchers have also considered
incorporating intrusion detection software onto network in-
terfaces, but they have focused on the use of network pro-
cessors (NPUs). Clark et al. used the multithreaded mi-
croengines of an Intel IXP1200-based platform to reassem-
ble incoming TCP streams that were then fed to an FPGA-
based string content matcher [10]. Bos and Huang also pro-
pose a solution based on an IXP1200 that performs both
stream reassembly and string content matching on the mi-
croengines [7]. Both of these systems target traffic rates of
no more than 100 Mbps, although the former performs the
string matching portion at Gigabit speeds. Although nei-
ther of these NPU-based works include the HTTP prepro-
cessor, they could most probably add it with some perfor-
mance degradation. Although NPUs can move network data
at high rates, their multithreaded latency tolerance features
are targeted toward the memory latencies seen in routers
rather than the much higher DMA latencies seen when NICs
must transfer data to and from the host. Consequently, pre-
vious work on using NPUs in NICs has seen performance
imbalances related to the cost of DMA transfers [21].
In their work proposing self-securing network interfaces,
Ganger et al. suggest integrating the secure network inter-
faces into switch ports rather than keeping them in individ-
ual machines [11]. This strategy aims to protect against
physical intruders swapping out the NICs with standard
ones. Although LineSnort’s design includes host-specific
elements such as DMA, it would not need to change sub-
stantially to support switch integration.
The primary approach to high-performance intrusion de-
tection today is through clustering multiple PCs that execute
IDS using a load-balancing switch. Schaelicke et al. have
proposed SPANIDS, a system that combines a specially-
designed FPGA-based load-balancing switch that considers
flow information and system load when redirecting pack-
ets to commodity PCs that run intrusion detection soft-
ware [26]. Commercial offerings by companies such as Top
Layer use L4–7 load-balancing switches to redirect traffic
to a pool of intrusion-detection nodes, allowing high overall
throughput scalability [31]. The expense and size of these
clusters tends to make them practical only at edge-based de-
ployments. LineSnort supports a different model, with the
self-securing NIC as a key defense against both external and
LAN-based attacks.
The research community has also proposed distributed
NIDS, in which nodes at various points in the network track
anomalies and collaboratively collect data that may indicate
a system-level intrusion even if no specific host triggers an
alert [12, 28]. Efforts in distributed NIDS have targeted
collaboratively gathering additional information to identify
intrusions, rather than processing packets at a faster rate.
Thus, distributed NIDS is largely orthogonal to LineSnort.
8 Conclusions
This paper presents the architecture and software design
of LineSnort, a programmable network interface card (NIC)
that offloads the Snort network intrusion detection system
(NIDS) from the host CPU of a high-end PC-based network
server. The paper investigates and analyzes design alterna-
tives and workloads that impact the performance of Line-
Snort, including the Snort rulesets used and the number and
frequency of processor cores.
Leveraging previous work in programmable NICs and
hardware content matching, LineSnort protects a single
host from both LAN-based and Internet-based attacks, un-
like edge-based NIDS which only guards against the lat-
ter. LineSnort exploits TCP session-level parallelism using
several lightweight processor cores and a dynamic assign-
ment of TCP flows to cores. LineSnort also exploits intra-
session concurrency through flow reassignment, providing
better load balance and higher throughput as the number
of cores increases, or for workloads with poor flow con-
currency. Simulation results using the Spinach toolkit and
Liberty Simulation Environment show that LineSnort can
achieve Gigabit Ethernet network throughputs while sup-
porting all standard Snort rule features, reassembling TCP
streams, and transforming HTTP URLs. To achieve these
throughput levels, LineSnort requires a small shared cache,
a string-matching assist in the hardware, and one of sev-
eral options for the number and frequency of processors.
Lightweight rulesets can achieve Gigabit throughput with
4 CPU cores at 400MHz, 6 at 300MHz, or 8 at 200MHz,
while a more demanding ruleset requires 6 CPU cores at
400MHz or 8 at 300MHz.
Substantial prior work has considered the use of net-
work interface cards as a resource for optimizing the flow of
communication in a system, with targets ranging from sim-
ple checksumming to full protocol offload and customized
services [1, 15, 18, 20, 22, 27]. LineSnort helps to de-
liver on the promise of programmable network interfaces
by demonstrating that efficiently offloading the challenging
problem of intrusion detection enables the new service of
per-machine NIDS for network servers, providing a higher
level of protection against both Internet-based and LAN-
based attacks.
References
[1] Adaptec. ANA-7711 Network Accelerator Card Specifica-
tion, Mar. 2002.
[2] A. V. Aho and M. J. Corasick. Efficient String Matching: An
Aid to Bibliographic Search. Commun. ACM, 18(6):333–
340, 1975.
[3] M. Aldwairi, T. Conte, and P. Franzon. Configurable
string matching hardware for speeding up intrusion detec-
tion. SIGARCH Comput. Archit. News, 33(1):99–107, 2005.
[4] Alteon Networks. Tigon/PCI Ethernet Controller, Aug.
1997. Revision 1.04.
[5] Z. K. Baker and V. K. Prasanna. A Methodology for the
Synthesis of Efficient Intrusion Detection Systems on FP-
GAs. In Proc. of the Twelfth Annual IEEE Symposium on
Field Programmable Custom Computing Machines 2004,
Apr. 2004.
[6] T. Berners-Lee, R. Fielding, and L. Masinter. Uniform Re-
source Identifiers (URI): Generic Syntax. IETF RFC 2396,
Aug. 1998.
[7] H. Bos and K. Huang. Toward software-based signature de-
tection for intrusion prevention on the network card. In Proc.
of the Eighth International Symposium on Recent Advances
in Intrusion Detection, September 2005.
[8] R. S. Boyer and J. S. Moore. A Fast String Search Algo-
rithm. Commun. ACM, 20(10):762–772, Oct. 1977.
[9] B. C. Brodie, D. E. Taylor, and R. K. Cytron. A scalable
architecture for high-throughput regular-expression pattern
matching. In ISCA ’06: Proceedings of the 33rd Interna-
tional Symposium on Computer Architecture, pages 191–
202, Washington, DC, USA, 2006. IEEE Computer Society.
[10] C. Clark, W. Lee, D. Schimmel, D. Contis, M. Kone´, and
A. Thomas. A Hardware Platform for Network Intrusion
Detection and Prevention. In Proc. of the Third Workshop
on Network Processors and Applications, February 2004.
[11] G. R. Ganger, G. Economou, and S. M. Bielski. Find-
ing and Containing Enemies Within the Walls with Self-
securing Network Interfaces. Technical Report CMU-CS-
03-109, Carnegie Mellon School of Computer Science, Jan.
2003.
[12] R. Gopalakrishna and E. H. Spafford. A Framework for Dis-
tributed Intrusion Detection using Interest Driven Cooperat-
ing Agents. In Proc. of the 4th International Symposium on
Recent Advances in Intrusion Detection, Oct. 2001.
[13] J. W. Haines, R. P. Lippmann, D. J. Fried, E. Tran,
S. Boswell, and M. A. Zissman. 1999 DARPA Intrusion De-
tection System Evaluation: Design and Procedures. Techni-
cal Report 1062, MIT Lincoln Laboratory, 2001.
[14] R. N. Horspool. Practical Fast Searching in Strings. Soft-
ware: Practice and Experience, 10(6):501–506, 1980.
[15] Y. Hoskote et al. A TCP Offload Accelerator for 10 Gb/s
Ethernet in 90-nm CMOS. IEEE Journal of Solid-State Cir-
cuits, 38(11):1866–1875, Nov. 2003.
[16] G. S. Johnson, J. Lee, C. A. Burns, and W. R. Mark. The
irregular z-buffer: Hardware acceleration for irregular data
structures. ACM Transactions on Graphics, 24(4):1462–
1482, 2005.
[17] P. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy Release
Consistency for Software Distributed Shared Memory. In
Proc. of the 19th Annual International Symposium on Com-
puter Architecture, pages 13–21, May 1992.
[18] H. Kim, V. S. Pai, and S. Rixner. Improving Web Server
Throughput with Network Interface Data Caching. In Proc.
of the Tenth International Conference on Architectural Sup-
port for Programming Languages and Operating Systems,
pages 239–250, October 2002.
[19] H. Kim, V. S. Pai, and S. Rixner. Exploiting Task-Level
Concurrency in a Programmable Network Interface. In Proc.
of the ACM SIGPLAN Symposium on Principles and Prac-
tice of Parallel Programming, June 2003.
[20] K. Kleinpaste, P. Steenkiste, and B. Zill. Software Sup-
port for Outboard Buffering and Checksumming. In Proc. of
the ACM SIGCOMM ’95 Conference on Applications, Tech-
nologies, Architectures, and Protocols for Computer Com-
munication, pages 87–98, Aug. 1995.
[21] K. Mackenzie, W. Shi, A. McDonald, and I. Ganev. An
Intel IXP1200-based Network Interface. In Proc. of the 2003
Annual Workshop on Novel Uses of Systems Area Networks,
Feb. 2003.
[22] K. Z. Meth and J. Satran. Design of the iSCSI Protocol. In
Proc. of the 20th IEEE Conference on Mass Storage Systems
and Technologies, Apr. 2003.
[23] Micron. 256Mb: x32 GDDR3 SDRAM MT44H8M32 data
sheet, June 2003. Available from www.digchip.com.
[24] J. Moscola, J. Lockwood, R. P. Loui, and M. Pachos. Im-
plementation of a Content-Scanning Module for an Internet
Firewall. In Proc. of the 11th Annual IEEE Symposium on
Field-Programmable Custom Computing Machines, pages
31–38, Apr. 2003.
[25] M. Roesch. Snort – Lightweight Intrusion Detection for Net-
works. In Proc. of the 13th USENIX Conference on System
Administration, pages 229–238, 1999.
[26] L. Schaelicke, K. Wheeler, and C. Freeland. SPANIDS:
A Scalable Network Intrusion Detection Loadbalancer. In
Proc. of the 2nd Conference on Computing Frontiers, pages
315–322, 2005.
[27] P. Shivam, P. Wyckoff, and D. Panda. EMP: Zero-copy OS-
bypass NIC-driven Gigabit Ethernet Message Passing. In
Proc. of the 2001 ACM/IEEE Conference on Supercomput-
ing, Nov. 2001.
[28] S. R. Snapp et al. DIDS (Distributed Intrusion Detection
System) - Motivation, Architecture, and An Early Prototype.
In Proc. of the 14th National Computer Security Conference,
pages 167–176, Washington, DC, Oct. 1991.
[29] I. Sourdis and D. Pnevmatikatos. Fast, Large-Scale String
Match for a 10Gbps FPGA-based Network Intrusion Detec-
tion System. In Proc. of the 13th International Conference
on Field Programmable Logic and Applications, pages 880–
889, Sept. 2003.
[30] L. Tan and T. Sherwood. A High Throughput String Match-
ing Architecture for Intrusion Detection and Prevention. In
Proc. of the 32nd Annual International Symposium on Com-
puter Architecture, pages 112–122, June 2005.
[31] Top Layer Networks. Network intrusion detection systems:
Important IDS network security vulnerabilities. White Pa-
per, September 2002.
[32] M. Vachharajani, N. Vachharajani, D. A. Penry, J. A. Blome,
and D. I. August. Microarchitectural Exploration with Lib-
erty. In Proc. of the 35th Annual International Symposium
on Microarchitecture, pages 271–282, November 2002.
[33] P. Willmann, M. Brogioli, and V. S. Pai. Spinach: A Liberty-
Based Simulator for Programmable Network Interface Ar-
chitectures. In Proc. of the ACM SIGPLAN/SIGBED 2004
Conference on Languages, Compilers, and Tools for Embed-
ded Systems, pages 20–29, June 2004.
[34] P. Willmann, H. Kim, S. Rixner, and V. S. Pai. An Effi-
cient Programmable 10 Gigabit Ethernet Network Interface
Card. In Proc. of the 11th International Symposium on High-
Performance Computer Architecture, pages 96–107, Febru-
ary 2005.
[35] S. Wu and U. Manber. A fast algorithm for multi-pattern
searching. Technical Report TR-94-17, Department of Com-
puter Science, University of Arizona, 1994.
[36] F. Yu, R. H. Katz, and T. V. Lakshman. Gigabit Rate Packet
Pattern-Matching Using TCAM. In Proc. of the 12th IEEE
International Conference on Network Protocols, pages 174–
183, Oct. 2004.
