Power and Memory Efficient Hashing Schemes for Some Network Applications by Yu, Heeyeol
POWER AND MEMORY EFFICIENT HASHING SCHEMES
FOR SOME NETWORK APPLICATIONS
A Dissertation
by
HEEYEOL YU
Submitted to the Oﬃce of Graduate Studies of
Texas A&M University
in partial fulﬁllment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
May 2009
Major Subject: Computer Science
POWER AND MEMORY EFFICIENT HASHING SCHEMES
FOR SOME NETWORK APPLICATIONS
A Dissertation
by
HEEYEOL YU
Submitted to the Oﬃce of Graduate Studies of
Texas A&M University
in partial fulﬁllment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
Approved by:
Chair of Committee, Rabi Mahapatra
Committee Members, Duncan M. Walker
Riccardo Bettati
Gwan Choi
Head of Department, Valerie E. Taylor
May 2009
Major Subject: Computer Science
iii
ABSTRACT
Power and Memory Eﬃcient Hashing Schemes
for Some Network Applications. (May 2009)
Heeyeol Yu, B.S., Korea Advanced Institute of Science and Technology;
M.S., University of California, Los Angeles
Chair of Advisory Committee: Dr. Rabi Mahapatra
Hash tables (HTs) are used to implement various lookup schemes and they need
to be eﬃcient in terms of speed, space utilization, and power consumptions. For IP
lookup, the hashing schemes are attractive due to their deterministic O(1) lookup
performance and low power consumptions, in contrast to the TCAM and Trie based
approaches. As the size of IP lookup table grows exponentially, scalable lookup
performance is highly desirable. For next generation high-speed routers, this is a
vital requirement when IP lookup remains in the critical data path and demands a
predictable throughput. However, recently proposed hash schemes, like a Bloomier
ﬁlter HT and a Fast HT (FHT) suﬀer from a number of ﬂaws, including setup failures,
update overheads, duplicate keys, and pointer overheads. In this dissertation, four
novel hashing schemes and their architectures are proposed to address the above
concerns by using pipelined Bloom ﬁlters and a Fingerprint ﬁlter which are designed
for a memory-eﬃcient approximate match. For IP lookups, two new hash schemes
such as a Hierarchically Indexed Hash Table (HIHT) and Fingerprint-based Hash
Table (FPHT) are introduced to achieve a a perfect match is assured without pointer
overhead. Further, two hash mechanisms are also proposed to provide memory and
power eﬃcient lookup for packet processing applications.
Among four proposed schemes, the HIHT and the FPHT schemes are evaluated
iv
for their performance and compared with TCAM and Trie based IP lookup schemes.
Various sizes of IP lookup tables are considered to demonstrate scalability in terms
of speed, memory use, and power consumptions. While an FPHT uses less memory
than an HIHT, an FPHT-based IP lookup scheme reduces power consumption by a
factor of 51 and requires 1.8 times memory compared to TCAM-based and trie-based
IP lookup schemes, respectively. In dissertation, a multi-tiered packet classiﬁer has
been proposed that saves at most 3.2 times power compared to the existing parallel
packet classiﬁer.
Intrinsic hashing schemes lack of high throughput, unlike partitioned Ternary
Content Addressable Memory (TCAM)-based scheme that are capable of parallel
lookups despite large power consumption. A hybrid CAM (HCAM) architecture has
been introduced. Simulation results indicate HCAM to achieve the same throughput
as contemporary schemes while it uses 2.8 times less memory and 3.6 times less power
compared to the contemporary schemes.
vTo my family
vi
ACKNOWLEDGMENTS
I would like to thank Dr. Mahapatra for his direction and support over the last
3 years. His faith in my abilities helped mould my transition from a graduate student
into a researcher. I would also like to thank Drs. Walker, Bettati, and Choi for
serving on my committee and being excellent teachers.
I want to thank my family who support me mentally and ﬁnancially. In addition,
my soccer club, the Korean Aggies Soccer Association (KASA), gave me wonderful joy
during my study at Texas A&M University. In particular, I miss Bong Su Koh who
always gave me a smile, Sanghyub Kang who was an ex-professional soccer player,
Jaewoo Suh who always loves an over-night drink, Won Ju Sung who I just met for
one semester, and Hyeongil Kwak who gave spiritual help. Furthermore, I want to
give a life-lasting appreciation to Uichin Lee at University of California Los Angeles
who helped me in many ways
vii
NOMENCLATURE
HT Hash Table
LHT Legacy Hash Table
FHT Fast Hash Table
TCAM Ternary Content Addressable Memory
CTCAM Cool TCAM
UTCAM Ultra TCAM
STCAM Selective TCAM
BTCAM Beyond TCAM
ACSM Approximate Concurrent State Machines
SRAM Static Random Access Memory
DRAM Dynamic Random Access Memory
PC Preﬁx Collapse
CPE Controlled Preﬁx Expansion
TCP Transport Control Protocol
IP Internet Protocol
SIP Source IP
DIP Destination IP
BF Bloom Filter
MBF Multi-predicate Bloom Filter
FF Fingerprint Filter
SBF Segmented Bloom Filter
SL Successful Lookup
UL Unsuccessful Lookup
viii
SS Successful Search
US Unsuccessful Search
BMF Bloomier Filter
PPC Parallel Packet Classiﬁer
MPC Multi-tiered Packet Classiﬁer
2TPC 2-tiered Packet Classiﬁer
3TPC 3-tiered Packet Classiﬁer
MBHT Multi-predicate Bloom ﬁlter Hash Table
HIHT Hierarchically Indexed Hash Table
IT Indexing Tree
HIT Hierarchical Indexing Tree
FPHT Fingerprint-based Hash Table
HCAM Hybrid CAM
SMT Segmented Multibit Trie
ix
TABLE OF CONTENTS
CHAPTER Page
I INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . 1
II RELATED WORKS IN PACKET PROCESSING . . . . . . . . 9
A. IP Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
B. Packet Classiﬁcation . . . . . . . . . . . . . . . . . . . . . 11
C. Other Packet Processing Applications . . . . . . . . . . . . 11
D. Parallel IP Lookup Using TCAM or SRAM . . . . . . . . . 13
III BASICS ON HASH FOR PACKET PROCESSING . . . . . . . 15
A. Basic Bloom Filter Theory . . . . . . . . . . . . . . . . . . 15
B. A Memory- and Power-Eﬃcient Fingerprint Filter . . . . . 18
C. IP Lookup Using Hashing . . . . . . . . . . . . . . . . . . 20
1. Controlled Preﬁx Expansion . . . . . . . . . . . . . . 20
2. Preﬁx Collapse . . . . . . . . . . . . . . . . . . . . . . 21
3. IPv6 IP Lookup . . . . . . . . . . . . . . . . . . . . . 22
D. Packet Classiﬁcation Using Hashing . . . . . . . . . . . . . 23
IV A MULTI-TIERED PACKET CLASSIFIER WITH N BFS . . . 26
A. Building a Multi-tiered Packet Classiﬁer . . . . . . . . . . 27
B. Insert Operation in an MPC . . . . . . . . . . . . . . . . . 31
C. Query Operation in an MPC . . . . . . . . . . . . . . . . . 31
1. False classiﬁcation in a successful lookup . . . . . . . . 33
2. False classiﬁcation in an unsuccessful lookup . . . . . 34
D. Delete Operation in an MPC . . . . . . . . . . . . . . . . . 36
E. Simulation Result for an MPC . . . . . . . . . . . . . . . . 36
1. Experiment for Power . . . . . . . . . . . . . . . . . . 37
2. Experiment for Throughput . . . . . . . . . . . . . . . 39
V MULTI-PREDICATE BLOOM-FILTERED HASH TABLE . . . 41
A. Index Address to a Key Table in Base-b . . . . . . . . . . . 42
B. Memory Eﬃciency with a Larger Base-b . . . . . . . . . . 45
C. Insert Operation in an MBHT . . . . . . . . . . . . . . . . 46
D. Query Operation in an MBHT . . . . . . . . . . . . . . . . 47
xCHAPTER Page
1. False indexing for an SS in an MBHT . . . . . . . . . 50
2. False indexing in a US in an MBHT . . . . . . . . . . 52
3. Hardware consideration for pipelining . . . . . . . . . 55
E. Delete Operation in MBHTs . . . . . . . . . . . . . . . . . 57
F. Analysis and Simulation for an MBHT . . . . . . . . . . . 58
1. Average access time of query . . . . . . . . . . . . . . 59
2. Memory usage . . . . . . . . . . . . . . . . . . . . . . 61
VI A HIERARCHICALLY INDEXED HASH TABLE . . . . . . . . 65
A. Building a Conceptual HIT in Stacked SRAMs . . . . . . . 65
B. Insert Operation in an HIT . . . . . . . . . . . . . . . . . 66
C. Delete Operation in dual HITs . . . . . . . . . . . . . . . . 67
D. Query Operation Making Index Paths in Dual HITs . . . . 69
1. False indexing to a key table in on-chip for a US . . . 72
2. False indexing to a key table in on-chip for an SS . . . 73
3. Detailed procedures for query and delete . . . . . . . 76
4. Parallel accesses to a key table in an interleave way . . 78
E. Simulation Result for an HIHT . . . . . . . . . . . . . . . 79
1. Memory comparison with other hash mechanisms . . . 79
2. Power comparison with TCAM for IP lookup . . . . . 81
3. Memory comparison with Trie for IP lookup . . . . . . 81
VII HASHING USING BLOOM AND FINGERPRINT FILTERS . . 84
A. Building a Conceptual IT of a Binary Preﬁx Tree . . . . . 85
B. Insert Operation in an IT . . . . . . . . . . . . . . . . . . 87
C. Query Operation Making Indexes in an IT . . . . . . . . . 88
1. False indexing to a key table for a UL . . . . . . . . . 90
2. False indexing to a key table for an SL . . . . . . . . . 91
3. Detailed algorithm for query . . . . . . . . . . . . . . 92
D. Delete Operation with Counting BFs . . . . . . . . . . . . 93
E. FPHT Optimization in a b-ary Preﬁx Tree . . . . . . . . . 95
F. Simulation Results for an FPHT . . . . . . . . . . . . . . . 95
1. Memory size in consideration of speed and scalability . 96
2. Power comparison with TCAM for IP lookup . . . . . 97
3. Memory comparison with Trie for IP lookup . . . . . . 97
VIII HASH-BASED IP LOOKUP ARCHITECTURE . . . . . . . . . 100
A. Hash-based IP Lookup Architecture Build . . . . . . . . . 100
xi
CHAPTER Page
B. Simulation Result of HIHT and FPHT-based IP Lookup
Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
1. Power-eﬃcient hash-based IP lookup . . . . . . . . . . 102
2. Memory-eﬃcient hash-based IP lookup . . . . . . . . . 103
IX HYBRID CAMS OF CAM AND SRAM FOR IP LOOKUP . . . 105
A. HCAM-based IP Lookup Architecture . . . . . . . . . . . . 105
B. Preﬁx Transformation with CAM & SRAM . . . . . . . . . 107
1. Preﬁx collapse . . . . . . . . . . . . . . . . . . . . . . 107
2. A complete preﬁx match through an STB in SRAM . 108
C. A Bloom Filter-based Lookup Distributor . . . . . . . . . 110
D. Experimental Results for an HCAM-based Scheme . . . . . 111
1. Throughput . . . . . . . . . . . . . . . . . . . . . . . 112
2. Power . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
X SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
A. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
B. Future Works . . . . . . . . . . . . . . . . . . . . . . . . . 118
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
xii
LIST OF TABLES
TABLE Page
I Lookup & update complexities. . . . . . . . . . . . . . . . . . . . . . 6
II Hardware features of each scheme. . . . . . . . . . . . . . . . . . . . 6
III Power value by CACTI in PPC(31Kx1, 20 ports), 2TPC(29Kx1,
19 ports), and 3TPC(14Kx1,18 ports). . . . . . . . . . . . . . . . . . 38
IV Complexities of operations to oﬀ-chip in four schemes. . . . . . . . . 61
V On-chip memory usage for three traces. The load factor is 0.034,
K=1024. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
VI AAS in a successful search of NLANR trace for three schemes. f=2−10. 64
xiii
LIST OF FIGURES
FIGURE Page
1 Comparison of power and area for a BF and an FF through CACTI. 19
2 Preﬁx conversion of a CPE with 3 preﬁxes in stride 3. . . . . . . . . 20
3 Preﬁx conversion of PC with the same 3 preﬁxes of Fig. 2. . . . . . . 21
4 Parallel packet classiﬁer engine of n BFs in a given packet. . . . . . . 23
5 Throughput comparison in a diﬀerent number of BFs, ps, and k. . . . 25
6 Power and area in multi memory read ports for 64K×1-bit memory. . 26
7 Pipeline memory architecture of a 2TPC in a forest. S1 and S2
are pipeline stages. Bij means the j-th BF at layer i. n=4. k=w
due to Eq. (3.3). w2=1, w1=k-1. b is a buﬀer size. . . . . . . . . . . 27
8 Memory architecture of a 3TPC in a forest and in pipeline. Bij
means the j-th BF at layer i. n=8. k=w due to Eq. (3.3). . . . . . . 28
9 (a) The total number of read ports in diﬀerent number of BFs.
w3=w2=1, w1=13 for a 3TPC. w2=1, w1=14 for a 2TPC. f=2−15.
(b) 2TPC and PPC area costs with n=8 in .13μm process technology. 29
10 The average packet misclassiﬁcation for a PPC-n and a 3TPC-n in
a diﬀerent SL rate. f=2−w=2−30, w1=28, w2=w3=1. n ∈ {32, 64, 128}. 35
11 The number of read ports and average number of memory reads
in diﬀerent number of BFs. w3=w2=1, w1=13 for a 3TPC. w2=1,
w1=14 for a 2TPC. f=2−15. . . . . . . . . . . . . . . . . . . . . . . . 37
12 Power consumption by two traces in PPCs, 2TPCs, and 3TPCs.
Also, n ∈ {8, 16, 32}. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
13 Throughput ratios of a 2TPC against a PPC with four traces in
diﬀerent number of buﬀer size b and n BFs. w1=28, w2=2. . . . . . . 40
xiv
FIGURE Page
14 Macro view of an MBHT in on/oﬀ-chip memory of base-2. n=22. . . 41
15 Partitioning of 8 elements in base-2 with 0-BF s and 1-BF s. . . . . . 43
16 Conversion of the base-2 number system to base-4 and base-8 for
64 elements. n = 26. By (X), X means the number of the same
digits in a BF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
17 Memory size Mb for b = 2, 4, 8, 16, and 32 with f and n. . . . . . . . 46
18 Probability of Xs, the number of f -indexes, in an SS. n = 216.
Required f=2−10 for b=2. . . . . . . . . . . . . . . . . . . . . . . . . 51
19 Probability of Xu, false memory access, in a US. n = 216. Re-
quired f=2−10 for b=2. . . . . . . . . . . . . . . . . . . . . . . . . . . 54
20 The beneﬁt of pipeline in an MBF returning ’no’ in a query for
two cases of k=12 or 24. . . . . . . . . . . . . . . . . . . . . . . . . . 56
21 An example of delete for item e located at 0124 in base-4. . . . . . . 57
22 Probabilities of memory access in an SS and a US and the average
access time to oﬀ-chip for an LHT, an FHT, and an MBHT with
the same memory 128K log2 n to fully utilize the saved memory
for increase in precisions of base-8 and base-16. k=10, and n=64K. . 59
23 Memory eﬃciency ratios of RM,L and RMF with various b and n.
wF=wM=20. Note that although an MBHT is set to have the
same average access as others, the actual average access times are
diﬀerent each other as shown in Fig. 22. . . . . . . . . . . . . . . . . 62
24 Basic conﬁguration of hierarchical indexing tree of 0- and 1-tree. . . . 65
25 Dual conﬁguration of HITs for delete operation. . . . . . . . . . . . 68
26 Examples of an i-path, f -segments, and f -paths. Probability of
f -paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
27 An i-path and d-trees in an SS, and P i(n) of Eq. (7.1) for each
d-tree in an HIT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
xv
FIGURE Page
28 Memory eﬃciency ratios of RH,B and RH,F with various s and w.
Note a corrected-FHT is considered. . . . . . . . . . . . . . . . . . . 79
29 Consumed energy per read clock in 0.09μm process technology. . . . 82
30 Memory comparison of Tree Bitmap and an HIHT in diﬀerent
table sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
31 An pipelined FPHT architecture with s stages. . . . . . . . . . . . . 84
32 Conceptual IT construction with BFs and tables of FPs and keys. . . 85
33 Examples of an i-path and f -paths for a given query of key e4 in
an IT without a virtual root. . . . . . . . . . . . . . . . . . . . . . . 89
34 An IT of 3 layers (or stages) with an i-path and dangling trees. . . . 92
35 A sample conﬁguration of a 4-SBF in k=2 banks. A 4-SBF rep-
resents S0 through S3. The memory size is 2×4×4. . . . . . . . . . . 95
36 Memory eﬃciency ratios of an FPHT over an MBHT and an HIHT
at various n and w. In an FPHT, a lookup precision of a CBF is
set to 6 for a 16-ary preﬁx tree. . . . . . . . . . . . . . . . . . . . . . 96
37 Consumed energy per read clock in 0.09μm process technology. . . . 98
38 Memory comparison of Tree Bitmap and an FPHT in diﬀerent
table sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
39 The number of collapsed preﬁxes and the average number of du-
plicate next-hops at various stride s. The preﬁx number for AS
65000 and AS 6447 are 233451 and 235307, respectively. . . . . . . . 100
40 IP lookup architecture with parallel Hash Lookup Engines (HLEs)
for a wildcard support. Each HLE has diﬀerent c and s values. . . . 101
41 Consumed energy per read clock in 0.09μm process technology. . . . 103
42 Memory size comparison of Tree Bitmap, an HIHT, and an FPHT
in diﬀerent table sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . 104
xvi
FIGURE Page
43 HCAM-based IP lookup architecture for a preﬁx set. Stride s=2.
The collapsed preﬁx lengths,d1, d2, d3, are 2,5, and 8, respectively. . . 105
44 A sample preﬁx set and a subtrie in a uni-bit trie for the set. . . . . 107
45 The number of collapsed preﬁxes and the number of transistors
at various stride s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
46 A stride tree for 2 preﬁx strides and an index method to an NH table. 109
47 The memory comparison of all schemes in terms of a transistor.
Lookup precision w=10. Note that ’HCAM’ includes all CPs,
STBs, and BFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
48 Queuing model of nc pipelines in an HCAM. nc=3. . . . . . . . . . . 113
49 Goodput vs. measured throughput of a CAM block in an SDA
trace. ρ=0.95. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
50 a) Total energy consumption in one clock for an NTCAM, a UT-
CAM, and an HCAM. Symbols ’N’, ’U.14’, and ’H.6’ denote NT-
CAM with a block of whole preﬁxes, UTCAM with 16 blocks of
14K preﬁxes, and HCAM with 16 blocks of 6K preﬁxes, respec-
tively. .13μm process technology is used. b) The energy consump-
tions for a single lookup operation in a block for three schemes. . . . 115
1CHAPTER I
INTRODUCTION
In packet processing, a router fast associates packets with a set of rules for packet
forwarding or various network services. Provision of such a fast packet processing like
IP lookup and packet classiﬁcation becomes harder, as the demand for high-speed
and large-scale routers continues to surge in networking. It has been reported that
the traﬃc of the Internet is doubling every two years by Moore’s law of data traﬃc
[1] and the number of hosts is tripling every two years [2].
These rapidly increased traﬃc and host numbers lead to two major packet pro-
cessing related problems for core routers. 1) Speed: a high-speed router needs to look
up a rule table at the rate that satisﬁes its bandwidth requirement. For example,
IP lookup at the rate of 160Gbps must process 500M lookups in a second, and this
implies that a packet of minimum 40 bytes must be forwarded to a next hop in 2ns
in the worst case. 2) Scalability: a fast packet processing must be made in searching
an associated rule even with hundreds of thousands of rules. For instance, in packet
classiﬁcation domain the maximum number of rules is up to 2 to 104 due to the
104-bit length of a tuple of source, destination IPs, etc.
Since a fast packet processing is a in the router’s critical data path, literature
on packet processing has developed numerous fast lookup schemes using three major
techniques, Ternary Content Addressable Memory (TCAM) [3–6], trie [7–10], and
hashing [11–17]. Although a TCAM provides a deterministic and high-speed packet
lookup [5, 6], due to its non-commodity nature and brute-force search method, its die
area cost and power dissipation tend to become prohibitive for packets with a large
This dissertation follows the style of IEEE Transactions on Networking
2number of rules and high line rates. Unlike TCAM, trie-based scheme uses a tree-like
data structure to successively classify a packet a few bits at a time [7–10], but it
inherently suﬀers from space to hold pointers from nodes to their children and the
sequential memory accesses introduced by these pointers. In addition, an imbalanced
memory access hinders the high IP lookup performance due to an irregular preﬁx
distribution in a trie’s tree structure. In contrast, the hash-based schemes neither
perform brute-force lookups as in TCAM nor suﬀer from imbalanced memory access,
so they can potentially receive an order-of-magnitude power and memory savings,
respectively.
Traditionally, a hash table (HT) is popularly used for a fast search and this is
due to its O(1) average memory access per lookup under reasonable assumptions.
Recently, HTs are used in a wide variety of packet processing applications such as
intrusion detection systems [18], packet classiﬁcation [19, 20], TCP/IP system man-
agement [21], and IP lookup [12, 16].
In particular, the binary search on preﬁx lengths algorithm [22] has the best
theoretical performance of any sequential algorithm for the longest preﬁx match in
IP lookup by using HTs. In addition, packet classiﬁcation applications utilize HTs
[23, 24], so they ﬁrst perform a lookup on a single header ﬁeld and later leverage the
results to narrow down the search to a smaller subset of packet classiﬁers. These use
HTs with the expectation of O(1) memory access and encompass a more predictable
worst-case lookup scheme.
However, as the table occupancy, or load, increases, collision occurs frequently,
which in turn reduces the performance by increasing the cost of the primitive opera-
tions. Although there are two collision resolutions (i.e. open address and chaining),
schemes in open address are not suitable for a fast lookup because of the worst case
performance. In addition, a chaining suﬀers from pointers’ overhead as it maintains
3a set of linked lists.
While these solutions are designed to maintain a good average performance de-
spite their high loads and increased collisions, their performance nevertheless meets
the packet processing needs: high speed and scalability. To satisfy such needs, an on-
chip Bloom ﬁlter (BF) is widely used because the BF of an m-bit vector can provide
both the memory eﬃciency and the high throughput using an approximate member-
ship testing. Packet processing applications using such a BF include IP lookup in
[12], an intrusion detection system in [13], or packet classiﬁcations [19, 20]. Also, in
trading oﬀ space, computation, and the impact of false positive lookup, an eﬃcient
lookup using a ﬁngerprint ﬁlter (FF) or d-left hashing has been considered preferable
in the literature on networking [21, 25, 26]. The reason is that even though a BF pro-
vides memory-eﬃcient approximate lookup, an FF is found more eﬃcient in power
and memory usages for set representation than its counterpart.
However, such an approximate match is not suitable for all packet processing
applications; the exception includes IP lookup, where its packets are required strictly
to be forwarded to a next hop according to a preﬁx table. For instance, the recently-
proposed IP lookup approaches [12, 16] have the following design ﬂaws that are not
suitable for a high-speed and large-scale router: 1) A Bloomier ﬁlter-based hash table
(BFHT) [16] utilizes a Bloomier ﬁlter [27]. However, it inherits two disadvantages of
a Blooimer ﬁlter: a setup failure in saving n keys and O(n log n) setup complexity
for n keys. 2) Authors in [12] propose a memory-eﬃcient IP lookup by using BFs,
each assigned to a set of the same-length preﬁxes. Although this scheme provides fast
approximate matches in on-chip, the perfect preﬁx match is achieved in an oﬀ-chip
HT due to the BFs’ false positive match. Thus, the lookup time is bounded in a slow
oﬀ-chip memory access.
A supportive scheme to [12]’s scheme is made to provide a fast oﬀ-chip HT
4access in [15]. However, this scheme suﬀers from duplicate keys saved in oﬀ-chip
memory, and the number of duplicates is depending on k which is the number of hash
functions and it controls the lookup precision. Also, the insert and delete operations
take approximately k times. Such a depending fact k of a large value is not suitable
for performing fast lookups and key updates in high speed routers. Other Peacock
and multilevel hash schemes for packet processing [28, 29] suﬀer from setup failures
as well.
To address these ﬂaws, such as key duplicates, complicated key updates, and
setup failures, we propose scalable hash schemes for maintaining a fast packet pro-
cessing by using BFs and an FF in pipeline. The four hash schemes in this disserta-
tion are 1) a multi-tiered packet classiﬁer (MPC), 2) a multi-predicate Bloom-ﬁlter
HT (MBHT), 3) a hierarchically indexed HT (HIHT), and 4) a ﬁngerprint-based
HT (FPHT). The ﬁrst two schemes are designed for general packet processing ap-
plications while the last 2 schemes are designed for IP lookup application due to a
memory-eﬃcient perfect match. These four schemes’ overviews are the following:
A multi-tiered packet classiﬁer (MPC) with n BFs provides a lookup distribution
for higher power and throughput eﬃciencies, compared to a parallel packet classiﬁer
(PPC) of n parallel BFs [12–14, 17]. A PPC accesses n BFs for one lookup every
cycle while an MPC accesses n BFs for several lookups every cycle with the same
BFs’ memory amount as that of a PPC. To build 2-tiered BFs, for an example of an
MPC, the total PPC memory is split between a pre-stage of small-sized BFs with one
read port and a post-stage of large-sized BFs with k-1 read ports. Then, a small-sized
BF is logically connected to two large-sized BFs, so that a forest of binary trees is
built [30].
Secondly, a multi-predicate Bloom-ﬁltered HT (MBHT) with a set of multi-
predicate BFs (MBFs) generates index addresses which have diﬀerent base number
5systems to a key table. The generated indexes are geared to in parallel access to
a key table on-chip with simple switching circuitry, so that for a successful query
at most one oﬀ-chip memory access is guaranteed for bandwidth requirement of a
router. There are two beneﬁts of an MBHT as regards to both on-chip and oﬀ-chip
memory. For the on-chip memory, an MBF reduces the memory size in the base-2x
number system by x times compared to that of the base-21 number system with a
binary predicate BF, where x is a positive integer larger than 1 [31].
Thirdly, a hierarchically indexed hash table (HIHT) is proposed and is used for
approximate testing on keys’ index paths in trees. Once the BFs of the last step
of pipelining complete their index addresses to entries in a table, a perfect match is
made by comparing the saved keys in the indexed entries with a given key, so that at
the most one oﬀ-chip access is made to a known associated rule with the given key
[32].
Finally, a f ingerprint-based HT (FPHT) generates indexes to a key table with
the help of both memory-eﬃcient BFs which are approximate membership testers
and an FF which is the most memory-eﬃcient set representation. Speciﬁcally, in
an FPHT with no pointers, BFs play a role in key searching in a b-ary preﬁx tree,
b∈{2, 4, 8, · · · }, and an FF ensures a fewer number of false indexes to a key table in
the worst lookup case [33].
In addition to hash-based approaches in packet processing, the TCAMs have
become the de facto industrial standard solution for a high-speed IP lookup. More
than 6 million TCAM devices were deployed worldwide in 2004 [34], and TCAMs
are projected to be increasingly used as the next generation network search engines.
Part of the TCAMs success is that they are developed with the abilities to store a
“don’t care” state for a preﬁx and to compare an input key against every TCAM
entry, thereby enabling them to retain a single clock cycle lookup.
6Table I. Lookup & update com-
plexities.
schemes
Trie O(W ) †
Hash O(1)
TCAM 1
†
W : # of IP address bits
Table II. Hardware features of each
scheme.
TCAM CAM SRAM
clock† 266 333 400
Power ‡ ≈15 ≈1 ≈0.1
Cell◦ 16 8 6
†
MHz unit
‡
Watts unit◦
# of transistors per bit
Despite TCAMs’ popularity and simplicity, TCAMs have their own limitations
with respect to IP lookup. 1) Throughput: parallel searches in all preﬁxes are made
in one clock cycle for a single lookup, so that the throughput is simply 1 as shown in
Table I. 2) Power: although a TCAM provides an one-cycle lookup, such an one-cycle
lookup, which is made in parallel searches on all preﬁxes, requires at most 150 times
more in power consumption than any SRAM-based scheme does. Table II shows
such power consumption diﬀerence measured by CACTI [35]. Thus, reducing TCAM
power usage is a paramount goal for a deterministic TCAM lookup.
A high TCAM throughput has been achieved through a partitioning technique
[36, 37]. Its principle with pipelining depends on a parallel architecture that fulﬁlls
multiple lookups per clock cycle. Likewise, a SRAM-based parallel scheme [38] parti-
tions a trie and maps subtries to pipelines using a solution of a NP-complete problem
for a high throughput. However, these kinds of approaches suﬀer from a high power
consumption and a complicated mapping algorithm complexity, respectively.
In dissertation we proposes a hybrid CAM (HCAM) IP lookup architecture for
maintaining a high throughput and power eﬃciency. Our approach adopts preﬁx
collapse and partitioning schemes with Bloom ﬁlters (BFs). A preﬁx collapse (PC)
reduces the number of preﬁxes as opposed to the preﬁxes expansion. In such preﬁx
collapse, the collapsed preﬁxes can be put in a deterministic lookup-capable CAM
7to demonstrate further hardware eﬃciencies for power and the number of transistors
per cell than a TCAM can as shown in Table II. A complete preﬁx match beyond the
collapsed preﬁx match is made through a stride tree bitmap (STB) saved in SRAM.
Also, the CAM for the same-length collapsed preﬁxes can be partitioned into CAM
blocks to provide multiple lookups on the collapsed preﬁxes per clock cycle.
This dissertation has the following contributions of the four hash schemes.
• An MPC hashing scheme with n BFs is proposed in a multi-tiered conﬁguration
of BFs with the same memory capacity as that of a PPC.
• An MBHT scheme is proposed using a contiguous memory space in oﬀ-chip
memory without using pointers to conduct a perfect match and a fast search.
• An HIHT scheme for fast and memory-eﬃcient packet processing is introduced.
It provides per-key information lookup to be used as an index to a key table in
on-chip memory without pointer operation.
• An FPHT scheme provides indexes to a key table using BFs and an FF without
incurring pointer operations.
• For each of the above schemes, new algorithms on insert, query, and delete
operations are proposed, and they are as easy to implement as those of a BF or
an LHT.
• In an MPC evaluation, it has been shown that the proposed MPC scheme has
4.2 and 2 times power and throughput eﬃciencies against a PPC, respectively.
• In comparison for scalability and speed, analyses on memory eﬃciency for an
MBHT, an HIHT, and an FPHT are made and multi-fold times memory eﬃ-
ciency is achieved over other contemporary schemes.
8• In IP lookup application of the proposed hash schemes, the HIHT and the
FPHT, memory and power comparisons with TCAM-based and trie-based IP
lookups are made. The proposed hash-based IP lookup schemes show at least
51 times power eﬃciency and 1.8 times memory eﬃciency, compared to TCAM-
and tried-based schemes.
• In addition, the proposed HCAM-based IP lookup scheme achieves the same
throughput as contemporary schemes while it uses 2.8 times less memory and
3.6 times less power compared to contemporary schemes
The rest of the paper is organized as follows. Sec. II presents several hash-
based schemes for packet processing, such as an FHT, a BFHT, and the Peacock
hashing. Sec. III discusses the basics of a BF and an FF in terms of their memory
size and power consumption. Also, this section shows two applications of an HT to
IP lookup and packet classiﬁcation. Then, a detailed MPC build with n BFs for
a packet classiﬁcation is shown in Sec. IV. In the following Sec. V, the detail of
an MBHT for a perfect match is discussed. In Sec. VI, the detail of an HIHT for
a perfect match is explained. A detailed FPHT build in a binary preﬁx tree for a
perfect match is illustrated in Sec. VII. As the last scheme, an HCAM is proposed
for a high throughput and power saving in Sec. IX. In each of Secs. IV, V, VI, VII,
and IX, the analysis on memory, power, or throughput eﬃciencies in comparison to
other contemporary schemes is made. A conclusion and future work are presented in
the following Sec. X.
9CHAPTER II
RELATED WORKS IN PACKET PROCESSING
Packet processing has diﬀerent objectives in each networking layer. For instance, in
layer 2 a router needs to forward a packet to a corresponding port in a limited time
with a large-scale routing table. In layer 3 a packet is classiﬁed into a ﬂow for various
purposes like ﬁrewall or qualify of service. In this chapter, related major research
works on packet processings like IP lookup and packet classiﬁcation are enumerated.
A. IP Lookup
Song et al. [15] claimed that for a perfect match an FHT with help of a BF improves
the performance over an LHT by reducing the number of oﬀ-chip memory accesses
needed for the most time-consuming lookups. This beneﬁt is possible by combining
hashed linked lists with k hash functions so that only the shortest linked list is used in
the search. Although chaining in a linked list for resolving a collision is one solution,
accessing a key in a linked list costs the same memory accesses as the number of
keys in the linked list because of pointer operation. Beyond the generic limitation of
linked list implementation, overlapping k linked lists in an FHT suﬀers from several
others described here. First of all, due to merging k linked lists there is a chance
that duplicate keys are saved in oﬀ-chip memory, depending on k. In that case,
k is reversely proportional to collision rate, a need of very low collision rate for a
high-speed router makes a number of copies of a key, proportional to k, in oﬀ-chip.
Although searching for a key is expedited by choosing the shortest linked list, the
insert and delete operations take at least k times memory accesses due to the k
shared linked lists. These operations are not suitable for a dynamically changing set
because any change in the set needs 2k times of oﬀ-chip memory access. Besides
10
time complexities of insert and delete operations, to obtain better performance over
an LHT in terms of reduced collisions, an FHT needs a plethora of buckets used as
pointers to oﬀ-chip memory and it holds a large wasted portion of buckets in on-chip
memory. Also, perfect match is made in oﬀ-chip memory, so that every query needs
at least one access to oﬀ-chip memory. Furthermore, due to the inherent drawback of
a BF, the delete operation was designed by introducing a 4-bit counter in each bucket
[39]. Yet, they did not consider the memory size of the counters, but just the number
of buckets.
There is a fundamental limitation in a HT using a linked list: a sequential access
to a key along the linked list. For example, to access key e located at the end of a
linked list of t keys, t sequential accesses are necessary in t cycles, because memory
address of key e is known after a previous key e′ with a pointer to the next key
e is obtained in the previous cycle. However, accessing a few entries with known
indexes in a table can be processed in one cycle with a simple switching circuitry.
To provide collision-free lookup with a key table, a BFHT [16] utilizes a Bloomier
ﬁlter [27] capable of per-key information lookup. Per-key information by a Bloomier
ﬁler is considered as an index address of a key table given a packet, so that a BFHT
performs perfect match to make a deterministic IP lookup with a key table. Although
a BFHT contributes preﬁx collapsing as well, it also inherits two disadvantages of a
Blooimer ﬁlter: ﬁrst, there is a setup failure in saving per-key informations of n keys
in a BFHT, so that another lookup mechanism is used for the failed keys in the setup.
Thus the number of hash functions gets increased to reduce setup failure rate, leading
to more memory need. Second, the setup complexity of n keys is O(n log n), implying
that a copy of a BFHT works for update of a new key in the rear of the BFHT for
seamless lookups of other keys.
11
B. Packet Classiﬁcation
The packet classiﬁcation goal is to identify a ﬂow characterized with a 5-tuple of
source IP (SIP), destination IP (DIP), protocol, source port (SP), destination port
(DP) and to forward the ﬂow to a corresponding output port. Several types of packet
classiﬁers like TCAM-based and SRAM-based ones are suggested [6, 20, 40–42]. In a
hash-based approach, a packet classiﬁer in [14] uses BFs in parallel, so that in a given
packet lookup all BFs need to be checked to ﬁnd the packet-associated ﬂow and the
packet is forwarded to a corresponding port where a BF returns ’yes’. However, in a
high-speed lookup to a BF, the number of memory read ports in the BF is considerably
large. Also, the number of BFs to be probed is as large as the number of a high-speed
router’s ports. Unlike the above schemes of the Θ(n) BF access complexity among
n BFs, our MPC needs probabilistically less complexity than Θ(n) for a lookup
C. Other Packet Processing Applications
Besides BF applications for packet processing in the previous section, applications of
other domains have utilized the beneﬁt of BFs, such as dynamic BF for data man-
agement [17], wide-area web caching [39], content delivery across overlay networks
[43], IP traceback [11], query routing in peer-to-peer networks [44]. Even in a wire-
less sensor networks where power saving is a paramount issue, a coordinated packet
traceback mechanism in [45] is introduced with the concept of dimensions in hash
algorithms in which a dimension can expand by the number of either hash functions,
hash tables, or both.
A legacy BF does not support deletion operation because a bit-location in a
bit-vector indexed by hash functions can be overlapped by more than one key. To
avoid this problem, Fan et al. [39] introduced the idea of a counting BF in which
12
each entry in the BF is not a single bit, but rather a small counter in a couple of bits.
Bonomi et al. [21] introduced Approximate Concurrent State Machines (ACSM).
While similar in spirit to BFs, the scheme is based on a combination of hashing and
ﬁngerprints, using d-left hashing to obtain a near -perfect hash function in a dynamic
setting. Although it is found that its data structure takes much less space than
a comparable counting BF, the fundamental problem in their approach is that in
an f -positive there is no way to verify a result given by a ACSM. In contrast, our
three schemes (MBHT, HIHT, FPHT) provide a perfect match mechanism without a
pointer. Cohen and Mattias [46] introduce Spectral Bloom Filter (SBF), an extension
of the original BF to multi-sets, allowing the ﬁltering of elements whose multiplicities
are below a threshold given at query time. However, SBF does not support a function
of relationship between a key and arbitrary per-key information.
Unlike previous BF approaches for approximate membership testing, for the ﬁrst
time, Bloomier ﬁlter in [27] provides storage and retrieval of arbitrary per-key infor-
mation. It guarantees perfect-hashing for a constant-time lookup in the worst case.
However, a disadvantage lies in static support of membership. Also, there is setup
failure probability of encoding all keys depending on k, the number of hash functions.
In an application of overlay networks, continuous reconﬁguration of virtual topol-
ogy by overlay management strives to establish paths with the most desirable end-
to-end characteristics. The approximate reconciliation tree for overlay networks by
Byers et al. [43] uses BFs on top of a tree structure to minimize the amount of data
transmitted for veriﬁcation.
13
D. Parallel IP Lookup Using TCAM or SRAM
Except a parallel SRAM scheme in [38], most parallel IP lookup engines for high
throughput are TCAM-based due to beneﬁt of employing parallel searches on TCAM
preﬁxes [36, 37]. They partition the full routing table into several TCAM blocks and
make parallel lookups on diﬀerent blocks. This parallelism obtains power eﬃciency
and throughput improvement.
Cool TCAM (CTCAM) was proposed in two separate schemes: bit-selection and
trie-based schemes [47]. In the former, selected bits are used to index diﬀerent TCAM
blocks directly. The latter scheme splits the trie by carving subtries out of the full
trie. However, the preﬁx distribution imbalance among the TCAM blocks can be
noticeably high, resulting in low worst case performance.
Ultra TCAM (UTCAM) in [36] increases the throughput 4.0 times with a 25%
TCAM entry redundancy. It uses distributed and parallel TCAM blocks aided by
having an index logic to choose the destination TCAM block for a given packet.
Likewise, Selective TCAM (STCAM) in [37] uses the multiple TCAM-block selectors
with preﬁx TCAM caches. A collision among TCAM-block selection attributes a
STCAM’s need to resolve TCAM block contentions with arbiters, and these arbiters
prevent from receiving a new lookup request. Thus, the STCAM throughput gain was
reported to be at most 1.5 times even with multiple TCAM blocks without caches.
Unlike TCAM partitioning, beyond TCAM (BTCAM) scheme in [38, 48] is in-
troduced for trie-partitioning using SRAMs where subtries were mapped to SRAM
blocks with consideration of memory balance. However, such a mapping is proved to
be NP-complete, so that remapping for preﬁx update during lookup operation is not
feasible. Furthermore, leaf-pushing causes the increase number of trie nodes resulting
in memory overhead.
14
Trie- and hash-based schemes shown in the above subsections are lack of high
lookup performance. In contrast, a TCAM’s lookup complexity is 1 and a TCAM
has been considered as a natural choice of multi lookups due to its parallel searches
through partitioning [36, 37]. The same characteristic is preserved in a CAM except
the preﬁx match. After discussing an issue in preﬁx match by preﬁx collapse or
expansion in Sec. C, a hybrid CAM (HCAM) scheme using CAM blocks is presented
for high throughput in Sec. D.
15
CHAPTER III
BASICS ON HASH FOR PACKET PROCESSING
This chapter introduces the basics of a BF and an FF as well as their applications to
packet processings, IP lookup and packet classiﬁcation.
A. Basic Bloom Filter Theory
To understand the fundamental relationship among the number of buckets, m; the
number of items, n; and the number of hash functions, k, the mathematics about a
BF and a false positive, or f -positive are presented.
A legacy BF for representing set S={e0, e1, ..., en-1} of n elements is described by
an array of m bits with each initially set to 0. A BF uses set H of k independent hash
functions h0, h1, ..., hk-1 with range [0:m-1 ], implying that in hardware implementa-
tion a memory module for a BF needs k ports for memory read. For mathematical
convenience, a natural assumption is made that these hash functions map each item
in the universe to a random number uniform over the range. For each element ej′∈S,
the bits indexed by hk′(ej′) are set to 1 for 0≤k′≤k-1, 0≤j′≤n-1. To verify that item
e′ is in S, it is checked whether k bits in a BF indicated by hk′(e′) are 1. If not, then
clearly e′ is not a member of S. Even if chosen bits indexed by hk′(y) have a value 1,
there may be a probability called f -positive that item y is falsely believed to belong
to set S due to the random gathering of k bits of value 1 set by independent items.
The above probability f of f -positive can be formulated in a straightforward way,
given our assumption that hash functions are perfectly random. Among m bits, the
chance of a bit being value 0 by one hk is 1/m. After all n elements of S are hashed
k times into the BF, i.e. totaling k·n times, the probability that a speciﬁc bit is still
0 is asymptotically p=(1-1/m)kn≈e−kn/m. Then, the probability of an f -positive by
16
randomly choosing k bits among m bits is
f ≥ {1− (1− 1/m)kn}k ≈ (1− p)k ≥ (1/2)m ln 2/n (3.1)
because k bits with probability of becoming 0, or p, could independently become
more than 0 when a membership test is requested. This probability is bounded
and the optimal k, the number of hash functions, that minimizes f is easily found
k= ln 2(˙m/n) according to the results of Broder and Mitzenmacher [49]. After some
algebraic manipulation, Broder and Mitzenmacher [49] claimed that the requirement
of f≤=2−w suggests
m ≥ n log2(1/)
ln 2
≈ 1.44n log2(1/) = 1.44nw, (3.2)
where w is a precision in query operation. Furthermore, in an optimal conﬁguration,
k becomes w according to the following derivation:
k = ln 2
m
ni
= ln 2
(
ni
log2(1/f)
ln 2
)
/ni = w. (3.3)
Also, k needs to be at least 29 ( ≈ log2 1/500M) to be a scheme of a deterministic
O(1) lookup processing 500M packets a second for a 160Gbps router.
Two important lemmas can be derived from Eq. (3.2), described as follows
Lemma 1 (Linear Property) Linear property between m and n exists in Eq. (3.2)
because given f requires that variable n is linearly proportionate to variable m. There-
fore, if n is reduced by half or decreased by constant α, the desired m for a given f is
reduced by half or decreased by the constant of α·1.44 log2(1/), respectively.
Proof: Suppose function Fm(n, f) of Eq. (3.2) has domain variables n and f . Once
f is set to a constant  as requirement, this function becomes a polynomial of variable
n, F ′m(n)=a·n, with degree one, where a=1.44·log2(1/). Therefore,
F ′m(n/2) = a · (n/2) = an · 1/2 = F ′m(n)/2 and
F ′m(n− α) = a · (n− α) = an− aα = F ′m(n)− aα,
17
proving Linear Property.
Lemma 2 (Reverse Exponential Property) The change of m has an exponen-
tial eﬀect on f for a given n from Eq. (3.2). That is, if m is increased by constant α
or multiplied x times, f is exponentially divided on base-2 by the power of constant
α/c or powered by x times where x>1, constant c=1.44n.
Proof: Suppose function Ff (n,m) is derived from Eq. (3.2) and rearranged in 2−m/c,
where c=1.44.n. Once n is set to a constant, this function becomes a exponential
function of m, F ′f (m). Therefore,
F ′f (m + α) = 2
−(m+α)/c = 2−m/c−α/c = F ′f (m)/2
α/c and
F ′f (xm) = 2
−xm/c = (2−m/c)x = F ′f (m)
x,
proving Reverse Exponential Property. These Linear and Reverse Exponential
Properties are used in introducing an MBF, so that an MBHT has the beneﬁt of
memory saving in on-chip memory by the Linear Property, and, thereinafter the
saved memory is designed to decrease f exponentially by the Reverse Exponential
Property.
We have linked the theoretical relationships between k, m, n for the required
f -positive, , in a query. If a BF is to be used for IP lookup despite producing an
approximate query result, a lookup precision w should be at least 29 ( ≈ - log2 1/500M)
for 160Gbps routers because a collision in 500M lookups in a second is not tolerable
in bandwidth requirement satisfaction. Also, Eq. (3.3) suggests that a BF memory
implementation for 160Gbps routers needs 29 read ports for the same number of hash
functions, but this is not feasible in terms of cost and power concerns. To lessen
these overheads, we adopt a segmented BF (SBF) [14] with memory banking. Using
this scheme with commodity memory is more practical since IDT currently produces
high-speed bank-switchable memory organized into a 64-bank memory array.
18
In an SBF, an m-bit vector is divided into k m′(=m/k)-bit subvectors, each
put in an independent memory bank. k hash functions with the range [0:m’-1 ] are
assigned to their corresponding subvectors, and an one-clock query in an SBF is based
on k indexed values in k subvectors (or banks) together as in a legacy BF. Although
a SBF’s memory banking scheme removes the multiport overhead, the SBF’s false
positive probability, f ′, becomes the same BF’s f as follows:
f =
(
1- (1-1/m)kn
)k
=
(
1-
(
1-1/km′
)kn)k
=
(
1-
(
1-k/km′+o(1)
)n)k ≈ (1- (1-1/m′)n)k = f ′ (3.4)
where a small o function is negligible at a large m′ value.
B. A Memory- and Power-Eﬃcient Fingerprint Filter
Authors in [49, 50] claim that an FF is the most memory-eﬃcient set representation
scheme. In this section, beyond the theoretical FF beneﬁt, it will be claimed that an
FF is the power-eﬃcient data structure in hardware implementation as well.
One method to determine the eﬃciency of a set representation scheme is to
consider how many bits, m, are necessary for a set of n keys from a universe. An
eﬃcient scheme must not allow any false negative but can at most allow an f -positive
of a fraction  of the universe. As claimed in [49], the following inequality of m for a
given required  is made:
m ≥ n log2(1/)=nw=w + · · · +w=
∑n
i=1 w. (3.5)
Thus, an FF can be regarded as an array of n ﬁngerprints (FPs) of log2(1/) bits
for the approximate set representation. Since an FF does not have a constant time
indexing mechanism like hash functions in a BF, knowing an index to a key’s FP is
other complicated search. However, once an index to the key’s FP is known, the same
19
index can be used in a key table for a perfect match. Also, an FF needs 1.44 times
less memory than a BF for required , based on Eq. (3.2) and Eq. (3.5).
In addition to the theoretical beneﬁt, in terms of memory architecture using an
FF SRAM requires a simpler memory read port design than an SBF SRAM does, so
that area and power beneﬁts in memory architecture are gained. Suppose there are
two SRAMs for an SBF and an FF and the required false positive  is 2-29 for 160Gbps.
The SBF requires 29 read ports to query a key as a result of its simultaneous accesses
while the FF needs only one read port to an FP of 29 bits. That is, an SBF SRAM
with n keys is designed as a w×1.44n×1-bit memory array with w read ports of 1-bit
output width while an FF SRAM is made of an 1×n×w-bit memory array with one
read port of w-bit output width.
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 0.4
 2  4  6  8  10
To
ta
l r
ea
d 
po
w
er
# of ports in a BF or output bit-width in a FP
BF
FF
(a) Power (W)
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 2  4  6  8  10
To
ta
l a
re
a
# of ports in a BF or output bit-width in a FP
BF
FF
(b) Area (mm2) for cost
Fig. 1. Comparison of power and area for a BF and an FF through CACTI.
Fig. 1 shows the power and area comparisons for an 11520-bit SBF and an
8000-bit FF in diﬀerent ports or output bit widths. .09μm process technology in
CACTI 4.2 [35] is used. In Fig. 1 a) and b), gaps between the SBF and the FF
are signiﬁcant as w get larger. or example, an FF memory consumes 9.5 times less
power, compared to an SBF while the former needs 76 times less area. Thus, for a
20
perfect match query, utilizing few-port SBFs in a binary search for a key’s ﬁngerprint
in an FF and accessing a key table through the indexed ﬁngerprint is a memory- and
power-eﬃcient hash scheme, and this scheme is introduced in the following section.
C. IP Lookup Using Hashing
Hash function maps a value in domain to a speciﬁc value in range uniformly. Thus,
hash-based schemes, like a BFHT and an FHT, do not address the issue of supporting
wildcard bits in preﬁxes. In this section, we present two kinds of schemes to support
preﬁx match in a hash-based IP lookup: Controlled Preﬁx Expansion (CPE) and
Preﬁx Collapse (PC).
1. Controlled Preﬁx Expansion
1001100
1001010
1001011
1001000
1001001
1001101
1001110
1001111
P1
−
−
−
−
P3
P1
P1
1010100
1010010
1010011
1010000
1010001
1010101
1010110
1010111
−
−
−
−
P2
−
−
P2
1001100
1001101
1001110
1001111
P1
P3
P1
P1
1010110
1010111 P2
P2
P1: 1001 1*
P3: 1001 101
stride 3
new
database
expanded prefixesP2: 1010 11*
Fig. 2. Preﬁx conversion of a CPE with 3 preﬁxes in stride 3.
A CPE in [7, 16] is to transform a set of preﬁxes by combining preﬁx expansion
and preﬁx capture to reduce any set of arbitrary length preﬁxes into an expanded
set of preﬁxes in optimized sequence of length. With dynamic programming, it was
applied to tries where the worst-case IP lookup time is O(W ), where W is the length
21
of IP address. For a hash-based scheme, CPE was used in [16] to support wildcards
in preﬁxes.
Fig. 2 shows a CPE mechanism with preﬁx database of 3 preﬁxes as a running
example. In expanding bits and wildcard of 3 preﬁxes, preﬁx 1001101 is overlapped
with preﬁx 10011*, so that the total number of expanded preﬁxes is 6. The 6 expanded
preﬁxes in a new database are keyed to a hash function in hash-based IP lookup
schemes. Although a CPE removes wildcards in preﬁxes for the hashing mechanism,
the number of expanded preﬁxes along with the same number of next hops can become
2 times larger compared to the original preﬁx set. In general, the expansion is made
in multi-fold and by simulation work on BGP tables, AS65000 and AS6447 [51, 52].
We found that the number of expanded preﬁxes increases as the stride size gets larger
and that the number is about 5 times larger in stride 5. The reason is that a given
preﬁx of stride l can be expanded to 2l preﬁxes if there is no overlapping with other
preﬁx, unlike preﬁx 10011* and 1001101.
2. Preﬁx Collapse
100 : P1
101 : P3
110 : P1
111 : P1
110 : P2
111 : P2
P1: 1001 1*
P3: 1001 101
P2: 1001 11*
stride 3
pointer
pointer
P3
P1
P1
P1
P2
P2
000 : −
001 : −
010 : −
011 : −1001
000 : −
001 : −
010 : −
011 : −
100 : −
101 : −
1010
1001
1010
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
in bit−vec.
count 1’s
collapsed prefix
bit vector
Fig. 3. Preﬁx conversion of PC with the same 3 preﬁxes of Fig. 2.
Unlike inﬂating the number of expanded preﬁxes and the next-hop informations
22
in a CPE, a PC converts a preﬁx of length x into a single preﬁx of shorter length
x-l by replacing its l least signiﬁcant bits with a wildcard [16]. The truncated preﬁx
of length x-l is collapsed with others of the same x-l bits, so that the number of
collapsed preﬁxes is reduced. Fig. 3 shows the preﬁx collapse mechanism with the
same set of preﬁxes as in Fig. 2 for a CPE. Although the ﬁrst conversion expands
wildcards of the preﬁxes in stride 3 like a CPE, the second conversion adopts a bit
vector indicating the relative index to a next-hop table. In addition, after expansion
of the wildcard, the ﬁrst and the third preﬁxes are same among 3 truncated preﬁxes
of length 4. Thus, the ﬁnal collapsed preﬁxes are preﬁx 1001 and 1010 with bit
vectors ( 00001111) and ( 00000011). Compared to the example in Fig. 2, the number
of collapsed preﬁxes is reduced, while the number of next-hops maintains the same
as that for a CPE but yet it still increased 2 times than the original set.
3. IPv6 IP Lookup
The addressing architecture for IPv6 is detailed in RFC 3513. In terms of the number
of preﬁx lengths in forwarding tables, the important address type is the global unicast
address which many be aggregated. RFC 3513 states that IPv6 unicast addresses may
be aggregated with arbitrary preﬁx lengths like IPv4 address under classless inter-
domain routing. While this provides extensive ﬂexibility, it is not foreseen that this
ﬂexibility necessarily results in an explosion of unique preﬁx lengths. The global
unicast address format has three ﬁelds: a global routing preﬁx, a subnet ID, and an
interface ID. All global unicast addresses, other than those that begin with 000, must
have a 64-bit interface ID in the Modifed EUI-64 format. These identiﬁers may be of
global or local scope; however, we are only interested in the structure they impose on
routing databases. In such cases, the global routing preﬁx and subnet ID ﬁelds must
consume a total of 64 bits. If these policies are followed, it could be anticipated that
23
IPv6 routing tables will not contain a signiﬁcant diﬀerence from the current IPv4
tables except a preﬁx length distribution. Thus, hash-based IP lookup schemes can
play a major role in saving memory and power for IPv6 as for IPv4, compared to
TCAM- and trie-based IP lookup schemes.
D. Packet Classiﬁcation Using Hashing
The issue of how to reduce the number of used BFs in processing a packet with n
BFs is a paramount power concern in any packet processing [12, 14, 53] as well as
network application including wireless sensor network [45]. However, in this section
we formalize and restrict the issue to packet classiﬁcation domain.
Bi
BnB1 B2
     
LeavingEntering
hash table
to confirm results
on−chip
off−chip
one packet
(SIP ..... DP)
a 5−tuple
i−th BF
packets packets
Fig. 4. Parallel packet classiﬁer engine of n BFs in a given packet.
A parallel lookup with n BFs is a common conﬁguration in packet classiﬁcation
[14] as shown in Fig. 4 where a 5-tuple of SIP, DIP, protocol, SP, and DP is extracted
from a packet and a lookup of the 5-tuple is made among n BFs. Fast on-chip
packet processing with n BFs is beneﬁcial because it reduces the number of oﬀ-chip
hash probes [12, 22]. Due to f -positives from the BFs, all positives are required to
24
be conﬁrmed by a hash table of recorded ﬂows. Due to QoS and security concern,
providing a perfect match is necessary in packet classiﬁcation. Thus, there is BFs’
access contention to the hash table. BFs can be fabricated in on-chip due to memory
eﬃciency while the hash table is located in oﬀ-chip due to its large size as in other
schemes [12, 13, 20]. Thus, the packet lookup throughput is bounded to the processing
time in the oﬀ-chip hash table.
The worst case throughput can be calculated in the following way: given a lookup
of a minimum 40-byte packet, there are two kinds of lookups, an unsuccessful lookup
(UL) in which a key is relentlessly searched although it does not exist in BFs, and a
successful but time-consuming lookup (SL) in which a key is to be searched in BFs.
Let ts and tu denote processing times in an oﬀ-chip hash table (HT) for an SL and a
UL, respectively. Then, the packet lookup throughput in n BFs is calculated as
T =
40 · 8
ps{ts + tu · (n-1)f}+ (1-ps){tu · nf}bits/sec., (3.6)
where ps is an SL rate and the nf and (n-1)f terms explain the expected numbers of
f -positives which is based on the binomial distribution of identical and independent
BFs in an SL and a UL, respectively.
Based on Eq. (3.6), Fig. 5 shows the throughput where HT’s processing time in
an SL, ts, is 1.001 times of 2ns in a modern T-RAM [54] and tu is set to 0.5 times
of 2ns. In the worst case of ps=1, the lookup throughput with BFs of k=10 read
ports can barely keep up with 160Gbps while BFs of k=15 read ports can meet the
bandwidth. Thus, a large number of read ports in a BF memory are required for a
high throughput, and avoiding irrelevant BFs of such a large number of ports for a
lookup is preferable. In the following section, such an avoidance is made by a PPC
which distributes lookups through small-sized BFs of a few ports, so that a subset
of the lookups are processed in large-sized BFs in one clock cycle for a higher power
25
50 75 100 125 1501
1.5
2
2.5
3
3.5 x 10
11
The number of BFs
Th
ro
ug
hp
ut
 in
 b
its
/se
c.
k=10,p
s
=1
k=10,p
s
=0.5
k=15,p
s
=1
k=15,p
s
=0.5
160Gbps
Fig. 5. Throughput comparison in a diﬀerent number of BFs, ps, and k.
and throughput eﬃciencies.
26
CHAPTER IV
A MULTI-TIERED PACKET CLASSIFIER WITH N BFS
This chapter introduces how to build an MPC and implement insert, query, and
delete operations in an MPC for better lookup performance.
2 4 6 8
0.01
0.02
0.03
0.04
0.05
# of read ports
Po
w
er
 (W
)
2 4 6 8
0
20
40
45
# of read ports
A
re
a 
(m
m2
)
Fig. 6. Power and area in multi memory read ports for 64K×1-bit memory.
Each hash function corresponds to one random lookup in an m-bit BF. Thus, a
BF having k hash functions for high throughput needs the exact same k of memory
read ports in an m-bit memory module. Although state-of-the-art VLSI technol-
ogy can fabricate memory with multiple ports, supporting more than 10 ports is
tremendously hard as noted in a concise summary of the recent embedded memory
technologies [55]. Fig. 6 shows such a diﬃculty in terms of the power and area costs
measured by CACTI [35], according to the number of read ports in a single memory
module. The conclusion from the ﬁgure is that the power and area costs is superlinear
to the number of read ports. Thus, a BF is considered as a high computation element
due to the large value of k for the high-speed router, and thereby reconﬁguring such
BFs for a power- and throughput-eﬃcient lookup is necessary.
27
1
1B
1
2B
1
3B
1 read
A
D
1 read
A
D
D : data port
: address port
2
1B
2
2B
   
   


k−1 k−1 k−1 k−1
reads readsreads reads
B 14
? == 1 ? == 1hashes hashes hashes hashes
layer 1
layer 2
BF memory with D & A
hashhash
b b
bufferpackets
2TPC
S1
S2 A A A A
DD D D
A
D
A
Fig. 7. Pipeline memory architecture of a 2TPC in a forest. S1 and S2 are pipeline
stages. Bij means the j-th BF at layer i. n=4. k=w due to Eq. (3.3). w2=1,
w1=k-1. b is a buﬀer size.
A. Building a Multi-tiered Packet Classiﬁer
In this section, we derive mathematical proof that an MPC uses the same memory
size as that of a PPC while the detailed insertion and query are mentioned in Secs. B
and C. Fig. 7 shows a conﬁguration example of an MPC, a 2-tiered PC (2TPC) on
top of 4 BFs, in place of a PPC used in a dashed box of Fig. 4. Also, Fig. 8 shows
a 3-tiered PC (3TPC) on top of 8 BFs. Given desired f -positive f=2-w, the total
PPC memory in bits with n BFs is n·m, where m is a BF’s memory based on Eq.
(3.2). However, with linear property between m and ni and an additive operation on
memory size mt, we can reconﬁgure BFs in a (r+1)-tiered way, r>0, while the same
memory size, mM , for an MPC is used as follows:
28
B 18    
    
    



  
  


   
   


   
   


   
   
   



1
1
layer 2
layer 1
1 read
hash
k−2 k−2
reads reads
? == 1hashes hashes
k−2
reads reads
hashes
1 read
hash
? == 1
B
1 read
hash
? == 1hashes
1 read
hash
k−2 k−2
reads reads
? == 1hashes hashes
k−2 k−2
reads reads
hashes
1 read
hash
? == 1
1 read
D
hash
A
? == 1hashes
layer 3
k−2
3TPC
packets buffer
bb
S2
S3
S1
A A
D
A A A A
D
A
D D
A
A
A
D
A A
A
D
D
D D
D
DD
D
Fig. 8. Memory architecture of a 3TPC in a forest and in pipeline. Bij means the j-th
BF at layer i. n=8. k=w due to Eq. (3.3).
n×m = n× {1.44 · ni · log2(1/f)}
= n× {1.44 · ni · w} = n× {1.44 · ni · (w − r + r)}
= n · 1.44 · ni · (w − r) +
r∑
t=1
{n · 1.44 · ni · 1}
=
n∑
i=1
(1.44 · ni · (w-r))+
r∑
t=1
n/2t∑
i=1
(1.44 · (2tni) · 1) (4.1)
= m1 +
r∑
t=1
mt+1 = mM ,
where mt is the total memory of BFs on layer t, r+1 is the number of tiers, 2tni is
the number of keys in Bti , and the lookup precisions of a BF on layer 1 and t, w1 and
wt, are w-r and 1, respectively. Based on Eq. (3.2), the f -positives of BFs on layer
1 and 2 in a 3TPC are expected to be 2-(w-2) and 2-1, respectively, and the second
term,
∑r
t=1
∑n/2t
i=1 (1.44 · (2tni) · 1), in Eq. (4.1) is the sum of small-sized BFs from
29
8 16 32 64
200
400
600
800
1000
# of BFs, n (log scale)
To
ta
l #
 o
f m
em
or
y 
re
ad
 p
or
ts PPC2TPC
3TPC
(a) The read port number
3
4
5
16
20
24
0   
2
4
5
nk (x10
4)w(=k)
A
re
a 
(m
m2
) (
x1
03
)
PPC
2TPC
(b) The area cost (w2=1)
Fig. 9. (a) The total number of read ports in diﬀerent number of BFs. w3=w2=1,
w1=13 for a 3TPC. w2=1, w1=14 for a 2TPC. f=2−15. (b) 2TPC and PPC
area costs with n=8 in .13μm process technology.
layer 2 to layer r+1. Also, a BF from layer 1 covers ni elements, and a BF from layer
2 covers 2ni keys. Generally, B
j
i covers all keys from B
j-1
2i and B
j-1
2i+1, 1≤i≤n/2,
1<j≤r in an MPC.
In this multi-tiered and pipelined conﬁguration with b=1, power in accessing
memory (or probing BFs) can be saved. For example, B12 has a key and there is a
lookup for the key. By preprocessing the lookup in stage S1 with B21 and B
2
2 , if B
2
2
returns ’no’ in the lookup there is no need to probe B13 and B
1
4 . Thus, a power used
to probe them can be saved.
In addition to the power concern, simply setting b to more than 1 does not achieve
a higher throughput eﬃciency. Although Eq. (4.1)’s derivation shows that an MPC
has the same memory size as a PPC, processing a lookup in small-sized BFs of one
read port does not provide a higher throughput in large-sized BFs on a lower layer.
For instance, even if b in Fig. 7 with w2=1 is set to 2, a one-read-port BF on layer 2
cannot process 2 lookups in one cycle. Thus, the number of read ports in the small-
30
sized BF needs to be the same as b. In general, the number needs to be b · w2 for a
throughput-eﬃcient MPC. As suggested in [12], using mini-BFs with few read ports is
the solution without degrading lookup accuracy. However, even if a BF is broken into
several mini-BFs, the total number of read ports in the mini-BFs is the same as that
of a PPC. Thus, breaking a BF into mini-BFs only gives the possibility of fabricating
BFs for packet processing, not the beneﬁt of high throughput. However, a proposed
MPC has two beneﬁts of few number of read ports and an area cost which can lead
to fabricate small-sized BFs of multi read ports for a high throughput without area
overhead.
Figs. 9(a) and 9(b) show such two beneﬁts: the smaller number of fabricated
read ports and the smaller area for a 2TPC. Fig. 9(a) shows the required numbers of
read ports in fabricating a diﬀerent number of BFs for a PPC, a 2TPC, and a 3TPC,
respectively. In fabricating, a 2TPC and a 3TPC use 4% and 10% less number of
read ports than a PPC in all cases. Fig. 9(b) shows 2TPC and PPC area costs in a
diﬀerent number of w and ni, and in each case the area costs of using 4 mini-BFs for
a BF are measured by using CACTI model [35].
Now, we show how to fabricate multi-ports in a small-sized BF without hardware
overhead. There is a noticeable gap between dotted and solid meshes in Fig. 9(b),
and the reason is that fabricating multi-ports in a small-sized memory does not need
area as much as in a large-sized memory. Due to page limit, we did not plot the area
costs for 2 through 5 read ports in a small-sized BF memory on layer 2. However,
there is a small area increase for the multi-port memory, compared to a PPC’s area.
Thus, it is clear that the buﬀer size b can amount to 5 at the most. Also, utilizing dual
reads on falling and rising edges in a clock [56] can double the memory read capacity
and a lookup throughput (i.e. double data rate scheme does in DRAM and AMD
Athlon64). Thus, the buﬀer size becomes twice and the maximum b is 10 without
31
memory overhead in an MPC.
B. Insert Operation in an MPC
Insert operation of a key in a BF on layer 1 is as simple as the key’s insertion in
a legacy BF. Similarly, on layer j, if a key to hash is assigned to Bji , the key is
given to Bj+1i/2 for insert operation, 1<j≤s. The detailed procedure is shown in
Procedure insert which does kj times memory write on layer j. Therefore, the
memory write complexity of one key insertion is
∑s
t=1 kt=w=kP which is the same as
a PPC, where kP is based on Eq. (3.3). Also, note that the ﬁrst vertically lined for
can be in pipeline because BF memories on a layer are independent ones from other
layers. Thus, in every cycle one key insertion is performed on the condition that B1i
on layer 1, 1≤i≤n, supports multiports.
Procedure insert
Input: Key e and index i for a BF on layer 1
Output: Encoded 2TPC for key e
for layer j = 1 to s do1
for t = 0 to kj − 1 do // ht is t-th hash func.2
Bji/2j-1[ht(e)]=1; // B is Mem. on SRAM for BF4
end5
end6
C. Query Operation in an MPC
Unlike insert operation where only the involved BFs are accessed, query operation
needs to access all BFs to ﬁnd which BFs return ’yes’. Because except one involved
BF the rest of irrelevant BFs give f -positives leading to packet misclassiﬁcation, the
irrelevant BFs in an MPC are not considered for probing, so that the BF access
complexity in processing a lookup with n BFs is far less than n. To provide such a
32
complexity, we split the memory of a PPC into small-sized BFs and large-sized BFs
in multi-tiers, and they are connected in binary trees. Then, accesses to large-sized
BFs are made only if their parents of small-sized BFs return ’yes’ (or value 1 in D)
as shown in Fig. 7. Also, BFs in multi-tiers can be in pipeline so that there is no
performance degradation. Before the detail procedure, let us introduce deﬁnitions of
a true path and a false path entangled in an MPC.
Definition 1 (True Path)
In query operation among a forest shown in Fig. 7, a true path, t-path, occurs. It is
composed of shadowed BFs from a root of a tree to return ’yes’. These were involved
in the previous insert operation for a key. The length of a t-path is 2 in case of
2TPCs.
For example, if a key is assigned to set B2 in PBFs, the BFs on a t-path for 2TPCs
are B21 , B
1
2 as shown in Fig. 7. From the above deﬁnition, in query operation all BFs
on the t-path should return ’yes’ for a given key as a legacy BF returns ’yes’ because
each BF has the key as a member.
Unlike a t-path, a false path is made from a group of BFs giving f -positives
so that packet misclassiﬁcation occurs. The detailed deﬁnition of a false path is as
follows:
Definition 2 (False Path)
In query along consecutive layers, a group of BFs giving f -positives makes a false
path, f -path. The series of BFs can be from either the oﬀ-branch BFs from a t-path
or a root of a tree to the bottom of the tree as shown in the checked boxes of Fig. 7
and 8.
33
The f -positives by the BFs, neither stemming from a branch of a t-path nor being a
complete path in a tree among a forest, can not contribute an f -path by the deﬁnition.
Also, the number of f -paths means the number of packet misclassiﬁcations. An
important fact from the above deﬁnition is that the probability of misclassiﬁcation
for an f -path contributing one packet misclassiﬁcation is cumulatively calculated in
product of each f -positive on the f -path.
1. False classiﬁcation in a successful lookup
We divide a lookup in two ways: 1) a successful lookup and 2) an unsuccessful lookup.
In network application, given a packet a router needs to determine the destination
based on a ﬂow table about classiﬁcation information. If there is a ﬂow in the table,
we call the lookup an SL. Now, we show the misclassiﬁcation probability in an SL.
By a recursive deﬁnition, the probability Pa(i) that root a in a binary tree has
i packet misclassiﬁcations is the product of the following three: the probability of
an f -positive in root a of the binary tree, the probability that a left subtree has
i-j packet misclassiﬁcations, and the probability that a right subtree has j packet
misclassiﬁcations as the following:
Pa(i) =
i∑
j=0
fa × Pl(i− j)× Pr(j), (4.2)
where fa is the probability of an f -positive from BF a, and as a base case, PB11 (1)=fB11 .
Finally, the dominant probability, Ps(1) that a single packet misclassiﬁcation occurs
across a forest is the following:
Ps(1) =
r−1∑
j=1
P
Bjt
(1) +
n/2r-1∑
i=2
PBri (1), (4.3)
where r is the number of tiers, the ﬁrst term is the summation of Eq. (4.3)’s
probabilities of BFs attached on a t-path and the second term is the summation of
34
probabilities of the remaining trees among the forest.
2. False classiﬁcation in an unsuccessful lookup
Since all packets are not under speciﬁc ﬂows based on a ﬂow table, a UL is important
as much as an SL. Unlike an SL, in a UL there is no t-path. This means that what
a BF returns, if any, is an f -positive. The dominant probability, Pu(1) that a single
packet misclassiﬁcation happens in a UL is
Pu(1) =
n/2r-1∑
i=1
PBri (1). (4.4)
Procedure query shows the details of query operation on an MPC. The code in
Procedure query
Input: Forest F of binary trees for an MPC and key e
Output: Set S for a true path and a group of false paths
for tree T ∈ in forest F do1
S = S ∪ query BT(T, e) ;2
end3
return S;4
the vertical line of Procedure query can be implemented in parallel. Also, it calls
subroutine query BT which is working recursively and in pipeline on each layer in a
binary tree to check a BF for the key e as a legacy BF does. Also, pipelining on
layers in a binary tree makes it sure that the query complexity is Θ(1) as a PPC’
complexity is.
Based on Eqs. (4.3) and (4.4), the expected packet misclassiﬁcation considering
SL and UL rates is
ps
n−1∑
i=1
i · Ps(i) + (1− ps)
n∑
i=1
i · Pu(i) = ps · Es + (1− ps)Eu, (4.5)
where ps is an SL rate, and Es and Eu are the average packet misclassiﬁcations for
35
an SL and a UL, respectively.
There is a minuscule classiﬁcation performance degradation in using an MPC.
Fig. 10 shows the average packet misclassiﬁcation of a PPC and a 2TPC based on Eq.
(4.5) with a rate of successful lookup ps. There are three important considerations: 1)
-4
-3
-2
-1
0
1
-4 -3 -2 -1 0
A
vg
. a
cc
es
s a
 se
ar
ch
 (l
og
 sc
ale
)
Successful search probability(Ps) (log scale)
AL for an LHTAF for a corrected FHTAF for an FHTAM for an MBHT
Fig. 10. The average packet misclassiﬁcation for a PPC-n and a 3TPC-n in a diﬀerent
SL rate. f=2−w=2−30, w1=28, w2=w3=1. n ∈ {32, 64, 128}.
Given desired f -positive, f , as long as the n is larger, the value of the average packet
misclassiﬁcation is getting larger due to bigger binomial coeﬃcient value B(f, n).
2) Given the same memory size, the probabilities of PPC-n and 2TPC-n for a UL
are the same while in a dominant rate of an SL, there is a minuscule diﬀerence,
2E-9, between them. 3) The diﬀerence gets smaller as long as the n is larger. In
conclusion, as long as the number of BFs, n, and the rate ps are larger, the diﬀerence
of packet misclassiﬁcations between a PPC and a 2TPC is negligible. The one-packet
misclassiﬁcations of Eqs. (4.3) and (4.4) show the same phenomenon shown in Fig.
10.
36
D. Delete Operation in an MPC
Delete operation is not as easy as insert because a basic BF in [12–14, 17] does not
support deletion of a key which was encoded in the BF. If a counting BF [39] or a
low power counting BF (L-CBF) [57] is adopted, delete operation can be as easy as
the basic BF. Line 4 in Insert procedure, Bji/2j-1[ht(e)]=1, shows bit setting for the
basic BF. However, if delete operation is provided the line needs to be changed to
Bji/2j-1[ht(e)]++ as a counting BF is used at line 4 for delete procedure.
Procedure delete
Input: Key e and index i for a BF on layer 1
Output: Deleted 2TPC for key e
for layer j = 1 to s do1
for t = 0 to kj − 1 do // ht is t-th hash func.2
Bji/2j-1[ht(e)] −−; // B is Mem. on SRAM for BF4
end5
end6
E. Simulation Result for an MPC
CACTI [35] models SRAM architecture in terms of area, access time, and power.
With the help of CACTI model, we measured throughputs and powers of PPC and
MPC with IP traces which are from NLANR PMA and Internet Traﬃc Research
Group [58]. We assume that a PPC needs one cycle to process a packet lookup to n
parallel BFs, and in an MPC a small-sized BF with multiports can process a group
of lookups in one cycle while a large-sized BF with multiports processes a lookup in
high precision. The used IP traces are PUR, SDA, FRG, and PSC which have 19.4K,
29.5K, 39.7K, and 37.9K ﬂows as rules, respectively. The simulation used 193.3K,
292.2K, 337K, and 314.3K packets in ﬂow identiﬁcation with diﬀerent number of router
ports, each having the same number of ﬂows equally.
37
8 16 32 64
0   
2
4
6
8
10
# of BFs (log scale)
# 
of
 fa
br
ic
at
ed
 re
ad
 p
or
ts 
(x1
02
)
 
 
0   
2
4
6
8
10
A
vg
. #
 o
f M
. r
ea
ds
 p
er
 lo
ok
up
(x1
02 )
PPC
2TPC
3TPC
Avg. read # in 2TPC
Avg. read # in 3TPC
Fig. 11. The number of read ports and average number of memory reads in diﬀerent
number of BFs. w3=w2=1, w1=13 for a 3TPC. w2=1, w1=14 for a 2TPC.
f=2−15.
1. Experiment for Power
For power estimation, each pipeline stage is designed to process a single lookup,
contrast to a multi-lookup capability in a throughput experiment of the following
section. For theoretical comparison, we calculate the average number of memory
reads per lookup in MPCs based on Eq. (4.5). As suggested in [12], using mini-BFs
with few read ports is the solution without degrading lookup accuracy. However, even
if a BF is broken into several mini-BFs, the total number of read ports in the mini-
BFs is the same as the number of the original BF. Thus, breaking a BF into mini-BFs
gives only the possibility of fabricating BFs for high throughput in packet processing,
but it does not beneﬁt reducing power and area costs. However, the propose MPCs
oﬀer beneﬁts such as fewer number of read ports and reduced power during lookup
operation.
Fig. 11 shows such two beneﬁts: the smaller number of fabricated read ports and
38
PPC 2TPC 3TPC0
2
4
6
8
10
Po
w
er
(m
W
)
n=8
n=16
n=32
(a) AMP
PPC 2TPC 3TPC0  
0.5
1
1.5
2
2.5
Po
w
er
(m
W
)
n=8
n=16
n=32
(b) PSC
Fig. 12. Power consumption by two traces in PPCs, 2TPCs, and 3TPCs. Also,
n ∈ {8, 16, 32}.
the smaller number of memory reads for a lookup in 2TPCs and 3TPCs. Suppose 15
ports are required in a BF’s fabrication in PPCs. The ﬁrst three solid lines show the
required number of read ports in fabrication of diﬀerent number of BFs for PPCs,
2TPCs, and 3TPCs, respectively. The other two marked lines are the number of
operational memory reads for a given lookup. In fabricating, 2TPCs and 3TPCs use
4% and 10% fewer number of read ports than PPCs. In addition, for a given packet
lookup, the average number of operational memory reads in 64 BFs is rapidly reduced
to 1.9 and 3.8 times memory reads for 2TPCs and 3TPCs, respectively, compared to
PPCs. Thus, we are certain that during a lookup in MPCs less power is consumed in
a real packet classiﬁcation.
Table III. Power value by CACTI in PPC(31Kx1, 20 ports), 2TPC(29Kx1, 19 ports),
and 3TPC(14Kx1,18 ports).
A BF power(W) A small-sized BF power(W)
PPC 0.120 N/A
2TPC 0.110 0.002
3TPC 0.097 0.008
39
Table III shows the typical power value used in CACTI in the case of AMP trace.
Based on these values, we measure the power for other trace PSC as shown in Fig. 12.
Fig. 12 shows the average power of four traces by 10 runs in diﬀerent conﬁgurations
(PPCs, 2TPCs, and 3TPCs). We set w=20 for a PPC, and the lookup precisions of
a large-sized BF in layer 1 are set to 19 and 18 for 2TPCs and 3TPCs, respectively.
The power eﬃciency ratios of 3TPCs against PPCs in AMP and PSC are at most 4.2,
4.1, 3.7, 3.2, respectively. Also, the power eﬃciency ratios of 3TPCs against 2TPCs
in AMP and PSC are 1.9, 1.9, 1.7, and 1,5, respectively. From these results, it is clear
that an MPC is more power eﬃcient that a PPC, and as the number of multi-tiers
becomes larger, the power eﬃciency becomes better.
2. Experiment for Throughput
The throughput is deﬁned as the number of packets over the number of simulation
cycles to process the whole IP traces, and we assume that each small- or large-sized
BF takes one clock cycle to process a lookup. Fig. 13 shows the average throughput
ratios of four traces by 10 runs in a 2PC architecture where each small-sized BF on
layer 2 has a b-sized buﬀer to process b packets in the buﬀer in one cycle. Once they
process packets in the their buﬀers, the results are forwarded to large-sized BFs on
layer 1. A BF on layer 1 works on a partially processed packet only if a parent BF of
the BF returns ’yes’ to the packet. Thus, if a BF on layer 2 returns ’no’ for a packet,
the children BFs of large size can process other following packets, leading to a higher
throughput. In each subﬁgure, in all diﬀerent numbers of BFs the larger is the buﬀer
size, the higher throughput ratio is, proving that our MPC gives a higher throughput
performance than a PPC. At most 2.0 times throughput was observed in PSC trace.
Although we simulated a case of, at most, 64 BFs, our MPC shows higher throughput
than those in Fig. 13 if a larger number of BFs and buﬀer size b is used.
40
3 4 5 6 7 8
1.4
1.5
1.6
1.7
b, (PUR)
Th
ro
ug
hp
ut
 ra
tio
 
 
3 4 5 6 7 8
1.3
1.4
1.5
1.6
b, (SDA)
Th
ro
ug
hp
ut
 ra
tio
 
 
3 4 5 6 7 8
1.4
1.5
1.6
1.7
1.8
1.9
b, (FRG)
Th
ro
ug
hp
ut
 ra
tio
 
 
3 4 5 6 7 8
1.6
1.8
2
b, (PSC)
Th
ro
ug
hp
ut
 ra
tio
 
 
n=16
n=32
n=64
n=16
n=32
n=64
n=16
n=32
n=64
n=16
n=32
n=64
Fig. 13. Throughput ratios of a 2TPC against a PPC with four traces in diﬀerent
number of buﬀer size b and n BFs. w1=28, w2=2.
41
CHAPTER V
MULTI-PREDICATE BLOOM-FILTERED HASH TABLE
In this chapter, we propose a novel hash architecture that uses a set of BFs in parallel
for a perfect match. BFs used in our hash mechanism are designed to support a
multi-predicate rather than a simple membership tester, i.e. binary-predicate, of a
legacy BF. Our scheme using multi-predicate BFs reduces the memory size in base-2x
number system by x times compared to that of base-21 number system with a binary
predicate BF, where x is a positive integer larger than 1.
 
 


 
 


 
 


00
01
10
11
10 w
indow11
01
00
free addr. 
queue of
00
01
10
11
l−MBHT
r−MBHT
key table
on−chipempty entry
rule table
l/r−reg.
used indexoff−chip
Fig. 14. Macro view of an MBHT in on/oﬀ-chip memory of base-2. n=22.
Fig. 14 shows the macro view of our architecture with two MBHTs, l-MBHT
and r-MBHT , and a key table residing in on-chip memory while there are a rule table
of n=22 entries and a queue of free addresses in oﬀ-chip memory. One of MBHTs is
involved in insert operation depending on l/r-register. This register is to be switched
l or r whenever n inserts are made on one MBHT, so that once a window of the queue
is used up the peer MBHT is cleaned up for future insert. Through this rotation,
without counting BFs costing 4 times memory, dual MBHTs can provide seamlessly
insert and delete operations for incremental updates of rules. In contrast, both of
MBHTs are involved in query and delete operations because it is not known where
42
a wanted key is located.
A. Index Address to a Key Table in Base-b
Unlike a legacy BF [12–14], a new hashing architecture is proposed, and it is capable
of indexing a key table of on-chip memory in the base-b number system for perfect
match, and accessing to a rule table in oﬀ-chip by the index address of the matched
entry for a given key. Although the BF is returning ’yes’ approximately, a MBF is
capable of telling arbitrary per-key information associated with a given key when a
membership test is met.
Assume there are n keys to hash in key and rule tables where the keys are saved
in contiguous and ﬂat memory space. In perfect hash like an LHT and an FHT [15],
the buckets are an array of pointers to linked lists in oﬀ-chip. These pointers are
in the form of a binary address and the total number of buckets, nb, is determined
in relation to collision rate. Thus, a HT in an LHT and an FHT is an array of nb
pointers of length log2 n. However, on-chip memory for the array partitioned to MBFs
and MBFs are grouped column-wise The n keys are saved in an arbitrary order at
index Ab of the two tables, where b is the base-b number system in a positive integer.
Given n keys and the base-b number system, there are r= logb n digits in an index for
a key in a key table, and each of r digits in the index Ab=a0a1 · · · ar-1 is expressed
in log2 b bits, i.e. ai∈{0, · · · , b-1}. Denote a-BF a multi-predicate BF embedded in
on-chip memory module, implying that if a membership test is met value a in log2 b
bits for the base-b number system is considered for a part of an index of a given query.
Provided that the address space is based on a number system of base-b, partitioning
the address space with a set of a-BF s, a∈{0, · · · , b-1}, is made so that each ai of
base-b in Ab is to be covered by ai-BF i, 0≤i≤r-1. After the digits are partitioned
43
column-wise by the set of MBFs, ai-BF
i in them is involved for ai in an insert
operation described in Sec. C, and the relevant a-BF from each column is to imply
value a in the query operation explained in Sec. C.
1−BF  :
1−BF  :
0−BF  :
1−BF  :
0−BF  :
0−BF  :
1−BF
0−BF
e0
e1
e7
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
addr. bits
1
1
2
2
0
0
MSB
Grouped and partitioned address bits by BFs
MSB: Most significant bit
LSB: Least significant bit
generated bits for addr.
path for
LSB
id
x. 
ad
dr
es
s s
pa
ce {e , e , e , e }0 1{e , e , e , e }4
{e , e , e , e }0 1
2
{e , e , e , e }0 2{e , e , e , e }1 3 75
4
7{e , e , e , e }6
7
32
65
3
4 5
6
e1
key table
hash table
on−chip
111
000
100
Fig. 15. Partitioning of 8 elements in base-2 with 0-BF s and 1-BF s.
Fig. 15 shows an example of the base-2 number system with 23 keys and three
pairs of 0-BF s and 1-BF s. The indexes in the upper ﬁgure are drawn in a rectangle
of 0- and 1-bits based on the base-2 number system, where each column has the same
number of 0/1 digits as shown to the right table. Below, 8 keys are regrouped in every
column according to their bits in the column, so that each of 0-BF vs and 1-BF vs, v∈
{0, 1, 2}, has its own set for insert as shown in the right side. For instance, suppose
e1=e0012 is to be saved at address 0012 of oﬀ-chip memory. In an MBHT, 0-BF
0,
0-BF 1, and 1-BF 2 from column 0, 1, and 2, respectively, are involved in saving e1 as
shown in the ﬁgure.
The base-2 number system used in Fig. 15 can be expanded into an arbitrary
number system for the beneﬁt of memory eﬃciency, as shown in Fig. 16. All address
44
r
e0
1−BF
(32)
(32)
0−BF
t−indexe63
0−BF
(8)
1−BF
(8)
2−BF
(8)
3−BF
(8)
4−BF
(8)
5−BF
(8)
6−BF
(8)
7−BF
(8)
(16)
0−BF
1−BF
(16)
2−BF
(16)
(16)
3−BF
122 4e
e328
a2a1a0 a0 a1a2 a50a a1 a4a3
0
index address space
MBHT
index address space
index address space
n−1
00
(b) Base−4 number system(a) Base−2 number system (c) Base−8 number system
n−1 n−1
.
.
.
.
.
.
.
.
000 1
777 6333
3
3 13
3 2
000
0
0
0
0 2
1
1 1 1 1 1 1
0 0 0 0 0 0
0 0 0 0 0 1
010000
110000
1 1 1 1 1 0
1 1 1 1 0 1
01 0111
Fig. 16. Conversion of the base-2 number system to base-4 and base-8 for 64 elements.
n = 26. By (X), X means the number of the same digits in a BF.
spaces in subﬁgures are partitioned column-wise and grouped by MBFs. The address
space for 26 elements in the base-2 number system in Fig. 16 (a) is transformed to
other address spaces of base-4 and base-8 number systems in Fig. 16 (b) and (c),
respectively, resulting in the fewer columns in each address space. However, this
transformation does not aﬀect addressing an index to a key table. For example,
suppose item e0110102 for base-2 is located at 0110102. This can be at 1224 and 328
of base-22 and base-23, respectively, as shown in Fig. 16 (b) and (c).
Even if the address space in one column is partitioned by b MBFs in base-b
system, they can be accessed with the same memory address by stacking MBFs in
the following way for hardware implementation: On-chip memory for b MBFs is an
array of words of b bits so that given an address to on-chip memory indicated by a
hash function b MBFs are involved. Therefore, if all k words indicated by k hash
functions for a-BF have bits of value 1 in oﬀset of a in the words, a-BF has a correct
membership test in query for a given key and the oﬀset a is used for a part of address
Ab. In this way the number of on-chip memory modules is r and they work in parallel
for insert.
45
B. Memory Eﬃciency with a Larger Base-b
With an invariant about addressing systems shown in Fig. 16, given the base-b1
number system and requirement of f , Linear Property of Lemma 1 regarding variables
m and n claims that even if the number of new BFs, b2, is increased in a column in
new base-b2, the total memory size for the column remains the same. The reason
is that although the number of elements to hash for each new BF in the column is
reduced, the total number of items for the new column in base-b2 is the same as that
for base-b1.
In general, considering two MBHTs the total memory usage in bits for the base-b
number system as a function of a requirement of f=2−wM is calculated as follows:
Mb(f)=2× C ×B = 2×
{
logb n
}
×
{
b (1.44(n/b) log2(1/fM ))
}
=2 ·
{
log2 n
log2 b
}
·
{
1.44n log2(1/fM )
}
=2.88
nwM log2 n
log2 b
, (5.1)
where wM is the precision of query operation, C is the number of columns, and
B is the number of bits for a series of MBFs in one column. From this equation,
denominator term log2 b makes Mb smaller as it increases provided that n and f
are constants. This is manifested in Fig. 17 showing the total memory usage in bits
considering only MBFs in several number systems based on Eq. (5.1). Along with n
axis of b=2, Mb increases greatly for a given f due to log2 n and n terms in Eq. (5.1).
Similarly, the change rates in axes of f and n for a smaller b are much larger than those
of a larger b. Furthermore, the gap of Mbs among diﬀerent bs is large enough that
the saved memory can be used to reduce an f -positive of each MBF. Thus, rather
than using base-b1, using a larger base-b2 number system is advantageous because
of log2 b2/ log2 b1 times on-chip memory saving, that is, Mb1/Mb2=log2 b2/ log2 b1.
However, choosing the appropriate base system depends on the current technology
46
of memory hardware. For example, b=220, the largest base system, could be the
best choice for n=220 because that gives the highest memory eﬃciency. In contrast
to a theoretical beneﬁt, in real hardware implementation it is very hard to probe
sequentially all 220 bits in a word indicated by a hash function to ﬁnd a bit position
having value 1. Even worse, such k words by k hash functions need to be checked for
returning multi-predicate value for the involved MBF, taking an unsupportable time
in hardware.
0
1
2
3
4
5
x 108
0
1
2
3
x 10−5
0
2
4
6
8
x 1011
The # of items, nf−positive, f
M
em
or
y 
siz
e 
in
 b
its
b=32
b=16
b=8
b=4
b=2
Fig. 17. Memory size Mb for b = 2, 4, 8, 16, and 32 with f and n.
C. Insert Operation in an MBHT
The detailed procedure of the insert operation is described in Algorithm insert
where r on-chip memory Mons of m words of b bits, m=1.44nwM by Eq. (3.2),
are involved. Address A fed to Algorithm insert is provided by a queue of free
addresses. The ﬁrst vertically-lined for loop in it is executed in parallel at each
column. Also, the second for loop is done in parallel, as long as a conventional
47
BF support k hash functions in parallel. [55] asserts that fabrication of 6 to 8 read
ports in on-chip memory is attainable. Even if the needed hash functions are larger
than the attainable number of ports, splitting into several on-chip memories with
the attainable port numbers is a solution as suggested in [12]. Therefore, the time
complexity in on-chip memory is Θ(1) on the conditions that hash functions return
indexes in constant time, and each column conducts hashing in parallel. Moreover,
the number of oﬀ-chip memory accesses through Moff [A] is exactly 1 because a rule
for item e is saved in the designated address A as shown in the last line. Thus, the
complexity of Algorithm insert for oﬀ-chip memory access is Θ(1). In contrast, an
FHT was calculated to be a time complexity of O(nk2/m + k), which is not suitable
for a dynamic update in packet processing [8].
Algorithm 4: insert(x, e, rule, A)
Input: x-MBHT x ∈ {l, r}, key e, its rule, and address A = a0a1 · · · ar-1 in base-b
Result: Encoded MBHT for key e
for column i = 0 to r − 1 do /* On-chip Op. */1
for t = 0 to k − 1 do2
g = ht(e); /* hashing, g ∈ {0, · · ·m-1} */3
M ion[g][ai] = 1; /* the size of M ion=m× b */4
end5
end6
Mkey[A] = e ; /* a key table in on-chip */7
Moff [A] = rule ; /* Off-chip memory access */8
D. Query Operation in an MBHT
After all elements are saved contiguously in oﬀ-chip memory and encoded in a set
of MBFs in on-chip memory, the remaining and ultimate goal of an HT is to search
for an item by a fast query operation. There are two kinds of search patterns: a
successful search (SS) in which an item is found; and an unsuccessful time-consuming
search (US) for an item that does not exist in an HT.
48
Before two kinds of searches with possible false access to oﬀ-chip memory are
examined, let deﬁnitions of a true index and a false index introduced.
Definition 3 (True Index)
A true index, or t-index, is deﬁned as a series of MBFs assigned to encode a key.
They are interconnected and back-to-back of each other from column 0 to column r-1,
where r is the number of columns in the base-b number system, making a sequence of
full index address bits. The sequence of bits is also matched with an arbitrary memory
address associated with a key saved in key and rule tables.
For instance, key e is to be saved at address 1224 in base-4 and 328 in base-8 as
shown in Fig. 16 (b) and (c), respectively. In base-4, for sequence 1224 1-BF 0,
2-BF 1, and 2-BF 2 are involved to save the key e while in base-8 for sequence 328
3-BF 0 and 2-BF 1 are for a t-index of key e. From the deﬁnition of a t-index, the
following corollary can be concluded:
Corollary 1 Once key e is saved at index A with a series of r=logb n MBFs,
i.e. a0-BF
0 · · · ar-1-BF r-1, in base-b, the involved BFs should return ’yes’ mak-
ing a0 · · ·ar-1 in the query of key e as a legacy BF always returns ’yes’ for true
membership.
Due to the independent and identically distributed (i.i.d) property of BFs, it is pos-
sible that irrelevant BFs could return their f -positives in the query operation. Thus,
due to the irrelevant f -positives, a false index, which is deﬁned as the following,
happens:
Definition 4 (False Index)
In query of key e, in each column i of an MBHT, a group of MBF s in column i
not pertaining to a t-index for insert of the key can return their f -positives. By
49
the f -positives, the indexes in a word of b bits in the column i could lead to a false
index, f -index, with other MBF js, j =i involved in the insert. Thus, an f -index is
a combination of index values of MBFs irrelevant and relevant to the insert of the
key and MBFs responding to a membership test for the key return their indexes in a
word of b bits. Also, the length of an f -index should be r=logb n.
Given query of a key, a set of a t-index and f -indexes should be probed to
guarantee that the key exists or not in a key table, implying perfect match unlike
approximate match in [12, 14, 17]. In our query for perfect match, the numbers of
f -indexes for an SS and a US are at most n-1 and n, respectively. These numbers
are comparable to the numbers of memory accesses for an SS and a US in a linked
list of an LHT and an FHT. However, the diﬀerence between an indexing to a key
table in an MBHT and a sequential access in a linked list of an FHT is that a key
table resides in on-chip while a pair of a key and its rule exists in oﬀ-chip, so that
at most one oﬀ-chip memory access is made in a MBHT while more than one access
are necessary in an FHT. Most importantly, to ﬁnd a matched key a few number of
indexes to a key table in an MBHT can be processed in one cycle in parallel while in
a sequential access in a linked list of length t, at most t cycles are necessary. Thus, it
is possible that although an MBHT with less memory can give more f -indexes, they
are processed in one cycle, so that less memory can be used while high bandwidth is
preserved with low collision rate. The next important step is to recognize a t-index
and annul a series of false positives randomly scattered in an MBHT so that the
possibility of an f -index can be reduced.
50
1. False indexing for an SS in an MBHT
We have explained the deﬁnitions of a t-index and an f -index and how they can
both occur in the query operations on an MBHT. Now, we derive and calculate the
probability of the number of false accesses, i.e. f -indexes, in an SS. In a query for
an SS, at least one MBF in each column needs to return its index value in k words,
so that a sequence of a0a1 · · · ar-1 of length r forms the full address A, i.e. t-index.
Furthermore, in case of an f -index, false addresses can be created through f -positives
by each irrelevant MBF in each column.
Suppose Xsi is a random variable of the number of f -positives from MBF
i
irrelevant to a t-index of an SS. Due to the i.i.d f -positives of b-1 BFs in a MBF, the
probability density function of Xsi is a binomial distribution, B(b-1, f). Also, assume
that the random variable Xs for an SS denotes for the total number of f -indexes
in a given query operation. Then, random variable Xs is deﬁned as the product
of random variable Xsi s, i.e. (
∏r−1
i=0 (X
s
i +1))-1, because of the i.i.d property of each
column and the probability of Xs = x is the following
Pr{Xs=x}=
∑
(x0+1)···(xr-1+1)=x+1
Pr{Xs0=x0, · · · ,Xsr−1=xr−1} (5.2)
=
∑
(x0+1)···(xr-1+1)=x+1
Pr{Xs0=x0} · Pr{Xs1=x1} · · ·Pr{Xsr-1=xr-1}.
For example, Xs=0 means that each layer i does not have any random variable Xsi
larger than 0. Therefore, the probability becomes
Pr{Xs = 0} =Pr{Xs0 = 0} · Pr{Xs1 = 0} · · · Pr{Xsr−1 = 0}. (5.3)
Also, in the cases of one and two f -indexes their probabilities are derived from the
following:
51
Pr{Xs = v} =
r−1∑
t=0
Pr{Xst = v}
r−1∏
t′ =t
Pr{Xst′ = 1},
where v∈{1, 2} because prime numbers 2 and 3 can be factored in the only one way:
2×1×· · · ×1 and 3×1×· · · ×1. Also, the mean of Xs is calculated based on the i.i.d
property of Xsi as shown
E[Xs] =
n−1∑
t=0
t · Pr{Xs = t} = E[(
r−1∏
i=0
(Xsi + 1)) − 1]
=
r−1∏
i=0
E[Xsi + 1]− 1 = [(b− 1)f + 1]r − 1. (5.4)
-40
-35
-30
-25
-20
-15
-10
 1  2  3  4
Pr
ob
. o
f X
s  
an
d 
E[
Xs
] (
log
 sc
ale
)
Xs
b=8
b=16
b=32
E[Xs].b=8
E[Xs].b=16
E[Xs].b=32
Fig. 18. Probability of Xs, the number of f -indexes, in an SS. n = 216. Required
f=2−10 for b=2.
Fig. 18 shows the probabilities for three base systems (23, 24, and 25) derived
from Eqs. (5.2) and (5.4). For a fair comparison, each memory size of M23 , M24 ,
and M25 are set equally so that inequality f23>f24>f25 is satisﬁed based on Lemma
2 where f23 , f24 , and f25 are f -positives of each MBF in base-2
3, base-24, and base-
52
25, respectively. The lines in Fig. 18 are not shown in monotonic decrease due to
binomial coeﬃcient in binomial distribution B(b-1, f). However, the average value of
Xs from Eq. (5.4) is decreased as the number system of base-b increases, and in case
of b>32 probability of Eq. (5.2) decreases monotonically due to memory gain by a
larger base number system.
2. False indexing in a US in an MBHT
In addition to ensuring a low probability of more than one access to a key table in
an SS, a design of an HT must also ensure the low probability of a US is. Unlike
an SS, a US has no valid index, which means that all MBFs returning ’yes’ make
f -positives. However, by deﬁnition of an f -index, each column should have at least
one BF returning ’yes’ as an f -positive, otherwise a group of f -positives can not
constitute an f -index. Therefore, we expect a much lower probability because of the
product of each independent f -positive probability of BFs.
Let Xui denote a random variable of the number of f -positives from BFs at
column i. Then, the probability density function of Xui follows a binomial distribution
B(b, f) due to the i.i.d f -positives of the BFs. Also, suppose random variable Xu is
the number of f -indexes in a US on an MBHT. Then, random variable Xu can be
formulated with random variable Xui into
∏r-1
i=0 X
u
i . In general, the probability of
Xu becomes
Pr{Xu = x} =
∑
x0···xr-1=x
Pr{Xu0 = x0, · · · ,Xur-1 = xr-1} (5.5)
=
∑
x0···xr-1=x
Pr{Xu0=x0} · Pr{Xu1=x1} · · ·Pr{Xur-1=xr-1}.
For example, the probability that there is no f -index can be calculated in the com-
plementary way, as in the following:
53
Pr{Xu = 0} =1−
r−1∏
t=0
Pr{Xt ≥ 1} = 1−
r−1∏
t=0
(1− Pr{Xt=0}) .
Also, in the cases of one, two, and three f -indexes, their probabilities are derived
as Pr{Xs} is by factorization. Similarly, in the case of four f -indexes, there are two
possibilities of factoring, i.e. 4×1×· · · ×1 and 2×2×1×· · · ×1. Thus, the probability
becomes the summation of two cases as follows:
Pr{Xu=4}=
r−1∑
t1,t2,t1 =t2
⎛
⎝Pr{Xut1=2} · Pr{Xut2=2}
r−1∏
t′ =t1,t2
Pr{Xut′=1}
⎞
⎠
+
r−1∑
t=0
⎛
⎝Pr{Xut = 4} r−1∏
t′ =t
Pr{Xut′ = 1}
⎞
⎠ .
Finally, the mean of random variable Xu can be calculated with i.i.d property:
E[Xu]=
n∑
t=0
t · Pr{Xu = t} = E[
r−1∏
i=0
Xui ] = [bf ]
r. (5.6)
Fig. 19 shows the probabilities for three base systems (23, 24, and 25) derived from
Eqs. (5.5) and (5.6) as Fig. 18 does.
Algorithm 5: query(MBHT,e)
Input: An MBHT and key e
Output: Set of Ab = a0 · · · ar−1 including false indexes
for column i = 0 to r − 1 in an MBHT do1
for t = 0 to b− 1 do2
if e ∈ t-BF i then /* i.e. M ion[][t]==1 */3
SAi = SAi ∪ {t};4
end5
end6
end7
SA = ∅; /* Set of an i-index and f-indexes */8
SA = make paths(SA0, · · · , SAr−1);9
return SA; /* No off-chip memory access */10
The query operation shown in the Algorithm query only considers on-chip
54
-70
-65
-60
-55
-50
-45
 1  2  3
Pr
ob
. o
f X
u
 
an
d 
E[
Xu
] (
log
 sc
ale
)
Xu
b=8
b=16
b=32
E[Xu].b=8
E[Xu].b=16
E[Xu].b=32
Fig. 19. Probability of Xu, false memory access, in a US. n = 216. Required f=2−10
for b=2.
operation and it needs to be called twice on l-MBHT and r-MBHT. Therefore, the
average of random variables X u and X s for a US and an SS, respectively, using two
MBHTs are the following based on Eqs. (5.6) and (5.4)
E[X u] = 2 · E[Xu] and E[X s] = E[Xu] + E[Xs], (5.7)
because for a US both MBHTs do not have a wanted key and for an SS one of
MBHTs does not have a wanted key. Function make paths makes f -indexes based
on set SAi , 0≤i≤r-1. For example, given inputs {24}, {14, 34}, {04} for SA0, SA1,
and SA2 in base-4 system, it returns address set {2104,2304} by concatenating each
member from all SAi. The time complexity of overall query is Θ(1) on the condition
that function make paths is performed in constant time, which is possible. Also, time
complexities of accessing oﬀ-chip memory depend on Eq. (5.4) and (5.6) for an SS
and a US.
The query described above returns set SA of candidates indexes to probe in a
key table. However, the size of set SA is geared to be probabilistically 1 to sustain
55
the bandwidth requirement of a high-speed router. According to Eq. (3.2), once each
MBF has the memory size appropriate to wM for a given bandwidth requirement,
it does not give any f -positives, resulting in no f -index. That is, the requirement of
160Gbps needs deterministic lookups of 500M keys (packets) in a second without an
f -positive, implying that a f -positive rate should be as low as 1/500M=2n without
consideration of binomial coeﬃcients in Xu and Xu.
3. Hardware consideration for pipelining
Previsously in insert and query, parallel MBFs have been used. However, in hard-
ware conﬁguration, pipelining on MBFs from column 0 to column r-1 is better than
parallel on MBFs in terms of operational power regardless of an SS and a US. The
reason is that in one column shown in Fig. 16 only one MBF among b MBFs of base-b
is involved for insert and to return ’yes’ in a query, while the rest are to return ’no’.
Also, in case of a US, all b MBFs in one column are to return ’no’ in a query. These
situations are true in all MBFs in a column-wise view. That is, although an SS needs
to search all columns, in case of a US if anyone of MBFs at previous columns does
not give ’yes’ ensuing MBFs in the next column do not have to perform query and
there is no t-index and f -index by Deﬁnitions 3 and 4. Thus, two ways of pipelining,
i.e. in order of MBFs in one column and then in order of columns, can maximize the
power eﬃciency rather than probing all MBFs at the same time.
Fig. 20 shows the beneﬁt of a pipelined MBF by measuring the average number
of steps to proceed and the average number of bits to probe under one MBF. Given a
required precision w=k for an MBF based on Eq. 3.2, ks-bit locations are probed in
one step so that there are k/ks steps to proceed in a query of a US. The upper ﬁgure
shows how many steps to proceed until the last step returns ’no’ in a US. By virtue
56
2 3 4 5 6
1
1.2
1.4
1.6
A
vg
. #
 o
f s
te
ps
2 3 4 5 6
3
4
5
6
k
s
A
vg
. #
 o
f b
its
 to
 p
ro
be
k=12
k=24
k=12
k=24
Fig. 20. The beneﬁt of pipeline in an MBF returning ’no’ in a query for two cases of
k=12 or 24.
of principle of a BF, i.e. several hash functions, using a larger value of ks shows the
less number of steps to proceed. However, the average number of total bits to probe
until the last step of ’no’ shows the reverse way as shown in the bottom ﬁgure where
the average number of bits to probe is calculated ks×(average # of steps). That is,
probing a smaller number of bits step by step shows the less number of memory reads,
implying that pipelining in an MBF needs less power. Note that this beneﬁt happens
only to an MBF returning ’no’ in a query. However, this beneﬁt is multiplied by the
rest of b-1 MBFs in one column as well as other MBFs in other columns, so that the
total beneﬁt of a pipelined MBHT becomes (b-1)×r times in an SS and b×r times in
a US larger than a single pipelined MBF, where r=log2 n.
57
  
  


  
  


   
132
113
head
tail012
132
113
queuer−MBHT
l−MBHT
After deletel/r reg.
head
tail
Fig. 21. An example of delete for item e located at 0124 in base-4.
E. Delete Operation in MBHTs
Unlike the two kinds of searches in query operations, we consider delete operation
for a successful deletion. The delete operation needs two query operations on both
l-MBHT and r-MBHT, where only one of MBHTs has a relevant key. Fig. 21 shows
an example with n=64 for delete. Initially, l-MBHT has been fully used for insert,
l/r-reg. indicates the r-MBHT for future n insert. Now after deletions of keys
located at indexes 1134 and 1324 of a key table, the stack has 1134 and 1324 as
candidate indexes further insertions. Suppose key e was inserted in 0-BF 0, 1-BF 1
and 2-BF 2 in checked boxes as shown in the ﬁgure. Once key e for delete operation
is conﬁrmed by accessing a key table with the address 0124, the address is to be put
on the stack for future insert.
Like the query operation, if there are an f -index and a t-index associated with
a key, two accesses to a key table in delete are necessary. Therefore, when random
variable Z is denoted as the number of accesses to a key table with both MBHTs,
the average memory access for a delete operation on the condition that the item
exists, i.e. a successful delete, is
58
E[Z] =
(
n∑
v=1
v · Pr{Xu = v}
)
+
(
1 +
n−1∑
v=1
v · Pr{Xs = v}
)
(5.8)
= [bf ]r + [1 + (b− 1)f ]r,
where the ﬁrst term accounts for a US in one of MBHTs based on Eq. (5.2) while
the second term explains the an SS in the other based on Eq. (5.5).
The detailed procedure of the delete operation is shown in Algorithm delete.
The complexity in on-chip memory is O(1) because the complexity of query used in
the algorithm is O(1). The complexity of memory access is O(E[Z]) on average for
a successful delete, and it is to be constant as E[X s] is O(1) while the complexity
for an FHT is O(nk2/m + k).
Algorithm 6: delete(l-MBHT,r-MBHT,e)
Input: Two MBHTs and key e
Result: Update associated BF in each column
Sl-MBHT=query(l-MBHT ,e) ; /* Only on-chip Op. */1
Sr-MBHT=query(r-MBHT ,e) ; /* Only on-chip Op. */2
for A ∈ Sl-MBHT ∪ Sr-MBHT do /* A = a0a1 · · · ar−1 */3
if Mkey[A] == e then /* On-chip Mem. Acc. */4
Mkey[A] = ∅;5
push(A,queue); /* push A to Q. for insert */6
end7
end8
F. Analysis and Simulation for an MBHT
This section presents analyses of memory eﬃciency and the average access time per
query to a key table for four schemes; an LHT, an FHT, and an MBHT. Also, a
phenomenon of duplicated keys in an FHT is analyzed. Finally, one simulation is
performed for determining the on-chip memory usage for IP lookup application with
BGP tables available from [51, 52]. Among a class of universal hash functions, a
59
hardware scheme in [59] is adapted for simulation.
1. Average access time of query
-20
-15
-10
-5
0
 1  2  3
Pr
ob
ab
ili
ty
 (l
og
 sc
ale
)
# of cycles to process memory reads
LHT
c-FHT
FHT
MBHT.b=8
MBHT.b=16
(a) Probability of cycles to
process mem. reads to a ta-
ble or a linked list in an SS.
For an MBHT Pr{X s}.
-60
-50
-40
-30
-20
-10
 1  2  3
Pr
ob
ab
ili
ty
 (l
og
 sc
ale
)
# of cycles to process memory reads
LHT
c-FHT
FHT
MBHT.b=8
(b) Probability of cycles to
process mem. reads to a ta-
ble or a linked list in a US.
For an MBHT Pr{Xu}.
-4
-3
-2
-1
0
1
-4 -3 -2 -1 0
A
vg
. a
cc
es
s a
 se
ar
ch
 (l
og
 sc
ale
)
Successful search probability(Ps) (log scale)
AL for an LHTAF for a corrected FHTAF for an FHTAM for an MBHT
(c) Avg. access time as a
function of successful-search
rate.
Fig. 22. Probabilities of memory access in an SS and a US and the average access
time to oﬀ-chip for an LHT, an FHT, and an MBHT with the same memory
128K log2 n to fully utilize the saved memory for increase in precisions of
base-8 and base-16. k=10, and n=64K.
Let us deﬁne the average access time to oﬀ-chip as the number of accesses to
oﬀ-chip given query operation. For an LHT with chaining, the load factor, αL, can
be given as n/mL where n is the number of items and mL is the number of buckets
used to point an address of oﬀ-chip memory after hashing. Let T sL and T
u
L denote the
average access time for an SS and a US, respectively, which are deﬁned in [60] as
T sL = 1 + αL/2− 1/2mL, T uL = αL.
To evaluate the average access time regardless of an SS and a US, another parameter
ps, which denotes the frequency of an SS of a key in oﬀ-chip memory, is introduced.
With these notations, the average access time TL for an LHT can be expressed as
TL = psT sL + (1-ps)T
u
L = ps
(
1 +
n− 1
2mL
)
+ (1-ps)
n1
mL
. (5.9)
60
For an FHT, let Ep be the expected length of a linked list in the FHT for an
item in a positive match and Ef be the expected length of a linked list in the FHT for
a f -positive match. Ep can be derived from the average number of items for which
all buckets’ length > j, or n · B((n-1) · k, 1/mF , > (j-1))k where B(n, 1/mF , > j) =
1 −∑ji=0 (ni)(1/mF )i(1-1/m2)n−i and mF is the number of buckets in an FHT. Also,
Ef can be derived from Eq. (9) in [15]. Therefore, the average access time TF for an
FHT is
TF = psEp + (1-ps)fEf = psEp + (1-ps) (1/2)
(mF /n) ln 2 Ef , (5.10)
where f is the f -positive probability in a shared linked list.
Finally, for an MBHT, Eqs. (5.7) are used to get an average access time TM as
the following
TM = psE[Xs] + (1− ps)E[Xu]. (5.11)
The average access times of other schemes can be calculated as easily as Eq. (5.11) of
an MBHT. Based on [60] and [15], those of an LHT and an FHT are related with the
load factor deﬁned as the number of keys over the number of buckets, i.e. n/m. Note
that an FHT considers neither buckets in bits used for pointers to oﬀ-chip memory
nor counters of linked lists in bits.
Fig. 22 shows the probabilities of memory access in an SS and a US and the
average access time calculated from Eq. (5.11), in terms of the number of oﬀ-chip
memory accesses for four schemes under diﬀerent successful search rates. Note that
a modiﬁed FHT is marked as c-FHT, considering counters and the existing FHT is
marked as FHT, not considering counters as memory. In Fig. 22 (a) it is shown that
Pr{Xs} of an MBHT is always less than those of an LHT and an FHT. Fig. 22 (b)
61
shows the probabilities Pr{Xu} of an MBHT and other schemes. Particularly, the
result in Fig. 22 (c) indicates that the lower the successful search rate, the better the
performance of the proposed MBHT is than those of an LHT and an FHT.
Table IV. Complexities of operations to oﬀ-chip in four schemes.
Operation insert query delete
LHT O(1) O(1) O(1)
FHT O(nk2/m + k) * O(1) O(nk2/m + k)*
MBHT Θ(1)◦ Θ(1) † Θ(1)
*
In optimal conﬁguration, O(k). ◦ In detail, that is just 1.
†
In detail, Θ(ps(1 + E[X s]) + (1− ps)E[X u]).
Table IV summarizes the complexities of oﬀ-chip memory access regarding insert,
query and delete operations in an LHT, an FHT, and an MBHT. The big diﬀer-
ence is in an FHT, which involves the labored complexities of insert and delete
operations depending on variables n, k, b, and m. In contrast, the complexities of an
MBHT and an LHT are constant.
2. Memory usage
This section presents and compares the on-chip and oﬀ-chip memory usage for each
scheme. Given f -positive f=2-w and the number of elements n, the memory usages
in bits of an FHT are the following:
ML= log2 n× 1.44nwL, and MF=
{
log2 n + 4
}
×
{
1.44nwF
}
,
respectively, where 4 in MF accounts for the number of bits in a counter, and
wF=wB=w. Memory eﬃciency ratio RM,F of MM to MF whose value is derived
from Eq. (5.1) becomes
62
RM,F =
MF
2{logb n · 1.44n(log2(1/f)+α)}
=
x(log2 n+4)w
2 · log2 n(w+α)
≈ x(log2 n+4)
2 log2 n
, (5.12)
where α=log2(b-1) due to coeﬃcients in the binomial functions of X
s and Xu, and
the size of a queue for free addresses in oﬀ-chip, n log n is not considered. Also, the
memory eﬃciency ratio RMB of MM to MB are
RM,L=
ML
2× {logb n · 1.44n(log2(1/f)+α)}
=
x
2
· w
w + α
≈ x
2
. (5.13)
where wF and wM need to be set to w of given precision requirement for a fair
comparison to each other.
1
2
3
4
5
6
12
16
20
24
28
32
0
0.5
1
1.5
2
2.5
3
3.5
b (log2 scale)n (log2 scale)
M
em
or
y 
ef
fic
ie
nc
y 
ra
tio
R
M,L
R
M,F
Fig. 23. Memory eﬃciency ratios of RM,L and RMF with various b and n. wF=wM=20.
Note that although an MBHT is set to have the same average access as others,
the actual average access times are diﬀerent each other as shown in Fig. 22.
Fig. 23 shows two ratios, RMF and RML, calculated from Eqs. (5.12) and (5.13)
in the range [2:26] for base-b and in the range [210:230] for n. The ﬁgure shows that
63
without a doubt the turning point for a better memory eﬃciency ratio surely begins
at b=23 due to a set of two MBHTs. Also, even with large values of coeﬃcients in
binomial functions B(b-1, f) and B(b, f) the acquired memory gains of four ratios
increase as b increases. Given b, the memory gain in a range of n does not change
much as shown in the ﬁgure, although the change rate of memory gain for a given n is
manifested along the b axis. Thus, compared to an LHT and an FHT, Fig. 23 proves
that an MBHT approach can gain much memory as long as a larger base number
system is used.
As one application of an MBHT to packet processing, i.e. URL switching, we
used NePSim [61] for URL switching where all the incoming packets to a switch are
parsed and forwarded according to URL. This kind of switching is a commonly used
content-based load balancing mechanism [62, 63]. Kachris et al. [63] used a simple
XOR hash to reduce the collisions among Block RAMs in connection manager for
web switching, and Prodanoﬀ et al. in [64] proposed URL signatures using CRC32
to reduce the size of routing tables and aggressive hashing with chaining of a linked
list to speed-up routing lookups in large-scale content distribution networks.
Table V shows the memory size in bytes of an LHT, an FHT, and an MBHT for
three trace databases on the condition that requirement of f is 2−20 and the load
factor becomes 0.034 accordingly. Each trace of UC Berkeley, NLANR, and CA*netII
has 149,344, 504,967, and 2,552,045 URLs, respectively. The result shows at most 1.7
times on-chip memory reduction at an MBHT in base-16 against an LHT as shown
in the table. If comparison is set for an FHT, about 2 times of the memory reduction
is observed due to consideration of counter bits in an FHT.
While authors in [61] validated NePSim with SDRAM, SRAM and six micro-
engines against the IXP 12000 architecture in terms of performance and power, the
number of accesses to SDRAM with NLANR trace is measured on the condition that
64
Table V. On-chip memory usage for three traces. The load factor is 0.034, K=1024.
URL Traces LHT FHT MBHT MBHT
base-8 base-16
UC Berkeley[65] 9024KB 11124KB 6860KB 5393KB
NLANR[66] 33634KB 40735KB 25570KB 20102KB
CA*netII[67] 190953KB 226841KB1145171KB 114127KB
an LHT, an FHT, and an MBHT were implemented in SRAMs. Especially, given
a query an MBHT is to return indexes with a set of SRAMs. Table VI shows the
Table VI. AAS in a successful search of NLANR trace for three schemes. f=2−10.
Schemes LHT FHT MBHT (b=8) MBHT (b=16)
AAS* 1.026306 1.002472 1.002411 1.000092
# of Acc.◦ 968861.7 946231.9 946303.9 944114.4
◦
It means the total number of oﬀ-chip accesses provided the
URL queries of NLANR.
measured accesses to SDRAM in NePSim with NLANR. The ﬁrst row is the average
access for a successful search. While an FHT needs 2.4E-3 extra accesses on average
for a successful search, the proposed MBHT with b=16 asks 9.2E-5. Although this
value could be minuscule, when it comes to the diﬀerence between the numbers of
oﬀ-chip accesses in an FHT and an MBHT, the gap between them is 2117.
65
CHAPTER VI
A HIERARCHICALLY INDEXED HASH TABLE
Unlike an FHT using a BF, with a set of BFs a hierarchical indexing tree (HIT)
is conceptually embedded into an HT of less memory size than an FHT. That is,
memory area for an HT used for pointers to a key table is partitioned to make an
HIT. An HIT for n keys in power of 2 is composed of s=log2 n layers (i.e. SRAM
modules) and partitions the address space in a rectangle of n×s 0/1 bits, so that a
BF covers a column group of the same bits, either 0 or 1, in the index address space
to a key table. The detail of how to build an HIT is as follows
A. Building a Conceptual HIT in Stacked SRAMs
: {e , e , e , e } : {e , e } : {e }
B11
B20 B21 B22 B23
B12 B
1
3
B24 B25 B26 B27
1B0
onM i Mkey
B00
B10
onM 0
e e e e4 5 6 7e0 e1 e 2 e 3
100
10
1
1
10
1
1
1
00
0 00
1
0
10
0
1
1
0
0B 0 1 2 3 12B 4 5 727B
r
r
r
3
r4
r
r
r
0
r
7
6
5
2
1
Moff iBj BFe4Path for Mem. module
B00
B01
1
e
2
e3
e4
e
e
e
e0
e
7
6
5
0
0
0
1
1
1
1
0
Bi
na
ry
 in
de
x s
pa
ce
1−tree0−treeMSB LSB
layer 0
on−chip
MSB
LSB
conceptual tree construction
off−chip
rule tablekey table
Memory architecture
s
0
0
1
1
0
1
1
0
0
1
0
1
0
1
1
0
Fig. 24. Basic conﬁguration of hierarchical indexing tree of 0- and 1-tree.
Fig. 24 shows a hierarchical partition for an HIHT. Let Bij denotes j-th BF
in layer i, hereinafter 0≤i≤s-1, and let all n keys be ﬁlled in a key table in on-chip
memory sequentially from index 00...0s-1 to index 10...1s-1. If key e∈S is to be inserted
at index address A= a0a1...as-1, where at∈{0, 1}, 0≤t≤s-1, a BF, denoted Bia0···ai at
66
each layer i, is involved to encode key e as a legacy BF does. In this hierarchical
partitioning and encoding, Bij at each layer i takes care of n
i=n/2i+1 keys of set S.
That is, Bij covers n
i=n/2i+1 keys starting from j·2s-1-i to (j+1)·2s-1-i-1 in index
address space. For instance, B00 , B
1
2 , and B
2
7 take care of sets {e0, · · · e3}, {e4, e5} and
{e7}, respectively. Eq. (3.2) states that m is linearly proportional to n in a given f .
Thus, given f i=2-w
i
for a BF on layer i, where wi is a precision of a BF on layer i,
the total size in bits of memory M ion for BFs on layer i is 2
i+1(1.44niwi)×1. Finally
an embedded HIT is comprised of a 0-tree and 1-tree covering half of n keys located
in 0x1...xs-1 of a key table and the remaining half in 1x′1...x′s-1, where xt, x′t∈ {0, 1},
1≤t≤s-1.
B. Insert Operation in an HIT
Fig. 24 shows a basic structure of our HIT consisting of 3 layers of BFs for 8 keys.
The left side of Fig. 24 shows the binary address space with a set of BFs partitioning
the address space of a key table, and the right side shows the transformed dual trees,
0-tree and 1-tree, where each node represents Bij . For an example of the insertion
of key e4 at index address 1002, B01 at layer 0, B
1
2 at layer 1, and B
2
5 at layer 2 are
involved.
Procedure insert shows the detailed insert operation and is as simple as that
for a BF. Although conceptually all BFs are separate each other in an HIT, for
hardware implementation assume that BFs on layer i are embedded in one on-chip
memory module M ion as shown in Fig. 24, and there are s=log2 n memory modules.
Finding a base address for Bij is easily calculated as shown in line 3. The ﬁrst
vertically-lined for loop in Procedure insert is executed in parallel and pipelining
at each layer. Also, the second for loop is done in parallel, as does a legacy BF.
67
Procedure insert
Input: key e, rule r, and its given address A = a0a1 · · · as−1 in a binary mode
Output: Encoded HIT for key e
for layer i = 0 to s− 1 do /* On-chip Op. */1
mi=1.44niwi; j=a0 · · · ai;2
idx=j ·mi; /* Find right base index idx for Bij */3
for t=0 to k-1 do /* One M. for BFs on layer i */4
M ion[idx+ht(e)]=1;5
/* ht(e) ∈ {0, · · ·mi-1}, M ion of 2i+1mi×1 bits */
end6
end7
Moff [A]=r ; Mkey[A]=e;8
Therefore, the time complexity in on-chip memory is Θ(1) on the condition that
hash functions return indexes in a constant time, and each layer conducts hashing
in parallel. Moreover, the number of oﬀ-chip memory accesses to Moff , is exactly 1
because key e and its associated rule are saved in Mkey and Moff at the designated
address A, as shown in line 8. Thus, the complexity of Procedure insert for oﬀ-
chip memory access is Θ(1). In contrast, an FHT claimed a time complexity of
O(nk2/m + k).
C. Delete Operation in dual HITs
delete operation is not as easy as insert because a basic BF does not support
deletion of a key which was inserted in the BF. However, dual HITs, an l-HIT and a
r-HIT, in on-chip memory is used to rotate a target HIT for insert operation and
another target HIT for delete operation, as shown in Fig. 25. Once one HIT is full
for previous n keys, query operation stays with the HIT. But if set S is dynamic, but
limited in size n, a new HIT takes care of insert for a new key by setting BFs in
a new HIT as well as a bit in a valid bit array (VBA). An index for the new key is
indicated by ’next’ which is updated every time from a free address stack (FAS) in
68
oﬀ-chip memory. Also, the old HIT handles delete operation by simply setting oﬀ a
bit in a VBA coupled with the corresponding HIT. Updating ’next’ and an FAS is not
a burden because whenever there are insert or delete operations, these operations
need oﬀ-chip access, thus, ’next’ and an FAS can be updated without another cost.
Checking Vl and Vr with indexes given by two HITs makes it sure that an unnecessary
access to oﬀ-chip memory is blocked within on-chip for HITs. Also, when all n keys
are encoded in one HIT, i.e. the moment that an FAS is empty, the other HIT needs
to be initialized 0 for the next set of insert operations with the initialized BF.
Vl rV001
B 00
B 10 B
1
1 B
1
2 B
1
3
B 01
B 21 B
2
6 B
2
7B
2
2 B
2
5B
2
3 B
2
4B
2
0
B 00
B 10 B
1
1 B
1
2 B
1
3
B 01
B 21 B
2
6 B
2
7B
2
2 B
2
5B
2
3 B
2
4B
2
0
100delete
on−chip
off−chip
next
l−HIT r−HIT0−tree 1−tree 0−tree 1−tree
FAS
e6
r
00 00 0 1 00
e0
l
2e
l e l3 e4
l e5
l e7
l
1e
l e6
l
1 0 1 1 1 0 10
Fig. 25. Dual conﬁguration of HITs for delete operation.
Suppose, for example, 8 keys, el0, ..., and e
l
7, are inserted in an l-HIT as shown
in Fig. 25. Then, after the 8 keys, a target HIT for new insertion becomes a r-
HIT. Now the ensuing operations are deletions of el4, e
l
1, and e
l
6. After the deletions,
suppose the next operation is insertion of er6 in a r-HIT with proper setting a bit
for er6 in array Vr as shown in Fig. 25 where next and an FAS have 001 and 100,
respectively. By rotating a target HIT for insertion among dual HITs and conﬁrming
an index returned by each HIT with a VBA, the operations of insert and deletion
are processed seamlessly. Also, by using two rotated HITs, an HIT does not need
69
counters in each BF, i.e. a counting BF, which costs 4 times more memory size than
a BF. Thus, using two HITs saves 2 times the memory. The detailed procedure and
complexity of delete are provided in Sec. C.
D. Query Operation Making Index Paths in Dual HITs
Once all keys are saved in a key table and encoded in a set of BFs in on-chip memory,
the remaining and ultimate goal of an HIT is to search a key in it by query operation
fast. There are two kinds of search patterns, an unsuccessful search (US) in which a
key is relentlessly searched although it is not in an HIT, and a successful but time-
consuming search (SS) in which a key is to be searched out in an HIT. Before a
discussion of these two kinds of searches, let deﬁnitions of index path, index segment,
false index path, and false segment introduced.
Definition 5 (Index Path)
In an HIT, an index path, or i-path, is deﬁned as a series of Bijs used in insert
operation and hierarchically connected each other from layer 0 to layer s-1, making a
sequence of address bits. The sequence of indexing bits in Bijs is also matched with
an arbitrary index for a key saved in a key table and the size of the sequence of bits
from the series of Bijs must be s.
As a corollary, it can be concluded that in query for key e previously encoded by
insert for the key e, an i-path for the key e should show up as a BF returns ’yes’ in
true membership testing.
In an HIT, besides an i-path dedicated to a key, due to f -positives from irrelevant
BFs a false index to a key table is possible. For example, suppose key e4 is inserted
with i-path 100102 in Fig. 24 and then a query to e4 is requested. The result of the
query may give an ambiguous 1001x2, x2∈{0, 1}, due to an f -positive of B25 . Thus,
70
this ambiguity needs two accesses to a key table. Given a query for an i -path of size
s, there are totally 2s-1 false indexes because each Bij is independent and identically
distributed, i.i.d. Besides the deﬁnition of an i-path, a false index path is deﬁned
by result of query operation, leading to a false indexing to a key table in on-chip
memory.
Definition 6 (False Index Path and False Segment)
In query, from hierarchically consecutive layers, a group of Bijs not pertaining to an
i-path can be formed in a series of at most size s, and to become a false index path, or
f-path, this series needs to be either connected to an i-path or a completely diﬀerent
path of size s, i.e. independent of an i-path in an HIT. Also, the group attached to an
i-path is called a false segment, or f-segment. The number of f -paths plus an i-path
is compatible with the length of shared linked list used for query of a key in an FHT.
Even if it is possible that there is a set of BFs giving f -positives in query, BFs
that are only hierarchically connected to each other and an i-path can be part of
an f -segment. Thus, f -positives from the rest BFs can be ignored. For example of
previous 1002 for e4 in Fig. 24, even if B11 and B
1
3 randomly make f -positives right
after query, there is no f -segment starting from the B11 and B
1
3 . By the deﬁnition of
an f -path, the probability of the f -path is cumulatively calculated as the product of
f -positives from BFs along the f -path.
Figs. 26(a) and 26(b) show an example of the calculation of probability of an
f -path in an HIT with one i-path and three f -paths. A series of a0a1a2a3 in the dark
boxes is an i-path. The probability of the f -segment b2b3 foaming f -path a0a1b2b3
is
∏3
t=2 f
t where f t is the f -positive of a BF on layer t. Also, the probabilities of
the remaining f -paths, c0c1c2c3 and c0c1c2d3, are the same as
∏3
t=0 f
t because the
probabilities of f -positives of BFs on the same layer are the same each other.
71
BF in i−path false positive
0−tree 1−tree
a3
a2
a1
a0
b2
b3
ai
c2
c0
c3 d3
bi
c1
(a) Examples of an i-path, f -
segments, and f -paths.
s−i
TrTl
T
Bi
TrTl
T
Bi
f−segment Case 2Case 1
(b) Probability of cumulative f positives
in an HIT.
Fig. 26. Examples of an i-path, f -segments, and f -paths. Probability of f -paths.
Once the probability of an individual f -path is known, the ﬁnal attention is paid
to the probability that an HIT has t f -paths, 0<t<n. Suppose binary tree T of height
l has two sub-trees, Tl and Tr of height l-1 with cumulative false positives FTl and FTr ,
respectively, as shown in Fig. 26(b). Also, let Tl and Tr have nl and nr f -segments
of size l-1. To have nl+nr f -segments of size l, the binary tree T rooted at B
i with
height l needs itself to be an f -positive. Therefore, the probability FT that the binary
tree T with its sub-trees has nl+nr f -paths is the product of three: the probability
that T needs to be an f -positive, the probability that Tl has nl f -paths, and the
probability that Tr has nr f -paths, i.e. f
i·FTl·FTr . Fig. 26(b) shows the cases of the
probability that tree T has 2 or 3 f -paths as follows:
Case 1 : The right child tree Tr does not have any f -segment. By deﬁnition, for an
f -segment to exist in T , Bi rooted in T must be an f -positive. Now that Bi is
an f -positive, f -segment in Tl of size l-1 become part of an f -segment of size l.
Therefore, automatically T has the same number of f -segments from Tl due to
its f -positive.
Case 2 : Both Tr and Tl have a few f -segments and contribute f -segments of tree
T in the summation of f -segments in Tl and Tr, in total 3 f -segments, because
72
Bi is an f -positive and both Tl and Tr have their own f -segments.
Suppose P i(t) is deﬁned as the probability of t f -segments starting on layer i in an
HIT of height s and it is calculated as the following in a recursive way:
P i(t) ≥
t∑
v=0
P i+1(t) · P i+1(t− v) · f i if t ≤ 2i+1, (6.1)
where base cases of t > 2i+1 and i=s are 0 and 1, respectively. Also, as f i based on
Eq. (3.1) is bounded and has a global minimum at ki=mi ln 2/ni=wi of Eq. (3.3), an
inequality in Eq. (7.1) for layer i can be removed at the optimal conﬁguration of ki.
1. False indexing to a key table in on-chip for a US
Besides the design issue of low probability of more than one access to a key table for
an SS, it is also equally important that the probability of f -indexes in a US is lower.
Unlike an SS, in a US there is no i-path for a given key, meaning that all BFs in
query return ’yes’ as f -positives. However, there is a chance that each of 0-tree and
1-tree can give plural f -paths. In contrast to one f -positive in an FHT [15] leading
to oﬀ-chip memory accesses, one f -path by a series of f -positives of hierarchically
connected Bijs in each layer i of an HIT becomes one index access to a key table.
Thus, far less probability is expected because of the product of f -positive probabilities
of BFs.
Suppose random variable Xu is the number of f -paths in a US on an HIHT.
Then, it is the number of entries in a key table and equally the number of memory
accesses if one memory access can do one memory read to an entry. The probability
Pr{Xu = v}, v>0 can be easily derived based on Eq. (7.1) of an optimal conﬁguration
of k=w as the following
73
Pr{Xu = v} =
∑
v=t0+t1+t2+t3
P 0(t0) · P 0(t1) · P 0(t2) · P 0(t3) (6.2)
because there are two HITs, l-HIT and r-HIT , each having 0- and 1-trees. The
sum in Eq. (7.2) accounts for the combination of becoming v among t0, t1, t2, and t3.
That is, if v=1, there are four cases: 0+0+0+1, 0+0+1+0, 0+1+0+0, and 1+0+0+0.
Although choosing a large value for precision wi for layer i>0 is possible, the
number of BFs on each layer in a HIT must be upper bounded, so that the total
number of memory reads to M ion for the BFs must be sustainable in hardware im-
plementation. The expected number of BFs to probe on layer i>0 for a US becomes
N iB,U = 2× {2i−1
i−1∏
t=0
f t × 2} = 2× {2i−1 × 2−2i × 2} = 21−i, (6.3)
where f i=2-w
i
=2-2 except layer s-1. On layer s-1, f s-1=2-w
s−1
is set a collision rate
as low as one for a high-speed router like 2−29 for 160Gbps. In convergence,
lim
i→∞
N iB,U = 0, (6.4)
meaning that as long as i for a layer index increases, the expected number of BFs
on the layer i is minuscule enough that simple memory hardware can support the
request of a small number of memory reads.
2. False indexing to a key table in on-chip for an SS
The probability of f -paths in a US has been derived. Now, the probability of the
number of f -paths in an SS is derived and calculated. The situation in an SS is very
diﬀerent from that of a US because there must be one i-path and several possible
f -paths while there is no i-path in a US. Fig. 27(a) shows an example of 5 layers
for 25 keys where along an i-path there are 5 dangling trees, d-trees, contributing
74
f -paths, if any. All d-trees except one rooted on layer 0 are attached to the i-path
and they contribute a number of f -paths with diﬀerent probabilities related to P i(n)
of Eq. (7.1).
a0
a1
a3
a2
4a
a −tree0 0a −tree
dangling tree i−pathBF  on i−path
f−posv. on f−segment
4P (n)
3P (n)
(a) An HIT of 5 layers with an i-path and
dangling trees.
0 1 2
10−5
10−4
10−3
10−2
10−1
100
The number of f−segments, t
Pr
ob
ab
ili
ty
 P
i (n
)
P4 (n), w4 =5
P3 (n), w4 =5
P4 (n), w4 =6
P3 (n), w4 =6
P4 (n), w4 =7
P3 (n), w4 =7
(b) P i(t) of Eq. (7.1). wj=3, j<4, s=5,
and n=25. Note that as to P 4(t), there is
only one possible f -segment, i.e. t=1.
Fig. 27. An i-path and d-trees in an SS, and P i(n) of Eq. (7.1) for each d-tree in an
HIT.
Fig. 27(b) shows the P i(t) in an HIT with various w4 for n=25 keys. Two
lines for w4=5 have a big diﬀerence of 4 times between them, meaning that P 4(t) is
the dominant one contributing the number of f -segments. This is true for all other
cases of w4=6, 7. Now, for comparison of the diﬀerence in precision choices about
ws-1, i.e. w4, look at three solid lines with markers located at the top of the ﬁgure.
The diﬀerence ratio of P 4(1) between two precision choices for layer 4, w1 and w2,
is determined by 2 to power of w1-w2. Therefore, no matter what precision of w
j is
chosen for the layer j, 0≤j≤s-2, a dominant probability of f -segments in a HIT comes
from the layer 4. Based on this result, it can be concluded that choosing reasonably
lower precision for layer j, 0≤j≤s-2, does not aﬀect the probability of the number of
f -paths, but results in saving memory for these layers.
75
Besides decision rule of wi for layer i, it is necessary to calculate the number
of partially found f -segments on each layer in an SS. Two times this number is
considered as the number of BFs needed to probe on the next layer, like the expected
number of BFs in Eq. (6.3) for a US. Note that in an SS, there is an i-path and at
least two children BFs probe from a BF on the i-path due to a binary property. The
expected number of BFs to probe on layer i>0, except layer s-1, for an SS becomes
N iB,S = 2 + f
i−1 · 2 + 2 · f i−1f i−2 · 22 + · · ·+ 2i−1 · f i−1 · · · f0 · 2i (6.5)
= 2 +
i−1∑
t=0
2t
t∏
v=0
f i−v−12t+1 = 2 +
(
1− 2−i) ,
where 2 accounts for two BFs on an i-path and f i=2−2. With Eq. (6.3), the
maximum in convergence among the expected numbers of BFs to probe in an SS and
a US becomes 3 as
lim
i→∞
max{N iB,U , N iB,S} = 3. (6.6)
Thus, the total number of memory reads to M ion for layer i, 0<i<s-1, is k
i times 3
because for each BF ki hash functions are necessary. In conclusion, in a query, either
unsuccessful or successful, the expected number of memory reads on M ion for layer i,
0≤i<s-1, is upper bounded to 3ki, where ki is set 2. On layer s-1, ks-1=ws-1 is set to
the log of reverse collision rate, i.e. 29 for 160Gbps, so that bandwidth requirement
is secured by our deterministic Θ(1) lookup. The ks-1 memory reads of random
locations by ks-1 hash functions can be supported by a simple switching circuitry in
one cycle. Even if a commodity of SRAM has a limit in the number of memory reads,
a large SRAM can be divided into a smaller SRAM with the less number of memory
reads without worsening an f -positive as suggested in [15].
Based on the observation from Fig. 27(b), calculating the number of f -paths,
i.e. false indexes to a key table in on-chip memory in an SS, is necessary. Let random
variable Xs be the number of f -paths from an HIHT, given an i -path for a key in
76
query operation. Then, Xs+ 1 is the total number of entries in a key table to check
in an SS and is equal to the length of searched linked list of an FHT [15]. The detailed
probability of Xs for an SS without f -paths is deﬁned as following
Pr{Xs = 0} = P 0(0) · P 1(0) · P 2(0) · · · P s−1(0), (6.7)
because each d-tree along an i-path and two of 0- and 1-trees are independent to each
another. In general, the Pr{Xs=v} is calculated based on the independent property
of each d-tree along a i-path as the following
Pr{Xs = v} =
∑
v=t0+···+ts-1
P 0(t0) · P 1(t1) · P 2(t2) · · ·P s-1(ts-1). (6.8)
3. Detailed procedures for query and delete
Complete query operation consists of query-i shown in Procedure query-i only
considering on-chip operation for layer i. The time complexity of this procedure is
Θ(1) on the condition that 2‖L‖, 2 times of the size of L, is bounded to the number
of memory reads that hardware supports without burden. The reason for this is that
given candidates for partial addresses in L, the number of BFs to probe is doubled
due to two children of each node in a binary tree. Thus, by pipelining on each layer
starting layer 0, query is performed in one cycle, so that query-s-1 returns complete
indexes to a key table for a given query. On the last layer s-1, the average numbers
of complete indexes are calculated as E[Xs]+1 or E[Xu] on average for an SS and a
US, respectively, where E[Xs] and E[Xu] can be derived from Eqs. (7.4) and (7.2) as
E[Xs] =
n−1∑
t=0
t · Pr{Xs = t}, E[Xu] =
n∑
t=0
t · Pr{Xs = t}, (6.9)
77
and they are considered Θ(1) because Pr{Xs=1}
1 and Pr{Xu=1}
1. Moreover,
Pr{Xs = 1}Pr{Xs=t} and Pr{Xu = 1}Pr{Xu=t}, t > 1 as recognized in Fig.
27(b).
Procedure query-i
Input: M ion for layer i, list L of partial indexes found on up to layer i-1, including
i-path, and key e
Output: A set of partial A = a0 · · · ai of i+1 bits, including f -segments
S = ∅; n=‖L‖; /* S: Set of partial paths. L={A0, · · ·An-1} */1
for t = 0 to n-1 do2
mi=1.44niwi; At = L[t]; idx0=2At ·mi; idx1=(2At + 1) ·mi;3
cnt 0=cnt 1=0;
for t=0 to ki-1 do /* One Mem. for BFs on layer i. ki hash funcs.4
*/
if M ion[idx0+ht(e)]==1 then cnt 0++; /* idx0, idx1 indicate Bij */5
if M ion[idx1+ht(e)]==1 then cnt 1++; /* Two sets of ki accesses6
due to binary children BFs in an HIT */
end7
if cnt 0==ki then S=S∪ At · 0; /* concatenate 0 or 1 bit to the end8
of At */
else if cnt 1==ki then S=S∪ At · 1;9
end10
return S; /* No off-chip memory access */11
As with delete, like query operation, if there is an f -path beside an i-path
associated with a key, two accesses to a key table are necessary. Thus, when random
variable Z is denoted as the number of accesses to a key table in on-chip memory,
the average memory access for delete operation on the condition of existence of a
target key, i.e. in an successful deletion, is
E[Z] = 1 +
n−1∑
v=1
v · Pr{Xs = v} = 1 + E[Xs].
The detailed procedure is shown in Procedure delete. The complexity of it in
on-chip memory is Θ(1) based on the complexity of query is Θ(1). The complexity
of indexes to access a key table is Θ(E[Z]) on average for a successful deletion and
78
it is to be constant as E[Xs] is Θ(1).
Procedure delete
Input: Two HITs of l- and r-HIT and key e
Output: Updated valid bit array Vl or Vr for l- and r-HIHT
Sl=query(l-HIT,e) ;1
Sr=query(r-HIT,e) ; /* Only on-chip Op. */2
if ‖Sl ∪ Sr‖ == 1 then3
A ∈ St; Vt[A] = 0; /* t s.t. ‖St‖==1 */4
else if ‖Sl ∪ Sr‖ > 1 then5
foreach A in St, t ∈ {l, r} do /* Off-chip M. Acc. */6
if Vt[A]==1 and Mkey[A]==e then7
Vt[A] = 0; /* Save A in FAS via ’next’ */8
end9
4. Parallel accesses to a key table in an interleave way
In a linked list used in an LHT and an FHT, accessing the last key in the linked list
of t keys takes t cycles in sequence, because memory address of key e is known after
a previous key e′ with a pointer to the next key e is obtained in the previous cycle.
In contrast, in an HIHT, the candidate indexes to a key table are generated at the
last layer s-1 in every cycle, so that parallel accesses to the key table are possible,
as shown in query-i. Thus, unlike the sequential access in a linked list, our memory
access is diﬀerent as follows: every cycle, a set of indexes is generated for a given
query and there is no other index to access the key table for the query as an i-path
exists in the set. Also, by a simple interleaving or switching circuitry in SRAM for a
key table, a group of memory reads to entries in the key table via the indexes can be
processed in parallel in one cycle because indexes are diﬀerent from each other and
known in advance. For instance of interleaving, Garcia et al. [68] provide worst-case
bandwidth guarantee by utilizing potential of bank interleaving in a SRAM/DRAM
hybrid for packet buﬀer. Once perfect match is complete with a set of indexes to a key
79
table in on-chip in one cycle, an HIHT can provide a deterministic lookup needing
only one access to oﬀ-chip memory to know rule about every query of an SS. If a
query is a US, there is no access to oﬀ-chip memory. The detailed analysis of the size
of a set of indexes in SS and US queries is shown in the following section.
E. Simulation Result for an HIHT
This section presents an analysis on a memory eﬃciency for three schemes: an FHT
[15], a BFHT [16], and an HIHT. As to a perfect hash function, Mitzenmacher and
Vadhan [69] claimed that simple hash function can provide a truly random hash
function. A class of universal hash functions are suitable for hardware implementation
and thus a scheme from [59] is chosen.
1. Memory comparison with other hash mechanisms
10
14
18
22
26
30
10
15
20
25
30
1
2
3
4
5
6
The # of layers, s=log2 n
precision w = log2 1/f
M
em
or
y 
ef
fic
ie
nc
y 
ra
tio
R
HB
R
HF
Fig. 28. Memory eﬃciency ratios of RH,B and RH,F with various s and w. Note a
corrected-FHT is considered.
80
To maximize memory eﬃciency of an HIHT, the precision of layer t is set to 2,
0≤t<s-1 and that of layer s-1 is set to w as the same as precisions of an FHT and
a BFHT. The precision value 2 is chosen based on the hardware consideration for
pipelining and memory read ports as stated in Sec. 2. Also, for a fair comparison,
the on-chip memory tables were considered in a BFTH and an HIHT. According to
Eq. (3.2) for the requirement of f=2-w, the total HIHT memory usage, M3, counts
two copies of HITs plus two VBAs and a key table and is calculated as the following
MH=2{2βn0w0+ · · ·+2s-1βns-1ws-1+n}+n log2 n (6.10)
=2βn
s−1∑
i=0
wi+2n+n log2 n=2βn
(
s−2∑
i=0
2+ws-1
)
+2n+n log2 n,
where β=1.44. In contrast, the FHT memory usage is MF=βwns+4βwn=βw(s+4)n
by considering 4-bit counters while an FHT in [15] did not consider the counters
for a fair memory comparison. The memory size MB for a BFHT is 2 times
kn(2 log2 n+log2 k+w) due to the seamless update in O(n log n) complexity. Thus,
considering n log2 n for an on-chip key table, memory eﬃciency ratios RH,F and
RH,B of MH over MF and MB become
RH,F=
βnw(s + 4)
2βn
∑s−1
i=0 wi+2n+n log2 n
(6.11)
RH,B=
2kn(log2 n+log2 k+w) + n log2 n
2βn
∑s−1
i=0 wi+2n+n log2 n
. (6.12)
Fig. 28 shows two ratios, RH,F and RH,B, calculated from Eqs. (6.12) and (6.11)
in the range [ 10:30 ] for w with the required precision of each scheme and in the range
[ 10:30 ] for s. Note that the LHT memory size against MH is not considered nor
drawn in the ﬁgure, because with given high precision w of 29, the LHT memory size
is more than a hundred times larger than that of an HIHT. The change in w provides
81
a greater beneﬁt in RH,F than the change in s does, implying that as long as the
collision rate stays lower for a high bandwidth, our HIHT maintains multi-integer-
fold, not fractional-fold, eﬃciency gain. For instance, 160Gbps needs to set w to be
at least 29 ( ≈ log2 500M). However, RH,B shows the minor change according to w
and s, and it gives around two times memory eﬃciency over all ranges. These results
on RH,F and RH,B support our claim that our HIT oﬀers a better space-eﬃcient
hashing architecture shown in the previous section.
2. Power comparison with TCAM for IP lookup
In addition to memory comparison among hash mechanisms, power comparison be-
tween an HIHT-based IP lookup and TCAM-based IP lookup is made. Although the
detailed hash-based IP lookup architecture with proposed hash schemes, an HIHT
and an FPHT, will be shown in Sec. VIII, this section and the following section show
preliminary power and memory comparisons with TCAM and a trie for IP lookup.
Fig. 29 shows the consumed energy per read clock in two IP lookup schemes: a
TCAM-based IP lookup and an HIHT-based IP lookup. The consumed energy per
read clock is measured by CACTI [35] and a tool [70]. The table size was varied from
236K to 1M entries. The ﬁrst table of 236K entries is taken from [51] and the rest
tables are created by random. It is found that the proposed HIHT-based IP lookup
scheme consumed 51 times less power compared to the TCAM-based IP lookup. It is
shown that the TCAM uses a tremendous power amount as the table size is increased
while the proposed HIHT-based schemes uses a small power amount.
3. Memory comparison with Trie for IP lookup
Fig. 30 shows the total memory for IP Lookup in three schemes: a Tree Bitmap
[71] and an HIHT. As Table I discussed, trie-based IP lookup schemes suﬀer O(W )
82
236029 500000 750000 1000000
10−1
100
101
102
Routing table size
En
er
gy
 p
er
 re
ad
 c
lo
ck
 
 
TCAM
HIHT
Fig. 29. Consumed energy per read clock in 0.09μm process technology.
lookup complexity where W is the IP address length while hash-based schemes provide
O(W ) lookup complexity. Comparing memory eﬃciencies of IP lookup schemes with
diﬀerent lookup complexities is not fair because hash-based schemes can provide a
higher throughput. However, the proposed HIHT-based hash scheme can provide
O(1) lookup complexity as well as memory eﬃciency as follows. The table size was
varied from 236K to 1M entries. The ﬁrst table of 236K entries is taken from [51]
and the rest tables are created by random as in the previous section. In each memory
calculation, other bitmaps and pointer memory overhead for hash-based and trie-
based approaches are considered. It is found that the HIHT-bases IP lookup scheme
consumed 1.75 times less memory compared to Tree bitmap. Furthermore, as the
table size increases, the HIHT-based IP lookup scheme saves at most 2.15 times
memory.
83
253371 500000 750000 1000000
0
2
4
6
8
10 x 10
7
routable table size
M
em
or
y 
in
 b
its
 
 
Tree Bitmap
HIHT
Fig. 30. Memory comparison of Tree Bitmap and an HIHT in diﬀerent table sizes.
84
CHAPTER VII
HASHING USING BLOOM AND FINGERPRINT FILTERS
Sec. III.B shows that an SBF and an FF are memory- and power-eﬃcient approximate
membership testers. Such membership testers are used to propose an FPHT in this
chapter. In an FPHT, a group of SBFs is used to do a pipelined binary search for
a key’s FP, and subsequently the found FP of w bits is used to access a key table
with a desired lookup precision w. Suppose there exists no f -positive in a SBF and
an FP. Then, there exists only one found FP in a FPHT’s binary search. However,
since a SBF and an FP can produce an f -positive, there exist candidate FPs and key
indexes in the binary search result. Since the number of key table accesses through
FPs is considered as the reverse of the lookup throughput, the probability of the fewer
number of key table access is desirable for performing a high speed packet processing.
Fig. 31 illustrates our FPHT pipeline architecture where a set of SRAM modules
is in pipeline and key and rule tables are accessed by an FPHT’s generated indexes
for a perfect match lookup. Based on a key’s query results from SBFs which are
designed through a SRAM module in Sec. A, each stage logic decides SBFs’ base
addresses based on Eq. (3.2) and accesses SBFs in the next stage.
logic0 logic1
SRAM0 SRAM1 SRAMs−1
logics−1
==?
matchperfect
rule table
key table
yes
Pipeline
key indexes
generated
Fig. 31. An pipelined FPHT architecture with s stages.
85
In a logic view, a tree in multiple memory modules with a set of SBFs is concep-
tually embedded into a memory (Hereafter a BF or an SBF is used interchangeably).
In the tree, nodes (or BFs) are used for a binary search where a BF’s key query result
choose the next BF nodes to probe. With BFs’ and an FP’ query results, the em-
bedded tree generates indexes to a key table. The tree, called an indexing tree (IT),
is memory eﬃcient while preserving a required lookup speed, because it never uses
pointers in implementing a tree for FP and key table accesses through the same in-
dexes. In addition, the memory cost for BFs and an FF in an IT is far less than other
schemes because BFs’ role is limited to a binary search for a key’s FP and an FF.
This ensures one access to a key table with a high probability. This memory saving
is beneﬁcial to modern fast packet processing that are challenged with a scalability
issue. The detailed IT build is as follows.
A. Building a Conceptual IT of a Binary Preﬁx Tree
e e e e eeee
F F F F F FF F
B00
B10 B11 B12 B13
VR
s
B01
1 4 5 6 7320key table
0
0
0
0
0
1
1
0
0
1
0
1
0
1
0
0
1
1
1
1
0
1
1
1
00
0 10 0 1 0 1
VR: virtual root
4 5 6 70 1 2 3
stage 2
stage 1
stage 0
0 1
1
1 1
FP table
addr.
bits
MSB
LSB
or layer 0
Fig. 32. Conceptual IT construction with BFs and tables of FPs and keys.
Suppose n is in power of 2. Then, an IT in a binary preﬁx tree is built as
follows: An IT for n keys is composed of s=log2 n stages and a preﬁx tree is built
86
based on index bits bits for a key table, so that a BF in the preﬁx tree is shared by a
preﬁx among indexes. Fig. 32 shows the IT partition where keys are stored in a key
table consecutively and the keys’ index addresses are partitioned by BFs on stage j,
0≤j≤s-2 or FPs on stage s-1, so that each key has its own BF-FP path in the IT.
In general, n keys are ﬁlled in an key table sequentially from index 00...0s-1 to index
10...1s-1. Let Bij denotes j-th BF in stage i, hereafter 0≤i≤s-1 while F s-1j denotes
j-th FP on stage s-1. Then, if key e is to be inserted at index A= a0a1...as-1, where
at∈{0, 1}, 0≤t≤s-1, a BF, denoted Bia0···ai at each layer i, is involved to encode key e
just like a legacy BF. In this hierarchical binary encoding, Bij covers ni=n/2
i+1 keys
whose indexes in a key table range from j·2s-i-1 to (j+1)·2s-i-1-1. For instance, B00
and F 35 cover key sets {e0,· · · ,e3} and {e5}, respectively.
Regarding memory hardware, although BFs are conceptually partitioned in a
layer (or a stage) for their key sets, they concatenate each other in a SRAM module
while separated by their base addresses. That is, as Eq. (3.2) speciﬁes the required BF
memory size for a bounded f -positive, Bij has base address j·1.44ni in each memory
bank of M ion[k]. Also, as it states that for a given f m is linearly proportional to n
based on Eq. (3.2), given fi=2-wi for a BF on stage i, the total memory of M ion for
BFs on layer i is of size 2i+1(1.44niwi). Finally, the index order of keys’ FPs in an
FP table is exactly corresponding to that of keys in a key table, and the an FP table
size in bits is nw based on Eq. (3.5).
An IT is named after a tree outlook because each stage in pipelining has a
sequence of bits and sub-block of bits for a BF on stage i makes a binary relationship
with sub-blocks on stage i-1 and i+1. Yet, to maintain a tree an IT does not use
explicit pointers as in a binary search tree [60], but an implicit index for each BF
sub-block in M ion. That is, B
i
j is located at j·1.44niwi in memory M ion. Also, all 2i+1
BFs, independent each other, on layer i contribute the memory size of M ion. Thus, a
87
large memory volume reserved for pointers is saved.
B. Insert Operation in an IT
Fig. 32 shows the binary address space with a set of BFs that hierarchically partition
a key table’s address space. In this tree structure, the insertion of key e0 at index
1002, for example, means B01 , B
1
0 , and F
2
0 of layer 0, 1, and 2 are involved. Algorithm
insert-i shows the detailed insert operation on stage i as simple as that for a BF.
Algorithm 10: insert-i()
Input: key e, rule r, and partial index A=a0a1· · · ai in binary bits
Output: Encoded IT for key e on stage i
mi=1.44niwi; j=a0 · · · ai;1
for t=0 to k-1 do2
// ht(e) ∈{0,· · · ,m′i-1}, M ion of 2i+1mi× bits
M ion[k][ht(e)][j]==1; // M ion: BFs on layer i3
end4
Albeit conceptually all BFs are separate from each other in an IT, their hardware
implementation assumes that BFs on layer i are embedded in one memory M ion and
there exist s memory modules. Finding base address for Bij is easily calculated as
shown in line 1. The ﬁrst for loop is done in parallel, as does a legacy BF and
Algorithm insert-i works on stage i in pipeline. Thus, the time complexity is O(1)
under the condition that hash functions return indexes within a constant time, and
each layer conducts hashing in parallel. This condition is made possible in hardware
implementation as noted in [15]. After the last stage s-1, key e and its associated
rule are saved as Mrule[A]=r and Mkey[A]=e, where A is the designated address.
The complexity of Algorithm insert for memory access to key and rule tables is
O(1), because key e and its associated rule are saved in Mkey and Mrule with A.
In contrast, an FHT claims a time complexity of O(nk2/m+k), while a BFHT does
88
O(n log n).
C. Query Operation Making Indexes in an IT
Once all the keys are saved in a key table and encoded in a set of BF and FP memory
modules, the ultimate remaining IT goal is to search a key by performing fast query
operation. There are two kinds of search patterns, an unsuccessful search (UL) in
which a key is relentlessly searched although it does not exist in an IT, and a successful
but time-consuming lookup (SL) in which a key is to be searched in an IT. Before
a discussion of these two kinds of searches, let deﬁnitions of an index path, a false
index path, and a false segment introduced.
Definition 7 (Index Path)
In an IT, an index path, or i-path, is deﬁned as a series of Bijs used in insert
operation and hierarchically connected each other from layer 0 to layer s-1 to produce
an index bit sequence. The sequence of indexing bits in Bijs is also matched with an
arbitrary index of a key saved in a key table and the size of the bit sequence from the
series of Bijs must be s.
As a corollary, it can be concluded that in query for key that is e previously encoded
by insert, an i-path for the key e should show up as BFs return ’yes’ for their true
membership.
A false index to a key table, other than an i-path dedicated to a key, is made
possible due to the f -positives from irrelevant BFs or FPs in an IT. For example,
suppose key e4 is inserted with i-path 100102 in Fig. 32 and then a query to e4
is requested. This query result may give an ambiguous 00xx′, x, x′∈{0, 1}, due to
f -positives of B11 and other FPs. Thus, this ambiguity needs to be resolved with
other accesses to a key table. Given a query for an i -path of size s, there are totally
89
2s-1 false indexes because each Bij is independent and identically distributed, or i.i.d.
Besides the i-path deﬁnition, I deﬁne a false index path in query operation, leading
to a false index to a key table.
Definition 8 (False Index Path and False Segment)
In query, a group of Bijs or F
i
j not pertaining to an i-path can be formed in a series of
at most size s from hierarchically consecutive layers. To become a false index path, or
f-path, this series needs to be either connected to an i-path or a completely diﬀerent
path of size s, i.e. independent of an i-path in an IT. Also, the group attached to an
i-path is called a false segment, or f-segment. The number of f -paths plus an i-path is
the total number of key table accesses which was is the shared-linked-list length used
for an FHT key query.
Even if it is possible that there is a set of BFs giving f -positives in query, BFs that
are only hierarchically mutually connected to BFs and an i-path can be a part of an
f -segment. Thus, f -positives from the rest BFs can be ignored. For the previous
example of 1002 for e4, even if B11 and B
1
3 randomly make f -positives right after
query, there is no f -segment starting from the B11 and B
1
3 . By the deﬁnition of an
f -path, the probability of the f -path is cumulatively calculated as the product of
f -positives from BFs along the f -path.
 
 

B 00
B 11
B 01
B 10 B 12 B
1
3
F 20 F
2
1 F 22 F 23 F 24 F
2
5 F
2
6 F 27  
 
in ’no’
BF or FF
in ’yes’ on i−path
in ’yes’ out of i−pathBF or FF
BF or FF
  
  


    
Fig. 33. Examples of an i-path and f -paths for a given query of key e4 in an IT without
a virtual root.
90
Fig. 33 shows an example of calculating the probability of an f -path in an IT
with one i-path and two f -paths. A series of B01B
1
2F
2
4 in the dark boxes is an i-path.
The probability of the f -segment B13F
2
7 forming f -path B
0
1B
1
3F
2
7 is
∏2
t=1 ft where ft
is the f -positive of a BF or an FP on layer t. Also, the probability of the remaining
f -path, B00B
1
0F
2
0 , is
∏2
t=0 ft since the probabilities of f -positives of BFs or FPs on
the same layer are the same each other.
Once the probability of an individual f -path is known, the ﬁnal attention is paid
to the probability that an IT may have t f -paths, 0<t<n. Suppose a binary tree T
of height  has sub-trees Tl and Tr of height -1 which have nl and nr f -segments
of size -1. Also, let Tl and Tr have probabilities FTl and FTr for their f -segments.
To obtain nl+nr f -segments of size , the binary tree T with height  needs to
be an f -positive. Thus, the probability FT of the binary tree T with its sub-trees
having nl+nr f -segments is the product of three: the probability that T needs to be
an f -positive, the probability that Tl has nl f -segments, and the probability that Tr
has nr f -segments, i.e. fi·FTl·FTr . Based on this recursive way, the probability Pi(t)
of t f -segments starting on layer i for an IT of height s is calculated as the following:
Pi(t) =
t∑
v=0
Pi+1(v) · Pi+1(t− v) · fi if t ≤ 2i+1, (7.1)
where base cases of t > 2i+1 and i=s are 0 and 1, respectively.
1. False indexing to a key table for a UL
Besides the design issue of producing a low probability of multiple accesses to a key
table in an SL, it is equally important that the probability of f -indexes in a UL is also
lower. Unlike an SL, there is no i-path for a given key in a UL, meaning that all BFs
in query return ’yes’ as f -positives. However, there is a chance that an IT may give
plural f -paths. In contrast to an f -positive in an FHT [15] leading to oﬀ-chip memory
91
accesses, an f -path by a series of f -positives with hierarchically connected Bijs in
each layer i becomes one index access to a key table. Thus, a far less probability is
expected due to the product of f -positive probabilities of BFs.
Suppose random variable Xu is the number of f -paths in a UL on an IT. Then,
the probability Pr{Xu = v}, v>0 can be easily derived based on Eq. (7.1) as the
following
Pr{Xu = v} =
∑
v=t0+t1
P0(t0) · P0(t1) (7.2)
because an IT has two children trees on layer 0. The sum in Eq. (7.2) accounts for
the combination of deriving v among t0 and t1. That is, if v=1, there are two cases:
0+1 and 1+0.
2. False indexing to a key table for an SL
The probability of f -paths in a UL is derived. Now, the probability of the f -paths
number in an SL is calculated. The situation in an SL is very diﬀerent from that
of a UL, because there must exist one i-path with possible f -paths of highly low
probability while there is no i-path in a UL. Fig. 34 shows an example of 3 layers
for 23 keys where along an i-path there are 3 dangling trees, labeled as d-trees,
contributing to f -paths, if any. All d-trees except one rooted on layer 0 are attached
to the i-path and they contribute to the f -paths number with diﬀerent probabilities
related to Pi(n) of Eq. (7.1).
Based on the observation from Fig. 34, to calculate the f -paths number, i.e.
false indexes to a key table in an SL, is necessary. Let random variable Xs be the
number of f -paths in query operation with an i -path for a key. Then, Xs+ 1 is the
total indexes in an SL which equals to the searched-linked-list length of an FHT [15].
The detailed probability of Xs for an SL without f -paths is deﬁned as following:
92
P (n)0
P (n)2
a1
0a
a2
P (n)1
0a a1a2
dangling tree
BF or FF  on i−path
f−posv. on f−segment
i−path =100
Fig. 34. An IT of 3 layers (or stages) with an i-path and dangling trees.
Pr{Xs = 0} = P0(0) · P1(0) · P2(0) · · · Ps-1(0), (7.3)
because each d-tree along an i-path are mutually independent to each another. In
general, the Pr{Xs=v} is calculated based on the independent property of each d-tree
along an i-path as the following
Pr{Xs = v} =
∑
v=t0+···+ts-1
P0(t0) · P1(t1) · P2(t2) · · ·Ps-1(ts-1). (7.4)
3. Detailed algorithm for query
A complete query operation consists of query-i for layer i, 0≤i≤s-1, shown in Algo-
rithm query-i. The time complexity of this algorithm is Θ(1) under the condition
that ‖L‖ is bounded by the number of memory reads supported by the hardware
without overhead. The reason for this is that given candidates for partial indexes in
L, the number of BFs to probe doubles due to having two children to each binary
tree node.
By pipelining starting layer 0, a query is performed in one cycle, so that query-s-
1 returns complete indexes to a key table for a given query. On the last layer s-1, the
average numbers of complete indexes are calculated as E[Xs]+1 or E[Xu] on average
93
Algorithm 11: query-i()
Input: M ion for layer i≤s-1, list L of partial indexes found on up to layer i-1
including i-path, and key e
Output: A set of partial A = a0 · · · ai of i+1 bits, including f -segments
// S: Set of partial paths. L={A0, · · ·An-1}
S = ∅; n=‖L‖; // ‖L‖ is the size of L1
for t = 0 to n-1 do2
mi=1.44niwi; At = L[t]; j0=At·0; j1=At·1; cnt 0=cnt 1=0;3
// One M. for BFs on layer i. ki hash funcs.
for t=0 to ki-1 do4
// idx0, idx1 indicate Bij
if M ion[t][ht(e)][j0]==1 then cnt 0++;5
if M ion[t][ht(e)][j1]==1 then cnt 1++;6
end7
// concatenate 0 or 1 bit at the end of At
if cnt 0==ki then S=S∪ At · 0;8
else if cnt 1==ki then S=S∪ At · 1;9
end10
return S; /* No memory access for a key table */11
for an SL and a UL, respectively, where E[Xs] and E[Xu] can be derived from Eqs.
(7.4) and (7.2) as
E[Xs] =
n−1∑
t=0
t · Pr{Xs = t}, E[Xu] =
n∑
t=0
t · Pr{Xs = t}. (7.5)
These equations are considered O(1) because having one f -path, Pr{Xs=1}
o(1)
and Pr{Xu=1}
o(1), is very unlikely because of the way a high-speed router is
designed.
D. Delete Operation with Counting BFs
A BMF in [27] suﬀers from a dynamic membership change, because an index table
stores a key’s k hash values of based on its neighborhood with other keys and this
neighborhood is collected by avoiding collision with other keys’ hash values. Thus,
an index-table setup in a Blooimer ﬁlter takes O(n log n) complexity, implying that
94
a BFHT using a Bloomier ﬁlter needs the same time complexity for updating keys.
However, our FPHT takes O(1) for update. Unlike an MBHT [31] and an HIHT [32],
CBFs for a dynamic update is adopted. Since CBFs are used for delete operation,
insert operation needs to be modiﬁed at line 3 as query operation does at lines 5
and 6 for counter operations. The detail is the following.
To remove a key with an i-path, all CBFs on the i-path need to delete the key.
Deleting the key in a CBF is as easy as decreasing counters indexed by hash functions.
Also, an FP for the key is reset to 0 to indicate an empty FP. Since an i-path for the
key is known, resetting the FP is easy. If a membership of a key to remove is not
known, a lookup on the key is necessary to ﬁnd an associated i-path. In this query, if
there are any f -paths besides an i-path associated with a key, the necessary number
of key table accesses is one plus the number of f -paths. Once CBFs and an FP is
updated for the key deletion, the i-path (or the FP index) is saved in an index pool,
so that when a new key insertion is asked, one from the index pool is used for the
next key’s insertion as an i-path.
When random variable Z is denoted as the number of accesses to a key table,
the average memory access for delete operation on the condition of a target key’s
existence, i.e. a successful deletion, is
E[Z] = 1 +
n−1∑
v=1
v · Pr{Xs = v} = 1 + E[Xs]. (7.6)
The delete complexity is O(1) based on the query’s O(1) complexity. The com-
plexity of indexes to access a key table is Θ(E[Z]) on average for a successful deletion
and it is to be constant as E[Xs] is O(1).
95
0h (e) h (e)1
S0 S1 S2 S3
bank 0
0
0
0
0
0
1
1
1
1
1 1
11
1
1
1
1
1
1
11
1
1
0 0 0
0
0 0 0
1
bank 1
0
selector selector
2 2
MUX MUX
Fig. 35. A sample conﬁguration of a 4-SBF in k=2 banks. A 4-SBF represents S0
through S3. The memory size is 2×4×4.
E. FPHT Optimization in a b-ary Preﬁx Tree
The IT so far is built as a binary preﬁx tree in a 2-base number system. Since a
BF acts as a binary-predicate in an IT, a BF assigned for bit 0 in its index bits
returns ’yes’ like a BF assigned for bit 1 does. However, when a b-ary preﬁx tree,
b∈{22, 23, · · · }, is adopted in an IT, a BF is assigned for bit x, x∈{0,· · · ,b-1}, and a
node in a b-ary preﬁx tree is implemented in a b-SBF as a 4-SBF is shown in Fig.
35. Also, the IT height, i.e. the number of pipeline stages, is reduced to s=log2 n,
thereby the total IT memory is. According to b-base number, using a b-ary preﬁx tree
requires a bit change in an key table index of 2-base number system. For instance,
index 01002 for e4 in a binary preﬁx tree is simply transformed to 104 in a 4-ary
preﬁx tree. However, this change does not create index addressing disturbance. Thus,
without any key table change memory saving is observed by adopting a b-ary preﬁx
tree.
F. Simulation Results for an FPHT
This section presents an analysis on memory eﬃciency for three schemes; an MBHT
[31], an HIHT [32], and an FPHT.
96
1. Memory size in consideration of speed and scalability
18
22
26
30
10
14
18
22
1
2
3
 
w (lookup precision)log2 n (# of keys)
 
M
em
or
y 
ef
fic
ie
nc
y 
ra
tio RM
RH
Fig. 36. Memory eﬃciency ratios of an FPHT over an MBHT and an HIHT at various
n and w. In an FPHT, a lookup precision of a CBF is set to 6 for a 16-ary
preﬁx tree.
This section shows calculation of the memory eﬃciency ratio among an MBHT,
an HIHT, and an FPHT to properly address speed and scalability. Since authors
in [31, 32] made memory eﬃciency by comparison of their schemes against an FHT
[15] and an BFHT [16], the memory comparison is not considered again. Also, a
16-ary preﬁx tree is used for an FPHT optimization. The MBHT memory size is
MM=2βn(w+log2 b) logb n+2n+n log2 n, b=16, and the HIHT memory size, MH , is
calculated as 2βn·(3(log2 n-1)+w)+2n+n log2 n. In contrast, the FPHT memory size
becomes βn(3+ log2 b)×(logb n-1)×C+nw+n log2 n where logb n is a preﬁx tree height
and C=3 for counter bits.
Fig. 36 shows two memory eﬃciency ratios, RM and RH of an FPHT over
an MBHT and a HIHT based on Eqs. (3.2) and (3.5). As shown in this ﬁgure, an
MBHT is not suitable for speed and scalability concerns. Although RM at small w
and n values is smaller than that of higher w and n values, it is evident that in the
97
overall range of w and n an FPHT approximately needs smaller memory size than a
MBHT, and the highest memory eﬃciency is 3.0. In case of RH , 2.1 times memory
eﬃciency is shown. The memory capacities of an MBHT and an FPHT at the highest
eﬃciency with n=210 and w=30 are 262,964 and 87,409 bits, respectively.
2. Power comparison with TCAM for IP lookup
In addition to memory comparison among hash mechanisms, power comparison be-
tween an FPHT-based IP lookup and TCAM-based IP lookup is made. Although the
detailed hash-based IP lookup architecture with proposed hash schemes, an HIHT
and an FPHT, will be shown in Sec. VIII, this section and the following section show
preliminary power and memory comparisons with TCAM and a trie for IP lookup.
Fig. 37 shows the consumed energy per read clock in two IP lookup schemes: a
TCAM-based IP lookup and an FPHT-based IP lookup. The consumed energy per
read clock is measured by CACTI [35] and a tool [70] The table size was varied from
236K to 1M entries. The ﬁrst table of 236K entries is taken from [51] and the rest
tables are created by random. It is found that the proposed FPHT-based IP lookup
scheme consumed 51 times less power compared to the TCAM-based IP lookup. It is
shown that the TCAM uses a tremendous power amount as the table size is increased
while the proposed FPHT-based schemes uses a small power amount.
3. Memory comparison with Trie for IP lookup
Fig. 38 shows the total memory for IP Lookup in three schemes: a Tree Bitmap
[71] and an FPHT. As Table I discussed, trie-based IP lookup schemes suﬀer O(W )
lookup complexity where W is the IP address length while hash-based schemes pro-
vide O(W ) lookup complexity. Comparing memory eﬃciencies of IP lookup schemes
with diﬀerent lookup complexities is not fair because hash-based schemes can provide
98
236029 500000 750000 1000000
10−1
100
101
102
Routing table size
En
er
gy
 p
er
 re
ad
 c
lo
ck
 
 
TCAM
FPHT
Fig. 37. Consumed energy per read clock in 0.09μm process technology.
a higher throughput. However, the proposed FPHT-based hash scheme can provide
O(1) lookup complexity as well as memory eﬃciency as follows. The table size was
varied from 236K to 1M entries. The ﬁrst table of 236K entries is taken from [51]
and the rest tables are created by random as in the previous section. In each memory
calculation, other bitmap and pointer memory overhead for hash-based and trie-based
approaches are considered. It is found that the FPHT-bases IP lookup scheme con-
sumed 1.75 times less memory compared to Tree bitmap. Furthermore, as the table
size increases, the FPHT-based IP lookup scheme saves at most 2.4 times memory.
99
253371 500000 750000 1000000
0
2
4
6
8
10 x 10
7
routable table size
M
em
or
y 
in
 b
its
 
 
Tree Bitmap
FPHT
Fig. 38. Memory comparison of Tree Bitmap and an FPHT in diﬀerent table sizes.
100
CHAPTER VIII
HASH-BASED IP LOOKUP ARCHITECTURE
This chapter presents HIHT and FPHT IP lookup architectures based on the proposed
hashing schemes and compares their performances with contemporary IP lookup ar-
chitectures in terms of power consumption and memory overhead.
A. Hash-based IP Lookup Architecture Build
Authors in [15, 16] show that hash-based IP lookup schemes are capable of providing
better memory and power performance. However, since a hash only supports a sin-
gleton match, either a preﬁx collapse in Sec. III.C.2 or a controlled preﬁx extension
in Sec. III.C.1 is necessary if hash schemes are applied to IP lookup. Since a con-
trolled preﬁx extension inﬂates the number of next-hops, a preﬁx collapse scheme is
a better way in build hash-based IP lookup architecture when proposed HIHT and
FPHT schemes are applied to IP lookup.
2 3 4 5
1
1.2
1.4
1.6
1.8x 10
5
stride s
# 
of
 c
ol
la
ps
ed
 p
re
fix
es
AS65000
AS6447
(a) The # of collapsed preﬁxes
2 3 4 52.5
3
3.5
4
4.5
5
5.5
6
stride s
R
at
io
 o
f d
up
lic
at
es
AS65000
AS6447
(b) Avg. ratio of duplicate next-hops
Fig. 39. The number of collapsed preﬁxes and the average number of duplicate nex-
t-hops at various stride s. The preﬁx number for AS 65000 and AS 6447 are
233451 and 235307, respectively.
101
Fig. 39(a) shows the beneﬁt of using the preﬁx collapse. In the ﬁgure, the
number of collapsed preﬁxes gets smaller than the number of the original preﬁxes at
various stride s. As the stride size increases, the number of collapsed preﬁxes reduces
and it is 2.7 times smaller than that of an original preﬁx set at stride 5 as shown in
Fig. 39(a). However, just as the stride size increases, there exists a problem with
the number of next-hops. When a bitmap for the preﬁx collapse is used, the ratio of
next-hop duplicates is increased as shown in Fig. 39(b). For example, the duplicate
ratio of 5.8 at stride 4 indicates that a bitmap of size 24 has 5.8 times duplicate
next-hops on average. The used BGP tables, AS 65000 and AS 6447, are obtained
from [51] and other BGP tables also show the similar pattern. However, since this
dissertation aims for power and memory eﬃciencies in hash itself, we leave the issue
of next-hop inﬂation open for hash-based IP lookup.
HLE
key
table ?=
table
NH
HLE
engine
hash
bitmap
bitmap
table
parse
bitmap
dst. IP
idx.
stride s
base pnt.
perfect match?
c
collapsed pref.
Fig. 40. IP lookup architecture with parallel Hash Lookup Engines (HLEs) for a wild-
card support. Each HLE has diﬀerent c and s values.
Fig. 40 illustrates a general hash-based IP lookup architecture using the preﬁx
collapse and the bitmap scheme. Preﬁxes are divided into collapsed preﬁxes and
bitmaps. Later, each HLE saves collapsed preﬁxes of the same length c in a key table
for a perfect match and its corresponding bitmaps in a bitmap table in order to index
102
the next-hop table. In the ﬁgure, an HIHT or an FPHT is substituted for an HLE.
For each IP lookup operation, an HLE strips the ﬁrst c bits and the following s bits
from a destination IP, does hash based on c bits, and accesses a next-hop table by
parsing an indexed bitmap, if perfectly matched. A match with a longest collapsed
preﬁx is the ﬁnal match for a given IP lookup among perfect matches.
B. Simulation Result of HIHT and FPHT-based IP Lookup Schemes
This section shows comparison result of HIHT and FPHT-based IP lookup schemes
against contemporary schemes in terms of power and memory. For a scalability issue
of routing table size, we consider four sizes: 236,029, 500K, 750K, and 1M.
1. Power-eﬃcient hash-based IP lookup
Fig. 41 shows the consumed energy per read clock in three schemes: a TCAM, an
HIHT shown in Sec. VI, and an FPHT shown in Sec. VII. We use CACTI [35] and
a tool [70] to measure the consumed energy per read clock. The table size was varied
from 236K to 1M entries. The ﬁrst table of 236K entries is taken from [51] and the
rest tables are created by random. It is found that the proposed scheme consumed 51
times less power compared to the TCAM-based IP lookup and 1.5 times less power
compared to the HIHT-based scheme for the ﬁrst table. It is shown that the TCAM
uses a tremendous power amount as the table size is increased while our hash-based
schemes of an HIHT and an FPHT use a small power amount. Furthermore, an
FPHT-based scheme always uses less power amount than an HIHT-based scheme in
all table sizes since an FF uses a smaller power than an BF as discussed in Sec. B.
103
236029 500000 750000 1000000
10−1
100
101
102
Routing table size
En
er
gy
(nJ
)  p
er 
rea
d c
loc
k (
log
)
 
 
TCAM
HIHT
FPHT
Fig. 41. Consumed energy per read clock in 0.09μm process technology.
2. Memory-eﬃcient hash-based IP lookup
Fig. 42 shows the total memory size for IP Lookup in three schemes: a Tree Bitmap
[71], an HIHT, and an FPHT. As Table I discussed, trie-based IP lookup schemes
suﬀer O(W ) lookup complexity where W is the IP address length while hash-based
schemes provide O(1) lookup complexity. Comparing memory eﬃciencies of IP lookup
schemes with diﬀerent lookup complexities is not fair because hash-based schemes can
provide a higher throughput. However, our hash-based schemes can provide O(1)
lookup complexity as well as memory eﬃciency as follows. The table size was varied
from 236K to 1M entries. The ﬁrst table of 236K entries is taken from [51] and the
rest tables are created by random as in the previous section. In each memory size
calculation for hash-based and trie-based approaches, other bitmaps, pointer memory
overhead, and hash-engine memory are considered. It is found that the HIHT-based
scheme consumed 1.8 times less memory compared to Tree Bitmap scheme and the
FPHT-based scheme used 1.1 less memory compared to the HIHT-based scheme for
the ﬁrst table. In conclusion, it is shown that the FPHT-based scheme is the most
104
memory-eﬃcient IP lookup scheme in this result. Furthermore, as the table size
increases, the FPHT-based scheme saves at most 2.4 times memory.
253371 500000 750000 1000000
0
2
4
6
8
10 x 10
7
routable table size
M
em
or
y 
in
 b
its
 
 
Tree Bitmap
HIHT
FPHT
Fig. 42. Memory size comparison of Tree Bitmap, an HIHT, and an FPHT in diﬀerent
table sizes.
105
CHAPTER IX
HYBRID CAMS OF CAM AND SRAM FOR IP LOOKUP
In this chapter, we propose a hybrid CAM (HCAM) IP lookup architecture for high
throughput and power eﬃciency. Our approach adopts both a preﬁx collapse scheme
and a circuit level redundancy in multi-ports to a Bloom ﬁlter (BF). A preﬁx collapse
reduces the number of preﬁxes while a collapsed preﬁx does not have a preﬁx feature.
In such preﬁx collapse, the collapsed preﬁxes (CPs) can be put in a deterministic
lookup-capable CAM to demonstrate further hardware eﬃciencies on power and the
number of transistors per cell than a TCAM. The detail is the following.
A. HCAM-based IP Lookup Architecture
Using TCAM for a preﬁx match has been considered as a prohibitive scheme despite
TCAM’s advantages in a deterministic lookup and partitioning for multi-lookups.
This section presents the detail of HCAM-based architecture with high throughput
and power eﬃciency.
lo
ng
es
t p
re
fix
 m
atc
h
ta
bl
e 
for
: SRAM for STBs: CAM for CPs
pkt3
pkt2
pkt1
P4: 100101*
P6: 100110101*
P2: 101*
P5: 1011001*
P7: 101010100*
P1: 100*
P3: 1101*
P8: 1010101001*
prefix set
10110
10011010
10010
10 0000110
0100000
0000010
0100000
0000010
0100100
NH idx.
NH idx.
NH idx.
SRAMs
BL
D
Qs & CAMson−chip
one clk one clk
10101010
11
Fig. 43. HCAM-based IP lookup architecture for a preﬁx set. Stride s=2. The col-
lapsed preﬁx lengths,d1, d2, d3, are 2,5, and 8, respectively.
A pipeline in an HCAM-based scheme has three stages in pipeline; a distributor
106
with BFs, CAMs with queues, and SRAMs as in Fig. 43. To provide a high through-
put, multiple pipelines can be used working in parallel. In the ﬁgure, 3 packets are
fed into a distributor together, and the distributor disseminate the packets to their
associated queues. A queue is buﬀer zone between a fast-distributor stage and a
slow-CAM stage as in [36, 38]. A Bloom ﬁlter is well known for a binary approxi-
mate membership query [49], and it removes irrelevant lookup queries to collapsed
preﬁxes which are saved in a CAM block. Thus, a high-power-consuming CAM query
is avoided. Once a CAM block entry is perfectly matched with a collapsed preﬁx, we
retrieve an SRAM block entry at the same index. The retrieved entry indicates an
STB associated with the collapsed preﬁx for stride match. Thus, the preﬁx match is
achieved by performing CAM and STB matches.
As to completing the longest preﬁx match, a table is used to record all CAM
matches for a given lookup. Once a lookup is forwarded to associated queues by a
distributor, a record of match statuses in all pipelines is created in the table. When-
ever a match is found in any pipeline, the match is recorded in the lookup’s record.
Once a found match is considered as the longest preﬁx match in the record, a packet
associated with the lookup is forwarded without waiting for other query results of the
lookup.
By IP lookup policy, a router forwards packets based in a preﬁx set while pre-
serving a packet order. Although packet disorder can happen due to queues’ delay
in parallel lookups, the disorder does not disturb an order of packets belonging to
a single ﬂow. A ﬂow is deﬁned as a set of packets between applications on two end
hosts identiﬁed by two IP addresses, and a preﬁx represents a set of IP addresses to
forward packets if the preﬁx is the longest. As long as a match order of packets which
are associated with the longest preﬁx is preserved, the same order of the outgoing
packets is. Since our HCAM scheme preserves an packet order in a pipeline queue by
107
selectors, the match order after a SRAM is the same as the order of packets in a ﬂow.
B. Preﬁx Transformation with CAM & SRAM
A proposed PC uses one bitmap and one pointer to encode a subtrie in a uni-bit
trie, so that the number of preﬁxes is reduced and the need of ’don’t care’ (or *) bit
comparison is eliminated. Now, the collapsed preﬁxes can be put in a CAM which
has less hardware complexity than a TCAM and provides a singleton match, but
the same parallel lookups through partitioning. As to comply with a preﬁx match
on stride bits, we build an STB in a SRAM, so that the overhead of TCAM usage
disappears. The details of a PC and an STB are shown in the following sections.
1. Preﬁx collapse
P4: 100101*
P6: 100110101*
P2: 101*
P5: 1011001*
P7: 101010100*
P1: 100*
P3: 1101*
P8: 1010101001*
1
00
0
11
1
1
1
00
depth 2
prefix set
P1 P2
root
depth 5
a
Fig. 44. A sample preﬁx set and a subtrie in a uni-bit trie for the set.
Given a uni-bit trie for a preﬁx set, our PC encodes every subtrie which is rooted
at speciﬁc trie-depth and whose depth is s, so that preﬁxes in a subtrie share a
common path from a trie root to the subtrie root. Suppose there are 8 preﬁxes and
a subtrie with root node a at trie-depth 2 is encoded as in Fig. 44. Since preﬁx P1
and P2 share a common preﬁx part, i.e. ’10’, one collapsed preﬁx ’10’ is used in a
108
CAM. Thus, the number of collapsed preﬁxes for a CAM can be far smaller than the
number of original preﬁxes.
2 3 4 5
0
1
2
3
Stride s
# 
of
 c
ol
la
ps
ed
 p
re
fix
es
(x1
05 )
 
 
AS39202, PC
AS39202
(a) The # of collapsed preﬁxes
2 3 4 5
4
6
8
10
12
14
Stride s
# 
of
 tr
an
sis
to
rs
(x1
07
)
 
 
TCAM
HCAM(CAM+SRAM)
SMT(SRAM)
(b) Memory size in terms of
transt.
Fig. 45. The number of collapsed preﬁxes and the number of transistors at various
stride s.
Such PC’s beneﬁt is shown in Fig. 45(a) by counting collapsed preﬁxes from a
routing table AS 39202 [72] whose preﬁx number of preﬁxes is 252,951. The rela-
tionship between the stride size and the number of collapsed preﬁxes is that as the
stride size gets larger, the number of collapsed preﬁxes is getting smaller. At stride 5,
the number of collapsed preﬁxes is 4.5 times fewer than that of the original preﬁxes
marked as a line, and for the stride 5 66, 3376, 52254, and 63 collapsed preﬁxes are
found at depth 7, 13, 19, and 25, respectively. Other stride sizes do not cause any
signiﬁcant reduction.
2. A complete preﬁx match through an STB in SRAM
Since a CAM does not support a preﬁx match, a supplementary match is necessary
even after a CAM match occurs. Given a collapsed preﬁx, there are 2s+1-1 possible
preﬁxes at stride s, and they can be presented at a subtrie bitmap. Fig. 46 a) shows
two preﬁx strides at stride s=3 and a stride tree for them. In a stride tree, a node
109
is marked as ’1’ when there is a corresponding preﬁx stride. Thus, when scanning
nodes’ bits in the horizontal order followed by the vertical order, we get an 15-bit
STB (00100000,0010,01,0) for three preﬁx strides.
1
2 1
x
1 2 scan order
2
3
P1: 1*
P2: 101
0
0
0 0 00
1
1 1
1111
0
pref. nodestride set
STB:
pkt stride: 100
1? 1?
NH table
1? h2
h1
Σ+base
first x bits used
a) Stride tree
P1
b) Index calculation for a given packet stride
P2
( 00000100, 0000, 01, 0 )
for 2 prefixes
index to find bit 1 & scan to sum bit 1s
Fig. 46. A stride tree for 2 preﬁx strides and an index method to an NH table.
Given an STB for the stride s, there are s+1 groups of bits, each designated for
bits on the same layer in a stride tree. In each group, bits are scanned while the
number of bits of value 1 is counted, and when a bit indexed by the most signiﬁcant
bits in a stride is 1, the counting stops. Then, the summed number of bits of 1,
∑
,
becomes a relative index to an NH table. Fig. 46 b) shows such an index calculation
in STB (00100000,0010,01,0) for the packet stride 100. Once a CAM block match
happens, the match’s index in the CAM block is used to access an STB in the corre-
sponding SRAM block. Once the STB is known by one SRAM access, calculating an
index
∑
can be made shortly at one CPU-clock speed. Procedure stride match
shows the detailed steps.
As to subtrie memory in Fig. 44, a proposed PC needs an (22+1-1)-bit STB
with one base pointer to an NH table for a 2-bit stride subtrie. In general, the
STB size is 2s+1-1 bits for a subtrie of s stride bits. However, the subtrie size for a
110
Procedure stride match
Input: Stride S of a0 · · · as−1, stride size s, and STB B of b0 · · · b2s+1−2
Output: Relative index
∑
to an NH table, or “no match”
for (idx B=s-1, sum=0; idx B≥0 ; idx B−−) do1
idx S = a0 · · · aidx B ;2
if idx B=s-1 then scan B=0;3
else scan B=
∑idx B-1
t=0 2
s−t; // Set base in B to scan4
for (idx scan=0;idx scan≤idx S+scan B;idx scan++) do5
if B[idx scan]==1 then sum++;6
end7
if B[idx scan]==1 then // match happens8
return sum;9
end10
return “no match”;11
segmented multibit trie (SMT) in [20] is 19 bits which is 2.7 times more than the STB
size. Generally, an SMT needs 3k+1(=2k+k+k+1) bits for k neighboring nodes. In
addition, an SMT needs two pointers to maintain connectivities among SMTs and an
NH table.
Such pointer overhead is manifest in Fig. 45(b). In general, as the stride is
larger, the numbers of transistors for an SMT scheme and an HCAM are reduced
signiﬁcantly except at stride 5, and the numbers are, at most 3.9 times, smaller than
that of a native TCAM scheme. In comparison between a SMT and an HCAM, an
HCAM uses 1.7 times less memory at stride 4 because an SMT is designed to encode
a lightly loaded subtrie and to maintain connectivity with others through pointers.
C. A Bloom Filter-based Lookup Distributor
STCAM [37] and BTCAM [38] schemes use a distributor forwarding multiple packets
per clock cycle. Such a packet distribution to corresponding pipelines is necessary for
high throughput. Such a distributor adopts a multi-tiered BLD with a set of nc BFs
[30], each forwarding a packet to a corresponding pipeline. Such a BLD is designed
111
to distribute lookups for multi-lookups per cycle and remove unnecessary lookups in
queues for a power eﬃciency lookup.
The total memory usage of various contemporary schemes is diﬀerentiated from a
proposed HCAM-based scheme as shown in Fig. 47. TCAM or SRAM blocks of other
schemes are only considered, and not memory block selectors, to store preﬁxes for
preﬁx match in this comparison. However, BFs’ memory is included in the HCAM
memory calculation. Although the BTCAM shows the highest throughput among
other contemporary schemes, our HCAM uses 2.8 times less memory while achieving
the same throughput as the BTCAM.
CTCAM STCAM UTCAM BTCAM HCAM
0
0.5
1
1.5
2
2.1 x 10
8
# 
of
 tr
an
sis
to
rs
AS3257
Fig. 47. The memory comparison of all schemes in terms of a transistor. Lookup
precision w=10. Note that ’HCAM’ includes all CPs, STBs, and BFs.
D. Experimental Results for an HCAM-based Scheme
This section presents an analysis on throughput and power eﬃciencies for an HCAM-
based scheme and other contemporary schemes.
112
1. Throughput
It is diﬃcult to theoretically analyze the HCAM’s throughput performance because
the non-determinacy of the lookup traﬃc. However, the upper and the lower bound of
its performance can be estimated based on the following lookup traﬃc assumptions.
1) Queuing theory is used to model the lookup engine and assume that the arrival
process of the incoming IP addresses is a Poisson process with the average arrival rate
λ. 2) The service process of the lookup operation follows a deterministic distribution
with a constant service rate μ due to CAM’s deterministic lookup. Then, a service
time to process lookups in a queue becomes Ts=1/μ, and it is independent of the
arrival processes. 3) The queue size in each pipeline is ﬁnite with nq lookup requests.
It is obvious that if nc CAM blocks perform independent IP lookups, the system
can be modeled as an M/D/nc/nqnc queuing. In this case, the lookup traﬃc can be
always balanced among all nc CAMs. Thus, the M/D/nc/nqC queuing model should
be the upper throughput bound.
However, the lower throughput bound is more interesting since it aﬀects the
practicality of the proposed scheme to a real ﬁeld. By neglecting the adaptive load
balancing process and assuming that the traﬃc is evenly distributed to nc CAMs
by BFs, an HCAM can be modeled as nc independently and identically distributed
M/D/1/nq queuing network for nc pipelines as in Fig. 48.
Now, an analysis is made on one of the identically distributed M/D/1/nq queues
by considering an increased arrival rate in a queue due to a BF’s f -positive. That
is, the arrival rate λ/nc is increased by a probabilistic value f because a BF falsely
assigns a lookup to each queue due to a BF’s f -positive. Once such a look exists in a
queue, a CAM block needs to proceed a lookup operation for the unsuccessful look,
and this consequently undermines throughput and wastes power. Thus, the traﬃc
113
λ/3+f
λ/3+f
λ/3
λ/3
λ/3+f
λ
λ/3
B
B
B
nq
μ
μ
μ
Fig. 48. Queuing model of nc pipelines in an HCAM. nc=3.
intensity of each queue is deﬁned as
ρ′ = (1 + f)λ/nc × Ts, (9.1)
while a successful lookup’ traﬃc intensity to a queue is ρ=λ/nc×Ts.
Let {Qi}∞i=1 be the stochastic process of the number of the IP addresses in the
queue at the time of the i-th arrival. Then, a queue’s loss probability which can be
derived from [36, 73, 74] is the following:
PL = P (Q = nq) = {1 + (ρ′-1)αnq (ρ′)}/{1+ρ′αnq(ρ′)}, (9.2)
where
αnq(ρ
′) =
∑
i+m=nq−2
eρ
′(i+1)(−1)mρ′m(i + 1)m
m!
, nq ≥ 2. (9.3)
Now, since processing ULs is considered as wasting the lookup time in a CAM block,
a throughput of our concern, Goodput, is deﬁned as
Goodput = ρ(1− PL), (9.4)
because under the probability that a queue is not full, 1-PL, a CAM block processes
successful lookups in traﬃc intensity ρ, not ρ′. Also, the overall goodput with nc
CAM blocks is calculated by multiplying Eq. (9.4) with nc.
In addition to the theoretical analysis, a series of experiments is also made to
measure an HCAM-based scheme’s throughput performance. Due to a diﬃculty in
114
2^−3 2^−7 2^−100.7
0.8
0.9
1
f
Th
ro
ug
hp
ut
SDA
Goodput
Fig. 49. Goodput vs. measured throughput of a CAM block in an SDA trace. ρ=0.95.
getting a pair of a BGP table and its corresponding IP trace, an SDA trace from [58]
is utilized to extract preﬁxes from packet streams by considering a unique destination
IP as a preﬁx. In experiment runs, it is assumed that a distributor disseminates
lookup requests fast enough that queues of successful or unsuccessful lookups are full
and a CAM block processes a lookup from a queue in one clock. Also, four CAM
blocks are used, each with a queue size nq=5. Fig. 49 shows Goodput deﬁned by Eq.
(9.4) and a CAM block’s throughput deﬁned by the number of SLs in a queue over
the total clocks to process packets. The ﬁgure shows that the smaller an f -positive
f is, the higher throughput is achieved. Also, the total Goodput of 4 CAM blocks is
marked as 3.7.
2. Power
By using the TCAM and CAM modeling tools [35, 70], we measured the total energy
in one clock and individual energy for a single lookup in a TCAM or CAM block
in three approaches: a naive TCAM (NTCAM), a UTCAM, and an HCAM. Such
energy consumptions are shown in Fig. 50 for AS 3257 and AS 3333 routing tables.
An NTCAM in the ﬁgure provides only one lookup with the entire preﬁxes while
115
N U.14 H.6 N U.14 H.60
10
20
30
40
50
60
To
ta
l e
ne
rg
y 
(nJ
)
AS3257 AS3333
(a) Total energy per clock cycle
N U.14 H.6 N U.14 H.60
10
20
30
40
50
60
In
di
vi
du
al
 e
ne
rg
y 
(nJ
)
AS3257 AS3333
(b) Energy in a TCAM- or CAM-
block
Fig. 50. a) Total energy consumption in one clock for an NTCAM, a UTCAM, and an
HCAM. Symbols ’N’, ’U.14’, and ’H.6’ denote NTCAM with a block of whole
preﬁxes, UTCAM with 16 blocks of 14K preﬁxes, and HCAM with 16 blocks
of 6K preﬁxes, respectively. .13μm process technology is used. b) The energy
consumptions for a single lookup operation in a block for three schemes.
a UTCAM and an HCAM can provide multiple lookups with 16 TCAM or CAM
blocks, respectively. To make the number of blocks in UTCAM and HCAM even,
UTCAM and HCAM block sizes are set to 14K and 6K entries, respectively. In this
conﬁguration, the same throughput can be achieved. On average, an HCAM saves 3.6
and 4.6 times total energies compared to a UTCAM and an NTCAM, respectively,
even if an HCAM and a UTCAM have the same number of blocks. The power usage
can be easily calculated by dividing a consumed energy by a lookup access time which
depends on the process technology of memory chip fabrication.
116
CHAPTER X
SUMMARY
A. Conclusion
It was discussed that the existing hash schemes for packet processing, like an FHT,
a BFHT, and Peacock hashing suﬀer from key duplicates, a complicated update, and
setup failure and they are not scalable in terms of scalability and speed. To overcome
these problems, one packet classiﬁer and three hashing schemes are proposed: an
MPC, an MBHT, an HIHT, and an FPHT, for large-scale and high-speed packet
processing.
An MPC is proposed by reconﬁguring BFs into small-sized BFs and large-sized
BFs in a multi-tiered way without memory overhead, compared to a PPC. By Linear
Property 1 in Sec. III.A, it is shown that how an MPC is built with the same memory
capacity as that of a PPC in Sec. A. It is observed that the number of fabricated
read ports in BFs’ memory as well as the MPC area cost are reduced with the same
memory. In simulation with NLANR’s IP traces for ﬂow identiﬁcation, an MPC shows
higher eﬃciencies in all traces than a PPC, at most 2.0 and 4.2 times of throughput
and power, respectively.
Also, an MBHT of a novel hash architecture is proposed, generating indexes to a
key table with a set of MBFs in base-b number system. The MBFs work in pipelining
in query so that a subset of them in row i determine Ai, which is a part of a whole
index address Ab=a0 · · · ar-1 of base-b number system. From Lemmas 1 and 2, it is
realized that adapting a larger base number system saves signiﬁcant on-chip memory
against an LHT and an FHT, and showed that base-23 is the starting point of better
memory eﬃciency for an MBHT as shown in Fig. 23. A novel hash architecture is
117
proposed, generating indexes to a key table with a set of MBFs in base-b number
system. The MBFs work in pipelining in query so that a subset of them in row
i determine Ai, which is a part of a whole index address Ab=a0 · · · ar-1 of base-b
number system. From Lemmas 1 and 2, it is realized that adapting a larger base
number system saves signiﬁcant on-chip memory against an LHT and an FHT, and
showed that base-23 is the starting point of better memory eﬃciency for an MBHT
as shown in Fig. 23.
Thirdly, a novel hash architecture with two HITs is proposed, generating indexes
to a key table with a set of BFs. The BFs in two HITs work systematically, or in
pipeline and hierarchical fashion to minimize the number of indexes Only one oﬀ-
chip memory access is required in addition to achieving eﬃciency in on-chip access.
For insert, an i-path is assigned to a key and one BF on each layer is involved in
encoding the key in one of HITs. For query, one on-chip memory module for each
layer is probed for candidate BFs having their base indexes on the memory derived
from Eq. (3.2). After the last probing in layer s-1, the returned indexes are used
for perfect match in a on-chip key table, so that a deterministic Θ(1) lookup is
guaranteed. For delete, by rotating two HITs, seamless update of keys is provided
without counters costing four times the memory, so that only half of the memory is
used.
As the last hash scheme, an FPHT, by using CBFs in a binary search for a
key’s ﬁngerprint and utilizing an keys’ FF in a high-precision query for a high-speed
router, a proposed FPHT produces an i-path and no f -path to a key table with a
high probability and memory eﬃciency. In throughput comparison against Peacock
hashing, it was shown that while Peacock hashing suﬀers from a lower throughput in
a UL, an FPHT throughput is proportional to the number of threads regardless of
lookup kinds.
118
In hash-based IP lookup architectures with an HIHT or an FPHT, it is observed
that an FPHT-based IP lookup saves 51 times power and 1.8 times memory compared
to TCAM and trie-based IP lookup, respectively.
In addition to these power- and memory-eﬃcient hash schemes, a hybrid CAM is
also proposed where a high performance lookup can be achieved by parallel lookups
among CAM and SRAM blocks. In an HCAM, a preﬁx is broken into a collapsed
preﬁx in CAM and a stride in SRAM. The preﬁx collapse reduces the number of
preﬁxes that results in reduced memory usage by a factor of 2.8. High throughput is
achieved by storing the collapsed preﬁxes in partitioned CAMs that perform multiple
IP lookups per cycle. A stride tree bitmap with a matched collapsed preﬁx completes
the longest preﬁx match.
B. Future Works
Since hashing provides only a singleton match for a one-dimension key, any hash-based
packet processing application needs a lookup-key transformation for its application
domain. For instance, since IP lookup needs a preﬁx match, the hash-based IP lookup
needs preﬁx expansion or collapse as discussed in Sec. C. Although this dissertation
proposed one packet classiﬁer and three hashing schemes proven with memory and
power eﬃciencies, these belong to a one-dimension singleton match. As future work,
a hashing scheme for a two-dimension key in packet classiﬁcation will be considered.
Power- and memory-eﬃcient hash mechanisms have been shown in this disser-
tation. However, the reviewing on the importance of a throughput metric in a high-
speed router implementation encourages us to consider mapping a m-trie, which was
developed for IP lookup, onto multiple pipelines. Since the principle of a pipeline is to
give an one-clock throughput, multiple m-trie-mapped pipelines can give multi-folds
119
throughput for IP lookup or packet classiﬁcation.
120
REFERENCES
[1] K. G. Coﬀman and A. M. Odlyzko, Internet Growth: Is There a ”Moore’s Law”
for Data Traﬃc?, Handbook of Massive Data Sets, Kluwer, New York, 2002.
[2] M. Gray, (1996), [Online]. Available: http://www.mit.edu/people/mkgray/net/
internet-growth-summary.html.
[3] E. Spitznagel, D. Taylor, and J. Turner, “Packet Classiﬁcation Using Extended
TCAMs,” in ICNP ’03: Proceedings of the 11th IEEE International Conference
on Network Protocols, 2003, p. 120.
[4] V.C. Ravikumar and R.N. Mahapatra, “TCAM Architecture for IP Lookup
Using Preﬁx Properties,” MICRO, IEEE, vol. 24, no. 2, pp. 60–69, 2004.
[5] V. C. Ravikumar, R. N. Mahapatra, and L. N. Bhuyan, “EaseCAM: An Energy
and Storage Eﬃcient TCAM-Based Router Architecture for IP Lookup,” IEEE
Trans. Comput., vol. 54, no. 5, pp. 521–533, 2005.
[6] K. Lakshminarayanan, A. Rangarajan, and S. Venkatachary, “Algorithms for
Advanced Packet Classiﬁcation with Ternary CAMs,” in SIGCOMM ’05: Pro-
ceedings of the 2005 Conference on Applications, Technologies, Architectures,
and Protocols for Computer Communications, 2005, pp. 193–204.
[7] V. Srinivasan and G. Varghese, “Fast Address Lookups Using Controlled Preﬁx
Expansion,” ACM Trans. Comput. Syst., vol. 17, no. 1, pp. 1–40, 1999.
[8] A. Basu and G. Narlikar, “Fast Incremental Updates for Pipelined Forwarding
Engines,” IEEE/ACM Trans. Netw., vol. 13, pp. 690–703, 2005.
121
[9] S. Sahni and K.S. Kim, “Eﬃcient Construction of Multibit Tries for IP Lookup,”
IEEE/ACM Trans. Netw., vol. 11, no. 4, pp. 650–662, 2003.
[10] S. Sahni and K.S. Kim, “Eﬃcient Construction of Pipelined Multibit-trie
Router-Tables,” IEEE Trans. Comput., vol. 56, no. 1, pp. 32–43, 2007.
[11] A.C. Snoeren, “Hash-based IP Traceback,” in SIGCOMM ’01: Proceedings of
the 2001 Conference on Applications, Technologies, Architectures, and Protocols
for Computer Communications, 2001, pp. 3–14.
[12] S. Dharmapurikar, P. Krishnamurthy and D.E. Taylor, “Longest Preﬁx Match-
ing Using Bloom Filters,” in SIGCOMM ’03: Proceedings of the 2003 Confer-
ence on Applications, Technologies, Architectures, and Protocols for Computer
Communications, 2003, pp. 201–212.
[13] S. Dharmapurikar, P. Krishnamurthy, T.S. Sproull, and J.W. Lockwood, “Deep
Packet Inspection Using Parallel Bloom Filters,” in MICRO 37: Proceedings
of the 37th Annual ACM/IEEE International Symposium on Microarchitecture,
New York, 2004, pp. 52–61.
[14] F. Chang, W-C. Feng, and K. Li, “Approximate Caches for Packet Classiﬁca-
tion,” in INFOCOM 2004. 23rd Annual Joint Conference of the IEEE Computer
and Communications Societies. Proceedings IEEE, Hong Kong, Chnia, 2004,
vol. 4, pp. 2196–2207.
[15] H. Song, S. Dharmapurikar, J. Turner, and J. Lockwood, “Fast Hash Table
Lookup Using Extended Bloom Filter: An Aid to Network Processing,” in SIG-
COMM ’05: Proceedings of the 2005 Conference on Applications, Technologies,
Architectures, and Protocols for Computer Communications, 2005, pp. 181–192.
122
[16] J. Hasan, S. Cadambi, V. Jakkula, and S. Chakradhar, “Chisel: A Storage-
Eﬃcient, Collision-free Hash-based Network Processing Architecture,” in ISCA
’06: Proceedings of the 33rd International Symposium on Computer Architec-
ture, 2006, pp. 203–215.
[17] D. Guo, J. Wu, G. Chen, and X. Luo, “Theory and Network Applications of
Dynamic Bloom Filters,” in INFOCOM 2006. 25th Annual Joint Conference
of the IEEE Computer and Communications Societies. Proceedings IEEE, 2006,
pp. 1233–1242.
[18] J. Moscola D.V. Schuehler and J.W. Lockwood, “Architecture for a Hardware-
based TCP/IP Content Scanning System,” in Hot Interconnect: IEEE Sympo-
sium on High Performance Interconnects, 2003.
[19] S. Dharmapurikar, H. Song, J. Turner and J. Lockwood, “Fast Packet Classiﬁ-
cation Using Bloom Filters,” in ANCS ’06: Proceedings of the 2006 ACM/IEEE
Symposium on Architecture for Networking and Communications Systems, San
Jose, 2006, pp. 61–70.
[20] H. Song, J. Turner, and S. Dharmapurikar, “Packet Classiﬁcation Using Coarse-
grained Tuple Spaces,” in ANCS ’06: Proceedings of the 2006 ACM/IEEE
Symposium on Architecture for Networking and Communications Systems, San
Jose, 2006, pp. 41–50.
[21] F. Bonomi, M. Mitzenmacher, R. Panigrah, S. Singh, and G. Varghese, “Be-
yond Bloom Filters: From Approximate Membership Checks to Approximate
State Machines,” in SIGCOMM ’06: Proceedings of the 2006 Conference on
Applications, Technologies, Architectures, and Protocols for Computer Commu-
nications, Pisa, Italy, 2006, pp. 315–326.
123
[22] M. Waldvogel, G. Varghese, J. Turner, and B. Plattner, “Scalable High Speed
IP Routing Lookups,” in SIGCOMM ’97: Proceedings of the ACM SIGCOMM
’97 Conference on Applications, Technologies, Architectures, and Protocols for
Computer Communication, Seattle, 1997, pp. 25–36.
[23] F. Baboescu and G. Varghese, “Scalable Packet Classiﬁcation,” IEEE/ACM
Trans. Netw., vol. 13, no. 1, pp. 2–14, 2005.
[24] T. V. Lakshman and D. Stiliadis, “High-speed Policy-based Packet Forwarding
Using Eﬃcient Multi-dimensional Range Matching,” in SIGCOMM ’98: Pro-
ceedings of the ACM SIGCOMM ’98 Conference on Applications, Technologies,
Architectures, and Protocols for Computer Communication, Vancouver, Canada,
1998, pp. 203–214.
[25] A.Z. Broder and M. Mitzenmacher, “Using Multiple Hash Functions to Improve
IP Lookups,” in INFOCOM 2001. 20th Annual Joint Conference of the IEEE
Computer and Communications Societies. Proceedings IEEE, 2001, pp. 1454–
1463.
[26] F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese, “An
Improved Construction for Counting Bloom Filters,” in ESA’06: Proceedings of
the 14th Conference on Annual European Symposium, 2006, pp. 684–695.
[27] B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal, “The Bloomier Filter: An Eﬃ-
cient Data Structure for Static Support Lookup Tables,” in SODA ’04: Proceed-
ings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms,
2004, pp. 30–39.
[28] S. Kumar, J. Turner, and P. Crowley, “Peacock Hashing: Deterministic and
Updatable Hashing for High Performance Networking,” in INFOCOM 2008. 27th
124
Annual Joint Conference of the IEEE Computer and Communications Societies.
Proceedings IEEE, pp. 101 – 105.
[29] A. Kirsch and M. Mitzenmacher, “Simple Summaries for Hashing with Choices,”
IEEE/ACM Trans. Netw., vol. 16, no. 1, pp. 218–231, 2008.
[30] H. Yu and R. Mahapatra, “A Throughput-eﬃcient Packet Classiﬁer with n
Bloom Filters,” in Proc. of IEEE Global Communications Conference (GLOBE-
COM), New Orleans, 2008, pp. 1 – 5.
[31] H. Yu and R. Mahapatra, “A Memory-eﬃcient Hashing by Multi-predicate
Bloom Filters for Packet Classiﬁcation,” in INFOCOM 2008. 27th Annual Joint
Conference of the IEEE Computer and Communications Societies. Proceedings
IEEE, Phoenix, 2008, pp. 1795 – 1803.
[32] H. Yu and R. Mahapatra, “A Space- and Time-eﬃcient Hash Table Hierarchi-
cally Indexed by Bloom Filters,” in IPDPS 2008. IEEE International Symposium
on Parallel and Distributed Processing, 2008, pp. 1 – 12.
[33] H. Yu and R. Mahapatra, “A Pipelined Indexing Hash Table Using Bloom and
Fingerprint Filters for IP Lookup,” in SIGCOMM 2008, pp. 463 – 464.
[34] The Linley Group, A Guide to Search Engines and Networking Memory, (2006,
Nov.), [Online]. Available: http://www.linleygroup.com/pdf/NMv4.pdf.
[35] CACTI, (2001, Feb.), [Online]. Available: http://www.hpl.hp.co.uk/personal/
Norman Jouppi/cacti5.html.
[36] K. Zheng, C. Hu, H. Lu and B. Liu, “An Ultra High Throughput and Power
Eﬃcient TCAM-based IP Lookup Engine,” in INFOCOM 2004. Proceedings
125
IEEE 23rd Annual Joint Conference of the IEEE Computer and Communications
Societies, 2004, pp. 7–11.
[37] J. Akhbarizadeh, M.M. Nourani, R. Panigrahy, and S. Sharma, “A TCAM-
Based Parallel Architecture for High-speed Packet Forwarding,” IEEE Trans.
Comput., vol. 56, no. 1, pp. 58–72, 2007.
[38] W. Jiang, Q. Wang, and V. Prasanna, “Beyond TCAMs: An SRAM-based
Parallel Multi-pipeline Architecture for Terabit IP Lookup,” in INFOCOM ’08.
Proceedings of IEEE 27th Annual Joint Conference of the IEEE Computer and
Communications Societies.
[39] L. Fan, P. Cao, J. Almeida, and A.Z. Broder, “Summary Cache: A Scalable
Wide-area Web Cache Sharing Protocol,” IEEE/ACM Trans. Netw., vol. 8, no.
3, pp. 281–293, 2000.
[40] V. Srinivasan, G. Varghese, S. Suri, and M. Waldvogel, “Fast and Scalable
Layer Four Switching,” in SIGCOMM ’98: Proceedings of the ACM SIGCOMM
’98 Conference on Applications, Technologies, Architectures, and Protocols for
Computer Communication, New York, 1998, pp. 191–202.
[41] M. Nourani and M. Faezipour, “A Single-Cycle Multi-Match Packet Classiﬁ-
cation Engine Using TCAMs,” in HOTI ’06: Proceedings of the 14th IEEE
Symposium on High-Performance Interconnects, Washington, DC, 2006, pp. 73–
80.
[42] M. Singhal, J. Xu and J. Degroat, “A Novel Cache Architecture to Support
Layer-Four Packet Classiﬁcation at Memory Access Speeds,” in INFOCOM
2000. Proceedings IEEE of the 19th Annual Joint Conference of the IEEE Com-
puter and Communications Societies, 2000, pp. 1445–1454.
126
[43] J. Byers, J. Considine, M. Mitzenmacher, and S. Rost, “Informed Content De-
livery Across Adaptive Overlay Networks,” in SIGCOMM ’02: Proceedings of
the 2002 Conference on Applications, Technologies, Architectures, and Protocols
for Computer Communications, 2002, pp. 47–60.
[44] A. Kumar, J. Xu and E. W. Zegura, “Eﬃcient and Scalable Query Routing for
Unstructured Peer-to-Peer Networks,” in INFOCOM 2005. Proceedings IEEE
24th Annual Joint Conference of the IEEE Computer and Communications So-
cieties, 2005, pp. 13–17.
[45] D. Sy and L. Bao, “CAPTRA: Coordinated Packet Traceback,” in IPSN ’06:
Proceedings of the Fifth International Conference on Information Processing in
Sensor Networks, 2006, pp. 152–159.
[46] S. Cohen and Y. Matias, “Spectral Bloom Filters,” in SIGMOD ’03: Proceedings
of the 2003 ACM SIGMOD International Conference on Management of Data,
2003, pp. 241–252.
[47] F. Zane, G. Narlikar, and A. Basu, “CoolCAM: Power-eﬃcient TCAMs for
Forwarding Engines,” in INFOCOM 2003. Proceedings of IEEE the 22nd Annual
Joint Conference of the IEEE Computer and Communications Societies, 2003,
pp. 42 – 52.
[48] W. Jiang and V. Prasanna, “Parallel IP Lookup Multiple SRAM-based
Pipelines,” in IPDPS ’08. 22nd IEEE International Parallel and Distributed
Processing Symposium, 2008, pp. 1–14.
[49] A. Broder and M. Mitzenmacher, “Network Applications of Bloom Filters: A
Survey,” Internet Mathematics, vol. 1, no. 4, pp. 485–509, 2002.
127
[50] A. Pagh, R. Pagh, and S. S. Rao, “An Optimal Bloom Filter Replacement,”
in SODA ’05: Proceedings of the Sixteenth Annual ACM-SIAM Symposium on
Discrete Algorithms, Philadelphia, PA, 2005, pp. 823–829.
[51] BGP Routing Tables Analysis Report, (2008), [Online]. Avail-
able:http://bgp.potaroo.net.
[52] University of Oregon Route Views Project, (2005, Jan.), [Online]. Available:
http://www.routeviews.org/.
[53] I. Kaya and T. Kocak, “Energy-eﬃcient Pipelined Bloom Filters for Network
Intrusion Detection,” in IEEE International Conference on Communications,
2006, pp. 2382 – 2387.
[54] F. Nemati, H.-J. Cho, S. Robins, R. Gupta, M. Tarabbia, K.J. Yang, D. Hayes,
and V. Gopalakrishnan, “Fully Planar 0.562μm2 T-RAM Cell in a 130nm SOI
CMOS Logic Technology for High-density High-performance SRAMs,” in IEEE
International Electron Devices Meeting ’04, 2004, pp. 273–276.
[55] B. Dipert, “Special Purpose SRAM Smooth the Ride,” Electronics Design,
Strategy, News, 1999, pp. 9–13.
[56] D. A. Patterson and J. L. Hennessy, Computer Architecture: A Quantitative
Approach, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1990.
[57] E. Saﬁ, A. Moshovos, and A. Veneris, “L-CBF: a Low-Power, Fast Counting
Bloom Filter Architecture,” in ISLPED ’06: Proceedings of the 2006 Interna-
tional Symposium on Low Power Electronics and Design, Tegernsee, Germany,
2006, pp. 250–255.
128
[58] Passive Measurement and Analysis Project, National Laboratory for Ap-
plied Network Research (NLANR), (2006, July), [Online]. Avail-
able:http://pma.nlanr.net/traces/traces.
[59] M. V. Ramakrishna, E. Fu, and E. Bahcekapili, “A Performance Study of Hash-
ing functions for Hardware Applications,” in Proceedings of Int. Conf. on Com-
puting and Information, 1994, pp. 1621–1636.
[60] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms,
McGraw-Hill, New York, 1990.
[61] Y. Luo, J. Yang, L. N. Bhuyan, and L. Zhao, “NePSim: A Network Processor
Simulator with a Power Evaluation Framework,” IEEE Micro, vol. 24, no. 5,
pp. 34–44, 2004.
[62] A. Apostolopoulos, D. Aubespin, V. Peris, P. Pradhan, and D. Saha, “Design,
Implementation, and Performance of a Content-based Switch,” in INFOCOM
2000. Proceedings of IEEE the 19th Annual Joint Conference of the IEEE Com-
puter and Communications Societies, 2000, pp. 1117 – 1126.
[63] C. Kachris and S. Vassiliadis, “Design of a Web Switch in a Reconﬁgurable
Platform,” in ANCS ’06: Proceedings of the 2006 ACM/IEEE Symposium on
Architecture for Networking and Communications Systems, San Jose, 2006, pp.
31–40.
[64] Z. G. Prodanoﬀ and K. J. Christensen, “Managing Routing Rables for URL
Routers in Content Distribution Networks,” Int. J. Netw. Manag., vol. 14, no.
3, pp. 177–192, 2004.
129
[65] Monthly Log Files 2000, Computer Science Division, University of California,
Berkeley.
[66] NLANR Sanitized Cache Access Logs, (2006), [Online]. Available:
ftp://ircache.nlanr.net/Traces/.
[67] Sanitized Log Files from Canada’s Coast to Coast Broadband Research Network
(CA*netII), (2000), [Online]. Available: ftp://ircache. nlanr.net/ Traces.
[68] J. Garc´ıa, J. Corbal, L. Cerda` and M. Valero, “Design and Implementation
of High-performance Memory Systems for Future Packet Buﬀers,” in MICRO
36: Proceedings of the 36th Annual IEEE/ACM International Symposium on
Microarchitecture, 2003, p. 373.
[69] M. Mitzenmacher and S. Vadhan, “Why Simple Hash Functions Work: Exploit-
ing the Entropy in a Data Stream,” in SODA ’08: Proceedings of the Nineteenth
Annual ACM-SIAM Symposium on Discrete Algorithms, Philadelphia, PA, 2008.
[70] B. Agrawal and T. Sherwood, “Modeling TCAM Power for Next Generation
Network Devices,” in ISPASS ’06: IEEE International Symposium on Perfor-
mance Analysis of Systems and Software, 2006.
[71] W. Eatherton, G. Varghese, and Z. Dittia, “Tree Bitmap: Hardware/Software
IP Lookups with Incremental Updates,” SIGCOMM Comput. Commun. Rev.,
vol. 34, no. 2, pp. 97–122, 2004.
[72] RIPE Network Coordination Centre, (2006), [Online]. Avail-
able:http://www.ripe.net/.
[73] K.S. Trivedi, Probability & Statistics with Reliability, Queueing, and Computer
Science Applications, Prentice-Hall, Inc., Englewood Cliﬀs, NJ, 1990.
130
[74] S. Alouf, P. Nain, and D. Towsley, “Inferring Network Characteristics via
Moment-based Estimators,” in INFOCOM 2001. Proceedings of IEEE the 20th
Annual Joint Conference of Computer and Communications Societies, 2001, pp.
1045 – 1054.
131
VITA
Heeyeol Yu was born in Kimje, Korea. After completing his schooling at Po-
hang Jechul High School, he went on to obtain his Bachelor of Science in Computer
Science from Korea Advanced Institute of Science and Technology, Taejon, Korea in
February 1994. He graduated with his Master of Science in Computer Science from
the University of California, Los Angeles in December 2003.
Contact address:
Department of Computer Science and Engineering
Texas A&M University
TAMU 3112
College Station, TX 77843-3112
The typist for this dissertation was Heeyeol Yu.
