A low-power network search engine based on statistical partitioning by Basci, F & Kocak, T
                          Basci, F., & Kocak, T. (2004). A low-power network search engine based on
statistical partitioning. In IEEE Workshop on High Performance Switching
and Routing, Phoenix AZ. (pp. 264 - 268). Institute of Electrical and
Electronics Engineers (IEEE). 10.1109/HPSR.2004.1303484
Link to published version (if available):
10.1109/HPSR.2004.1303484
Link to publication record in Explore Bristol Research
PDF-document
University of Bristol - Explore Bristol Research
General rights
This document is made available in accordance with publisher policies. Please cite only the published
version using the reference above. Full terms of use are available:
http://www.bristol.ac.uk/pure/about/ebr-terms.html
Take down policy
Explore Bristol Research is a digital archive and the intention is that deposited content should not be
removed. However, if you believe that this version of the work breaches copyright law please contact
open-access@bristol.ac.uk and include the following information in your message:
• Your contact details
• Bibliographic details for the item, including a URL
• An outline of the nature of the complaint
On receipt of your message the Open Access Team will immediately investigate your claim, make an
initial judgement of the validity of the claim and, where appropriate, withdraw the item in question
from public view.
A Low-Power Network Search Engine Based on 
Statistic a1 Partitioning 
Taskin Kocak and Faysal Basci 
Department of Electrical and Computer Engineering 
University of Central Florida 
Orlando, Florida 32816-2450 
E-mail : tkocak@cpe.ucf.edu; tel: 407-823-4758; fax: 407-823-5835 
Abstract-Network search engines based on Ternary CAMS 
are widely used in routers. Ilowever, due to parallel search nature 
of TCAMs power consumption becomes a critical issue. In this 
work we propose an architecture that partitions the lookup table 
into multiple TCAM chips based on individual TCAM cell status 
and achieves lower power figures. 
I .  INTRODUCTION 
Content addressable memory (CAM) provides access by 
data rather than by .memory address. CAMs have higher 
advantage over other memory search algorithms, such as look- 
aside tag buffers, binary or tree based searches. However, this 
performance advantage comes with a price of higher silicon 
area, and higher power consumption. Today, commercial CAM 
chips or embedded CAMs are being utilized in lots of different 
applications including pattern recognition, neural networks, 
encryption, firewalls, switches and routers.. However, in this 
paper, we are interested in the ones for networking applica- 
tions. CAMs are generally used in packet forwarding lookup 
tables in the routers [I]. It is used to extract and process 
the address information from incoming packets: compare the 
destination address of the packet with the stored data and 
if a match occurs, associated routing information is given to 
the forwarding circuit. Despite their performance advantages, 
CAMS have serious power consumption problems. In [l], it is 
reported that a 5OOK-entry lookup table for an IPv6 network 
processor, formed by CAM chips will consume up to 133 W, 
which is around 3 W per chip. Moreover, in [71 i t  is  reported 
that a MK-word by 40-hit CAM chip consumes 5.2 W. 
In this paper, we are going to address the power con- 
sumption issues in ternary CAMS (TCAMs) used in IP for- 
warding tables and propose an approach which reduces the 
expected power consumption. In section 11, an introduction 
to fundamentals of CAMs is presented, section 111 discusses 
usage of TCAMs in IP forwarding, and section IV presents 
our statistical partitioning approach. In section V. power 
consumption formulation of statistically parlitioned TCAM is 
presented, and section VI discusses the experimental results. 
Finally, in section VII, we present the conclusion. 
line drivers. When a cell detects a mismatch, it will pull down 
the match line it is connected to. If a word is fully matched the 
match line will remain high. The encoder will select a single 
row in the case of multiple matchas, and will assert a hit signal 
and the corresponding address of the selected row. As can be 
seen from the figure any search word preiiented is searched 
in parallel. Due to this, the search is very fast, however this 
also implies that for each search operation all of the cells are 
utilized, resulting in excessive power consumption. 
There are two classes of CAMs, binary CAMS and ternary 
CAMs. A binary CAM cell can store either a 0 or a 1. TCAMs, 
on the other hand, has the capability to store a “don’t care 
(x)”. To store an x we need an extra bit. This can he achieved 
by simply combining two binary CAM cells (41 or a more 
elaborate CAM cell design is also possible as in [9]. In TCAM 
cells the stored data is encoded to represenl: 0, 1 and x .  CAM 
architectures utilizes wire-anding:, before a !search operation is 
performed, the matchline is precharged and search lines are 
discharged. During a search operation, if a cell matches the 
stored data with the one on the search line it will do nothing, 
however when a mismatch is detected, the cell will pull down 
the match line to low. So, even if one bit mismatches, a 
mismatch will be issued. On the ,other hand, if a don’t care is 
stored in a cell then the search d.ata will be discarded by that 
cell. In a way it will behave like a matching cell. 
CAMs (or TCAMs) consume power mainly in 3 parts: 
matchline (and searchline) prfcbarging (pre-discharging), 
comparison, and clock and control signaling. Most of the 
power is consumed during pre-charging operation [6]. There- 
fore most of the CAM designs try to minimize pre-charging 
events as in [51. [8 ]  and [lo]. Some approaches try to reduce 
the voltage swing across search and matchlines as in [ I l l  
whereas some approaches use system level optimizations. For 
example in [12], a parameter characterizing the data is also 
stored in the CAM besides the original data. First, a search 
among the parameters is perfo:rmed and a second search 
is done among the original daia for which corresponding 
parameter search returned a hit. l l i s  way evaluation and pre- 
11. CAM BASlCS 
Fig. 1 shows a basic CAM architecture [31. This is an n-bit 
k-word CAM. Data stored in CAM are searched by applying 
the reference word to hit lines, which run vertically, through bit 
charging activities are avoided. 
Other system level approaches utilizes the application spe- 
cific properties, as in the case of IP forwarding engines. Next 
section discusses the usage of TCAMs in IP forwarding. 
0-7803-8375-3/01/$20.00 0 2004 IEEE 264 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on July 9, 2009 at 06:10 from IEEE Xplore.  Restrictions apply.
_3 
hit 
+--- 
address 
Fig. I. CAM architecture. Data to be searched (or stared) is fed lhrough bit line drivers. address is the e n d e d  address of We matched data. Hit is a canvol 
signal indicating i f  a match found. Word liner are used when Writing data 10 lhe cells. 
111. TCAMS I N  IP  FORWARDING 
Nowadays, IP lookups are based on classless interdomain 
routing ( O R )  scheme. After the adoption of CIDR in 1993, 
IP routes have been characterized by a routing prefix and 
the prefix length. In CIDR scheme. route lookups aim at the 
longest prefix match (LPM). 
As a TCAM providea a don’t care storage capability, it is a 
favorable lookup hardware; in that, in IP lookup engines, when 
an entry is stored in a TCAM, depending on its prefix, some 
of its rightmost bits will be stored as x .  For example in IPv4. 
for a 24 bit prefix, last 8 bits will be x.  Moreover, because of 
x storage capability, routing entries belong to different prefix 
sets can be placed on the same chip. When multiple matches 
occur, a priority encoder chooses the longest matching prefix. 
In the case of binary CAMS, one needs to use a separate CAM 
chip for each prefix set. 
Routing table entries stored in TCAMs are ordered accord- 
ing to their prefixes. For example in Fig. 2, the highest prefix 
set lies at the top of the entries (or lowest addresses), and 
the lowest prefix set lies at the bottom [14]. So in the case 
of multiple matches priority encoder will choose the one with 
lowest address, which is indeed the longest matching prefix. 
IV. STATISTICAL PARTITIONING 
When we look at the prefix distribution in the core routers 
we see that they all have a similar characteristic. Fig. 3 shows 
the distribution of prefixes in the routing tables that belong to 
routers of various networks [13]. It can be seen that a great 
portion of the entries are accumulated at prefix set 24. The 
question is can we make use of this? At first, If we just check 
this prefix set we can achieve around 50% hit ratio provided 
that we have a random traffic pattern. However, as our aim is 
to find the longest matching prefix. we should also look at 
I 32-bit prefix set I 
3 l-bit nrefix set I 
I 30-bit prefix set I . . 
0 
0 
I %bit prefix set I 
I %bit nrefis set I 
Fig. 2. 
seu are at the b t t o m  
TCAM storage schcm. Higher prefix sen BR at top. lower prefix 
the higher prefixes. Fig. 3 suggests that prefixes higher then 
24 are considerably rare. So. we exploit this fact and propose 
to partition the muting lookup table as shown in Fig.4. In 
this architecture. the routing lookup table is divided into 
two parts, TCAMI and TCAMZ, respectively. In a lookup 
operation, first TCAMl will be searched and if a mismatch 
occurs, then TCAM2 will be searched. The search procedure 
can be summarized as follows: 
If there is a match 
DONE 
Otherwise 
search TCAMZ 
Perform a search in TCAMl 
265 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on July 9, 2009 at 06:10 from IEEE Xplore.  Restrictions apply.
If there is a match 
DONE 
Otherwise 
issue mismatch 
end 
Fig. 3. Prefix distribution in the m u n g  tables of various networks 
There is a buffer between TCAMl and TCAMZ which is 
used to store the current search word. This enables pipelining 
the search operation. In the case of a mismatch in TCAMI, 
TCAM2 uses the value stored in the buffer to do a search. 
This way, TCAMl can accept a new search word each time. 
Although pipelining ensure a sustained average throughput of 
one result per clock cycle, the average latency of the search 
operation will he higher. However, considering the overall 
latency of packet processing in network nodes, we believe that 
this latency will not be significant as long as the throughput 
rate is sustained. 
Beside the distribution of prefixes we should also take power 
consumption and average latency into account when partioning 
the table. When we have all of these figures we can deduce 
an optimal partitioning. 
V. POWER CONSUMPTION FORMULATION F CAM 
Power consumption in a TCAM cell can be written as follows: 
CIRCUITS 
PTCAM PSTC f PCLK + PMlSS  + PMATCH + P X  (1) 
where PSTC, PCLK represent the static power consumption 
and power dissipation due the clock circuitry, and PM[ss. 
PMATCH. Px represent the average power consumption in the 
case of a mismatch, match and don't care, respectively. Please 
note that latter three also include the power consumption due 
to matchline and searchline switchings. 
Fig. 4. Panitiomd TCAM 
Let ri represent the set of entries associated with a prefix i, 
and prefix sets stored in TCAMl he {r,,,, . . .rn} where m < 
n, and prefix sets stored in TCAhl2 be {rl. rz, . . .r,,-l}. 
The number of entries in set ri can he represented by N,. 
where I< i <n . The number of don't cart: cells in a prefix 
i is n-i. Then power consumption for a comparison operation 
of a word that belongs to the set r; is: 
P&,,q = P ~ A T c H * M ~ ~ + P ~ , ~ s s * M s ; . + P ~ * ( ~ - - ~ )  (2) 
where Mtj  represents the number of matching cells and Msj  
represents the number of mismaiching cells in stored word 
j .  Let p represent the probability that a match will occur 
in TCAMI, then the total power consumption for a search 
operation can be written as: 
&,TAL = PTCAMI + (1  - PO * P )  PTCAMZ (3) 
where po represents the probability that the search word has 
a match m the whole table. Then we can write the power 
consumptions for each TCAM portion as: 
n IV, 
r=m , =1 (4) 
PTCAM~ = E L~ \-- p&oMp, 
To proceed further, note that 
266 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on July 9, 2009 at 06:10 from IEEE Xplore.  Restrictions apply.
2.1- 
. .  . - . .  . -. -. . . -. . .  -. -, 
O S M S j 5 i  (7) 
Assuming that the number of mismatching and matching cells 
are equal (contradicting case will be discussed later) _., 1.9 
1 8 -  
E { Mtl.} = i j 2 ;  (8) 2 
1 7 -  .. ... .... .. ..... .......... 
E { Msj} = i j 2 ;  
where E{ ...} represents the expected value operation. Then 
the expected value of power consumption for each TCAM 
reduces to the following: 
1.5 - 
1.4 --Tarvs 
.... 
McVkrOnOon-km 
13 - - T W E u m s  
E { P T C A M I }  = (10) * I I C C 4 A I W U S l G " H I B 9 ~  
n 
i = l  
Hence we can write the expected total power expression as 
follows: 
Fig. 5. Power canrumption in search operation far partitioned CAM With 
(11) varyingm 
Reach 24.21 28.65 1.6871 
Route-Wews.Oregon 24.21 27.62 1.6321 
Teleglabe Europe 24, 21 28.89 1.6932 
AOL-PR. GNN 24. 21 28.84 1.6914 
TABLE I 
EXPECTED POWER SAVINGS AND AVERAGE LATENCY WHEN E {!%TAL} = E {PTCAMl} + (1 - P o  * P I )  * E  {PTCAMZ)  
(12) PARTITIONING TCAM INTO THREE 
VI. EXPERIMENTAL RESULTS 
After formulating power, the problem is to find the value of 
m that will minimize the expected total power consumption, 
while maintaining an acceptable average search latency. As 
the value of m depends on the routing table entries and traffic 
partem one would expect that it will be different for each 
lookup table. To calculate the expected power consumption 
for different values of m, or a typical lookup table, we have 
done calculations for a few networks including Telstra and 
Reach 1131. We have implemented an example TCAM circuit 
in TSMC 0.18prn CMOS technology and we run it at 100 
MHz. PMT, PMS and P x  are obtained from Cadence Spectre 
simulations. In these simulations, number of TCAM cells 
varied and in each case TCAM block is tested with different 
matching and mismatching bit patterns. Finally, all of the 
obtained results are averaged for matching, miss-matching and 
don't care cells. The results are as follows: 
expected value of the latency can be shown to be equal to the 
following : 
where p1 is the probability that a match will occur in TCAM1. 
Fig. 6 shows the plot of PLP for the same network set. It can be 
seen that the until m=25 the PLP increases very slowly,'and 
then it shows a sharp increase. When we calculate average 
latency for the case where m=%, we see that it varies from 
1.36 to 1.45 clock cycles. The same experiment is run for the 
case of 3 partitions. This time, we have 3 TCAM parts. The 
prefix set stored in TCAMl in this case is {rrn,, .. . rn} ml 
< n, prefix set stored in TCAM2 is I?-,,,.,. . . rrn,-,} m2 < 
ml and prefix set stored in TCAM3 is . . r,,,2-l}. After 
running the simulations based on this scheme, the obtained 
E{L}  = 2 -PI (14) 
- 
minimum power values are tabulated in Table I. From the 
table it can be seen that with some added latency, wwer (13) P M ~  = 397 nTV, PMS = 515 nIV, Px = 336 nW . .  
Fig. 5 ,  shows the power consumption in search operations 
for 5 different networks, based on the above values. It can 
be seen that for all of the networks the minimum power 
consumption is achieved for the value of m=24. Power savings 
gained by doing that are between 20 to 24%. We can also use 
expected power - latency product (PLP) as another metric. If 
we represent search latency in terms of clock cycles with L, 
consumption can be reduced by up to 30%. Experiments we 
conducted show that further partitioning does not increase 
power savings considerably. 
In the experiments, it is assumed that the number of cells 
that return a match is equal to the number of cells returning 
a mismatch. However, to exemplify the power consumplion 
figure for different miss rates a simulation is done for Telstra 
267 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on July 9, 2009 at 06:10 from IEEE Xplore.  Restrictions apply.
m' I * h m p d u n r x ~ l r ~ " l ~ l  
Expected power-latency product (PLP) with varying m Fig. 6. 
* 4mmor.e 
2.l 
~ 
e 
819- 
Y 
5 , s  
. . . . . . . .  . . -. . 
3 2  
, , $ 7 .  
3 8 .  
< /  . .  
I ,I 20 a Y) 3, I> 
(R Ih ims lldn i- lmubd.3 T W ,  
Fig. 7. Power consumplion in search operation for panitioned CAM with 
varying m for Telsvs Network with different miss rates far individual cells 
network. The result is shown in Figure 7. As can be seen 
the optimal value of m is not affected, however the power 
consumption figures are scaled up or down with miss rate. 
VII. CONCLUSION 
We have presented a partitioning scheme, which utilizes 
statistical distribution of prefixes and individual power con- 
sumption of cells in the cases of match, mismatch and don't 
cares. We showed that indeed the partitioning helps reducing 
the power consumption in IP lookup applications. Partitioning 
into two and three, reduce power consumption considerably. 
whereas fiutber partitioning shows little improvement. 
REFERENCES 
[I] White Paper, EZCHIP Technologies, "Ipv6 to Ipv4 is Not Merely 50 
Mom.'' (web http:llwww.ezchip.comrhlmVtech lPv6.html) 
[2] "Content-Addressable memory (CAM) and its network 
applications." (web www.eelasia.codARTICLESl2~MAYl 
2000MAY03NEMNTEK-TAC.PDF! 
131 K. I. Schulu. 'Contmt-addressable memory  on cells A survey." 
Integration, the V U 1  journal 23, Page($): 171.188. 1997 
[4] R. Sergio, R. Chavez, '.Encoding don't cans in static and dynamic 
Content-Addnssable Memories." IEEE 'Tmmnmctionr on Circuits and 
System-11 : Analog and Digital Signal Pmcrsshg Vol. 39, No.8. 
August 1992 
G. 'Thimgnanw N. Vijaykrishnan, M.I. Irwin. "A novel low power 
CAM design." 14th Annual IEEE I~iternationol ASIUSOC Confemnce 
Proceedings, Page@): 198 -202. 12-15 Sept. Z(m1 
H.Y. Liang Hriao; D.H. Wang; C.". len, "Power modeling and low- 
power design of content addressable memories '' ISCAS, 'The 2W1 
1EEE Internan'onol Symposium on Circuits and Systems. Volume: 4 , 
Pagc(s): 926 -929 Vol. 4, 6-9 May 2001 
F. Hafai. K.J. SchuItz, G.F.R. Gibran, A.G. Bluschke. D.E. Somppi. 
"Fully parallel 30-MHr. 2.5-Mb CAM." IEEE b u m 1  of Solid-State 
Circuits. Volume: 33 Issue: II, Page(s): 1690 ..1696, Nov. 1998 
C.A. Zukowski. S.Y. Wang. "Use of selective pnrharge for low-power 
on the m t c h  lines of content addressable memrier:' l n t ~ r M t i O M l  
Workhop on Memory Technology. Proceedings of Design and Testing. 
Page@): 64 -68, 11-12 Aug. 1997 
1. Arsavrki. T. Chandler. A. Sh,:ikholcslami "A le- content- 
addrcsrable memory (TCAM) bared on 4T !static Storage and in- 
cluding a current-race sensing scheme" lEEE loumol of Solid-State 
Circuits. Volume: 38 Issue: I ,  Page@): 155 -1%. Jan. 2003 
T. Chadwick. T. Gordon, R. Nidkami, 1. Rawland, "An ASIC- 
embedded content addressable memory with power-saving and design 
for test features." 1EEE Confermre on Custoin lntegrored Circuits. 
Pa@): 183 -186. 6-9 May 2001 
H. Miyatake, M. Tanaka. Y. Moii. "A design for high-speed low- 
power CMOS fully parallel cont'mt addressable memory macros." 
IEEE Journal of Solid- State Circuits, Volume: 36 Issue: 6. Paedr): . 
956 -968. June~2Wl 
C.S. Lin. J.C. Chang, B.D. Liu."P. law-pawer premmpulation-based 
fully mrallel content-addressable memow" IEEE J o u m l  of Solid. 
St& Circuits, Volume: 38 Issue: .1, Pageis): 6.54 -662. Aprii2W3 
BGP lable statistics. (web: http:ilbgp.patamo.neV). September 22, 
2003 
[I41 DShah, P.Gup!a "Fast updating algalithmr for T C A M  IEEE Micm. 
Volume: 21 Issue: I ,  Pa&): 36 -47, Jan.-Feb 2001 
268 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on July 9, 2009 at 06:10 from IEEE Xplore.  Restrictions apply.
