Statistically partitioned, low power TCAM by Basci, F & Kocak, T
                          Basci, F., & Kocak, T. (2004). Statistically partitioned, low power TCAM. In
2nd Annual IEEE Northeast Workshop on Circuits and Systems, Montreal,
Canada. (pp. 129 - 132). Institute of Electrical and Electronics Engineers
(IEEE). 10.1109/NEWCAS.2004.1359039
Peer reviewed version
Link to published version (if available):
10.1109/NEWCAS.2004.1359039
Link to publication record in Explore Bristol Research
PDF-document
University of Bristol - Explore Bristol Research
General rights
This document is made available in accordance with publisher policies. Please cite only the published
version using the reference above. Full terms of use are available:
http://www.bristol.ac.uk/pure/about/ebr-terms.html
Take down policy
Explore Bristol Research is a digital archive and the intention is that deposited content should not be
removed. However, if you believe that this version of the work breaches copyright law please contact
open-access@bristol.ac.uk and include the following information in your message:
• Your contact details
• Bibliographic details for the item, including a URL
• An outline of the nature of the complaint
On receipt of your message the Open Access Team will immediately investigate your claim, make an
initial judgement of the validity of the claim and, where appropriate, withdraw the item in question
from public view.
Poster Session II : Applications & Emmerging Technologies 
Statistically Partitioned, Low power TCAM 
Fapsal Basci and Taskin Kocak 
Department of Electrical and Computer Engineering 
University of Central Florida. Orlando. FL, 3281 6.2450 
Abstract-Network search engines based on 'Ternary CAMS 
are widely used in routers. However, due tn parallel search nature 
of TCAM5 power consumption becomes a critical issue. In this 
work u'e propose an architecture that partitions the lookup table 
into mulliple TCAM portions bdsed on individual TCAM cell 
status and achieves up to 30% power reduction. 
Keywords: CAM. ternary CAM, low power, IP lookup. 
routing table 
I. INTRODUCTION 
Content addressable memory (CAM) provides access by 
data rather than by memory address CAM's has higher 
advantage over other memory search algorithms, such as look- 
aside tag huffers, binary or tree based searches. However, this 
performance advantage cnmes with a price of higher silicon 
area, and higher power consumption. Today, commercial CAM 
chips or embedded CAMs are being utilizcd in lots of different 
applications including pattcrn recognition. neural networks, 
encryption, fircwalls, switches and routers. In this paper, we 
are interested in the ones for networking applications. CAMs 
are generally used in packet forwarding lookup tables in the 
routers [ I ] .  It is used to extract and process the address 
information from incoming packets: compare the destination 
address of the packet with the stored data and if a match 
occurs. associated routing information is given to the forward- 
ing circuit. Despite their performance advantages, CAMs have 
serious power consumption problems. In [I], i t  is reponed 
that a 500K-entry lookup table for an IPv6 network processor, 
formed hy CAM chips will consume up to 133 W. which is 
around 3 W per chip. Moreover, in [6] it is reported that a 
64K-word by 40-hit CAM chip consutnes 5.2 W. 
hi this paper, we addrcss thc power consumption issues 
in tcrnary CAMs (TCAMs) used in 1P forwarding tables 
and propose an approach which reduces thc expected power 
consumption. In section 11, an introduction to fundamentals of 
CAMS is presented, section 111 discusses usage of TCAMs in 
IP forwarding, and section IV presents our statistical partition- 
ing approach. In section V, power consumption formulation 
of statistically partitioned TCAM is presented. and section VI 
discusses the experimental results. Finally. in section VIII, we 
present the conclusion. 
11. CAM BASicS 
Fig. 1 shows a hasic CAM architecture [2]. This is an n-hit 
k-word CAM. Dam stored in CAM is searched hy applying 
the reference word to hit lines, which run vertically. through 
hit line drivers. Any search word presented is searched in 
0-7803-8322-2/04/$20.00 02004 IEEE. 
parallel. Due IO this, the search is very fast, however this 
also implies that lor each search operation all of the cells 
are utilized, resulting in excessive power consumption. There 
Bit line dr iven 
! 
d 
hit 
t. 
address 
Fis. 1. CAM architecture Data to be searched (or stored) is fed Ihrou_eh 
bit line dnverr. address is the encoded address of the matched data. Hit is B 
con~rol signal indicating if a match found. Word lines are used when writing 
data to the cella. 
are two classes of CAMS, binary CAMs and tcrnary CAMs. A 
binary CAM cell can store either a 0 or a 1. TCAMs, on the 
other hand; has the capahility to store a "don't care (x)". To 
store an x we need an extra hit. This can be achieved by simply 
combining two binary CAM cells [3] or a more elahorate CAM 
cell design is also possible as in [7]. In TCAM cells the stored 
data is encoded to represent 0, I and x. CAM architectures 
utilizes wire-anding: bel'ore a search operation is performed, 
the matchline is precharged and search liner are discharged. 
During a search operation, if a cell matches the stored data 
with the one on the search line it will do nothing. however 
when a mismatch is detected, the cell will pull down the match 
line to low. So, even if one hit mismatches, a mismatch will 
he issued. On the othcr hand. if a don't care is stored in a cell 
then the search data will be discarded by that cell. In a way 
i t  will behave like a matching cell. 
CAMs (or TCAMs) cnnsume power mainly in 3 parts: 
matchline (and searchline) precharging (pre-discharging), 
comparison, and clock and control signaling. Most of the 
power is consumed during pre-charging operation [5]. There- 
fore most of the CAM designs try to minimize pre-charging 
events as in [4] and [SI. Some approaches try to reduce the 
voltage swing across search and matchlines as in [Y] whereas 
some approaches use system level optimizations as in [IO]. 
129 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on July 9, 2009 at 06:20 from IEEE Xplore.  Restrictions apply.
Poster Session II : Applications & Emmerging Technologies 
Other system level approaches utilizes the application spe- 
cific propcrtics. as in the case of 1P forwarding cngincs. Next 
section discusses the usage of TCAMs in IP forwarding. 
111. TCAMS IN 1P FORWARDING 
Nowadays. IP lookups are hased on classless interdomain 
routing (CIDR) scheme. After thc adoption of CIDR in 1993. 
1P routes haw hcen characterized hy a routing prefix and 
the prefix length. In CIDR scheme, route lookups aim at the 
longest prefix match (LPM). 
As TCAM provide a don't care storage capahility, it is a 
favorahle lookup hardware; in that. in IP lookup engines, when 
an entry is stored in a TCAM; dependine on its prefix. some 
of its rightmost hits will he stored as x. For example in lPv4, 
Fig. 2. Parcilioncd TCAM 
lor a 24hit prefix, last 8 hits will hex.  Moreover, hecause of x 
storage capability, routing entries belong to difkrent prefix sets 
can he placed on the same chip. Generally routing tahle entries 
stored in TCAMs are ordered according to their prefixes. For 
example, the highest prefix set lies at the top of the entries (or 
lowest addresses). and the lowest prefix set lies at the bottom 
[12]. When multiple matches occur, a priority encoder chooses 
thc longest matching prefix, which, in this example, is the 
match with lowest address. 
Iv. STATISTICAL PARTITIONING 
When we look at the prefix distribution in the core routers 
we see that they all have a similar characteristic [ I  I ] .  A great 
portion of the entries are accumulated at prefix set 23. If we 
just check this prefix set, we can achieve around 50% hit ratio 
provided that we have a random traffic pattern. However, as 
our aim is to find the longest matching prefix. we should also 
look at the higher prefixes. Indeed. prefixes higher then 24 
are considerably rare. We exploit this fact and propose to 
partition the routing lookup table as shown in Fig. 2. The 
routing lookup tahle is divided into two parts, TCAMl and 
TCAM2. respectively. In a lookup operation, first TCAMI will 
be searched and if a mismatch occurs, then TCAM2 will he 
searched. 
There is a buffer between TCAMI and TCAM2 which is 
used to store the current search word. This enables pipelining 
the search operation. In the case of a mismatch in TCAMI, 
TCAM2 uses the value stored in the buffer to do a search. 
This way. TCAMl can accept a new search word each time. 
Although pipelining ensure a sustained average throughput of 
one result per clock cycle: the average latency of the search 
operation will he higher. However, considering the overall 
latency of packet processing and queuing delays in network 
nodes, this latency will not he significant. 
Beside the distribution of prefixes we should also take 
power consumption and average latency into account when 
partitioning the tahlc. When wc have all of these figures we 
can deduce an optimal partitioning. 
V. POWER CONSUMPTION FORMULATION OF CAMS 
Power consumption in a TCAM cell can he written as follows: 
PTCAIM = PSTC + PCLX f P M S  f PA4T P Y  (1) 
where PSTC, PCLI< represents the static power consumption 
and power dissipation due the clock circuitry. and P,,rs, PllrT, 
P.y represent the average dynamic powcr consumption in the 
case of a mismatch. match and don't care_ respectively, Please 
note that latter three also include the power consumption 
due to matchline and searchline switchings. Here we are 
concerned about the dynamic power consumption caused by 
search operation. This is the dominant power drain, even under 
very low load. In facl, for networking applications the TCAM 
generally works with full load. 
Let ri represent the set of entries associated with a prefix i, 
and prefix sets stored in TCAMl he {rnZ, . . .rvz} where m < 
n, and prefix sets stored in TCAM2 he {rl, r2. . . .rm-l}. 
The number of entries in set ri can be represented by Ni. where 
1< i <n (n=32 for IPv4). The number of don't care cells in 
a prelix i is 11-i. Then power consumplion for a comparison 
operation of a word that belongs to the set ri is: 
P&',Ip> = * Mtj + PhIS * Iws; + Px * (n  - i )  (2 )  
where Mt;. represents the number of matching cells and Ads; 
represents the number of mismatching cells in stored word J ,  
We can write the power consumptions for each TCAM portion 
as: 
(4) 
Let p1 represent the fraction ol  entries that resides in 
TCAMI, then the total power consumption lor a search 
operation can he written as: 
PTOTAL = PTC.4dfl + (1  - Po * P i )  * PTCAbIZ ( 5 )  
130 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on July 9, 2009 at 06:20 from IEEE Xplore.  Restrictions apply.
where po represents the prohahility that tlie search word has 
a match in the whole table. 
VI. EXPERIMENTAL RESULTS 
The value of m depends on the routing table entries and 
traffic pattern. s o  it will he different for each lookup table. To 
calculate the expected power consumption for different values 
of m. for a typical lookup table. we have done calculations 
for a Iew networks including Telstra and Reach [ I  I] .  We have 
implemented an example TCAM circuit in TSMC 0.18pin 
CMOS technology and we run it at 100 MHz. P,lyr. Pnis 
and PI are obtained from Cadence Spectre simulations. In 
these simulations, number of TCAM cells varied and in 
each case TCAM block is tested with differcnt matching and 
mismatching hit patterns. Finally, all of the obtaincd rcsults 
arc averagcd for matching, miss-matching and don't care cells. 
Thc results are as follows: 
P.41~ = 397 71I,T;.4,,5 = 515 iill,':P,y = 336 R T ' V  ( 6 )  
Fig. ?_ shows the expected dynamic power consumption in 
13 20 25 
2r.- /_.*I Dlpl,jj ,"ef. mS#*d.O ,I TCAM, "l 
1 2  
Fig. 3. 
varying ut 
scarch operations for 5 diffcrcnt nctworks, based on the above 
values. It can hc seen that for all of tlie nctworks thc minimum 
power consumption is achieved for the value of m=24. Power 
savings gained by doing that are between 20 to 24%. We can 
also use expected power - latency square product (PLSP) as 
another metric. If we represent search latency in terms of clock 
cycles with L, expected value of the latency can he shown to 
he equal to the lollowing : 
Power consumption m search operation for partitioned CAM with 
E{L}  = 2 - po (7) 
Here we assume that, the system utilizing the results of the 
TCAM have the ability to handle two results per clock cycle, 
otherwise, there will be a stall stage and average latency will 
be equal to 2;  independent of the hit rate. 
Fig. 4 shows the pi01 of PLSP tor the same network set. It can 
he seen [hat the until m=25 the PLSP increases very slowly. 
and then i t  shows a sharp increase. When we calculate average 
latency for the case where m=24. we see that it varies from 
1.36 to 1.45 clock cycles. The same experiment is run for 
. . . . .. . . 
Rouie-V.Orcgon 1 24. ? I  i 27.90 I I.fl32l 
.AOL-PR. GNN 1 24. 21 ! 29.12 I 1.6916 
Trlegiok Europe 1 21. 21 1 ?Y.20 1.6Y3? 
thc case of 3 partitions. This timc, we havc 3 TCAM parts. 
with ml, ml  as partitioning point. The obtained minimum 
power values are tabulated in Tahle I. From the table it can 
be seen that with some added latency. power consumption 
can he reduced by up  to 30%. Experiments we conducted 
show that further partitioning does not increase power savings 
considerably. 
In the experiments. i t  is assumed that the number of cells 
that return a match is equal to the number of cells returning 
a mismatch. However. to exemplify the power consumption 
figure for different miss rates a simulation is done for Telstra 
netulork. The result is shown in Figure 5. As can he seen 
the optimal value of rn is not affected. however the power 
consumption figures are scaled up or down with miss me.  
VII. DiScuSStON 
When the TCAM is partitioned. It can be claimcd that 
cxccpt for the first partition the other(s) can be implemented 
using less 'number of hits. For examplc in the casc of two 
partitions, if the the prefix down to 24 is stored in the first 
partition, first TCAM would he 32 hits wide and second one 
need to he only 23 hits wide. However, that would limit the 
ability to  reconfigure the partitioning. On the other hand, soft- 
ware control of partitioning without compromising from speed 
or power reduction complicates design considerably. Giving 
full soitware control over partitioning requires that each and 
every word in the TCAM system to have an input selection 
mechanism, which will increase the amount of silicon area 
needed and will increase the power consumption. However. 
131 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on July 9, 2009 at 06:20 from IEEE Xplore.  Restrictions apply.
Poster Session II : Applications & Emmerging Technologies 
"i ....................................... " ~ . .  
4 
" i  
, . " 
--- 
Fig. 6. Architecture allowing panial suitware control fnr pamiinning 
the prefix sets with a lot of entries are centered hetween 20 
to 25 range. As can be seen from the results obtained, lor the 
case of two partitions, partitioning is generally most heneficial 
at prefix 24 and for the case of three partitions, the partitioning 
again occurs in 20 to 25 range. A compromise would he to 
make only the part of TCAM. where the partitioning takes 
place, reconfigurable. Furthermore. instead of providing an 
input selection mechanism to each word, a group of word can 
use the same input selection mechanism. This way, tbc speed 
and power penalty could be minimized. Figure 6 shows an 
architecture allowing the partial software control for the case 
of three partitions. In this architecture, TCAMli,i represents 
the fixed part of the first partition, and TCAML,,,, is the 
fixed part of the last partition. Configurahle sub-partitions 
{ s p a : .  . . . s ~ e - ~ }  have a multiplexer at their input, and it is 
controlled hp the partitioning controller. Among all ofthe mul- 
tiplexers either two (three-partitions) or one (two-partitions) of 
them will get their input from buffers and from that point 
on, that input will he propagated down to either the next 
partitioning point or 1'CAM2fi,, for three and two partitions 
rcspcctively. For example, for the casc of two partitions. if the 
suh partition i gets input from huffcr-I all ofthc sub partitions 
between TCAMli,, and sp,  will have the sainc input. that 
means TCAMl contains all the suh partitions u p  to sp i .  And 
every other sub partition below sp ;  and TCAML,;, will 
constitute the last partition. Another multiplexer is connected 
at the input of the buffer, which selects one of the outputs 
of the suh partitions. If spi  is the partitioning point then the 
multiplexer will choose the output OS sp ,_ , .  This architecture, 
actually, allows us 10 use less number of hits in fixed portion 
the last partition. This way added power consumption due lo 
selection mechanism can be compensated. 
VIII. C o N C L u s i n N  
We have presented a partitioning scheme. which utilizes 
statistical distribution ol' prefixes and individual power con- 
sumption ol' cells in the cases of match, mismatch and don't 
cares. We showed that indeed the partitioning helps reducing 
the power consumption in IP lookup applications. Partitioning 
into two and three, reduce power consumption considerably, 
whereas furthcr partitioning shows little improvement. We also 
rcpresentcd an architccture which allows software control over 
partitioning. This architccture can he made more flexible with 
compromising power and silicon area. 
REFERENCES 
[I] Whire Paper. EZCHIP Technologies. '7pv6 to lpv4 is nor merely SO 
mcm." (web: hnp://~~u.w.eichip.com/hfml/t~~h Ih6.html) 
I21 K. J. Schulrz, "Content-addressable memory core cells: A survey." 
Integratio,~. rhe VI31 Joareol 23. Pageca): 171-188. 199i 
131 R .  Sergio, R. Chavez. "Encoding don't care' i n  static and dynamic 
confmt-oddrersnhle mernofics." I rr,,"sLlcr;""s 011 Circeirr 0,d 
.Y?rierns-II : A n d q g  nrid Digird Sijirtoi Pmerring Vol. 19. No.S, 
August 1992 
141 G. Thirugnanam. N. Viiaykriahnan, M.J. Irwin. "A novel IOU, power 
CAM design." 14th Annrinl IEEE lnieniurinrial ASIC/SOC Conjerertrr 
Pnxeedirigs, Pagc(s): 198 -202, 12-15 Sept. 2001 
151 H.Y. Liane Haiao, D.H. Wmg. C.W. Jcn. "Power modeling and l o w  
power design of ~onieni addressable nrniories " ISCAS The 2001 
IEEE Inrenuiiiorrul Sympmium on Circnirr m d  Sysrmu.~, Volume: 4 , 
Pa&): 926 -929. 6-9 May 2001 
161 F. Hafai. K.J. Schultz. G.F.K. Gibson. A. 
"Fully parallel 30-MHz. ?.S-Mb CAM." 
Circairs. Volume: 33 Issue: 11. Page@): 1690 -1696. Nov. 1998 
[7] 1. Arsovski. T. Chandler. A. Sheikholeslami. "A ternary canfent- 
addressable memory (TCAM) based on 4T static p a g e  and in- 
cluding a current-mce sensing scheme" lEEE Ju~rrnul of Solid-Stme 
Circuirs, Volume: 38 Issue: I, Page(s): IS5  -158. Jan. 2003 
[8] T. Chadwick, T. Gordon, R. Nadkami. J. Rowland. "An ASIC- 
embedded content addressable nlernnry with power-saving and design 
for test frames.'' IEEE Coqference on Curronc Inregmied Cim-uiis. 
Pa&): 183 -186. 6-9 May 2001 
191 H. Migatake, bl. Tan&, Y. Mori, "A design for high-speed Iow- 
power CMOS fully parallel content addressable memory macrns." 
IEEE Jourrzol of Solid- State Cimuirs. Volume: 36 Issue: 6, Pagu(s): 
956 -968. June 2001 
[IO] C.S. Lin, J.C. Clung, B.D. Liu, "A low-power prccamputalion-based 
fully parallel content-addressable mumnory:' IEEE Jouniul of Solid- 
Stme Cimuim, Volume: 38 Issue: 4. Pagels): 654 -662. Aprilr2003 
[I I ]  BCP table statistics, (web http:l~gp.polamo.netl). September 22.  
2003 
[I21 D.Shah. P.Cupta, "Fast updating alaorithms for T C A M  IEEE M c r u .  
Volume: 21 Issue: I, Page($,: 36 -47, Jan -Feb. 2001 
132 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on July 9, 2009 at 06:20 from IEEE Xplore.  Restrictions apply.
