Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0 by Balasubramonian, Rajeev & Muralimanohar, Naveen
40th IEEE/ACM  International Sym posium  on M icroarchitecture
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches
With CACTI 6.0 *
Naveen M uralim anohar1, Rajeev Balasubram onian1, N orm  Jouppi*
T School o f Com puting, University o f U tah 
* Hewlett-Packard Laboratories
A b strac t
A significant part o f future microprocessor real 
estate will be dedicated to L2 or L3 caches. These 
on-chip caches will heavily impact processor perfor­
mance, power dissipation, and thermal management 
strategies. There are a number o f interconnect design 
considerations that influence power/performance/area 
characteristics o f large caches, such as wire mod­
els (width/spacing/repeaters), signaling strategy 
(RC/differential/transmission), router design, etc. 
Yet, to date, there exists no analytical tool that takes all 
o f  these parameters into account to carry out a design 
space exploration fo r  large caches and estimate an 
optimal organization. In this work, we implement two 
major extensions to the CACTI cache modeling tool that 
focus on interconnect design for a large cache. First, we 
add the ability to model different types o f wires, such as 
RC-based wires with different power/delay characteristics 
and differential low-swing buses. Second, we add the 
ability to model Non-uniform Cache Access (NUCA). We 
not only adopt state-of-the-art design space exploration 
strategies fo r  NUCA, we also enhance this exploration 
by considering on-chip network contention and a wider 
spectrum o f  wiring and routing choices. We present a 
validation analysis o f the new tool (to be released as 
CACTI 6.0) and present a case study to showcase how the 
tool can improve architecture research methodologies.
Keyw ords: cache models, non-uniform cache archi­
tectures (NUCA), memory hierarchies, on-chip intercon­
nects.
1. In tro d u c tio n
M ulti-core  processors w ill incorporate large and com­
plex cachc hierarchies. The Intel M ontccito employs two 
12 M B private L3 cachcs. one fo r cach corc [25]. In­
tel is already prototyping an 80-corc processor [30. 38] 
and there is speculation that entire dies in a 3D packagc 
may be employed fo r large SRAM  cachcs or D R A M  main 
memory [7. 23. 30], Therefore, i t  is cxpcctcd that future 
processors w ill have to in te lligen tly  manage many mega­
bytes o f on-chip cachc. Future research w ill like ly  ex­
plore architectural mcchanisms to (i) organize the L2  or
'W e  thank the anonymous reviewers for their helpful suggestions. 
This work was supported in parts by N SF grant CCF-0430063 and NSF 
C A REER award CCF-0545959.
L3  cachc into shared/private domains, ( ii)  move data to 
im prove loca lity  and sharing, ( ii i)  optim ize the network 
parameters (topology, routing) fo r c ffic icn t communica­
tion between cores and cachc banks. Examples o f on­
going research in these veins includc [5. 6. 10. 11. 12. 15.
19. 2 0 .2 1 .2 2 .3 3 .4 4 ],
Many cachc evaluations employ the C AC TI cachc ac- 
ccss modeling tool [43] to estimate delay, power, and area 
fo r a given cachc size1. The C A C TI estimates arc invalu­
able in setting up baseline simulator parameters, comput­
ing temperatures o f blocks neighboring the cachc. eval­
uating the mcrits/ovcrhcads o f novel cachc organizations 
(banking, reconfiguration, additional bits o f storage), ctc. 
W h ile  C A C TI 5.0 produces reliable dclay/powcr/arca es­
timates fo r moderately sized cachcs. i t  docs not model the 
requirements o f large cachcs in sufficient detail. Besides, 
the search spacc o f the tool is lim ited and hcncc so is its 
application in powcr/pcrformancc tradc-off studies. W ith 
much o f future cachc research focuscd on multi-megabyte 
cachc hierarchy design, this is a serious short-coming. 
Hcncc. in this work, wc extend the C A C TI tool in many 
ways, w ith  the prim ary goal o f im proving the fide lity  o f 
its large cachc estimates. The tool can also aid in tradc-off 
analysis: fo r example, w ith  a comprehensive design spacc 
exploration. C A C TI 6.0 can iden tify  cachc configurations 
that consumc three times less power fo r about a 25% delay 
penalty.
The main cnhanccmcnt provided in C A C TI 6.0 is a 
very detailed modeling o f the interconnect between cachc 
banks. A  large cachc is typ ica lly  partitioned into many 
smaller banks and an intcr-bank network is responsible 
fo r communicating addresses and data between banks and 
the cachc controller. E arlie r versions o f C A C TI have em­
ployed a simple H -trcc network w ith  global wires and 
have assumed uniform  acccss times fo r every bank (a 
uniform  cachc acccss architecture, referred to as UCA). 
Rcccntly. non-uniform  cachc architectures (N U C A  [21]) 
have also been proposed that employ a packct-switchcd 
network between banks and yie ld acccss times that arc a 
function o f  where data blocks arc found (not a function o f 
the latency to the most distant bank). Wc add support fo r 
such an architecture w ith in  C AC TI.
W hether wc employ a packct-switchcd or H-trcc net­
work. the delay and power o f the network components 
dominate the overall cachc acccss delay and power as the
!The first four versions o f CACTI 131. 32. 35. 4 3 1 have been cited by 
more than 1000 papers and are also incorporated into other architectural 
simulators such as W attch |8 |.











Figure 1. Contribution of H-tree network to overall
cache delay and power.
size o f the cache scales up. Figure 1 shows that the H-tree 
o f the C AC TI 5.0 model contributes an increasing percent­
age to the overall cache delay as the cache size is increased 
from  2 to 32 M B . Its contribution to total cache power is 
also sizeable- around 50% 2. The inter-bank network it ­
self is sensitive to many parameters, especially the wire 
signaling strategy, w ire parameters, topology, router con­
figuration. etc. The new version o f the tool carries out 
a design space exploration over these parameters to esti­
mate a cache organization that optimizes a combination 
o f power/delay/area metrics fo r  UCA and N U C A  archi­
tectures. Network contention plays a non-trivia l role in 
determining the performance o f an on-chip network de­
sign. We also augment the design space exploration w ith  
empirical data on network contention.
Components o f the tool are partia lly validated against 
detailed Spice simulations. We also present an example 
case study to demonstrate how the tool can facilitate ar­
chitectural evaluations. The paper is organized as follows. 
Section 2 describes related work. Section 3 describes the 
baseline C AC TI 5.0 model. Section 4 provides details on 
the new interconnect models and other enhancements in­
tegrated into CACTI 6.0. A case study using CACTI 6.0 is 
discussed in Section 5. We draw conclusions in Section 6.
2. R elated W ork
The C AC TI too l was first released by W ilton and 
Jouppi in 1993 |43] and it has undergone four major re­
visions since then 131. 32. 35]. M ore details on the latest 
C AC TI 5.0 version are provided in Section 3. The primary 
enhancements o f C AC TI 2.0 131 ] were power models and 
multi-ported caches; C AC TI 3.0 [32] added area models, 
independently addressed banks, and better sense-amp c ir­
cuits; CACTI 4.0 [35] improved upon various SRAM  c ir­
cu it structures, moved from  aluminium  w iring  to copper, 
and included leakage power models; CACTI 5.0 adds sup­
port fo r D R A M  modeling. The C AC TI 6.0 extensions de­
scribed in this paper represent a major shift in focus and
2 As the cache si/e  increases, the bitline power component also grows. 
Hence, the contribution of H-tree power as a percentage remains roughly 
constant.
■ H-tree delay percentage
■ H-tree power percentage
2 4 8 16 32
Cache Size (MB)
add support fo r new interconnect components that dom i­
nate cache delay and power. Unlike the p rio r revisions o f 
C AC TI that focused on bank and SRAM  cell modeling, 
the current revision focuses on interconnect design. The 
results in later sections demonstrate that the estimates o f 
C AC TI 6.0 are a significant improvement over the esti­
mates o f C AC TI 5.0.
A few other extensions o f C AC TI can also be found in 
the literature, including m ultiple different versions o f e- 
CACTI (enhanced CAC TI). eCACTI from  the University 
o f California-Irvine models leakage parameters and gate 
capacitances w ith in a bank in more detail [ 24] (some o f 
this is now part o f C AC TI 4.0 135]’). A p rio r version o f 
eCACTI ]1] has been incorporated into C AC TI 3.0 |32], 
3DCAC TI is a tool that implements a cache across m u lti­
ple stacked dies and considers the effects o f various inter­
die connections [37].
A number o f tools |3. 9. 14. 16. 34. 39. 41] exist in 
the literature to model network-on-chip (NoC). The Orion 
too lk it from  Princeton does a thorough analytical quan­
tification o f dynamic and leakage power w ith in  router e l­
ements ]41] and some o f this is included in C AC TI 6.0. 
However. Orion does not consider interconnect options, 
nor carry out a design space exploration (since it is ob liv ­
ious o f the properties o f the components that the net­
work is connecting). NOCIC ]39] is another model that 
is based on Spice simulations o f various signaling strate­
gies. Given a tile  size, it identifies the delay and area needs 
o f each signaling strategy.
A recent paper by Muralimanohar and Balasubramo­
nian ]27] describes a methodology to extend CACTPs 
design space exploration to estimate an optimal N U C A 
cache organization. That work also proposed techniques 
to exploit specific wires and topologies fo r the address net­
work. This work adopts a sim ilar in itia l strategy when 
considering N U C A  organizations. In addition to the cre­
ation o f a tool fo r public distribution, we include a number 
o f features that are not part o f p rio r work:
• Extend the design space exploration to different wire 
and router types.
• Consider the use o f low-swing d ifferentia l signaling 
in addition to traditional global wires.
• Incorporate the effect o f network contention during 
the design space exploration.
• Take bank cycle time into account in estimating the 
cache bandwidth.
• Validate a subset o f the newly incorporated models.
• Improve upon the tool API. including the ab ility  to 
specify novel metrics involving power, delay, area, 
and bandwidth.
• Provide insight on the too l’s design space exploration 
and trade-off analysis, as w e ll as example case stud­
ies.
3. B ackground
This section presents some basics on the CACTI 5.0 






(a) Logical organization of a cache. (b) Example physical organization of the data array.
Figure 2. Logical and physical organization of the cache (from CACTI 3.0 [32]).
4. C A C TI 6.0 E nhancem entsstructure o f a uniform  cache access (U C A) organization. 
The address is provided as input to the decoder, which 
then activates a wordline in the data array and tag array. 
The contents o f an entire row (referred to as a set) are 
placed on the bitlines. which are then sensed. The m ultiple 
tags thus read out o f the tag array are compared against 
the input address to detect i f  one o f the ways o f the set 
does contain the requested data. This comparator logic 
drives the m ultiplexor that fina lly  forwards at most one o f 
the ways read out o f the data array back to the requesting 
processor.
The C AC TI cache access model [35] takes in the fo l­
low ing major parameters as input: cache capacity, cache 
block size (also known as cache line size), cache associa­
tiv ity. technology generation, number o f ports, and num­
ber o f independent banks (not sharing address and data 
lines). As output, it  produces the cache configuration that 
m inimizes delay (w ith  a few exceptions), along w ith  its 
power and area characteristics. C AC TI models the de­
lay/power/area o f eight major cache components: decoder, 
wordline. bitline. senseamp. comparator, m ultiplexor, out­
put driver, and inter-bank wires. The wordline and bit- 
line delays are two o f the most significant components 
o f the access time. The wordline and bitline delays are 
quadratic functions o f the w id th  and height o f each array, 
respectively. In practice, the tag and data arrays are large 
enough that it is inefficient to implement them as single 
large structures. Hence. C AC TI partitions each storage 
array (in the horizontal and vertical dimensions) to pro ­
duce smaller sub-arrays and reduce wordline and bitline 
delays. The bitline is partitioned into Ndbl different seg­
ments. the wordline is partitioned into Ndwl segments, and 
so on. Each sub-array has its own decoder and some cen­
tral pre-decoding is now required to route the request to 
the correct sub-array. The most recent version o f CACTI 
employs a model fo r semi-global (intermediate) wires and 
an H-tree network to compute the delay between the pre­
decode c ircu it and the furthest cache sub-array. CACTI 
carries out an exhaustive search across different sub-array 
counts (different values o f Ndbl. Ndw l. etc.) and sub-array 
aspect ratios to compute the cache organization w ith  op­
tim al total delay. Typically, the cache is organized into a 
handful o f sub-arrays. A n example o f the cache's physical 
structure is shown in Figure 2(b).
4.1. Interconnect Modeling for UCA Caches
As already shown in Figure 1. as cache size increases, 
the interconnect (w ith in  the H-tree network) plays an in ­
creasingly greater role in terms o f access time and power. 
The interconnect overhead is impacted by (i) the number 
o f sub-arrays, ( ii)  the signaling strategy, and ( iii)  the wire 
parameters. W hile p rio r versions o f C AC TI iterate over 
the number o f sub-arrays by exploring different values o f 
Ndbl. N dw l. Nspd. etc.. the model is restricted to a sin­
gle signaling strategy and w ire type (global w ire). Thus, 
the design space exploration sees only a modest amount 
o f variation in the component that dominates the overall 
cache delay and power. We therefore extend the design 
space exploration to also include a low-swing differential 
signaling strategy as w ell as the use o f local and fa t  wires.
The delay o f a w ire is a function o f its R C  time con­
stant which increases quadratically w ith  the length o f the 
wire. The insertion o f optim ally sized repeaters at reg­
ular intervals makes the delay a linear function o f w ire 
length. For long latency wires, the w ire throughput can 
be increased by inserting pipelined latches at regular in ­
tervals. However, the use o f repeaters and pipeline latches 
at regular intervals requires that the voltage levels on these 
wires swing across the fu ll range (0 -  VM) fo r proper op­
eration. Given the quadratic dependence between voltage 
and power, these fu ll-sw ing wires dissipate a large amount 
o f power. Also, the silicon area requirement imposed by 
repeaters and latches precludes the possib ility o f routing 
these wires on top o f other modules. In essence, tradi­
tional global wires, o f the kind employed in C AC TI 5.0. 
are fast but entail high power and routing complexity. Fur­
ther. the delay, power, and bandwidth properties o f these 
wires can be varied by changing the fo llow ing  parameters: 
(i) w ire width, ( ii)  w ire spacing, ( ii i)  repeater size, and (iv) 
repeater spacing. By considering these choices, a segment 
on the H-tree network can be made to have optimal delay 
while  having much higher power and area requirements. 
Alternatively, the segment can be made to have low power 
requirements while  incurring a delay or area penalty. The 
considered w ire types and their properties are summa­
rized later in this section. By examining these choices and 
carrying out a more comprehensive design space explo-
5
' (b) Design spacc exploration with full-swing global wires (red, bottom region),
(a) Design spacc exploration with global wires wires with 30% delay penalty (yellow, middle region), and differential
low-swing wires (blue, top region)




Figure 4. 8-bank data array with a differential low- 
swing broadcast bus.
M emory intensive 
benchmarks
applu, fma3d, swim, lucas 
cquakc, gap, vpr, art
L2/L3 latency 
sensitive benchmarks
amnip, apsi, art, bzip2, 
crafty, con, cquakc, gcc
H alf latency sensitive & 
half non-latcncy 
sensitive benchmarks
ammp, applu, lucas, bzip2 
crafy, mgrid, 
mesa, gcc
Random benchmark set Entire SPEC suite
Table 1. Benchmark sets
ration, CAC TI 6.0 is able to identify cache organizations 
that better meet the user-specified metrics. Figure 3(a) 
shows a power-delay curve where each point represents 
one o f the many hundreds o f cache organizations consid­
ered by C AC TI 5.0. The red points represent the cache 
organizations that would have been considered by CAC TI
5.0 w ith its lim ited design space exploration w ith a sin­
gle w ire type (global w ire w ith  delay-optimal repeaters). 
The yellow points (m iddle region) in  Figure 3(b) represent 
cache organizations w ith different w ire types that are con­
sidered by CAC TI 6.0. Clearly, by considering the trade­
offs made possible w ith wires and expanding the search 
space, C AC TI 6.0 is able to identify cache organizations 
w ith very relevant delay and power values.
One o f the primary reasons fo r the high power dissi­
pation o f global wires is the fu ll swing requirement im ­
posed by the repeaters. W hile  we are able to somewhat 
reduce the power requirement by reducing repeater size 
and increasing repeater spacing, the requirement is s till 
relatively high. Low voltage swing alternatives repre­
sent another mechanism to vary the wire power/delay/area 
trade-off. Reducing the voltage swing on global wires 
can result in a linear reduction in power. In addition, as­
suming a separate voltage source fo r low-sw ing drivers 
w ill result in a quadratic savings in power. But, these lu ­
crative power savings are accompanied by many caveats. 
Since we can no longer use repeaters or latches, the delay 
o f a low-swing w ire increases quadratically w ith length. 
Since such a w ire cannot be pipelined, they also suffer 
from  lower throughput. A  low-sw ing w ire requires special 
transmitter and receiver circuits fo r signal generation and 
amplification. This not only increases the area require­
ment per bit, but also assigns a fixed cost in terms o f both 
delay and power fo r each b it traversal. In spite o f these 
issues, the power savings possible through low-sw ing sig­
nalling makes it  an attractive design choice. The detailed 
methodology fo r the design o f low-sw ing wires and their 
overhead is described later in this section. In general, low- 
swing wires have superior power characteristics but incur 
high area and delay overheads.
The choice o f an H-tree network fo r CAC TI 5.0 (and 
earlier versions o f C AC TI) was made fo r the fo llow ing  
reason: it  enables uniform  access times fo r each bank, 
which in turn, simplifies the p ipelin ing o f requests across 
the network. Since low-swing wires cannot be pipelined 
and since they better amortize the transmitter/receiver 
overhead over long transfers, we adopt a different network 
style when using low-swing wires. Instead o f the H-tree 
network, we adopt a collection o f simple broadcast buses 
that span across a ll the banks (each bus is shared by ha lf 
the banks in a column -  an example w ith eight banks is 
shown in  Figure 4). The banks continue to have uniform  
access times, as determined by the worst-case delay. Since 
the bus is not pipelined, the wire delay lim its  the through­
put as w ell and decreases the operating frequency o f the
6
Fetch queue size 
Bimodal predictor size 
Level 2 predictor 
Branch mispredict penalty 
Dispatch and commit width 
Register file size 
LI 1-cache 
L2 cache 
L2 Block size 




at least 12 cycles 
8 '
100 (int and fp, each) 
32KB 2-way 
32MB 8-way SNUCA 
64B
128 entries, 8KB page size
Branch predictor 
Level 1 predictor 
BTB size 
Fetch width 
Issue queue size 
Re-order Buffer size 
LI D-cache
M emory latency
comb, o f bimodal and 2-level 
16K entries, history 12 
16K sets, 2-way 
8 (across up to 2 basic blocks) 
60 (int and fp, each)
80
32KB 2-way set-associative,
3 cycles, 4-way word-interleaved
300 cycles for the first chunk
Network topology 
No. of virtual channels
Grid
4 /physical channel
Flow control mechanism 
Back pressure handling
Virtual channel 
Credit based flow control
Table 2. Simplescalar simulator parameters.
cache. T he cycle  tim e o f  a  cache  is equal to the m axim um  
delay  o f  a  segm ent th a t can n o t b e  pipelined . Typically, the 
sum  o f b itline  and sense am plifier delay  decides the  cycle 
tim e o f  a  cache. In a  low -sw ing m odel, the  cycle  tim e is 
determ ined  by the m axim um  delay  o f  the  low -sw ing bus. 
W e also  consider low -sw ing w ires w ith varying w idth and 
spacing th a t fu rther p lay  into the  delay /pow er/area  trad e ­
offs.
W ith low -sw ing w ires included in the C A C TI design 
space exploration , the  tool is ab le to identify  m any m ore 
po in ts th a t y ie ld  low  pow er a t a  perform ance and area 
cost. T he b lue  p o in ts (top reg ion) in F igure  3(b) rep resen t 
the  cache o rgan izations considered  w ith low -sw ing w ires. 
Thus, by  leveraging d ifferen t w ire p roperties, it is possib le 
to  genera te  a  b road  range o f  cache  m odels w ith  d ifferen t 
pow er/delay  characteristics.
4.2. NUCA Modeling
The U C A  cache m odel d iscussed  so far h as  an access 
tim e th a t is lim ited  by  the  delay  o f  the  slow est sub-bank. 
A m ore scalable approach  fo r fu tu re  large caches is to re ­
p lace  the H -tree bus w ith a  packet-sw itched  on -ch ip  grid 
netw ork . T he latency fo r a  bank  is determ ined  by  the  d e ­
lay to  rou te  the req u est and response  betw een  the ban k  that 
con ta in s the  data  and the  cache controller. S uch  a  N U C A  
m odel w as first proposed  by K im  e t al. [21] and h as  been 
the  sub jec t o f  m any arch itectu ra l evaluations. W e th e re ­
fo re  extend C A C TI to  support such N U C A  organ izations 
as well.
The tool first iterates over a  num ber o f  bank  organ iza­
tions: the  cache is partitioned  into 2N  banks (w here N  
varies from  1 to 12); fo r each  N ,  the ban k s are o rganized  
in a  grid  w ith  2M row s (w here M  varies from  0 to  N ).  
For each bank  organ ization , C A C TI 5 .0 is em ployed  to 
determ ine  the  optim al sub-array  partition ing  fo r the cache 
w ithin each bank. E ach bank  is associated  w ith a  router. 
T he average delay  fo r a  cache access is com puted  by  e s­
tim ating  the num ber o f  ne tw ork  hops to each bank, the 
w ire delay  encountered  on each  hop, and the cache  ac ­
cess delay  w ithin each  bank . W e further assum e th a t each 
traversal th rough  a  ro u te r takes up R  cycles, w here R  is 
a  user-specified  input. R ou ter p ipelines can  be  designed 
in m any ways: a  four-stage p ipeline  is com m only  ad v o ­
cated  [13], and recently , speculative p ipelines th a t take 
up  three, tw o, and one p ipeline stage have also  been  p ro ­
posed [13, 26, 28], W hile  w e give the user the option  to 
p ick  an aggressive o r conservative router, the tool defaults 
to em ploy ing  a  m oderately  aggressive rou ter p ipeline with
three stages.
M ore  partitions lead to sm aller de lays (and pow er) 
w ithin each bank, b u t g rea ter delays (and pow er) on the 
netw ork  (because o f the constan t overheads associated  
w ith each rou ter and decoder). H ence, the above design  
space exp loration  is requ ired  to estim ate the  cache parti­
tion th a t y ields optim al delay  o r pow er. T he above algo ­
rithm  w as recen tly  p roposed  by  M uralim anohar and B a la­
subram onian  [27]. W e fu rther extend th is algorithm  in the 
fo llow ing ways.
F irst, w e explore d ifferen t w ire types fo r the  links b e ­
tw een ad jacen t rou ters. T hese w ires are m odeled  as low- 
sw ing differential w ires as w ell as local/g lobal/fa t w ires to 
y ield  m any po in ts in the pow er/delay /area  spectrum .
S econd, w e m odel d ifferen t types o f  rou ters. T he sizes 
o f  buffers and v irtua l channels w ithin a  rou ter have a 
m ajor influence on rou ter pow er consum ption  as w ell as 
rou ter con ten tion  u nder heavy  load. By varying the  nu m ­
b er o f  virtual channels p e r  physical channel and the  nu m ­
b er o f  buffers p er v irtua l channel, w e are able to achieve 
d ifferen t po in ts on the rou ter pow er-delay  trad e-o ff curve.
T hird , w e m odel con ten tion  in the  netw ork  in m uch 
greater detail. T his itse lf has tw o m ajor com ponents. If 
the cache  is partitioned  into m any banks, there  are m ore 
rou ters/links on the  netw ork  and the p robab ility  o f  tw o 
packets conflic ting  a t a  rou ter decrease. T hus, a  m any- 
banked  cache is m ore capab le o f  m eeting  the  bandw id th  
dem ands o f  a  m any-core system . Further, certain  a s­
pects o f  the cache access w ithin a  bank  canno t be  easily 
p ipelined . T he longest such delay  w ithin the cache access 
(typically  the  b itline  and sense-am p delays) rep resen ts the 
cycle  tim e o f  the bank  -  it is the  m in im um  delay  betw een  
successive accesses to th a t bank. A m any-banked  cache 
has relatively  sm all banks and a  relatively  low cycle  tim e, 
allow ing it to  support a  h igher th roughpu t and low er w ait- 
tim es once a  req u est is delivered  to the bank. B oth o f  these 
tw o com ponen ts (low er con ten tion  a t rou ters and low er 
con ten tion  a t banks) tend to favor a  m any-banked  system . 
T his aspect is also  included in estim ating  the average ac­
cess tim e fo r a  given cache configuration.
T he con ten tion  values fo r each considered  N U C A  
cache organ ization  are em pirically  estim ated  fo r typical 
w orkloads and incorporated  into C A C TI 6.0 as look-up 
tables. For each  o f  the g rid  topo log ies considered  (for 
d ifferen t values o f  N  and  M ),  w e sim ulated  L2 requests 
o rig inating  from  single-core, tw o-core , four-core, eight- 
core, and six teen-core processors. E ach  core executes a 
m ix  o f  p rog ram s from  the SPE C  benchm ark  suite. W e d i­
v ide the benchm ark  se t into four categories, as described
7
Bank Count
(a) Total network contention value/access for CMPs with different 
NUCA organizations
(b) Optimal NUCA organization 
Figure 5. NUCA design space exploration.
in Tabic 1. For every C M F organization, we run four sets 
o f simulations, corresponding to each benchmark set tab­
ulated. The generated cache traffic is then modeled on a 
detailed network simulator w ith  support fo r virtual chan­
nel tiow control. Details o f the architectural and network 
simulator are listed in Table 2. The contention value (av­
eraged across the various workloads) at routers and banks 
is estimated for each network topology and bank cycle 
time. Based on the user-specified inputs, the appropriate 
contention values in the look-up table are taken into ac­
count during the design space exploration. Some o f this 
empirical data is represented in Figure 5(a). We observe 
that for many-core systems, the contention in the network 
can be as high as 30 cycles per access (for a two banked 
model) and cannot be ignored during the design space ex­
ploration.
For a network w ith completely pipelined links and 
routers, these contention values are only a function o f the 
router topology and bank cycle time and w ill not be af­
fected by process technology or L2 cache size3. I f  C AC TI 
is being employed to compute an optimal L3 cache orga­
nization, the contention values w ill like ly  be much less 
because the L2 cache filters out most requests. To han­
dle this case, we also computed the average contention 
values assuming a large 2 M B  L I  cache and this is incor­
porated into the model as well. In summary, the network 
contention values are impacted by the fo llow ing param­
eters: M , N ,  bank cycle time, number o f cores, router 
configuration (VCs, buffers), size o f preceding cache. We 
plan to continue augmenting the tool w ith  empirical con­
tention values for other relevant sets o f workloads such 
as commercial, multi-threaded, and transactional bench­
marks w ith  significant traffic from  cache coherence.
Figure 5( b) shows an example design space exploration 
fo r a 32 M B  N U C A  L2 cache while attempting to m in­
3 We assume here that the cache is organized as static-NUCA 
(SNUCA), where the address index bits determine the unique bank where 
the address can be found and the access distribution does not vary greatly 
as a function of the cache size. CACTI is designed to be more generic 
than specific. The contention values are provided as a guideline to most 
users. If a user is interested in a more specific NUCA policy, there is no 
substitute to generating the corresponding contention values and incor­
porating them in the tool. As a case study in Section 5, we examine a 
different NUCA policy.
imize latency. The X-axis shows the number o f banks 
that the cache is partitioned into. For each po in t on the 
X-axis, many different bank organizations are considered 
and the organization w ith  optimal delay (averaged across 
all banks) is fina lly represented on the graph. The Y-axis 
represents this optimal delay and it is further broken down 
to represent the contributing components: bank access 
time, lin k  and router delay, router and bank contention. 
We observe that the optimal delay is experienced when 
the cache is organized as a 2 x  4 grid o f 8 banks.
4.3. Wire Models
This section details the analytical model fo r delay and 
power calculation o f different wires. We begin w ith 
a description o f the delay and power model fo r global 
wires, then describe how wires w ith  different power-delay 
characteristics can be modeled. Finally, we discuss the 
methodology for calculating delay and power fo r low- 
swing wires.
4.3.1 F u ll-S w in g  R e p ea ted  W ires
For fu ll-sw ing wires, the delay o f a wire is governed by its 
R C  time constant iR  is resistance, C  is capacitance). The 
resistance and capacitance per unit length are governed by 
the fo llow ing  equations 117]:
R'wirc
P
(th ickness  ..  barrier)(w idth 2 barrier) 
(1)
th ickness
f u ( 2 A --------- ;-------f 2e
spacing 
+ / r  inge (<-:/, . e )
width
— )layer-spacing'
T h ickness  and width  represent the geometrical dimen­
sions o f the wire cross-section, barrier represents the thin 
barrier layer around the wire to prevent copper from  d if­
fusing into surrounding oxide, and p is the material re­
sistivity. The potentially d ifferent relative dielectrics for 
the vertical and horizontal capacitors are represented by
8
f-horiz and  evcrt, K  accounts fo rM ille r-e ffec t coupling  ca- 
pacitanccs, spacing  represents the  gap betw een  ad jacent 
w ires on the  sam e m etal layer, and layerspacing  rep re­
sents the gap betw een  ad jacen t m etal layers.
A  change in the w id th  and spacing  o f  w ires y ie lds d if­
ferent delay  and pow er characteristics. F or the  CA C TI 
design  space  exploration , w e restrict ourselves to sem i- 
g lobal, g lobal, and fat w ires that have  w idth  and spacing 
in the  ratio  1:2 :16. A s a  result, the latencies for these w ires 
are  in the ratio  8:4:1, and their pow er consum ption  is in the 
ratio  2:1.8:1.
It has also  been  derived that a  w ire yie lds optim al delay 
if  repeaters have the  fo llow ing spacing  (L optinlaf) and  size 
(Sopt irnal ) 14]:






In the  above equations, co is the  capacitance o f  the m in i­
m um  sized  repeater, cp is its output parasitic capacitance, 
and rs is its output resistance. B anerjee  et al. [4] describe 
a  m ethodology  to com pute a  repeater configuration  that 
m in im izes pow er w hile  g iv ing  up perform ance. W e adopt 
a  sim ilar m ethodology  to com pute  the  pow er-delay  trad e­
o ff fo r various repeater configurations. F igure  6 show s the 
relative pow er and delay  as a  function  o f  w ire  length  for 
a  delay-op tim al global w ire  and w ires that trade-o ff delay 
fo r low er power.
4.3.2 Differential Low-swing Wires
A  low -sw ing in terconnect system  consists o f  th ree m ain 
com ponents: (1) a  transm itter that genera tes the  low- 
sw ing  signal, (2) tw isted  differential w ires, and (3) a  re­
ceiver am plifier. For the  transm itter circuit, w e em ploy 
the  m odel proposed  b y  H o et al. [18]. To im prove d e ­
lay characteristics (equations not reproduced here), the 
transm itter circuit uses p re-em phasis and pre-equalization  
op tim ization  techniques. P re-em phasis reduces the  w ire 
charg ing /d ischarg ing  tim e by  using  a  drive voltage sig n if­
icantly  h igher than  the m in im um  signal requ ired  b y  the 
receiver. P re-equalizing  the  w ires enab les signal transfer 
w ith  h a lf  the  voltage sw ing.
T he to tal capacitance o f  the  low -sw ing segm ent is 
given by
Cioad cv-/ IL’l'i ■c,drain ■c,scnsc-amp
Cdrain is the  drain  capac itance  o f  the driver 
transistor. T he dynam ic energy  is expressed  as
F or o u r evaluations,
w e assum e an overdrive voltage o f  200m V  and a  low 
sw ing  voltage o f  lOOmV. A t the receiver, w e em ploy  the 
sam e sense-am plifier circuit used  by  C A C TI fo r its b itline  
sensing  [35]. T he pow er and delay  characteristics o f 
low -sw ing w ires are also  represented  in F igure  6.
4.4. Router Models
A s d iscussed  earlier, various rou ters have been  p ro ­
posed w ith  d iffering  levels o f  specu lation  and p ipeline
stages [13, 26, 28]. T he num ber o f  stages fo r each  
router is left as a  user-specified  input, defau lting  to 3 cy­
cles. For router pow er, w e em ploy  the analy tical pow er 
m odels for crossbars and arb iters em ployed  in the O rion 
toolk it [41]. C A C T I’s R A M  m odel is em ployed  fo r router 
buffer power. T hese  represent the  prim ary  con tribu tors 
to netw ork  pow er (in addition  to link  power, that was 
d iscussed  in the p revious sub-section). W e restrict our­
selves to a  grid  topo logy  w here  each  rou ter has 5 inputs 
and 5 outputs, and consider th ree  poin ts on the power- 
perfo rm ance trade-o ff curve. E ach  point prov ides a  d if­
ferent n um ber o f  buffers per v irtua l channel and a  d iffer­
en t n um ber o f  v irtua l channels per physical channel. A c ­
cordingly , w e see a  significant variation  in buffer capacity  
(and pow er) and con ten tion  cycles at routers. A s before , 
the  con ten tion  cycles are  com puted  w ith  detailed  netw ork  
sim ulations. Table 3 specifies the  three types o f  routers 
and  their co rresponding  buffer, crossbar, and  arb iter en ­
ergy values.
4.5. Improvement in Trade-Off Analysis
F or arch itectu ra l studies, especially  those related  to 
m em ory  h ierarchy  design , an early  estim ate  o f  cache  ac ­
cess tim e and pow er for a  given input configuration  is c ru ­
cial in m aking  a  sound  evaluation. A s described  in S ec­
tion  3, C A C TI 5.0 carries out a  design  space  exploration  
over various sub-array  partitions; it then elim inates o rg a­
n iza tions that have an area  that is 50%  h igher than the o p ­
tim al area; it fu rther e lim inates those o rgan izations that 
have an access tim e value m ore  than  10% the m inim um  
value; and finally se lects an o rgan iza tion  using  a cost fu n c­
tion  that m in im izes pow er and cycle  tim e.
M odem  p rocessor design  is no t singu larly  focused  on 
perform ance and m any designers are w illing  to com pro ­
m ise som e perfo rm ance fo r im proved power. M any future 
studies w ill likely carry  out trade-o ff analyses involving 
perfo rm ance, pow er, and area. To facilita te such  an a ly ­
ses, the  new  version  o f  the  tool adopts the  fo llow ing cost 
function  to evaluate  a  cache  organ ization  (taking into ac ­
count delay, leakage power, dynam ic power, cycle  tim e, 
and area):
c o s t:
Wt
accJim e













T he w eights for each  term
. Wcyclc-timc, Wa
indicate the  relative im portance o f  each  term  and these 
are specified  b y  the  user as input param eters in the 
configuration  file:
-w e ig h t 100 20 20 10 10
9
5 6 7 
Wire Length (mm)
5 6 7 
Wire Length (mm)
(a) Delay characteristics of different wires (b) Energy characteristics of different wires
F ig u re  6. Energy/Delay values for different wires
Component Configuration 1 
4 VCs/PC; 16 buffers/VC
Configuration 2 
2 VCs/PC; 8 buffers/VC
Configuration 3 
2 VCs/PC; 2 buffers/VC
Arbiter 0.33e-12 0.27e-12 0.27e-12
Crossbar (avg) 0.99e-ll 0.99e-ll 0.99e-l 1
Buffer read operation/VC 0.11e-ll 0.76e-12 0.50e-12
Write buffer operation/VC 0.14e-ll 0.10e-ll 0.82e-12
T ab le  3. Energy consumed (In J) by arbiters, buffers and crossbars for various router configurations at 32nm technology 
(flit size of 128 bits).
The above default weights used by the tool reflect the pri­
ority of these metrics in a typical modern design. In ad­
dition, the following default line in the input parameters 
specifics the user's willingness to deviate from the opti­
mal set of metrics:
- d e v ia te  1000 1000 1000 1000 1000
The above line dictates that we are willing to consider 
a cache organization where each metric, say the access 
time, deviates from the lowest possible access time by 
1000%. Hence, this default set of input parameters spec­
ifies a largely unconstrained search space. The following 
input lines restrict the tool to identify a cache organiza­
tion that yields least power while giving up at most 10% 
performance:
- w e ig h t 0 100 100 0 0 
- d e v ia te  10 1000 1000 1000 1000
4.6. Validation
In this work, we mainly focus on validating the new 
modules added to the framework. This includes low- 
swing wires, router components, and improved bitline and 
wordline models. Since SF1CF. results depend on the 
model files for transistors, we first discuss the technology 
modeling changes made to the recent version of CACTI 
(version 5) and later detail our methodology for validating 
the newly added components to CACTI 6.0.
F.arlier versions of CACTI (version one through four) 
assumed linear technology scaling for calculating cache 
parameters. All the power, delay, and area values are first 
calculated for 800nm technology and the results are lin­
early scaled to the user specified process value. While this
approach is reasonably accurate for old process technolo­
gies, it can introduce non-trivial error for deep sub-micron 
technologies (less than 90nm). This problem is fixed in 
CACTI 5 (36] by adopting 1TRS parameters for all cal­
culations. The current version of CACTI supports four 
different process technologies (90nm, 65nm, 45nm, and 
32nm) with process specific values obtained from 1TRS. 
Though 1TRS projections are invaluable for quick ana­
lytical estimates, SF1CF. validation requires technology 
model files with greater detail and 1TRS values cannot 
be directly plugged in for SF1CF. verification. The only 
non-commercial data available publicly for this purpose 
for recent process technologies is the Predictive Technol­
ogy Model (PTM) (2]. For our validation, we employ 
the HSP1CF. tool along with the PTM 65 nm model file 
for validating the newly added components. The simu­
lated values obtained from HSP1CF. are compared against 
CACTI 6.0 analytical models that take PTM parameters 
as input 4. The analytical delay and power calculations 
performed by the tool primarily depend on the resistance 
and capacitance parasitics of transistors. For our valida­
tion, the capacitance values of source, drain, and gate of 
n and p transistors are derived from the PTM technology 
model file. The threshold voltage and the on-rcsistance 
of the transistors are calculated using SP1CF. simulations. 
In addition to modeling the gate delay and wire delay of 
different components, our analytical model also considers 
the delay penalty incurred due to the finite rise time and 
fall time of an input signal (42],
Figure 7 (a) & (b) show the comparison of delay 
and power values of the differential, low-swing analyti-
4The PTM parameters employed for verification can be directly used 
for CACTI simulations. Since most architectural and circuit studies rely 






Delay. SPICE - S4ps. CACTI - 92ps 
Delay. SPICE - 1.204ns. CACTI - 1.395ns 
__________Delay. SPICE - 200ps__________
Power. SPICE - 7.20 . CACTI - 7.50 
Power. SPICE - 29.90. CACTI - 34fJ 
Power. SPICE - 5.70
T ab le  4. Delay and energy values of different components for a 5mm low-swing wire.
W ire Length  (mm) Length (mm)
(a) Delay verification (b) Energy verification
F ig u re  7. Low-swing model verification
cal models against SPICE values. As mentioned earlier, 
a low-swing wire model can be broken into three compo­
nents: transmitter (that generates the low-swing signal), 
differential wires5, and sense amplifiers. The modeling 
details of each of these components are discussed in sec­
tion 4.3.2. Table 4 shows the delay and power val­
ues of each of these compcncnts for a 5mm low-swing 
wire. Though the analytical model employed in CACTI
6.0 dynamically calculates the driver size appropriate for 
a given wire length, for the wire length of our interest, it 
ends up using the maximum driver size (which is set to 
100 times the minimum transistor size) to incur minimum 
delay overhead. Earlier versions of CACTI also had the 
problem of over estimating the delay and power values of 
the scnsc-amplificr. CACTI 6.0 eliminates this problem 
by directly using the SPICE generated values for scnsc- 
amp power and delay. On an average, the low-swing wire 
models are verified to be within 12% of the SPICE values.
The lumped RC model used in prior versions of CACTI 
for bitlines and wordlines are replaced with a more ac­
curate distributed RC model in CACTI 6.0. Based on a 
detailed spice modeling of the bitline segment along with 
the memory cells, we found the difference between the old 
and new model to be around 11% at 130 nm technology. 
This difference can go up to 50% with shrinking process 
technologies as wire parasitics become the dominant fac­
tor compared to transistor capacitance i29], Figure 8(a)
& (b) compare the distributed wordline and bitline delay 
values and the SPICE values. The length of the word­
lines or bitlines (specified in terms of memory array size) 
are carefully picked to represent a wide range of cache 
sizes. On an average, the new analytical models for the 
distributed wordlines and bitlines are verified to be within 
13% and 12®/r of SPICE generated values.
Buffers, crossbars, and arbiters are the primary com­
ponents in a router. CACTI 6.0 uses its scratch RAM 
model to calculate read/write power for router buffers. We 
employ Orion's arbiter and crossbar model for calculat­
' Delay and power values of low-swing driver is also reported as part 
of differential wires.
ing router power and these models have been validated by 
Wang etal. [40].
5. Case Study
We expect that CACTI 6.0 will continue to be used in 
architectural evaluations in many traditional ways: it is 
often used to estimate cache parameters while setting up 
architectural simulators. The new API makes it easier for 
users to make powcr/dclay/arca trade-offs and we expect 
this feature to be heavily used for architectural evalua­
tions that focus on powcr-cfficicncy or are attempting to 
allocate power/area budgets to cache or processing. With 
many recent research proposals focused on NUCA organi­
zations, we also expect the tool to be heavily used in that 
context. Since it is difficult to generalize NUCA imple­
mentations, we expect that users modeling NUCA designs 
may need to modify the model's parameters and details to 
accurately reflect their NUCA implementation. Hence, as 
a case study of the tool's operation, we present an example 
NUCA evaluation and its inter-play with CACTI 6.0.
Many recent NUCA papers have attempted to im­
prove average cache access time by moving heavily ac­
cessed data to banks that are in close proximity to the 
core 16, 12, 19, 20, 21], This is commonly referred to as 
dynamic-NUCA or D-NUCA becasue a block is no longer 
mapped to a unique bank and can move between banks 
during its L2 lifetime. We first postulate a novel idea and 
then show how CACTI 6.0 can be employed to evaluate 
that idea. Evaluating and justifying such an idea could 
constitute an entire paper - we are simply focusing here 
on a high-level evaluation that highlights the changes re­
quired to CACTI 6.0.
The Proposal: For a D-NUCA organization, most re­
quests will be serviced by banks that are close to the cache 
controller. Further, with D-NUCA, it is possible that ini­
tial banks will have to be searched first and the request for­
warded on if the data is not found. All of this implies that 
initial banks see much higher activity than distant banks. 
To reduce the power consumption of the NUCA cache, we
11
1000 1000






(a) Wordline (b) Bitline
F ig u re  8. Distributed wordline and bitline model verification
propose that heterogeneous banks be employed: the initial 
banks can employ smaller power-efficient banks while the 
distant banks can employ larger banks.
For our case study evaluation, we will focus on a grid- 
based NUCA cache adjacent to a single core. The ways of 
a set are distributed across the banks, so a given address 
may reside in one of many possible banks depending on 
the way it is assigned to. Similar to D-NUCA propos­
als in prior work [21], when a block is brought into the 
cache, it is placed in the most distant way and it is gradu­
ally migrated close to the cache controller with a swap be­
tween adjacent ways on every access. While looking for 
data, each candidate bank is sequentially looked up until 
the data is found or a miss is signaled.
Recall that CACTI 6.0 assumes an S-NUCA organi­
zation where sets are distributed among banks and each 
address maps to a unique bank. When estimating average 
access time during the design space exploration, it is as­
sumed that each bank is accessed with an equal probabil­
ity. The network and bank contention values are also es­
timated for an S-NUCA organization. Thus, two changes 
have to be made to the tool to reflect the proposed imple­
mentation:
• The design space exploration must partition the 
cache space into two: the bank sizes for each parti­
tion are estimated independently, allowing the initial 
banks to have one size and the other banks to have a 
different size.
• Architectural evaluations have to be performed to es­
timate the access frequencies for each bank and con­
tention values so that average access time can be ac­
curately computed.
With our simulation infrastructure, we considered a 
32 MB 16-way set-associative L2 cache and modeled the 
migration of blocks across ways as in the above D-NUCA 
policy. Based on this, the access frequency as shown in 
Figure 9 was computed, with many more accesses to ini­
tial banks (unlike the S-NUCA case where the accesses 
per bank are uniform). With this data integrated into 
CACTI 6.0. the design space exploration loop of CACTI
6.0 was wrapped around with the following loop structure:
f o r  i  = 0 t o  1 0 0
F ig u re  9. Access frequency for a 32MB cache. The 
y-coordinate of a point in the curve corresponds to the 
percentage of accesses that can be satisfied with x KB 
of cache.
# Assume i% o f  th e  cache  h a s  one
# b ank  s i z e  and  th e  r e m a in in g
# (10 0- i)%  h a s  a d i f f e r e n t  b ank  s i z e  
f o r  th e  f i r s t  i% o f  c a c h e ,
p e r fo rm  CACTI 6 .0  e x p lo r a t i o n  
(w it h  new a c c e s s  f r e q u e n c ie s  
and  c o n t e n t io n )  
f o r  th e  r e m a in in g  (10 0- i)%  o f  c a c h e , 
p e r fo rm  CACTI 6 .0  e x p lo r a t i o n  
(w it h  new a c c e s s  f r e q u e n c ie s  
a nd  c o n t e n t io n )
As an input, we provide the - w e ig h t and - d e v ia te  
parameters to specify that we are looking for an organi­
zation that minimizes power while yielding performance 
within 10% of optimal. The output from this modified 
CACTI 6.0 indicates that the optimal organization em­
ploys a bank size of 4MB for the first 16MB of the cache 
and a bank size of 8MB for the remaining 16MB. The 
average power consumption for this organization is 20% 
lower than the average power per access for the S-NUCA 
organization yielded by unmodified CACTI 6.0.
12
This paper describes major revisions to the CACTI 
cache modeling tool. Interconnect plays a major role in 
the delay and power of large caches and we extended 
CACTFs design space exploration to carefully consider 
many different implementation choices for the intercon­
nect components, including different wire types, routers, 
signaling strategy, and modeling contention. We also 
added modeling support for a wide range of NUCA 
caches. CACTI 6.0 identifies a number of relevant design 
choices on the power-delay-area curves. The estimates 
of CACTI 6.0 can differ from the estimates of CACTI
5.0 significantly, especially when more fully exploring the 
power-delay trade-off space. CACTI 6.0 is able to iden­
tify cache configurations that can reduce power by a fac­
tor of three, while incurring a 25% delay penalty. We 
validated components of the tool against Spice simula­
tions and showed good agreement between analytical and 
transistor-level models. Finally, we present an example 
case study of heterogeneous NUCA banks that demon­
strates how the tool can benefit architectural evaluations.
References
11J V. Agarwal, M. Hrishikesh, S. Keckler, and D. Burger. 
Clock Rate versus IPC: The End of the Road for Con­
ventional Microarchitectures. In Proceedings o f  ISCA-27, 
pages 248-259, June 2000.
12] Arizona State University. Predictive Technology Model. 
http://www.eas.asu.edu/~ptm.
[3] H. Bakoglu and J. Meindl. A System-Level Circuit Model 
for Multi- and Single-Chip CPUs. In Proceedings o f  
ISSCC, 1987.
[4] K. Banerjee and A. Mehrotra. A Power-optimal Re­
peater Insertion Methodology for Global Interconnects in 
Nanometer Designs. IEEE Transactions on Electron De­
vices, 49(1l):2001-2007, November 2002.
[5] B. Beckmann, M. Marty, and D. Wood. ASR: Adaptive 
Selective Replication for CMP Caches. In Proceedings o f  
MICRO-39, December 2006.
[6] B. Beckmann and D. Wood. Managing Wire Delay in 
Large Chip-Multiprocessor Caches. In Proceedings o f  
MICRO-37, December 2004.
[7] B. Black, M. Annavaram, E. Brekelbaum, J. DeVale, 
L. Jiang, G. Loh, D. McCauley, P. Morrow, D. Nelson, 
D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and 
C. Webb. Die Stacking (3D) Microarchitecture. In Pro­
ceedings o f  MICRO-39, December 2006.
[8] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A 
Framework for Architectural-Level Power Analysis and 
Optimizations. In Proceedings o f  ISCA-27, pages 83-94, 
June 2000.
]9j A. Caldwell, Y. Cao, A. Kahng, F. Koushanfar, H. Lu, 
I. Markov, M. Oliver, D. Stroobandt, and D. Sylvester. 
GTX: The MARCO GSRC Technology Extrapolation Sys­
tem. In Proceedings of'DAC, 2000.
[10] J. Chang and G. Sohi. Co-Operative Caching for Chip 
Multiprocessors. In Proceedings o f  ISCA-33, June 2006.
[11] Z. Chishti, M. Powell, and T. Vijaykumar. Distance As­
sociativity for High-Performance Energy-Efficient Non­
Uniform Cache Architectures. In Proceedings o f  MICRO- 
36, December 2003.
6. C o n c lu s io n s [12] Z. Chishti, M. Powell, and T. Vijaykumar. Optimizing 
Replication, Communication, and Capacity Allocation in 
CMPs. In Proceedings ofISCA-32, June 2005.
[13] W. Dally and B. Towles. Principles and Practices o f  In­
terconnection Networks. Morgan Kaufmann, 1st edition, 
2003.
[14] J. Eble. A Generic System Simulator (Genesys) for ASIC 
Technology and Architecture Beyond 2001. In Proceed­
ings o f  9th IEEE International ASIC  Conference, 1996.
[15] N. Eisley, L.-S. Peh, and L. Shang. In-Network Cache Co­
herence. In Proceedings o f  MICRO-39, December 2006.
[16] B. M. Geuskins. Modeling the Influence o f  Multilevel In­
terconnect on Chip Performance. PhD thesis, Rensselaer 
Polytechnic Institute, Troy, New York, 1997.
[17] R. Ho, K. Mai, and M. Horowitz. The Future of Wires. 
Proceedings o f  the IEEE, Vol.89, No.4, April 2001.
[18] R. Ho, K. Mai, and M. Horowitz. Managing Wire Scaling: 
A Circuit Prespective. Interconnect Technology Confer­
ence, pages 177-179, June 2003.
[19] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and 
S. Keckler. A NUCA Substrate for Flexible CMP Cache 
Sharing. In Proceedings o f  ICS-19, June 2005.
[20] Y. Jin, E. J. Kim, and K. H. Yum. A Domain-Specific On- 
Chip Network Design for Large Scale Cache Systems. In 
Proceedings ofHPCA-13, February 2007.
[21] C. Kim, D. Burger, and S. Keckler. An Adaptive, Non­
Uniform Cache Structure for Wire-Dominated On-Chip 
Caches. In Proceedings ofASPLOS-X, October 2002.
[22] F. Li, C. Nicopoulos, T. Richardson, Y. Xie, N. Vijaykr- 
ishnan, and M. Kandemir. Design and Management of 3D 
Chip Multiprocessors Using Network-in-Memory. In Pro­
ceedings o f  ISCA-33, June 2006.
[23] G. Loi, B. Agrawal, N. Srivastava, S. Lin, T. Sherwood, 
and K. Banerjee. A Thermally-Aware Performance Anal­
ysis of Vertically Integrated (3-D) Processor-Memory Hi­
erarchy. In Proceedings oj'DAC-43, June 2006.
[24] M. Mamidipaka and N. Dutt. eCACTI: An Enhanced 
Power Estimation Model for On-Chip Caches. Technical 
Report CECS Technical Report 04-28, University of Cali­
fornia, Irvine, September 2004.
[25] C. McNairy and R. Bhatia. Montecito: A Dual-Core, 
Dual-Thread Itanium Processor. IEEE Micro, 25(2), 
March/April 2005.
[26] R. Mullins, A. West, and S. Moore. Low-Latency Virtual- 
Channel Routers for On-Chip Networks. In Proceedings 
ofISCA-31, May 2004.
[27] N. Muralimanohar and R. Balasubramonian. Interconnect 
Design Considerations for Large NUCA Caches. In Pro­
ceedings o f  the 34th International Symposium on Com­
puter Architecture (ISCA-34), June 2007.
[28] L.-S. Peh and W. Dally. A Delay Model and Specula­
tive Architecture for Pipelined Routers. In Proceedings 
o f HPCA-7, 2001.
[29] J. M. Rabaey. Digital Integrated Circuits.
[30] J. Rattner. Predicting the future, 2005. Keynote at Intel 
Developer Forum, http://www.anandtech.com/tradeshows 
/showdoc.aspx?i=2367&p=3.
[31] G. Reinman and N. Jouppi. CACTI 2.0: An Integrated 
Cache Timing and Power Model. Technical Report 2000/7, 
WRL, 2000.
[32] P. Shivakumar and N. P. Jouppi. CACTI 3.0: An Integrated 
Cache Timing, Power, and Area Model. Technical Report 
TN-2001/2, Compaq Western Research Laboratory, Au­
gust 2001.
[33] E. Speight, H. Shafi, L. Zhang, and R. Rajamony. Adaptive 
Mechanisms and Policies for Managing Cache Hierarchies 
in Chip Multiprocessors. In Proceedings ofISCA-32, June 
2005.
13
[34] D. Sylvester and K. Keutzer. System-Level Performance 
Modeling with BACPAC - Berkeley Advanced Chip Per­
formance Calculator. Tn Proceedings o f  1st International 
Workshop on System-Level Interconnect Prediction, 1999.
[35] D. Tarjan, S. Thoziyoor, and N. Jouppi. CACTT 4.0. Tech­
nical Report HPL-2006-86, HP Laboratories, 2006.
[36] S. Thoziyoor, N. Muralimanohar, and N. P. Jouppi. CACTI 
5.0: An Integrated Cache Timing, Power, and Area Model. 
Technical report, HP Laboratories Palo Alto, 2007.
[37] Y.-F. Tsai, Y. Xie, N. Vijaykrishnan, and M. Irwin. Three­
Dimensional Cache Design Using 3DCacti. In Proceed­
ings o f/C C D , October 2005.
[38] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, 
J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, 
S. Venkataraman, Y. Hoskote, and N. Borkar. An 80-Tile 
1.28TFLOPS Network-on-Chip in 65nm CMOS. In Pro­
ceedings o f/SSC C , February 2007.
[39] V. Venkatraman, A. Laffely, J. Jang, H. Kukkamalla, 
Z. Zhu, and W. Burleson. NOCIC: A Spice-Based Inter­
connect Planning Tool Emphasizing Aggressive On-Chip 
Interconnect Circuit Methods. In Proceedings o f  Interna­
tional Workshop on System Level Interconnect Prediction, 
February 2004.
[40] H.-S. Wang, L.-S. Peh, and S. Malik. A Power Model for 
Routers: Modeling Alpha 21364 and InfiniBand Routers, 
volume 23, January/February 2003.
[41] H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik. Orion: 
A Power-Performance Simulator for Interconnection Net­
works. In Proceedings o f  MICRO-35, November 2002.
[42] S. Wilton and N. Jouppi. An Enhanced Cache Access and 
Cycle Time Model. IEEE Journal o f  Solid-State Circuits, 
May 1996.
[43] S. J. E. Wilton and N. Jouppi. An Enhanced Access and 
Cycle Time Model for On-Chip Caches. Technical Report 
93/5, WRL, 1994.
[44] M. Zhang and K. Asanovic. Victim Replication: Maxi­
mizing Capacity while Hiding Wire Delay in Tiled Chip 
Multiprocessors. In Proceedings o f  ISCA-32, June 2005.
14
