A DRAM backend for the impulse memory system by Zhang, Lixin
A DRAM Backend for The Impulse Memory System
Lixin Zhang 
U UC S-00-002
D epartm en t o f  C om p u ter S cie n ce  
U n iversity  o f  U tah  
S alt L a k e  C ity , U T  84112 , U S A
D ecem b er 16 , 1998
Abstract
T h e Im pulse A d ap tab le  M e m o ry  S ystem  exp oses D R A M  access patterns not seen in  conven tion al m em ory 
system s. F or instance, it  can  generate 32 D R A M  a ccesses ea ch  o f  w h ic h  requests a  fou r-byte  w ord  in  32 
c yc le s . C on ven tion al D R A M  b acken d s are o p tim ized  for a ccesses that request fu ll cach e  lines. T h e y  m ay 
not b e  able to handle sm aller a ccesses effectively .
In this docum ent, w e  describe and evaluate a D R A M  b acken d  that reduces the average D R A M  access laten cy  
b y  exp lo itin g  the potential p aralle lism  o f  D R A M  a ccesses in  the Im pulse system . W e design  the D R A M  
b acken d  b y  studying ea ch  o f  its im portant design  options: D R A M  organization , hot ro w  p o licy , d yn am ic re­
ordering o f  D R A M  accesses, and in terleavin g o f  D R A M  banks. T h e  experim ental results obtained from  the 
execution-driven  sim ulator P ain t [10] sh o w  that, com pared  to a  conven tion al D R A M  backend, the proposed 
b acken d  can  reduce the average D R A M  access la ten cy  b y  up to 98 % , the average m em ory c y c le s  b y  up to 
90 % , and the execu tio n  tim e b y  up to 80% .
This effort was sponsored in part by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research 
Laboratory (AFRL) under agreement number F30602-98-1-0101 and DARPA Order Numbers F393/00-01 and F376/00. The views 
and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official 
polices or endorsements, either express or implied, of DARPA, AFRL, or the US Government.
1 In tr o d u c t io n  3
2  O v e r v ie w  o f  T h e  Im p u ls e  M e m o r y  S y s te m  3
2.1 H ardw are O r g a n iz a t io n ......................................................................................................................................... 4
2.2  R em ap p in g A lg o r i t h m s ......................................................................................................................................... 6
3 D R A M  B a s ic s  7
3.1 Syn ch ro n ou s D R A M ............................................................................................................................................. 7
3.2  D ire ct R am bus D R A M  ......................................................................................................................................... 8
4 D e sig n  10
4.1  D R A M  D ispatch er ................................................................................................................................................  11
4.2  S la v e  M e m o ry  C o n t r o l le r .....................................................................................................................................  12
4.3  O thers ...........................................................................................................................................................................  12
4 .3.1 H ot ro w  p o l i c y .........................................................................................................................................  12
4 .3.2  A c c e s s  reordering .....................................................................................................................................  13
4 .3.3  In terleavin g ................................................................................................................................................  14
5  E x p e r im e n ta l F r a m e w o r k  15
5.1 S im ulation  E nviron m en t .....................................................................................................................................  15
5.2  B en ch m arks ................................................................................................................................................................ 15
5.3  M e th o d o lo g y  ............................................................................................................................................................ 16
C o n t e n t s
1
6 P e r fo r m a n c e  17
6.1 T h e  Im pacts o f  D R A M  O r g a n iz a t io n .............................................................................................................. ....18
6.2  T h e  Im pacts o f  S la v e  B u s s e s ............................................................................................................................. ....20
6.3  T h e  Im pacts o f  H ot R o w  P o lic y  ..........................................................................................................................22
6.4  T h e  Im pacts o f  A c c e s s  R eord erin g .................................................................................................................. ....24
6.5 T h e  Im pacts o f  I n te r le a v in g ................................................................................................................................. ....26
6.6 P utting It A l l  T o g e t h e r .............................................................................................................................................28
7  C o n c lu s io n  a n d  F u tu r e  W o r k  30
2
1  I n t r o d u c t i o n
T h e Im p ulse m em ory system  adds tw o  im portant features to a traditional m em o ry  system . F irst, it  supports 
ap p licatio n -sp ecific  op tim izations through con figu rab le  p h ysica l address rem apping. B y  rem appin g p h ysica l 
addresses at the m em o ry  controller, applications can control h ow  their data is  a ccessed  and cach ed , thereby 
im p ro vin g  cach e perform ance and bus utilization. S eco n d , it can prefetch data from  D R A M  to an S R A M  
b u ffer in  the m em ory controller. F or a ccesses that hit in  the S R A M  b uffer, Im p ulse e ffe ctiv e ly  h ides D R A M  
a ccess la ten cy  from  the processor.
A s  a result, Im p ulse exhibits D R A M  access patterns d ifferen t w ith  w h at a conven tion al m em ory system  
does. F o r exa m p le , it m a y  gather a 128-byte cach e lin e b y  generatin g 32  fou r-byte  D R A M  a ccesses d irected 
to 32 d ifferen t m em o ry  locations. S in ce  conven tion al D R A M  b ackends are d esign ed  to handle a ccesses that 
fetch  cach e lin es, they m ay not w o rk  w e ll w ith  sm aller D R A M  accesses. To further im p ro ve the perform ance 
o f  the Im p ulse m em o ry  system , w e  exp lo re the potential o f  red esign in g the D R A M  b ack en d  fo r  Im pulse.
To handle the large num ber o f  sm all D R A M  a ccesses in  the Im p ulse m em ory system , a D R A M  backen d fo r  
Im p ulse m ust b e able to exp lo it the inherent parallelism  o f those D R A M  accesses. T h e  design  options that 
can sign ifican tly  a ffect the e ffic ie n cy  o f  such a backen d  in clu d e D R A M  organ ization , hot ro w  p o licy , access 
sch ed u lin g , and b an k interleaving. D R A M  organization  determ ines h o w  the D R A M  b an ks are conn ected  
together, h o w  the D R A M  backen d com m un icates w ith  the m em o ry  controller, and h o w  the fu n ctio n ality  lik e  
access sch ed u lin g , b an k  in terleav in g , and D R A M  refresh in g , is  distributed in sid e the D R A M  backend. H ot 
row  p o lic y  tries to reduce the average D R A M  access  la ten cy  b y  ju d ic io u sly  open in g/closin g the hot ro w s o f 
D R A M s . A c c e s s  sch ed u lin g reorders D R A M  a ccesses to exp lo re parallelism . T h e  in terleavin g o f  m em ory 
banks m a y a ffect the perform an ce d ram atically  b ecau se it  d irectly  determ ines the potential parallelism  that 
a sequen ce o f  D R A M  a ccesses m ight have.
T h e  rem ainder o f  this docum ent is  organ ized  as fo llo w s. S ection  2  provides the o v e rv ie w  o f the Im pulse 
m em ory system , fo cu sin g  on the m aster m em ory controller. S ection  3 provides som e b ackgroun d in fo r­
m ation  abou t tw o com m on  types o f  D R A M s : Syn ch ron ou s D R A M  and D ire ct R am bus D R A M . S ectio n  4 
d escrib es the prop osed  D R A M  backend. S ection  5 describ es the sim ulation  environm ent and the b en ch ­
m arks used  in  our experim ents. S ection  6 presents the p erform an ce results. S ection  7 d iscusses future w o rk  
and con clu d es this docum ent.
2 Overview of The Impulse Memory System
T h e m ost d istin guishable feature o f  Im p ulse is  the addition o f  another lev e l o f  address translation at the 
m em ory controller. T h e  k ey  in sigh t exp lo ited  b y  this feature is that “ un used” p h ysica l addresses can  undergo 
a  translation to “ rea l”  p h ysica l addresses at the m em ory controller. F o r exam ple, in  a  conven tion al system  
w ith  32 -bit p h ysica l addressing and o n ly  one g ig ab y tes  o f  in stalled  D R A M , the other three g ig a b ytes  o f  
p h ysica l address sp ace are not d irectly  b ack ed  up b y  D R A M  and w ill generate errors i f  presented to a
3
conven tion al m em ory controller. W e ca ll these oth erw ise-unused p h ysica l addresses shadow  addresses, 
and th ey constitute a  shadow  address space. In an Im pulse system , applications can  reorgan ize their data 
structures in  the shadow  address sp ace to e x p lic itly  control h o w  their data is  a ccessed  and cached. W h en  
the Im pulse m em ory controller receives a shadow  address, it w ill translate the shadow  address to a set o f  
“ rea l”  p h ysica l addresses (a .k .a  p h ysica l D R A M  addresses) instead o f  generatin g an error as a  conventional 
m em ory controller does. In the current Im pulse design, the m apping from  the shadow  address sp ace to the 
real p h ysica l address sp ace can  b e  in  an y p o w er-o f-tw o  granularity  from  w o rd -size  to p a g e  size.
D ata  item s w h o se  virtual addresses are not contiguous can  b e m apped to contiguous shadow  addresses, 
so that sparse data item s in  virtual m em ory can  b e  com p acted  into dense cach e lines in  shadow  m em ory 
b efo re  b ein g transferred to the processor. T o  m ap data item s in  these com p acted  cach e lin es b a c k  to p h ysica l 
m em ory, Im pulse m ust recover their o ffsets w ith in  the virtual layo u t o f  the origin al data structures. W e ca ll 
these o ffsets p seudo-virtua l addresses. P seudo-virtual m em ory m irrors real virtual m em ory and is  n ecessary  
to m ap data structures larger than a  page. T h e  m em ory con troller translates p seudo-virtua l a ddresses  to 
p h ysica l D R A M  addresses in  p a g e-leve l. T h e  shadow  ^  p seu d o -v irtu a l —>■ p h y sica l m appin gs a ll take 
p la ce  w ith in  the Im p ulse m em ory controller. T h e  shadow  —>■ p seu d o -v irtu a l  m apping in v o lves som e sim ple 
arithm etic operations and is im plem ented  b y  A L U  units. T h e  p seu d o -v irtu a l p h y sica l  m apping in volves 
p a g e  table looku p s and is  im plem ented  b y  a  sm all table lookaside buffer  (T L B )  at the m em o ry  controller.
T h e  secon d  im portant feature o f  Im pulse is  that it  supports prefetch in g —  M em ory-C on tro ller-b ased  p refetch ­
in g  (M C -b ase d  prefetching). A  sm all am ount o f  S R A M  -  so-ca lled  M e m o ry  C on tro ller cach e or M C a ch e  
-  integrated at the m em ory controller stores data p refetched  from  D R A M . F o r this docum ent, w e  a s­
sum e a sim ple n ext-lin e sequential prefetch  schem e for M C -b a se d  prefetching: w h en  an access m isses 
in  the M C a ch e , fetch  the requested cach e lin e and p refetch  the n ext one; w h en  an access hits in  the 
M C a ch e , prefetch  the n ext one. F o r norm al data, prefetch in g is  u sefu l for red u cin g the m em ory laten cy  
o f  seq u en tia lly-accessed  data. F o r shadow  data, prefetch in g en ables the con troller to hide the co st o f  rem ap­
pin g shadow  addresses and issu in g m ultiple D R A M  accesses.
T h e  shadow  address sp ace is m an aged  b y  the operating system  in  a w a y  sim ilar to real p h ysica l address 
space. T h e  operating system  guarantees the shadow  address sp ace im ag e o f  an y rem apped shadow  region  
to b e  contiguous even  it  spans m ultip le pages. T h is  guarantee not o n ly  sim plifies the translation hardw are 
at the m em o ry  controller, but a lso  a llo w s the C P U  to use superpage T L B  entries to translate rem apped data. 
T h e  operating system  provides an in terface for applications to sp e c ify  op tim izations fo r  their particular data 
structures and con figu re the Im p ulse m em ory controller to reinterpret the shadow  addresses presented to 
it. T h e  program m er (or the com piler, in  the future) inserts d irectives into the applications to con figu re the 
Im pulse m em ory controller. To keep the m em ory con troller sim ple and fast, Im pulse restricts rem appin g in  
tw o w a ys. First, an y data item  b ein g  rem apped m ust b e  a  p o w er o f  tw o  in  size. S econ d, an application  that 
uses rem appin g m ust ensure data co n sisten cy  through appropriate flushing o f  the caches.
2.1 Hardware Organization
F igu re  1 show s the b lo c k  d iagram  o f  the Im pulse m em ory system , w h ich  in clu d es the fo llo w in g  com ponents:
4





















D R A M D R A M -T
DRAM backend_____________
F igu re  1: The Impulse memory architecture. The arrows indicate how data flows within an Impulse mem ory system.
•  a  sm all num ber o f  control registers, w h ic h  are sp lit into a  set o f  Shadow  D escrip tors (S D escs)  and 
store configuration  inform ation  for rem apped shadow  regions,
•  a  sim ple A L U  unit (A d d rC a lc), w h ic h  translates shadow  addresses to pseudo-virtual addresses;
•  a  M em ory C on tro ller  T L B  (M T L B ), w h ic h  is b acked  up b y  m ain  m em ory and m aps pseudo-virtual 
addresses to p h ysica l D R A M  addresses, a lo n g w ith  a  sm all D R A M  b u ffer to h o ld  p refetched  page 
table entries;
•  a  M em ory C on troller C a ch e (M C a ch e), w h ic h  holds data p refetched  from  D R A M ;
•  a  D R A M  Scheduler, w h ic h  contains circu itry  that orders and issues D R A M  accesses;
D R A M  chips, w h ich  constitute m ain  m em ory.
T h e  extra lev e l o f  address translation at the m em ory con troller is option al, so an address appearing on 
the system  m em ory bus m a y  b e  a  real p h ysica l or a  shadow  address (a ) . A  real p h ysica l address passes 
untranslated to the M C a c h e / D R A M  scheduler (b ) . A  shadow  address has to go  through the m atch in g shadow  
descriptor (d ) . T h e  A d d rC a lc  unit translates the shadow  address into a  set o f  pseudo-virtual addresses using
l
5
the rem appin g inform ation  stored in  the m atch in g shadow  descriptor (e). T h ese  pseudo-virtual addresses 
are translated into real p h ysica l addresses b y  the M T L B  (f). T h e  real p h ysica l addresses pass to the D R A M  
scheduler (g). T h e  D R A M  scheduler orders and issues the D R A M  a ccesses (h) and sends data b a c k  to the 
m em ory controller (i). F in ally , w h en  a  fu ll cach e lin e has been  gathered, the M M C  sends it to the system  
m em ory bus (j).
2.2 Remapping Algorithms
Currently, the address translation at the Im p ulse m em ory controller can  take fou r form s, depending on h o w  
the M M C  is  used  to a ccess  a  particular data structure: direct rem apping, strided rem apping, transpose 
rem apping, or rem apping through an indirection vector.
•  D ir e ct m apping  m aps one contiguous cach e lin e  in  shadow  m em ory to one con tiguous cach e lin e  in 
real p h ysica l m em ory. T h e  pseudo-virtual address for the shadow  address saddr  is  (saddr — ssaddr), 
w h ere ssaddr  is the starting address (assign ed  b y  the O S ) o f  the data structure’s shadow  address space  
image. E xa m p les  o f  u sin g this m apping in clu d e reco lorin g p h ysica l p ages w ith ou t c o p y in g  [2 ] and 
constructing superpages from  n on-con tiguou s p h ysica l p ages w ith ou t co p y in g  [11] .
•  S trided  m apping  creates dense cach e lines from  data item s that are not contiguous but stridedly d is­
tributed in  virtual m em ory. T h e  M M C  m aps a  cach e lin e  addressed b y  the shadow  address saddr  
to m ultip le pseudo-virtual addresses: (stride  x  (saddr — ssaddr) / size jof-data  Jtem  +  stride  x  i), 
w h ere i ran ges from  0  to (cache -lin e s i z e  / size -of-data Jtem  — 1). T h is  m apping can  be used to create 
tiles o f  a  dense m atrix w ith ou t co p y in g  or to co m p a ct strided array elem ents [2].
•  Transpose m apping  creates the transpose o f  a  tw o-dim en sion al m atrix b y  m apping the elem en t [j] [i] o f  
the transposed m atrix to the elem en t o f  the orig in al m atrix. T h is  m appin g can  be used  w h erever 
a  m atrix  is a ccessed  in  a  m ajor d ifferen t w ith  w h a t it is stored [14].
•  R em apping through an indirection vector  p acks dense cach e lin es from  array elem ents accord in g to an 
indirection  vector. T o  rem ap the shadow  address saddr, the M M C  first com putes its o ffse t in  shadow  
m em ory as soffset =  (saddr — ssaddr) / size -of-array -elem ent, then uses the indirection  vecto r vector  
to m ap the cach e lin e addressed b y  the shadow  address saddr  to several pseudo-virtual addresses 
(v e c to r[so f  f  s e t  + « ]), w h ere i ran ges from  0 to (ca ch e-lin e  s i z e  / size .of-array -elem ent — 1). T h e  O S  
m oves the indirection  vecto r into con tigu ou s p h ysica l m em o ry  so that the address translation for the 
indirection  vector is not needed. O n e exam p le o f  usin g this m apping is  to u se it  to op tim ize the sparse 
m atrix-vector product a lgorithm  [2].
In d irect m appin g, ea ch  shadow  address generates e x a ctly  one D R A M  access. In other three m appings, 
ea ch  shadow  address generates (ca ch e-lin e  s i z e  / size  jof-da ta  Jtem )  D R A M  a ccesses i f  (cach e J in e  s i z e  >  
size-o f-d a ta Jtem ), or one D R A M  a ccess  i f  (ca ch e-lin e  s i z e  <  size  -of-data Jtem ).
6
3 DRAM Basics
T h is  section  describ es the b asics o f  D R A M  (D yn a m ic  R an d o m  A c c e s s  M em o ry) and tw o com m on  types o f  
D R A M s : Syn chron ous D R A M  and D ire ct R am bus D R A M .
D R A M  is arranged as a m atrix  o f  “ m em ory c e lls ”  la id  out in  ro w s and colum ns, and thus a data access 
sequen ce consists o f  a  row a ccess  strobe  signal (R A S ) fo llo w e d  b y  one or m ore colum n a ccess  strobe  signals 
(C A S ) . D urin g R A S , data in  the storage ce lls  o f  the d ecod ed  row  is  m oved  into a b an k o f  sense am plifier 
(a .k .a p a g e buffer  or hot row), w h ic h  serves as a  row  cache. D u rin g C A S , the colum n  addresses are decoded  
and the selected  data is read from  the p a g e buffer. C o n secu tive  a ccesses to the current p a g e  b u ffer -  called  
p a g e hits -  o n ly  n eed colum n  addresses, savin g the R A S  signals. H ow ever, the hot ro w  m ust first be c lo sed  
b efo re  another row  can  be opened. In addition, D R A M  has to b e refreshed abou t hundreds o f  tim es each  
secon d in  order to retain  data in  its m em ory cells.
3.1 Synchronous DRAM
S D R A M  syn ch ron izes all input and output sign als to a system  c lo ck , therefore m akin g the m em ory retrieval 
process m u ch  m ore efficien t. In S D R A M , R A S  and C A S  signals share the sam e bus. S D R A M  supports 
burst transfer to p rovide a constant flow  o f  data. T h e  program m ab le burst len gth  can  b e  tw o, four, e igh t 
cy c le s  or a fu ll-p age. It has b oth  “ autom atic”  and “ con tro lled ” precharge com m an ds, w h ic h  a llo w  a read or 
a  w rite  com m an d to sp e c ify  w hether or not to lea ve  the row  open.
F igu re  2 show s the sequences o f  som e S D R A M  transactions, assum ing a ll transactions access the sam e 
bank. P a r t  1 o f  F igu re  2  d isp lays the in terleavin g o f  tw o  read transactions directed  to the sam e row  w ithout 
autom atic precharge com m ands. T h e  secon d read hits the hot row , so it  does not need a R A S  signal. P a r t  2 
o f  F igu re  2  show s the in terleavin g o f  tw o read transactions d irected  to tw o differen t ro w s w ith ou t autom atic 
precharge com m an ds. S in ce  the secon d read needs a  d ifferen t row , the previous h ot ro w  has to b e c lo sed  
(i.e., a  precharge com m an d m ust b e done) b efo re  the secon d read can open a  n ew  row. P a r t  3 o f  F igu re  2 
show s tw o read transactions w ith  autom atic precharge com m an ds (i.e., the ro w  is  a u tom atica lly  c lo sed  at 
the end o f  an access). W h e n  the autom atic precharge is enabled, the sequen ce o f  tw o  read transactions w ill 
b e sam e no m atter w hether they access the sam e ro w  or not. P a r t  4 o f  F igu re 2 d isp lays a  w rite  transaction 
fo llo w e d  b y  a read transaction w h ic h  a ccesses a n ew  row. A n  e x p lic it precharge com m an d m ust b e  inserted 
b efo re  the secon d transaction starts. T h e  w rite  transaction introduces tw o  restrictions. F irst, a  d elay  ( t D P L )  
m ust be satisfied from  the start o f  the last w rite  c y c le  to the start o f  the precharge com m and. S econ d, the 
d ela y  b etw een  the precharge com m an d and the n ext activate com m an d (R A S )  m ust b e greater than or equal 
to the precharge tim e ( t R P ) .  F igu re  2 a lso  show s the k ey  tim ing param eters o f  S D R A M  [7 ]. T h eir  m eanings 








DOUTA DOUTA DOUTB DOUT B
Part 1: Two reads to the same row, without automatic precharge
Part 2: Two reads to two different rows, without automatic precharge
Begin auto precharge Begin auto precharge
Part 3: Two read transactions, with automatic precharge
Part 4: A write followed by a read to a different row, without automatic precharge
F igu re  2 : E xa m p les o f  S D R A M  transactions
S y m b o l M ean in g V alue
t R A S m inim um  b an k  active  tim e 7
t R C D R A S  to C A S  d e la y  time 3
t A A C A S  laten cy 3
t C C D C A S  to C A S  d e la y  tim e 1
t R P precharge tim e 3
t D P L data in  to precharge tim e 2
t D A L data in  to active/refresh tim e (equals to t R P  +  t D P L ) 5
T able 1 : Im portant tim ing param eters o f  S yn ch ro n ou s D R A M .
3.2 Direct Rambus DRAM
D ire ct R am bus D R A M  is  a  h igh  speed D R A M  d evelop ed  b y  R am bus, Inc [3 ]. R D R A M  has independent 
pins for ro w  address, colum n  address, and data. E ach  b an k  can  b e  indepen den tly  opened, accessed , and
8
precharged. D ata  and control inform ation  are transferred to and from  R D R A M  in a  packet-oriented  protocol. 







A C T a
tRC
P R E R A C T b
tRAS tRP
R D  a 1 R D  a2
—  fOFFP
-  Q (a 1)
tCAC
Q (a2 )
Part 1: A  read transaction with precharge followed by another read.
A C T  a
R D  a 1 R D  a2 R D  b 1 R D  b 2
Q (a 1) Q (a2) Q (b 1) 2)Q(





A C T a P R E R A C T b
^-----tRP -------
R D  a 1 R D  a2
Q (a 1) Q (a2)
Part 3: A  read transaction without precharge followed by an explicit precharge command
A C T  a A C T  b A C T  c A C T  d A C T  e A C T  f
C O L R D  a 1 R D  a2 R D  b 1 R D  b 2 R D  c 1 R D  c 2 R D  d 1 R D  d2 R D  e 1
D Q Q (a 1) Q (a2 ) Q( 2)Q( Q (c1) Q(c2)
Part 4: Ideal interleaving o f transactions directed to non-adjacent banks 
F igu re  3 : E xa m p les o f  R D R A M  operations
F igu re  3 show s som e R D R A M  transactions that a ll access the sam e chip. P a r t  1 o f  F igu re 3 show s a 
read transaction w ith  a  precharge com m and, fo llo w e d  b y  another transaction to the sam e bank. P a r t  2 
o f  F igu re  3 show s the overlapp in g o f  tw o  read transactions directed  to the sam e row . P a r t  3  o f  F igu re  3 
show s a read transaction w ith ou t a precharge com m an d fo llo w e d  b y  a transaction to a d ifferen t row. In this 
case, the h ot ro w  m ust be e x p lic itly  precharged b efo re  the secon d transaction starts. P a r t  4  o f  F igu re  3 
d isp lays an id eal steady-state sequen ce o f  dual-data read transactions directed  to non-adjacent banks o f  a 
sin gle  R D R A M  chip. T h e  k ey  tim ing param eters o f  R D R A M  and their typ ica l values in  c lo c k  c y c le s  are 
presented in  T able 2 , w h ic h  assum es a  400M H z c lo c k  rate [6].
9
S ym b o l M ean in g V alue
m e the m inim um  d ela y  from  the first A C T  com m an d to the secon d  A C T  com m an d 28
t R A S the m inim um  d ela y  from  an A C T  com m an d to a  P R E R  com m an d 20
t n c D d ela y  from  an A C T  com m an d to its first R D  com m an d 7
t R P the m inim um  d ela y  from  a  P R E R  com m an d to an A C T  com m an d 8
t C A C d ela y  from  a  R D  com m an d to its associated  data out 8
t c c d ela y  from  a  R D  com m an d to n ext R D  com m an d 4
t O F F P the m inim um  d ela y  from  the last R D  com m an d to a  P R E R  com m and 3
t B U B l b ubble b etw een  a  R D  and W R  com m and 4
t n u m b ubble b etw een  a  W R  and R D  com m an d to the sam e d ev ice 8
T ab le 2 : Im portant tim ing param eters o f  R am bus D R A M .
4 Design
T h e p rop osed  Im pulse D R A M  b a ck en d 1 contains three m ajor com ponents: the D R A M  D ispatcher, S lav e  
M e m o ry  C on tro llers (S M C s), and p lu g-in  m em o ry  m odules —  D R A M  chips. T h e  D R A M  dispatcher, 
S M C s , and R A M  A d d ress b usses (R A  busses) con n ectin g them  constitute the D R A M  scheduler show n 
in  F igu re  1. A  D R A M  b ack en d  contains one D R A M  dispatcher, but can  have m ultiple S M C s , m ultiple R A  
busses, and m ultiple p lu g-in  m em ory m odules. F igu re  4  show s a  configuration  that has fou r S M C s , four 
D R A M  chips, e igh t banks, and tw o  R A  busses. N ote that the D R A M  dispatcher and S M C s  do not h ave to 
b e in  differen t chips. F igu re  4  ju st show s them  in  a  w a y  ea sy  to understand. W h eth er or not to im plem en t 
the D R A M  scheduler in  a  sin gle  ch ip  is an open question.
T h e  M aster M e m o ry  C on tro ller [13] (M M C ) is the co re  o f  the Im pulse m em o ry  system . It com m unicates 
w ith  the processors and I/O adapters o ver the system  m em o ry  bus, translates shadow  addresses into p h ysica l 
D R A M  addresses, and generates D R A M  accesses. A  D R A M  access can  b e a  shadow  access or a  norm al 
access. T h e  M M C  sends a  D R A M  request to the b acken d  v ia  S la v e  A d d ress b usses (S A  busses) and passes 
data from  or to the D R A M  b acken d  v ia  S la v e  D ata  b usses (S D  busses). D u rin g our experim ents, w e  vary  
the num ber o f  S A  busses/SD  busses from  one to one plus the num ber o f  shadow  descriptors. I f  there is  o n ly  
one S A  or S D  bus, norm al a ccesses and shadow  a ccesses w ill share it. I f  there are tw o S A  or S D  busses, 
norm al a ccesses w ill u se one e x c lu s iv e ly  and shadow  accesses w ill u se the other one exclu sively . I f  there 
are m ore than tw o S A  or S D  busses, one w ill b e  e x c lu s iv e ly  used  b y  norm al a ccesses and ea ch  o f  the rests 
w ill b e used  b y  a  subset o f  shadow  descriptors. T h e  contention  on  S A  b usses is  reso lved  b y  the M M C  and 
the contention  on  S D  b usses is  reso lv ed  b y  the D R A M  dispatcher. O n e g o al o f  our experim ents is to find 
out h o w  m any S A /S D  b usses are needed to avoid  h ea v y  contention  on them . T h e  experim ental results w ill 
show  that m ore than one S A  or S D  bus does not g iv e  sign ifican t b en efit o ver sin gle  S A  or S D  bus.
1Since the proposed memory system was modeled based on the HP Kitt-Hawk memory system [4], this document uses the 
terminology of Kitt-Hawk memory system.
10
System Memory Bus
F igu re  4 : Im pulse D R A M  B ack en d  B lo c k  D iagram
4.1 DRAM Dispatcher
T h e D R A M  dispatcher is respon sib le fo r  sending m em ory a ccesses  com in g from  S A  busses to the relevan t 
S M C  v ia  R A  busses and passin g data b etw een  S D  busses and R A M  D ata  busses (R D  bus). I f  there is 
m ore than one S A  bus, contention  on R A  bus occu rs w h en  tw o a ccesses  from  tw o differen t S A  busses 
sim ultan eously  n eed the sam e R A  bus. F or the sam e reason, contention  on S D  b usses or R D  busses w ill 
o ccu r i f  there is m ore than one R D  bus or m ore than one S D  bus. T h e  D R A M  dispatcher reso lv es these 
contentions b y  p ick in g  up w inn ers accord in g to a  designated algorith m  and queuing the others. I f  a  w aitin g
11
queues b eco m es critica lly  fu ll, the D R A M  dispatcher stops the M M C  from  sending m ore requests. A l l  
w aitin g  queues w o rk  in  F irst-C o m e-F irst-S erve  ( F S F C )  order.
4.2 Slave M emory Controller
E a c h  slave m em ory controller controls one R D  bus and D R A M  chips usin g the R D  bus. T h e  S M C  has 
independent control signals for ea ch  D R A M  chip. E a c h  chip  has m ultip le m em ory banks. E a c h  m em ory 
b an k  n orm ally  has its ow n  p a g e  b u ffer and can  b e  a ccessed  indepen den tly  from  all other b a n k s2. H o w  m any 
banks each  D R A M  chip  has depends on  the D R A M  type. T y p ica lly , each  S D R A M  chip  contains tw o to four 
banks and ea ch  R D R A M  chip  contains e igh t to 16 banks.
T h e  S M C  is responsib le for several im portant tasks. F irst, it  tracks each  m em ory b an k 's  page b u ffer and 
decides w hether or not to lea ve  p a g e  b uffer open after an access. S econ d, it  controls an independent w aitin g 
q ueue for each  b an k  and reorders the w aitin g  transactions to reduce the average m em ory latency. T h ird, the 
S M C  m anages the in terleavin g o f  m em ory banks. W h en  an a ccess  is b roadcasted  on an R A  bus, o n ly  the 
S M C  that controls the m em ory to w h ich  the a ccess  go es w ill respond. T h e  in terleavin g sch em e determ ines 
w h ic h  S M C  should respond to a  sp ecified  p h ysica l address. Fourth, the S M C  is  responsib le fo r  refreshing 
D R A M  ch ips periodically.
4.3 Others
T h is  section  describ es algorithm s im plem ented  in  the prop osed  D R A M  backend: h o t  r o w  p o lic y  w h ich  
decides w hether or not to lea ve  a  hot ro w  open at the end o f  an access, b a n k  q u e u e  r e o r d e r in g  a lg o r ith m  
w h ich  reorders transactions to m in im ize the average m em o ry  laten cy  p erceived  b y  the processor, in te r le a v ­
in g  sc h e m e  w h ich  determ ines h o w  the p h ysica l D R A M  addresses are distributed am on g D R A M  banks.
4 .3.1 H o t r o w  p o lic y
T h e  co lle ctio n  o f  hot row s can  b e regarded as a  cach e. P rop er m anagem ent is  n ecessary  to m ake this “ c a c h e ” 
profitable. T h e  Im pulse D R A M  b ack en d  a llo w s h ot ro w s to rem ain  a ctiv e  after b ein g accessed . T h e  benefit 
o f  lea v in g  a  row  open is  elim in atin g the R A S  sign als for accesses that h it the row. H ow ever, a  D R A M  access 
has to p a y  the p en alty  o f  c lo sin g  the row  i f  it m isses the row. W e test three hot ro w  p o lic ies: close-p a ge  
p o licy , w h ere the a ctiv e  ro w  is  a lw a y s  c lo sed  after an access; open-page  p o licy , w h ere the a ctiv e  row  is 
a lw a y s  le ft  open after an access; use-predictor  p o licy , w h ere predictors are used  to speculate w hether the 
n ext access  w ill h it or m iss an open row.
2Some RDRAM chips let each page buffer to be shared between two adjacent banks, which introduces the restriction that 
adjacent banks may not be simultaneously accessed.
12
T h e use-predictor  p o lic y  w as in itia lly  d esign ed  b y  R .C . Sch um an n  [9 ]. In this p o licy , a separate predictor 
is used  fo r  each  potential open row . E a ch  predictor records the hit/m iss results fo r  the previous several 
accesses to the associated  m em ory bank. I f  an access requires the sam e row  as the previous access to the 
sam e bank, it  is  recorded  as a h it no m atter w hether the ro w  w a s kept open or not. O th erw ise , it is  recorded 
as a  m iss. T h e  predictor then uses the m ultip le-b it h istory  to predict w hether the n ext access w ill b e  a  h it or a 
m iss. I f  the previous recorded a ccesses are a ll hits, it predicts a  hit. I f  the previous recorded  a ccesses are all 
m isses, it  predicts a  m iss. S in ce  the optim um  p o lic y  is not ob viou s fo r  the other cases, a  softw are-con trolled  
precharge p o lic y  register is  p rovided  to define the p o lic y  fo r  each  o f  a ll the p o ssib le  cases. A p p licatio n s 
can  set this register to sp e c ify  the desired  p o lic y  or can  d isable the hot ro w  schem e altogether b y  setting 
the register to zeros. In our experim ents, the precharge p o lic y  register is  set “ op en ” w h en  there are m ore 
hits than m isses in  the h istory  and “ c lo se ” oth erw ise. F or exam ple, i f  the h istory  has four-bit, the precharge 
p o lic y  register is  set to be 1 1 1 0  1000 1000 0000 upon in itia lization, w h ic h  keeps the ro w  open w h en ever 
three o f  the preced in g fou r a ccesses are p a g e  hits.
W e exp an ded  the origin al use-predictor p o lic y  w ith  one m ore feature: w h en  the b an k  w aitin g  q ueue is not 
em pty, u se the first transaction in  the w aitin g  queue instead o f  the predictor to perform  speculation. I f  the 
n ext transaction a ccesses the sam e ro w  as the current transaction, the ro w  is le ft  open. O th erw ise , the row  is 
closed .
4 .3.2 A c c e s s  r e o r d e r in g
T h e  Im pulse M M C  can  several types o f  D R A M  a ccesses to the D R A M  backend. F igu re  1 sh o w s flo w s to the 
D R A M  b acken d  from  differen t units. B ased  on the issuer and the nature o f  a  D R A M  access, ea ch  D R A M  
access is  c la ssified  as one o f  the fo llo w in g  fou r types.
D irect a cc esse s  are norm al a ccesses d irectly  com in g from  the system  m em ory bus (arrow  b  in  F ig ­
ure 1). A s  in  conven tion al system s, each  d irect access requests a cach e lin e  from  the D R A M  backend.
•  Indirection vector a ccesses  fe tch  the in direction  vecto r for rem apping through an indirection vector  
(k ). E a c h  indirection  vector access  is  fo r  a cach e lin e and the return data is  sent to the relevan t shadow  
descriptor.
M T L B  a ccesse s  are generated  b y  the M T L B  to fe tch  p a g e  table entries from  D R A M  into the M T L B  
b uffers (l). To reduce the num ber o f  M T L B  a ccesses, ea ch  M T L B  access requests a w h o le  cach e line, 
not ju st a sin gle  entry.
Shadow  a ccesse s  are generated  b y  the Im pulse m em o ry  con troller to fe tch  rem apped data (g). T h e 
size  o f  ea ch  shadow  access varies w ith  ap p licatio n -sp ecific  m appings. T h e  return data o f  shadow  
a ccesses is sent to the rem appin g con troller for further processin g.
F or con ven ien ce, w e  u se non-shadow  a ccesses  to represent d irect a ccesses, indirection  vecto r accesses, and 
M T L B  accesses. N orm ally, m ost o f  D R A M  accesses are either d irect accesses or shadow  accesses, w ith
13
a  fe w  b ein g  M T L B  a ccesses and indirection  vecto r accesses. Intuitively, d ifferen t types should be treated 
d ifferen tly  in  order to reduce the average m em ory latency. F o r exam ple, an indirection  vector access  is 
depended upon b y  a  b unch  o f  shadow  a ccesses and its w aitin g  c y c le s  d irectly  contribute to the la ten cy  o f  the 
a ssociated  m em ory request, so it had better b e  taken care o f  as early  as possib le. A n y  d e la y  on a  prefetchin g 
a ccess w ill not lik e ly  increase the average m em o ry  la ten cy  as lo n g  as the data is p refetched  ea rly  enough, 
w h ic h  is ea sy  to a cco m p lish  in  m ost situations, so a  prefetch in g access does not h ave to com p lete  as early  
as p o ssib le  and it can  g iv e  a w a y  its m em ory b an k  to indirection  vector a ccesses and M T L B  accesses. A fte r  
h avin g taken consider these facts, w e  prop ose a  reordering algorithm  w ith  the fo llo w in g  rules.
1. N o  reordering shall v io la te  data co n sisten cy  —  read after w rite, w rite  after read, and w rite  after w rite.
2 . O n ce  an access is  used  to pred ict w hether or not to lea ve  a  row  open at the end o f  the preced in g access 
(see S ectio n  4 .3 . 1), no other a ccesses can  g e t ahead o f  it and it  is guaranteed to a ccess  the relevant 
m em o ry  b an k  right after the preced in g one.
3 . N on -p refetch in g a ccesses h ave h igh er p riority  than prefetch in g accesses.
4 . M T L B  accesses and indirection  vector a ccesses have h igh er priority  than others.
5 . S in ce  it  is hard to determ ine w hether d irect a ccesses or shadow  a ccesses should b e g iv en  higher 
p riority  over ea ch  other, w e  study tw o op posite ch oices: g iv in g  d irect a ccesses h igh er priority  or 
g iv in g  shadow  a ccesses high er priority.
6. T h e  priority  o f  an access in creases as its w aitin g  tim e increases. T h is  rule guarantees no access w o u ld  
stay in  a  w aitin g  queue “ fo rever” . W e m ake this rule optional so that w e  can  find out i f  it is u sefu l and 
i f  som e a ccesses w ill starve w ith ou t this rule.
7 . I f  there is a  co n flic t am ong d ifferen t rules, a lw a ys u se the first rule o f  the co n flictin g  ones accord in g 
to the sequen ce above.
4 .3.3 I n te r le a v in g
T h e  in terleavin g schem e determ ines the m apping from  p h ysica l D R A M  addresses to m em o ry  banks. Inter­
lea vin g  can  b e either p a g e-le ve l i f  the m apping granularity  is  the size  o f  the page b u ffer or cach e-lin e-leve l 
i f  the m apping granularity is the s ize  o f  an L 2 cach e line. P ag e-leve l in terleavin g m eans b an k 0  has a ll pages 
w h o se  address m odulo p a g e-size  is  0, b an k  1 has a ll p ages w h o se  address m odulo p a g e-size  is 1, and so on. 
W e consider tw o in terleavin g schem es, as show n  b y  F igu re 4 : m odulo-interleaving, w h ic h  m aps con secu tive 
p h ysica l addresses to d ifferen t D R A M  chips; and sequential-interleaving, w h ich  m aps con secu tive  p h ysica l 




W e extended the execu tive-d riven  sim ulator P ain t [10 , 12] to m odel the Im pulse m em ory controller and the 
prop osed  D R A M  backend. P ain t m od els a  variation  o f  a  sin gle-issu e H P  P A -R IS C  1.1 processo r running 
a  B S D -b a se d  m icro-kernel and an H P  R u n w a y  bus. T h e  32K  L 1 data cach e is  n on -b lockin g, s in g le-cyc le , 
w rite-around, w rite-through, virtu a lly  indexed, p h y sic a lly  tagged, and d irect m apped w ith  32 -byte lines. 
T h e  256K  L 2 data cach e is n on -b lock in g, w rite-allocate, w rite-b ack, p h y sic a lly  in d exed  and tagged, 2 -w ay  
set-associative, and has 128-byte lines. Instruction cach in g  is assum ed to b e perfect. T h e  un ified  I/D T L B  
is s in g le-cyc le , and fu lly  associative, uses a  n ot-recently-used  replacem en t p o licy , and has 120 entries.
T h e  sim ulated  Im p ulse m em ory controller is derived  from  the H P  m em ory con troller used  in  servers and 
high -end  w orkstations [5 ]. W e m od el seven shadow  descriptors, ea ch  o f  w h ic h  is associated  w ith  fou r 128- 
b yte  lines. T h e  con troller can  prefetch  the correspon din g shadow  data into these fu lly  associative  lines. 
A  4 K b y te  S R A M  holds n on-shadow  data p refetched  from  D R A M s . T h e  M T L B  has an independent b an k 
for each  shadow  descriptor. E a ch  M T L B  b an k  is direct-m apped, has 32  eigh t-b yte  entries, and uses tw o 
128-byte buffers to prefetch  con secu tive  lin es o f  page table entries.
In the sim ulated  system , the C P U  runs at the c lo c k  rate o f  400M H z. T h e  S D R A M  w orks at 147M H z. 
E a c h  S D R A M  chip  contains tw o banks, ea ch  o f  w h ic h  has a  16K b y te  page-buffer. T h e  R D R A M  w o rks at 
400M H z. E a ch  R D R A M  chip contains 8 banks, each  o f  w h ich  has an 8 K b yte page-buffer. T h e  D R A M  
w id th  is 16 b ytes, so are the R D  b usses and S D  busses. A n  S A  bus can  transfer a  request per cyc le .
E v en  though P ain t's  P A -R IS C  p rocesso r is  sin gle-issue, w e  m odel a  pseudo-quad-issue superscalar m achin e 
b y  issu in g fou r instructions ea ch  c y c le  w ith ou t ch eck in g  structural hazards. W h ile  this m odel is  unrealistic 
for gathering p rocesso r m icro-architecture statistics, it stresses the m em ory system  in  a  m anner sim ilar to a 
real superscalar processor.
5.2 Benchmarks
W e u se the fo llo w in g  benchm arks to test the p erform an ce o f  the prop osed  Im pulse D R A M  backend.
•  C G  from  N P B 2.3  [1] uses con jugate gradient m ethod to com pute an approxim ation  to the sm allest 
e igen valu e o f  a  large, sparse, sym m etric  po sitive  defin ite m atrix. T h e  kernel o f  C G  is  a  sparse m atrix 
vecto r product ( S M V P )  operation , w h ere is the sparse m atrix  and is the vector. T w o
Im p ulse op tim izations can b e  ap p lied  to this b en ch m ark independently: rem appin g through an in d i­
rection  vecto r (c a ll it C G .iv  in  the fo llo w in g  d iscu ssion ) and d irect m apping. R em ap p in g through 
an indirection  vecto r rem aps the vector to im p rove its cach e perform ance. D ire ct m appin g im p le ­
m ents n o -co p y  p a g e-co lo rin g  w h ich  m aps to the first h a lf  o f  L 2  cach e and other data structures to
15
the secon d  h a lf  o f  L 2  cache. T h ere  are tw o version s o f  page-co lo rin g: one rem aps o n ly  the three m ost 
im portant data structures (call it  C G .p c 3); another one rem aps all the seven m ajor data structures in 
C G  (call it  C G .p c 7).
•  S p a r k 98  [8] perform s a  sequen ce o f  sparse m atrix  vecto r product operations usin g m atrices that are 
derived  from  a  fa m ily  o f  three-dim ensional finite elem ent earthquake applications. It uses rem apping 
through an indirection  vecto r to rem ap the vecto r used  in  S M V P  operations. In Spark9 8 , ea ch  vector 
elem en t is a  sub-vector. In order to m eet the restriction  that the size  o f  a  rem apped data item  
m ust b e  a  p o w er o f  tw o, the sub -vector is padded to b e  .
•  T M M P  is an Im p ulse-version  im plem entation  o f  the tiled  den se m atrix-m atrix product algorithm
, w h ere , , and are den se m atrices. Im pulse rem aps a  tile o f  ea ch  m atrix into 
a  con tigu ou s sp ace in  shadow  m em o ry  usin g strided m apping . Im pulse d ivides L 1 cach e into three 
segm ents. E a ch  segm en t keeps a  tile: the current output tile  from  C ;  the input tiles from  A  and B .  In 
addition, sin ce the sam e row  o f  m atrix  is used  m ultiple tim es to com pute a  row  o f  m atrix , it is 
kept in  L 2  cach e during the com putation  o f  the relevant ro w  o f  the m atrix.
•  R o ta t io n  rotates an im a g e  b y  perform in g three on e-d im en sional shears. T h e  secon d shear operation 
w a lk s  a lon g the co lu m n  o f  the im ag e m atrices (assum ing the im a g e  arrays are stored a lo n g the £ axis), 
w h ich  g iv es  poor m em ory p erform an ce for large im ages. Im p ulse can transpose m atrices w ithout 
co p y in g  it, so co lu m n  w a lk s are rep laced  b y  ro w  w alks. E a c h  shear operation  in volves one input 
im ag e and one output im age. B o th  input im ag e m atrix and output im ag e m atrix are rem apped usin g 
transpose rem apping  during the secon d shear operation.
•  A D I  im plem ents the n aive “A ltern atin g  D irectio n  Im p lic it”  integration  algorithm . T h e  A D I  in te­
gration algorithm  contains tw o  phases: sw eep in g  alo n g ro w  and sw eep in g a lo n g colum n. Im pulse 
transposes the m atrices in  the secon d phase so that the orig in al colum n  w a lk s are rep laced  b y  the row  
w a lk s o ver the shadow  m atrices. T h e  algorithm  in volves three m atrices. A l l  o f  them  are rem apped 
u sin g transpose rem apping  during the secon d phase.
R em ap p in g im proves a ll o f  these benchm arks significan tly. W h e re  the benefits co m e from  and h o w  m uch 
im provem en t ea ch  benchm ark has is b eyon d  the scop e o f  this docum ent. Instead, this docum en t fo cu ses on 
h o w  the D R A M  b acken d  im pacts the perform ance. T h ese  benchm arks have d ifferen t w a y s  to u se rem ap­
pin gs. T able 3 lists the p rob lem  size  and the Im pulse-related  resources that ea ch  b en ch m ark uses. “ R e m ap ­
p in g ” m eans the rem appin g algorithm  that this b en ch m ark uses. “ D escrip tor” represents the total num ber 
o f  shadow  descriptors used  b y  the benchm ark. “ G athering F acto r”  indicates the num ber o f  D R A M  a ccesses 
n eeded to com p act a  shadow  cach e line.
5.3 Methodology
T h e design  options that w e  care about in clu d e the in terleavin g schem e, the reordering algorithm , the hot row  
p o licy , the num ber o f  D R A M  banks, the num ber o f  R D  busses (or slave m em o ry  controllers), the num ber 
o f  S A  busses, the num ber o f  S D  busses, the num ber o f  S D  bus queues, and the D R A M  type. E a c h  option
16
B en ch m arks P rob lem  S ize R em ap p in g D escrip tors G athering F actor
C G .iv C la ss  A scatter/gather 1 16
C G .p c C la ss  A direct 3 1
C G .p c 7 C la ss  A direct 7 1
spark98 s f.5.1 scatter/gather 4 4
T M M P 512x 512 (double) strided 3 16
R otation 1024x l 024 (char) transpose 2 128
A D I 1024x 1024 (double) transpose 3 16
T able 3 : The problem size, the remapping scheme, and the number o f shadow descriptors used by each benchmark; 
and the number o f D R A M  accesses generated for a shadow address.
can  h ave m ultip le values. It is in feasib le  to u se the fu ll factorial design, so w e  p ic k  up a  p racticable  subset 
o f  these options at a  tim e to study their relative im pacts.
In our experim ents, w e  first set up a  b aselin e and then vary  one or a  fe w  options at a  tim e. T h e  b aselin e uses 
cach e-lin e m odulo-interleaving, no access reordering, and c lo se-p a g e hot row  p o licy , and has fou r D R A M  
chips, e ig h t banks (for S D R A M ) or 32  banks (for R D R A M ), fou r R D  busses, tw o  S D  bus queues, one S A  
bus, and one S D  bus.
6 Performance
A  D R A M  access  latency, defined to b e  the interval b etw een  the tim e that the M M C  generates the D R A M  
a ccess and the tim e that the M M C  receives the return data from  the D R A M  b acken d, can  b e  b roken  dow n  
into five  com ponents:
•  SA  cy c les  -  tim e spent on usin g S A  bus, in clu d in g w aitin g  tim e and transferring tim e;
•  S D  cy c les  -  tim e spent on usin g S D  bus, in clu d in g w aitin g  tim e and transferring tim e;
•  R D  w aiting cy cles  -  tim e w aitin g  fo r  R D  bus;
B an k w aiting cy c les  -  tim e spent on b an k  w aitin g  queue;
•  B an k a ccess  cy cles  -  tim e a ccessin g  D R A M .
T h e  data transferring c y c le s  on S A /S D /R D  b usses are in evitab le  and can not b e  reduced. T h e  b an k  access 
cy c le s  are in evitab le  too, but th ey can  b e  reduced  b y  an appropriate hot row  p o licy . T h e  w aitin g  c y c le s  on 
S A /S D /R D  bus or D R A M  b an k  are extra overhead and should be avoid ed  as m u ch  as possib le . R ed u cin g 
the w aitin g c y c le s  is the m ain  g o al o f  the p rop osed  D R A M  backend.
17
6.1 The Impacts of DRAM Organization
150 
*  140 
S 130 2 120 
-  110 

















m  m m  m i
■ SA cycles
□  SD cycles
□  RD waiting cycles
□  Bank waiting cycles
□  Bank access cycles
m l Finn
1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6  
C G (iv) C G (p c3) C G (p c 7) S p a rk  A D I T M M P  R otation
F igu re  5 : Breakdown o f the average R D R A M  access latency for various D R A M  organizations: organization 1 - 
32/2/1 (the number o f memory banks/RD busses/SD bus queues); 2 -  32/4/2 ; 3 -  64/4/2 ; 4 -  128/8/4 ; 5 -  256/8/4 ; 6 - 
256/16/8 .
150 
*  140 
e  130 
te 120 
-  110 






















□  SD cycles
□  RD waiting cycles
□  Bank waiting cycles
□  Bank access cycles
Bs,
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
C G (iv) C G (p c3) C G (p c7) S p a rk  A D I T M M P  R otation
F igu re  6: Breakdown o f the average SD R A M  access latency for various D R A M  organizations: organization 1 -  8/2/1 
(the number o f memory banks/RD busses/SD bus queues); 2 -  8/4/2 ; 3 -  16/4/2 ; 4 -  32/8/4 ; 5 -  64/8/4 ; 6 -  64/16/8 .
T h e  num ber o f  m em ory banks, R D  busses, and S D  bus queues are tig h tly  related  to one another, so w e  
consider them  together as one com pound factor —  D R A M  organization. T h e y  can  a ffect the b an k  w aitin g  
c y c le s , R D  w aitin g  cyc le s , and S D  c y c le s , but cannot a ffe c t the S A  c y c le s  and b an k access  cyc le s . Intuitively, 
in creasin g the num ber o f  m em ory banks should decrease the b an k  w aitin g  cyc le s , in creasin g the num ber o f  
R D  b usses should decrease the R D  w aitin g  c y c le s , and in creasin g the num ber o f  S D  bus queues should 
d ecrease the S D  cyc les .


















S D R A M
CG(iv) CG(pc3) CG(pc7) Spark ADI TMMP Rotation CG(iv) CG(pc3) CG(pc7) Spark ADI TMMP Rotation
F igu re  7 : Execution times with various D R A M  organizations.
D R A M  organization s b ased  upon R D R A M . F igu re  6 sh o w s the correspon din g results for S D R A M . F igu re  7 
show s their execu tio n  tim es. A l l  the results show n  in  this docum en t are n orm alized  to the b a se lin e ’s, excep t 
stated oth erw ise. W e split execu tio n  tim e into tw o parts: m em ory cy cles  and C P U  cycles. T h e  D R A M  
b acken d  is targeted to attack  the m em ory c y c le s  only, so the C P U  c y c le s  rem ains constant w h en  the D R A M  
b acken d  chan ges. B eca u se  ch an gin g the D R A M  organization  does not ch an ge the b an k  access c y c le s , the 
perform an ce o f  C G .p c 3 , C G .p c 7 , S p a r k , and T M M P , w h ere the b an k  access c y c le s  dom inate the D R A M  
a ccess latency, ch an ges insignificantly. H o w ever, the perform an ce o f  C G .iv ,  A D I , and R o ta tio n , w h ere 
the b an k  w aitin g c y c le s  are the dom inant factor o f  the D R A M  access  latency, chan ges significan tly. In 
particular, in creasin g the num ber o f  m em ory banks, R D  busses, and S D  bus queues ach ieves n early  linear 
im provem en t in  the b an k  w aitin g  cyc les . F or exam ple, com parin g con figuration  1 w ith  6 in  F igu re  6, the 
average D R A M  access la ten cy  o f  R o ta t io n  decreases from  2123  c y c le s  (2067  o f  w h ich  are b an k  w aitin g 
cyc le s)  to 109 c y c le s  (52  o f  w h ic h  are b an k w aitin g  c yc le s), w h ic h  results in  a  87%  savin g in  the m em ory 
c y c le s  and a  78 %  savin g in  execu tio n  tim e.
A ll  other benchm arks, excep t A D I , h ave sim ilar results on  R D R A M  and S D R A M . W h en  the D R A M  organi­
zation  expands, A D I  gain s m u ch  m ore im provem en t on  R D R A M  than on S D R A M . It is b ecau se S D R A M , 
w h ic h  n orm ally  has few e r  banks than sam e c a p acity  R D R A M , cannot p rovide the m inim um  num ber o f  
m em ory b an ks that A D I  needs to avoid  lo n g  b an k  w aitin g  queues. F or instan ce, even  w ith  configuration  6, 
the average len gth  o f  a  b an k w aitin g  queue is  13.6 transactions for S D R A M , but o n ly  1.9 transactions for 
R D R A M .
Intuitively, the ratio am ong the num ber o f  m em o ry  banks, R D  busses, and S D  bus queues has to b e in  a 
certain  range in  order to keep  the D R A M  b acken d  balanced. W h en  the ratio is b eyon d  that range, in creasin g 
the num ber o f  one com ponen t w ill not increase the perform ance. S p e cifica lly , each  R D  bus or S D  bus 
queue can  serve o n ly  a  certain  num ber o f  m em o ry  banks. L ettin g  it serve few e r  banks w astes its capacity, 
and lettin g it serve m ore banks m akes it b eco m e the bottlen eck. F igu re  8 sh o w s the perform ance o f  C G . iv  
and R o ta t io n  under various D R A M  organization s based on  S D R A M . W e can  m ake tw o observations from  
F igu re  8. First, w h en ever the ratio b etw een  the num ber o f  m em ory banks and the num ber o f  R D  busses drops 
from  a  h igh er valu e to 2 :1, there is  a  b ig  drop in  the average D R A M  access latency. T h is  observation  im plies 
that the b est ratio b etw een  the num ber o f  m em o ry  banks and the num ber o f  R D  busses is 2 :1. S in ce  each  
S D R A M  chip  contains tw o banks, w e  can  rephrase it in  another w a y: in  order to avoid  significan t contention 




















H  SA cycles
□  SD cycles
□  RD waiting cycles
□  Bank waiting cycles
□  Bank access cycles
n i l B H H
000 010 011 100 110 120 121 122 210 220 230 231 233 000 010 011 100 110 120 121 122 210 220 230 231 233
C G (iv) R otation
F igu re  8: Breakdown o f the average SD R A M  access latency o f C G  and R otation. Each bar is labeled as A B C , where 
A  represents the number o f memory banks: 0 -  8, 1 -  16, 2 -  32; B  represents the number o f RD busses: 0 - 2 , 1 - 4 ,
2 -  8, 3 -  16; C  represents the number o f SD bus queues: 0 - 1, 1 -  2 , 2 -  4 , 3 -  8.
bus queues does not in crease perform ance n oticeably. T h e  observation s indicate that one S D  bus q ueue can 
m atch  up w ith  at lea st 16 R D  busses.
C o n c lu s io n : a  co st-effective , b alan ced  D R A M  b ack en d  should  have one R D  bus for ea ch  D R A M  chip and 
one S D  bus queue fo r  ev ery  16 R D  busses.
6.2 The Impacts of Slave Busses
W e con sid er three alternative configuration s abou t slave busses: one S A  bus and one S D  bus; tw o S A  busses 
and tw o S D  b u sses; and e igh t S A  b usses and e ig h t S D  busses. T h e  corresp on din g results are show n  in 
F igu re  9 , 10 , and 11 .
C G .iv C G .p c 3 C G .p c 7 Spark A D I T M M P R otation
c y c le s (R D R A M ) 1. 1/4.5 1.0/2.6 1.0/2.6 1. 1/4.0 2 .8/5.3 1.0/1.6 1.3/2.8
c y c le s (S D R A M ) 1.1/2.6 1.0/2.6 1.0/2.6 1.0/3.7 2 .7/4.1 1.0/1.6 1.9/2.2
T ab le 4 : T h e  average c y c le s  o f  ea ch  S A /S D  w a it in  the baselines.
N on e o f  those benchm arks w astes sign ifican t am ount o f  tim e w aitin g fo r  slave busses. T able 4  lists the 
average w aitin g  c y c le s  o f  an S A  or S D  w a it in  the b aselin e execu tio n  o f  ea ch  benchm ark. C om p ared  to the 
average D R A M  access latency, the average S A / S D  w aitin g  c y c le s  are n eglig ib le . I f  w e  lo o k  at F igu re  5 and
6, w e  a lso  can  see that the S A  c y c le s  and the S D  c y c le s  are a lm o st iden tical for any configuration s. T h ese  


























1 2 8 
C G (iv)
1 2 8 
C G (p c 3)
1 2 8 
C G (p c7)
1 2 8 
S p a rk
I
m  SA cycles
□  SD cycles
□  RD waiting cycles
□  Bank waiting cycles
□  Bank access cycles
A 'X /
I
1 2 8 1 2 8 1 2 8 
A D I T M M P  R otation
F igu re  9 : Breakdown o f the average R D R A M  access latency when the number o f slave busses varies: 1 -  1 S A  bus, 


























1 2 8 1 2 8 1 2 8 1 2 8
C G (iv) C G (p c3) C G (p c7) S p a rk
1 2 8 
A D I
M  SA cycles
□  SD cycles
□  RD waiting cycles
□  Bank waiting cycles
□  Bank access cycles
1 2 8 
T M M P
I
1 2 8 
R otation
F igu re  10 : Breakdown o f the average SD R A M  access latency when the number o f slave busses varies: 1 -  1 S A  bus, 
1 SD bus; 2 -  2 S A  busses, 2 SD busses; 8 -  8 S A  busses, 8 SD busses.












R D R A M ■ Memory Cycles □ CPU Cycles









S D R A M
CG(iv) CG(pc3)CG(pc7) Spark ADI TMMPRotation
F igu re  11 : Execution times with various numbers o f  slave busses
150 
*  140 
S 130 2 120 
-  110 






















C G (p c 3)
c o p
C G (p c 7)
c o p
S p a rk
c o p
A D I
m  SA cycles
□  SD cycles
□  RD waiting cycles
□  Bank waiting cycles
□  Bank access cycles
■
c o p
T M M P
c o p
R otation
F igu re  12 : Breakdown o f the average R D R A M  access latency for three hot row policies: c -  close-page; o -  open- 
page; p -  use-predictor.
6.3 The Impacts of Hot Row Policy
F igu re  12 and 13 d isp lay  the results o f  three differen t hot-row  p o licies: c lo se-p a g e p o lic y  (c), op en -page 
p o lic y  (o), and use-predictor p o licy (p ). In the use-predictor p o licy , the predictor has 4 -bit history. T h e  n ext 
a ccess is  predicted  to be a  “ h it”  i f  there are at lea st three hits in  the history, and a  “ m iss”  oth erw ise. T h e 
d irect e ffe ct o f  a  hot row  p o lic y  is  to reduce the b an k access cyc les . T h e  indirect e ffe ct is  to reduce the bank 
w aitin g  cyc le s . T able  5  d isp lays the hot row  hit and m iss ratios under the op en -p age p o lic y  and use-predictor 
p o licy . T h e  hit ratio is  com puted as (total hot ro w  hits / total accesses). T h e  m iss ratio is com puted as (total 
hot ro w  m isses / total accesses). I f  a  transaction a ccesses a  b an k  w ith ou t an active  hot row , it  is  taken neither 
as a  h it nor as a  m iss. In addition, hot row  has to b e  c lo sed  during refresh  operations, so the hit ratio plus 
m iss ratio for the op en -p age p o lic y  m a y  not equal 100% .
To better understand the tradeoffs in volved , le t ’s first lo o k  b a c k  at F igu re  2  and 3 to q uan tify  the b en efit o f  
h itting a  ro w  and the p en alty  o f  m issin g a  row . F o r R D R A M , the h it b en efit is  savin g t R C D  +  t O F F P  +
22
150 
*  140 
S 130 
2  120 
-  110 


















c o p  c o p  c o p  c o p  




H  SA cycles
□  SD cycles
□  RD waiting cycles
□  Bank waiting cycles
□  Bank access cycles
I
c o p c o p 
TMMP Rotation
Figure 13: Breakdown of the average SDRAM access latency for three hot row policies: c -  close-page; o -  open- 

















RDRAM ■ Memory Cycles □ CPU Cycles
















CG(iv) CG(pc3)CG(pc7) Spark ADI TMMPRotation
Figure 14: Execution times with three hot row policies.
CG.iv CG.pc3 CG.pc7 Spark ADI TM M P Rotation
open-page(RDRAM) 37/63 20/80 37/63 34/66 71/29 68/7 40/59
predictor(RDRAM) 13/18 9/32 29/28 13/20 69/18 64/6 23/19
open-page(SDRAM) 19/81 20/80 37/63 27/73 64/36 72/10 63/36
predictor(SDRAM) 9/7 5/15 24/15 11/17 63/24 65/9 63/25
Table 5 : The hot row hit/miss ratios.
cycles for (read, read)3, or cycles for (read, write), or
cycles for (write, X); and the miss penalty is adding t R P  =  8 cycles. For SDRAM, the hit benefit is 
cycles; and the miss penalty is the minimum of and the CAS count o f previous
transaction.
3Represents (type of previous access, type of current access). “X” means either read or write.
23
The impact of a hot row policy on a special run can be approximated using the following expression:
(total-hits x  average-hit-benefit — to ta l-m is s e d  average-miss-penalty). (1)
In order to achieve positive impact, Equation 1 must be greater than 0, i.e., the ratio between total hits and 
total misses must be greater than (average-miss-penalty / average-hit-benefit). Therefore, we use the ratio 
between total hits and total misses as the leverage to measure a hot row policy’s performance.
The theoretical results obtained using Equation 1 match with the experimental results almost perfectly, with 
only a couple o f exceptions. For the open-page policy, CG .iv and C G .pc7 have the same hit/miss ratios, 
but different average DRAM access latency. It is because of their different ratios between reads and writes. 
In CG.iv, more than 98.3% of DRAM accesses are reads. In C G .pc7 , only 83% of DRAM accesses are 
reads. In RDRAM , whenever a write transaction is involved, the hit benefit decreases from 14 cycles to 
10 or 6 cycles. Though C G .pc7 has the same hot row hit/miss ratios as CG.iv, its higher percentage of 
write transactions introduces a smaller average hit benefit. Another weird example, occurred when SDRAM 
is used, is on R otation , which has different miss ratios but the same performance for the open-page and 
use-predictor policy. M ost o f the DRAM  accesses in R otation  are shadow accesses requesting eight-byte 
double words, so the average miss penalty is close to the average CAS count —  one. W ith one-cycle miss 
penalty and three-cycle hit benefit, the 11% difference in the miss ratio cannot make a big difference.
The use-predictor policy always has performance between the close-page policy’s and open-page policy’s. 
W herever the open-page policy is helpful, the use-predictor policy is also helpful, but at a smaller degree. 
W herever the open-page policy hurts the performance, the use-predictor policy also hurts the performance, 
but at a much smaller degree. Though both policies may degrade the performance by up to 4 %, they can 
improve the performance by up to 44%. In addition, they has positive gains in most test benchmarks.
Conclusion: both the open-page policy and use-predictor policy are acceptable choices. We suggest the 
use-predictor policy because it is stabler than the open-page policy.
6.4 The Impacts of Access Reordering
We have considered six different access reordering algorithms. The first one is no reordering (number it 
as N o.1). The other five are the alternatives of the algorithm described in section 4 .3.2 . N o .2 gives direct 
access higher priority over shadow access and has no priority updating. N o .3 gives shadow access higher 
priority over direct access and has no priority updating. N o .4 gives shadow access and direct access the 
same priority and has no priority updating. N o .5 is No .2 plus priority updating, which increases priority 
along with increased waiting time. No.6 is N o .3 plus priority updating.
The highest priority in the simulated model is 15. The priority vector, in the sequence o f MTLB access, 
indirection vector access, direct access, and shadow access followed by their prefetching versions in the 
same order, is: {15, 15, 11, 9, 7 , 7 , 3 , 1} for N o .2 and No.6; {15, 15, 9 , 11, 7, 7, 1, 3} for N o .3 and No.5 ; 
{15, 15, 10, 10, 7, 7 , 2 , 2} for No.4 . The updating policy increases an access’s priority by 1 whenever it is
24
overtaken by another access. For example, in No.2, one non-prefetching shadow access, which starts with 
priority 9 , may be overtaken by accesses with higher priority for at most six times. Once it has given away 
its position for six times, it will have the highest priority and no accesses after it can get ahead o f it.
Reordering algorithms try to reduce the bank a ccess  cy c les  by using the first transaction on the waiting 
queue to make correct predictions about whether or not to leave a hot row open after an access.
The basic premise to use reordering is that a waiting queue must contain more than one transaction. If a 
waiting queue contains none or one transaction most of the time, reordering cannot do anything because 
there is nothing to be reordered. Table 6  lists the average queue length in the baseline execution o f each 
benchmark. It shows that only A D I and R otation  have long waiting queues. Consequently, they are the 
only benchmarks noticeably impacted by the reordering algorithm.
CG.iv CG.pc3 CG.pc7 Spark ADI TM M P Rotation
RDRAM 0.04 0.01 0.01 0.01 19.36 0.00 0.73
SDRAM 0.77 0.02 0.09 0.10 19.92 0.01 12.18
Table 6: The average bank queue length.
150 
*  140 
S 130 2 120 
-  110 














R E NUN IIIIIIII 11
■ SA cycles
□ SD cycles
□ RD waiting cycles
□ Bank waiting cycles
□ Bank access cycles
1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6  
CG(iv) CG(pc3) CG(pc7) Spark ADI TMMP Rotation
Figure 15: Breakdown of the average RDRAM access latency for various reordering algorithms.
Figure 12, 13, and 14 show the performance numbers for various reordering algorithms. They show that 
No.2 , No.3, and No .4 perform closely on all benchmarks. Taking a closer look at DRAM access patterns of 
these benchmarks, we found either most of DRAM  accesses are direct accesses or most o f DRAM  accesses 
are shadow accesses at a  short period of time, which results in every few reorders between direct accesses 
and shadow accesses. These results indicate that we can just give direct access and shadow access the same 
priority to simplify the reordering algorithm. These figures also show that the updating rule does not help 
very much. Although it gives a 3% reduction in the average DRAM  access latency for A D I on RDRAM , it 
contributes a 19% increase in the average DRAM  access latency for R otation . For R otation , the updating 
rule decreases the average DRAM  access latency o f a prefetching shadow access from 1350 cycles to 1173
25
150 
*  140 
e 130 
te 120 
-  110 

















m  m m  m i
■ SA cycles
□ SD cycles
□ RD waiting cycles
□ Bank waiting cycles
□ Bank access cycles r
M i l
1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6  
CG(iv) CG(pc3) CG(pc7) Spark ADI TMMP Rotation



























CG(iv) CG(pc3) CG(pc7) Spark ADI TMMP Rotation CG(iv) CG(pc3) CG(pc7) Spark ADI TMMP Rotation
Figure 17: Execution times with various reordering algorithms.
cycles, but it increases the average DRAM access cycles of a non-prefetching shadow access from 338 
cycles to 596 cycles.
Conclusion: the best choice is No.4 , which gives direct accesses and shadow accesses the same priority and 
does not update priority dynamically.
6.5 The Impacts of Interleaving
We consider four interleaving schemes, as described in section 4 .3 .3. Figure 18 and 19 show the break­
down of the average DRAM access latency. Figure 20 shows the execution times. How well an interleaving 
scheme can perform heavily depends on applications’ access patterns. There is no optimal scheme work­
ing for all benchmarks. For example, for applications that perform sequential accesses, cache-line-level 
interleaving is better; for applications that perform strided accesses, page-level interleaving may be better. 
However, modulo-interleaving is always better than sequential-interleaving. Sequential-interleaving is bad 







































■  SA cycles
□  SD cycles
□  RD waiting cycles
□  Bank waiting cycles




Figure 18: Breakdown of the average RDRAM access latency for various interleaving schemes: mp -  modulo, 






































m  SA cycles
□  SD cycles
□  RD waiting cycles
□  Bank waiting cycles






Figure 19: Breakdown of the average SDRAM access latency for various interleaving schemes: mp -  modulo, page- 
level; mc -  modulo, cache-line-level; sp -  sequential, page level; sc -  sequential, cache-line-level.
recting consecutive accesses to the same chip, which limits the inherent parallelism of DRAM accesses. 
Spatial locality also requires cache-line-level interleaving to ensure that consecutive requests go to different 
memory banks. That is why page-level interleaving cannot work well with applications with good spatial lo­
cality. For example, S p ark  tends to put non-zero elements in a row close to one another, so the four DRAM 
accesses generated by a gather operation are likely directed to the same page. Page-level interleaving makes 
the four DRAM  accesses go to the same bank and be served serially instead o f in parallel. The results show 
that, compared to cache-line-level interleaving, page-level interleaving increases the average DRAM access 
latency of S p ark  by 52% on RDRAM  and 91% on SDRAM.










RDRAM ■ Memory Cycles □ CPU Cycles










CG(iv) CG(pc3) CG(pc7) Spark ADI TMMP Rotation
Figure 20 : Execution times with various interleaving schemes.
i  +  2 x 8 K ,  . . . , £  +  1023 x  8 K .  If  interleaving is in cache-line-level, all accesses go to the same bank. 
If the interleaving is in page-level, the ith access (x  +  i  x  8K )  goes to bank (i % 32) if  RDRAM  is used, 
or bank ((i/2) % 8) if  SDRAM is used. In another word, all accesses go to the same bank in cache-line- 
level interleaving, but they are uniformly distributed among all banks in page-level interleaving. As a result, 
page-level interleaving performs a lot better than cache-line-level interleaving for A D I -  about 30% saving 
in execution time.
R otation  operates on a 1024x 1024 gray-scale image. Walking along a column of the image generates access 
sequence x ,  x  +  I K ,  x  +  2 x  1K ,  . . . , «  +  1023 x  1K .  W hen SDRAM is used, these accesses all go to 
the same bank if  cache-line-level interleaving is used, or the first 16 accesses go to bank 0, the next 16 go 
to bank 1, . . . ,  and so on if  page-level interleaving is used. Therefore, page-level interleaving is better than 
cache-line-level interleaving for R otation  when SDRAM is used. W hen RDRAM  is used, the ith access 
goes to bank (i % 4) if  cache-line-level interleaving is used, or the first eight accesses go to bank 0 , the next 
eight go to bank 1, . . . , and so on if  page-level interleaving is used. Though page-level interleaving better 
distributes accesses among banks, cache-line-level achieves better performance because it puts consecutive 
accesses to different banks.
Conclusion: cache-line-level modulo-interleaving is the best choice.
6.6 Putting It All Together
Based on the experimental results above, we propose a DRAM backend with this configuration: one RD 
bus for each DRAM chip, one SA bus, one SD bus, one SD bus queue, use-predictor hot row policy, N o .4 
reordering algorithm, and cache-line-level modulo-interleaving. Since cache-line-level interleaving may 
significantly slow down applications that access data with page-sized strides, we pad an extra line to each 
such stride to avoid unbalanced loading in memory banks.
Figure 21 and 22 present the performance numbers o f four different DRAM  backends: the original baseline
(o); the configuration combined by the worst setting of each option (w); the configuration combined by the 





















o w  b
CG(iv)
RDRAM
r  o w D r D r
CG(pc3) CG(pc7)
w  b r
Spark
■  SA cycles
□  SD cycles
□  RD waiting cycles
□  Bank waiting cycles
□  Bank access cycles
o w  b r
ADI
CL
o w  b r  o w  b r
TMMP Rotation
o w
Figure 21: The average RDRAM access latency for various DRAM backends: o -  original baseline; w -  worst; b 
best; r  -  recommended.
220 
cy 200 
te n 180 





















RD waiting cycles 
Bank waiting cycles 
Bank access cycles
CG(iv) CG(pc3) CG(pc7) Spark
o w  ET
ADI
r  o w  D r  o w  D r
TMMP Rotation
Figure 22: The average SDRAM access latency for various DRAM backends: o -  original baseline; w -  worst; b -  
best; r  -  recommended.
works well in all benchmarks. It performs closely to or even better than the combination of the “bests”. The 
interaction among different factors makes the combination of the “bests” not always work best. Comparing 
the recommended one to the baseline, the saving in the DRAM  access latency ranges from 1% to 94% with 
a geometric mean of 35% for RDRAM  and from 9% to 96% with a geometric mean of 49% for SDRAM.
29
7  C o n c l u s i o n  a n d  F u t u r e  W o r k
The Impulse memory system exposes DRAM  accesses not seen in conventional memory systems. In this 
document, we investigate whether or not it is worthwhile to redesign the conventional DRAM  backend 
for Impulse and to quantify the degree that the DRAM  backend can affect the performance. We do so 
by proposing a DRAM backend that can effectively exploit parallelism o f DRAM  accesses in the Impulse 
system.
We use the execution-driven simulator Paint to evaluate the performance of the proposed backend. The 
experimental results show that the proposed backend can reduce the average DRAM access latency by up 
to 98%, the average memory cycles by up to 90%, and the execution time by up to 80%. These results 
demonstrate that it is indeed necessary to design a specific DRAM backend for the Impulse memory system.
There are still a number o f issues needed to be solved before we build a real backend. In the proposed 
design, there is a queue for each memory bank. One alternative is to use a queue for each DRAM chip, 
then perform bank-level scheduling. Another alternative is to use a global queue for all banks, then perform 
global scheduling. We can also extend the priority-based algorithm to include non-priority-based rules. For 
example, putting transactions accessing the same row together might be helpful. More interleaving schemes, 
such as double-word interleaving or combinations of modulo and sequential interleaving, are probable and 
remain to be exploited. More research on the use-predictor policy needs to be done to answer questions like 
how many bits in history are enough and what is the best value for the precharge policy register.
References
[1] D. Bailey, E. Barszca, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Fred- 
erickson, T. Lasinski, R. Schreiber, and H. Simon. The NAS parallel benchmarks. Technical Report 
RNR-94-007, NASA Ames Research Center, M arch 1994.
[2] J. B. Carter, W. C. Hsieh, L. B. Stoller, M. R. Swanson, L. Zhang, E. L. Brunvand, A. Davis, C.-C. Kuo, 
R. Kuramkote, M. A. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a smarter memory 
controller. In P ro ceed in g s o f  the F ifth  IE E E  Sym posium  on H ig h  P erform ance C om p uter A rchitecture, 
pages 70- 79, Orlando, FL USA, January 1999.
[3] R. Crisp. Direct RAM BUS technology: The new main memory standard. IE E E  M icro, pages 18- 29, 
November 1997.
[4 ] Hewlett-Packard. Kitt Hawk M emory System, External Reference Specification, Revision B. Dwg No 
A-5180-7358-1 Rev A, May 1995.
[5 ] T. R. Hotchkiss, N. D. Marschke, and R. M. McClosky. A new memory system design for commercial 
and technical computing products. H ew lett-Packard Journal, 47(1):44- 51, February 1996.
[6] IBM. IBM Advanced 64Mb Direct Rambus DRAM, November 1997.
30
[7] IBM. IBM Advanced 256Mb Synchronous DRAM  -  Die Revision A, August 1998.
[8] D. R. O ’Hallaron. Spark98 : Sparse matrix kernels for shared memory and message passing systems. 
Technical Report CM U-CS-97- 178, School of Computer Science, Carnegie M ellon University, Octo­
ber 1997.
[9] R. Schumann. Design of the 21174 memory controller for digital personal workstations. D ig ita l 
T echnical Journal, 9(2), November 1997.
[10] L. Stoller, M. Swanson, and R. Kuramkot. Paint: PA instruction set interpreter. Technical Report 
UUCS-96-009, University of Utah, September 1996.
[11] M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow 
memory. In P ro ceed in g s o f  the 25th International Sym posium  on C om p uter A rchitecture, pages 204­
213, Barcelona, Spain, June 1998.
[12] J. E. Veenstra and R. J. Fowler. MINT tutorial and user manual. Technical Report 452, University of 
Rochester, August 1994.
[13] L. Zhang. ISIM: The simulator for the Impulse adaptable memory system. Technical Report UUCS- 
99-017, University of Utah, September 1999.
[14] L. Zhang, J. B. Carter, W. C. Hsieh, and S. A. McKee. M emory system support for imaging process­
ing. In P roceed in gs o f  the 1999 International C on feren ce on  P a ra llel A rchitectu re and C om pilation  
Techniques, pages 98- 107, Newport Beach, CA USA, October 1999.
31
