Rank-switching, Open-row DRAM Controller for Mixed-Critical Real-Time Systems by Krishnapillai, Yogen
 
 
Rank-switching, Open-row DRAM 
Controller for Mixed-Critical Real-Time 
Systems 
 
 
 
 
 
 
by 
 
 
Yogen Krishnapillai 
 
 
 
 
 
 
A thesis 
presented to the University of Waterloo 
in fulfillment of the 
thesis requirement for the degree of 
Master of Applied Science 
in 
Electrical and Computer Engineering 
 
 
 
 
 
 
Waterloo, Ontario, Canada, 2015 
© Yogen Krishnapillai 2015 
ii 
 
I hereby declare that  I am the sole author  of th is thesis. This is a  t rue copy of the 
thesis, including any required fina l revisions, as accepted by my examiners.  
 
 
Note tha t  some contents of th is thesis a re t aken from my previously published paper  
[1] where I am the first  au thor . In  addit ion, some content  a re t aken from my other  
published paper  [2] where I am one of the co-authors. 
 
 
I understand that  my thesis may be made elect ronica lly ava ilable to the public  
iii 
 
Abstract  
 
In  th is thesis, we present  a  rank -switch ing open-row DRAM cont roller  for  mixed 
cr it ica l rea l t ime systems. This memory controller  is opt imized for  mult i -requestor  
and mult i-rank memory systems. The key to improved per formance is an  innovat ive 
rank-switching mechanism which  h ides the la tency of wr ite to read t ransit ions in  
DRAM devices without  requir ing unpredictable request  reorder ing. We fur ther  
employ open-row policy to take advantage of the data  caching mechanism (row 
buffer ing) in each  device. We choose the bank pr iva t iza t ion  scheme where each  
requestor  is assigned its own pr iva te bank or  set  of banks. This pr iva te bank mapping 
guarantees tha t  each  requestor  has its own row buffers and cannot  be in terfered by 
other  requestors. The proposed memory controller  design  allows maximum of thir ty 
two requestors at  a  t ime ta rget ing either  two or  four  ranks. This cont roller  provides 
complete t iming isolat ion  between cr it ica l and non -cr it ica l applicat ions and a llows for  
composit iona l t iming ana lysis over  number  of r equestors and memory ranks in  the 
system. We designed both  the front  end logic for  the command genera t ion  and back 
end logic for  the DRAM t iming constra in t  check and arbit rat ion  ut ilizing the rank 
switch ing techniques. The complete design  is implemented an d synthesized using 
Ver ilog RTL and fina lly, we eva lua ted per formance using var ious benchmarks. Our  
proposed memory controller  offers significant ly lower  worst  case la tency bounds for  
cr it ica l rea l-t ime applica t ions and suppor ts average throughput  for  non -cr it ica l rea l-
t ime applica t ions compared to exist ing rea l t ime memory controller s in the lit era ture.  
  
iv 
 
Ackn ow le dge m e n ts  
 
 
I would like to thank my supervisor  Rodolfo Pellizzoni for  h is encouragement  and 
excellen t  guidance for  a  successfu l complet ion  of th is research. Prof Rodolfo has been  
rea lly helpfu l and dedica t ing h is va luable t ime to accomplish th is research  project .  
 
I would like to thank my supervisor  Manoj Sachdev for  giving me the motiva t ion  to 
apply for  MASc Program while my at tent ion  was diver ted by my fu ll t ime car r ier . 
 
I would a lso like to thank Professor  Andrew Mor ton  and Hiren  Patel for  reviewing 
th is thesis and for  their  va luable comments. 
 
  
v 
 
De dicat ion  
 
This is dedicated to my parents, my brothers and sisters for  their  constant  love and 
mot iva t ion  throughout  my life. I would not  be here without  them. It  has been  a  grea t  
cha llenge studying a  thesis based Engineer ing Master  program while focusing on  a  
fu ll t ime car r ier  in  the same field of interest . Dedica t ing the t ime and energy for  both  
studies and car r ier  a t  the same t ime has been  a  rewarding exper ience.  
  
vi 
 
Table  of Con te n ts  
 
List  of Tables……………………………………………………………………..….……… viii 
 
List  of Figures………………………………………………………………………….…… ix 
 
1 In trodu ction  …………………………………………………………………… ……… 1 
1.1 Problem Statement………………………………………………………….…….. 1 
1.2 Cont r ibut ion………………………………………………………………………... 2 
1.3 Organiza t ion………………………………………………………………..……… 3 
1.4 Acknowledgement…………………………………………………………..……… 3 
 
2 Backgrou n d  ………………………………………………………………………….. 4 
2.1 DRAM Basics …………………………………………………………………….. 4 
2.2 DRAM Timing Const ra in ts …………………………………………………….. 6 
2.2.1 Rank to Rank Timing Basics ………………………………………… 7 
2.2.2 Request s t a rget ing the same rank …………………………….……. 7 
  2.2.3 Request s t a rget ing different  ranks …………………………….…… 9 
2.3 DRAM Row Buffer  Management  …………………………………………........ 10 
 2.4 DRAM Mapping ……………………………………………………………….…. 11 
2.4.1 Cont inuous Memory Mapping …………………………………….. 11 
2.4.2 In ter leaved Memory Mapping ………………………………….…. 12 
2.5 Rela ted Works ………………………………………………………….………… 12 
 
3 Me m ory Con troller De s ign ……………………………………………….……….. 16 
3.1 Design Decisions ………………………………………………………………….. 16 
3.1.1 Row Management  Policy …………………………………………….. 16 
3.1.2 Address Mapping Scheme …………………………………………… 16 
3.1.3 Rank Switching Mechanism ………………………………….…….. 17 
3.1.4 Select ion  of Arbit rat ion  Architecture Type …………………..…… 19 
3.2 Arbit ra t ion  Rules ……………………………………………………………….... 21 
 
4 Th e ore tical An alys is  ………………………………………………………………. 25 
4.1 Worst  Case Per -Request  Latency ……………………………………………… 26 
4.1.1 In ter ference delay for  PRE ACT Commands ………...…………. 27 
4.1.2 CAS-to-Data ……………………………………………………......... 32 
 
 
vii 
 
5 Me m ory Con troller Im ple me n tation  …………………………………… …….. 39 
5.1 Front  End of the Memory Cont roller  ………………………………..………… 40 
5.2 Back End of Memory Cont roller  ……………………………………………….. 43 
5.2.1 Command Queues in Stage 4 ……………………………………… 44 
5.2.2 PRE ACT Arbiter  and CAS Arbiter  in  Stage 4 ………………….. 45 
5.2.3 PRE, ACT, CAS Sequencer  in Stage 4 …………………………… 46 
5.2.4 PRE, ACT Queue in Stage 4 ………………………………………. 47 
5.2.5 PRE, ACT Arbiter  and RRD FAW Sequencer  in Stage 3 ……… 48 
5.2.6 PRE, ACT Arbiter  in  Level 2 ………………………………….…… 48 
5.2.7 CAS FIFO in Stage 4.......................................................................51 
5.2.8 CAS Arbiter  and CAS BTB Sequencer  in Stage 3………….......... 51 
5.2.9 CAS Arbiter  in  level 2 …………………………………………….... 52 
5.2.9.1 BTB Comparator…………………………………………..…. 54 
5.2.9.2 RTR Sequencer…………………………………………..…… 55 
5.2.10 PRE ACT CAS Arbiter  in  Level 1 ………………….………….….. 58 
5.3 Pipeline Implementat ion  of the Memory Cont roller  ……………….……….. 59 
5.3.1 Pipeline Stage 4 – Request  Arbit ra t ion ………………….….…… 60 
5.3.2 Pipeline Stage 3 – Bank Arbit ra t ion  ……………………….….…. 60 
5.3.3 Pipeline Stage 2 – Rank Arbit ra t ion  ………………….…….……. 61 
5.3.4 Pipeline Stage 1 – Command Arbit rat ion  …………….……..…… 62 
5.3.5 Timing Analysis of Pipeline stages ……………………..…….…… 63 
5.3.6 Data pa th of the Memory Cont roller……………………………….. 64 
5.3.7 Test ing of the Memory Cont roller  ………………………………….. 65 
 
6 Evalu ation  ……………………………………………………………………………. 66 
6.1  Synthet ic benchmark Results…….……………………….…...................... ...... 66 
6.2 Latency of open and close memory read access ……………………………… 68 
6.3 Simula t ion  of Crit ica l tasks ………………………………………………..…… 69 
6.4 Simula t ion  of Crit ica l and non -cr it ica l t asks …………………………………. 73 
 
7 Con clu s ion  …………………………………………………………………………..... 76 
 
Re fere n ce s  ………………………………………………………………………………... 77 
 
Appe n dix  A: Design Block Diagrams……………………………………………………. 79 
Appe n dix  B: Simula t ion  Outputs………………………………………………………… 82 
 
 
viii 
 
 
List  of Table s  
 
2.1 J EDEC Timing Const ra int s …………………………………………………………... 6 
2.2: Summary of the rela t ed work ……………………………………………………….. 15 
4.1 Timing Parameter  Definit ion …………………………………………………………..27 
6.1 Latency of open, close access with  0 % wr ite…………………………………………69 
 
  
ix 
 
List  of F igu re s  
 
2.1 DRAM Architecture……………………………………………………………………..  5 
2.2  Requests t arget ing different  banks in the same rank……………………….…….   9 
2.3  Requests t arget ing different  ranks………………………………………………….. 10 
2.4  Cont inuous Memory Mapping ………………………………………………….…... 11 
2.5 In ter leaved Memory Mapping …………………………………………………........ 12 
 
3.1 (A) Arbit rat ion  for  1 Rank …………………………………………………….……… 18 
3.1 (B) Arbit rat ion  for  2 Ranks ……………………………………………………….….. 18 
3.1 (C) Arbit rat ion  for  4 Ranks ……………………………………………….………….. 18 
3.2 Choice of Arbiter  Types ………………………………………………….……………. 19 
3.3 Three levels Arbit rat ion  ……………………………………………………………… 21 
 
4.1 Worst  Case Latency Decomposit ion ………………………………………………… 25 
4.2  Arr iva ls-to-CAS Decomposit ion  for  Close Request  ………………………………. 26 
4.3 In ter ference Delay for  ACT command , R = 2, r  = 1 and Mr = 5 …………….….. 29 
4.4  Read to Read Latency, R = 2 and r  = 1 ……………………………………….……. 32 
4.5  Write to Read Latency, Case (a ) with  R = 2 and r  = 1 ……………………….…..  34 
4.6  Write to Read Latency, Case (b) with  R = 3 and r  = 1 ……………………….…..  36 
4.7  In it ia l Write Latency, R = 3 and r  = 1 ………………………………………….…. 38 
 
5.1  Memory Cont roller  with  Front  and Back end logic ……………………………… 39 
5.2  Front  End Memory Cont roller  ……………………………………………………… 40 
5.3 Refresh  Controller  ……………………………………………………………….……. 42 
5.4  Back End Memory Controller  ………………………………………………….…… 43 
5.5 CMD Queues with  PRE, ACT Arbiter  and CAS Arbiter  …………………….…...  44 
5.6 Scheduling of PRE, ACT CMDs through L3, L2, L1 ………………………….….. 49 
5.7 CAS FIFO and CAS BTB Sequencer  ……………………………………………..… 51 
5.8 Level 3 CAS BTB Sequencer  and Level 2 CAS Arbiter  ……………………….…. 53 
5.9 Logic to ca lcu la te the Smallest  BTB ………………………………………….…….. 54 
5.10 Scheduling of CAS CMDs through L3, L2, L1 …………………………………… 56 
5.11 Example BTB ca lcula t ion  for  RCAS CMD from Rank 0 ……………….……….. 58 
x 
 
5.12 Three Stage Pipeline for  the backend Memory Cont roller  ……………………...  59 
5.13 Stage 4 Pipeline – Request  Arbit ra t ion ……………………………………...……. 60 
5.14 Stage 3 Pipeline – Bank Arbit ra t ion  …………………………………………….... 60 
5.15 Stage 2 Pipeline – Rank Arbit ra t ion …………………………………………..….. 61 
5.16 Stage 1 Pipeline – Command Arbit ra t ion …………………………………….….. 62 
5.17 Timing Analysis of Pipeline Stages ……………………………………….………. 63 
 
6.1 Synthet ic 16 Requestors 64 bits da ta  bus result  ……………………………........  67 
6.2 Synthet ic 8 Requestors 64 bits da ta  bus result……………………………….…… 67 
6.3 Synthet ic 8 Requestors 64 bits da ta  bus result……………………………….…… 68 
6.4  Simula t ion  setup for  Memory Configura t ions 1…………………………….…...... 70 
6.5 CHStone: 16 Requestors 64 bits da ta  bus result  ………………………….…….… 71 
6.6 CHStone: 8 Requestors 64 bits da ta  bus result  ………………………………....… 72 
6.7  Simula t ion  setup for  Crit ica l and Non -Crit ical requestors…………………….… 74 
 
  
1 
 
Ch apte r 1 
 
In trodu ction  
 
The memory clien ts either  have rea l t ime or  non -rea l t ime requirements. The rea l 
t ime requests can  be either  cr it ica l or  non -cr it ica l. The cr it ica l requests demand worst  
case upper  bound la tency whereas non -cr it ica l requests demand average minimum 
bandwidth . The cr it ica l rea l t ime systems such as avionic system, nuclear  plant  and 
safety-cr it ica l elect ronic medica l devices demand the worst  case upper  bound la tency. 
In  th is thesis, memory clien ts such  as CPU or  hardware accelera tors or  IO per iphera ls 
a re refer red to as requestor s from now on . The proposed memory cont roller  was 
implemented by u t ilizing the rank -switch ing techniques, open -row policy, pr iva te 
bank mapping and dynamic scheduling. The memory controller  implemented in  th is 
fashion is to show how the la tency of a  memory request  can  be significant ly reduced 
by applying rank switching techniques and thereby h iding the highly t ime consumed 
read to wr ite and wr ite to read t iming const ra in t s. Fur ther , designing the memory 
cont roller  in  th is fashion  helps to prove the poin t  tha t  la rge number  of requestors can 
be handled through effect ive mult i stage arbit ra t ion  mechanism while sa t isfying their  
complex memory t iming parameters to serve a ll requestors to execute their  memory 
request  demand in  an order ly fashion. In  Sect ion  1.1, we present  the problem 
sta tement  and Sect ion 1.2 lists the cont r ibut ions accomplished for  th is research . 
Sect ion 1.3 gives a  preview of how th is document  is organized and fina lly Sect ion  1.4 
provides the content  acknowledgement .  
 
1.1   P roblem  Statem e n t  
 
The challenges a r ise when la rge number  of memory clients is running in  para llel with 
it s own cr it ica l and non -cr it ica l applica t ions ta rget ing one and only common shared 
memory cont roller  in  a  single channel memory environment . The common shared 
memory controller  should be able to handle cr it ica l and non -cr it ica l applica t ions 
where heavily in ter  dependant  DRAM t iming const rain t s become very complica ted to 
ana lyse. Further , the command scheduling of a ll the memory clients become another  
cha llenge when the number  of memory clients is growing.  
2 
 
Scheduling DRAM commands for  many memory clien ts is not  st ra ight  forward, since 
there a re a  number  of t iming constra in ts tha t  must  be sa t isfied before a  memory 
clien t  can  be chosen  and its command can  be issued. It  is a  grea t  cha llenge to design  
such  a  memory controller  tha t  provides equal chances to a ll the memory clien ts 
through fa ir  a rbit ra t ion  and schedule their  commands dynamically while sa t isfying 
the in ter -dependant  t iming const ra in t . The design  st ra tegy of th is proposed  memory 
cont roller  should a lso focus on  scenar ios where both  non-cr it ica l rea l t ime requestors 
and non-rea l-t ime requestors cannot  inter fere with  the t ight  la tency t iming deadlines 
of the cr it ica l real t ime requestors. In  other  words, the memory cont roller  should be 
able to handle different  type of requestors according to their  respect ive la tency and 
throughput  requirements.  All these demands should be sa t isfied by providing the 
t ight  bounds on  the worst  case execut ion  t ime for  the cr it ica l requestors and average 
throughput  for  the non-cr it ica l requestors a long with guaranteed bandwidth .  
 
1.2   Con tribu tion  
 
In  th is thesis, we present  a  rank -switch ing, open -row memory controller  for  mixed 
cr it ica l rea l t ime systems. The major  cont r ibut ions a re the following.  
 
 The worst -case execut ion  t ime was ana lysed for  a  single memory request  from 
requestor  under  ana lysis while remaining other  requestors provide worst  case 
memory in terference a t  the same t ime. This in it ia l phase helped to look in to 
the para llelism na ture to reduce the in terference among mult iple requestors. 
 
 Our  rank-switching open row memory cont roller  a rchitecture has both  front  
end and back end logic blocks. The front  end design  was implemented to 
achieve address mapping, refresh  cont roller  and command genera t ion  for  a  
mult i requestor  environment  where our  memory cont roller  can  accept  request s 
from “n” number  of requestors where 0 <  n <= 32.  
 
 The back end design  was implemented with three levels of arbit ra t ion  such as 
requestor  arbit rat ion, rank arbit ra t ion  and command arbit ra t ion. To achieve 
the h ighest  throughput , back end design  was architected in  a  four  stage 
pipelined fashion . 
 
3 
 
 The ver ifica t ion plat form such  as test  bench, tests su its and simula t ion  were 
developed and the design  was ver ified. 
 
 The fina l design  was synthesized. The Sta t ic Timing Analysis (STA) was 
car r ied out  to fix set  up and hold t ime viola t ions. 
 
 The evalua t ion  was car r ied out  on  our  memory cont roller  through extensive 
hardware simula t ions to ana lyse how the rank -switch ing techniques effect ively 
improve the performance.  The eva lua t ion results were compared with  the 
ana lyt ica l result s. 
 
 
1.3   Organ ization  
 
This document  is organized as follows. Chapter  2 provides required background on  
DRAM. Chapter  3 discusses proposed memory cont roller  design  tha t  includes the 
important  design  decisions a nd arbit ra t ion  ru les. Next , Chapter  4 is focused on  the 
theoret ica l analysis of worst  case per -request  la tency. Chapter  5 is dedica ted to the 
implementat ion  deta il of our  proposed memory cont r oller . Next , Chapter  6 discusses 
the ver ificat ion  plat form used to ver ify the design  and a lso eva lua tes the performance 
of our  design . Fina lly, Chapter  6 provides concluding remarks.  At  the end, the 
schemat ic diagrams and one sample simulat ion  output  waveform of the memory 
cont roller  design  are included in  Appendix. The Ver ilog RTL code of the design  can  be 
found a t  [18]. 
 
1.4   Ackn ow le dge me n t  
 
Sect ion 4.0, Sect ion 3.2, Sect ion  2.5 and Sect ion  2.1 were taken from published paper  
[1]. I would like to thank Professor  Rodolfo Pellizzoni for  h is great  assistance in  
formula t ing the theoret ica l ana lysis and arbit ra t ion  ru les. I would a lso like to thank 
Zheng Pei Wu for  h is suppor t  for  published paper  [1] and providing the necessary 
benchmark memory t races used in our  simula t ions.   
 
 
 
4 
 
 
Ch apte r 2 
 
Backgrou n d 
 
This background chapter  is dedicated to three main  areas. F irst , it  descr ibe s the basic 
opera t ion  of DRAM memory device. Second, complex t iming behaviour  of DRAM will 
be illust ra ted in  deta ils. F ina lly rela ted work per formed by others is discussed and it  
helps to different iate between our  proposed design  and the exist ing approaches.  
 
2.1  DRAM Bas ics  
 
Modern memory devices a re organized into ranks and each  rank is divided in to 
mult iple banks, which  can  be accessed in  para llel provided that  no collisions occur  on  
either  buses. Each  bank compr ises a  row-buffer  and an  ar ray of storage cells 
organized as rows and columns. This thesis  considers devices with a t  least  two ranks 
for  our  ana lysis on  rank switch ing techniques. A Memory Cont roller  cont rols the 
opera t ions of DRAM device by issu ing five important  memory commands such  as 
Act iva te, Read, Write, Pre-charge and Refresh.  
 
To access the da ta  in  a  DRAM row, an  Act iva te (ACT) command must  be issued to 
load the da ta  in to the row buffer  before it  can  be read or  writ t en. Once the da ta  is in  
the row buffer , a  read CAS or  wr ite CAS command can be issued to ret r ieve or  store 
the da ta . If a  second request  needs to access a  different  row with in  the same bank, the 
row buffer  must  be wr it t en  back to the da ta  a rray with a  Pre-charge (PRE) command 
before the second row can  be act iva ted. In  a  r ank, when each DRAM device 
cont r ibutes with  8 bits, a  rank with  8 devices has the da ta  bus size of 64 bits. The 
following Figure 2.1 illust rate row, columns, bank and rank configurat ion  and it  a lso 
shows how the tota l 64 bit  da ta  from the memory cont roller  reach  t he rank tha t  has 8 
DRAM devices. 
5 
 
 
    Figure 2.1 DRAM Architecture 
 
F ina lly, a  per iodic Refresh  (REF) command must  be issued to a ll ranks and banks to 
ensure da ta  in tegr ity. Note that  each  command takes one clock cycle on  the command 
bus to be serviced. Each CAS command accesses da ta  in  a  burs t  of length  BL and the 
amount  of da ta  t ransferred is BL x WBUS, where WBUS is the width  of the da ta  bus. 
Since DDR memory t ransfers data  on  r ising and fa lling edge of clock, the amount  of 
t ime for  one t ransfer  is t
BUS
 = BL/2 memory clock cycles. For  example, with  BL = 8 and 
WBUS of 64 bit s, it  will t ake 4 cycles to t ransfer  64 bytes of data .  A row tha t  is cached 
in  the row buffer  is considered open, otherwise the row is considered closed. For  an 
open request , only a  read or  wr ite CAS command is genera ted since the desired row is 
a lready cached in row buffer . For  close request , if row buffer  conta ins a  row tha t  is not  
the desired row, then  a  PRE command is genera ted to close the cur rent  row. Then an 
ACT is genera ted to load the new row and fina lly read/wr ite is genera ted to access 
da ta . The memory controller  can  employ one of two polices to manage the row buffers. 
Under  open row policy, the memory cont roller  leaves the row buffer  open for  as long 
as possible. In  cont rast , close row policy automat ically pre -charges the row buffer  
a fter  every request . F ina lly, the cont roller  must  map the incoming request  to the 
cor rect  rank, bank, row and column. With  inter leaved bank mapping, each  request  
can  access a ll banks in para llel. However  since a ll requestors share a ll b anks, they 
can cause mutua l in ter ference by closing each  other ’s rows. With pr iva te banks 
mapping, each  requestor  is assigned it s own bank or  set  of banks. Therefore, the sta te 
of row buffers of one requestor  cannot  be influenced by other  requestors. 
 
6 
 
2.2  DRAM Tim in g Con strain ts  
 
Every memory device has t iming requirement s in order  to perform read, wr ite, and 
refresh  opera t ions. Therefore, it  is the memory cont roller  which  sat isfies the t iming 
constra in ts needed by the memory devices. The opera t ion  and t iming const ra int s of 
memory devices are defined by the J EDEC standard. The Table 2.1 lists the 
descr ipt ion  of all t iming paramet ers for  DDR3-1333H device that  we used in  our  
design . The table 2.10 a lso show which  t iming parameters a re involved when the 
request s t arget  the same bank, same rank or  different  banks, same rank or  different  
ranks.  
                                                                         J DEC  SP ECIFICATIONS  
Timing 
Parameters  
Descr ipt ions  
DDR3 
1333H 
Same Bank 
Same Rank 
Diff. Banks 
Same Ranks 
Diff. Ranks 
t
RCD
 ACT to READ/WRITE delay 9 Yes No No 
t
RL
 READ to Da ta  Sta r t  9 Yes No No 
t
WL
 WRITE to Da ta  Sta r t  7 Yes No No 
t
BUS
 Data  bus t ransfer  4 Yes Yes Yes 
t
RP
 PRE to ACT Delay 9 Yes No No 
t
WR
 End of WRITE to PRE Delay 10 Yes No No 
t
RTP
 Read to PRE Delay 5 Yes No No 
t
RAS
 ACT to PRE Delay 24 Yes No No 
t
RC
 ACT-ACT (same bank) 33 Yes No No 
t
RRD
 ACT-ACT (differen t  bank) 4 No Yes No 
t
FAW
 Four  ACT Window 20 No Yes No 
t
RTW
 READ to WRITE Delay 7 Yes Yes No 
t
WTR
 WRITE to READ Delay 5 Yes Yes No 
t
RTR
 Rank to Rank Switch  Delay 2 No No Yes 
t
RFC
 Time required to r efresh  a  row 160 ns Yes No No 
t
REFI
 Refresh  per iod 7.8 us Yes No No 
     
Table 2.1 DRAM Timing Const rain t  
7 
 
2.2.1 Ran k to  Ran k Tim in g  Bas ics  
 
Before ana lysing the t iming ana lysis with in  the same rank and between different  
ranks, it  is impor tant  to ana lyse the or igin  of th is t iming requirement  ca lled Rank to 
Rank (RTR) which  was der ived from Data  Strobe (DQS). DDR SDRAM uses both  a  
clock and a  source-synchronous data  st robe (DQS) in  order  to achieve h igh  da ta  ra tes. 
The DQS signal is a  shared signal used by either  bus masters such  as memory 
cont roller  or  DRAM memory device. Dur ing the read command, the DQS signal is 
dr iven  by the Memory device to not ify the read da ta  t iming to the memory cont roller . 
Dur ing the wr ite command, the DQS signal is dr iven  by the memory controller  to 
not ify the wr ite data  t iming from the memory cont roller  to the Memory device. While 
DQS is dr iven  by the Memory controller , the same DQS is used by the memory device 
to sample the incoming wr ite da ta . Since the DQS is a  shared signal by the bus 
masters, the synchroniza t ion  t ime is needed for  one bus master  to hand off the DQS 
signal to another  bus master . The DQS is the bus turnaround t ime, inser ted to 
account  for  skew on the bus and to prevent  different  bus mast ers from dr iving the bus 
a t  the same t ime. To avoid such  collisions, a  second rank must  wait  a t  least  t
DQS
 a fter  a  
fir st  rank has fin ished before dr iving the bus.  This synchroniza t ion t ime is ca lled 
Rank to Rank Time, RTR.  
 
2.2.2  Requ ests  targe tin g  th e  sam e  Ran k 
 
The Figure 2.2 shows the scenar io where requests t arget  different  rows, banks with in  
the same rank. First , the REQ1 targets row 1, bank 1 and rank 1. Since it  is a  close 
read request , the memory cont roller  issues the ACT command to row 1, bank 1, rank 
1. Then, it  waits for  RCD delay to issue the CAS Read  command and it  wait s for  Read 
Latency (RL) t ime unit  before expect  to receive the fir st  byte of read da ta  from the 
memory device. The length  of the read da ta  is equal to the burst  length (BL).  
 
Right  a fter  REQ1 ar r ived, a  new wr ite request  REQ3 a lso ar r ived and ta rget ing 
different  bank in  the same rank (row 1, bank 2, rank 1). Since it  is a  close request , the 
ACT command needs t o be issued for  REQ 3. As per  the ACT to ACT t iming const ra in, 
the row to row delay const ra in  is applied for  requests t a rget ing different  banks in  the 
same rank.  Therefore, the REQ3 ACT command cannot  be issued r ight  a fter  the 
REQ1 ACT command. Instead, the REQ3 ACT command should be delayed by RRD 
delay. Now, the REQ 3 has issued ACT command after  RRD t ime.  
8 
 
At th is point , tha t  memory cont roller  cannot  issue CAS wr ite command after  the 
RCD delay is elapsed as expected. This scenar io is expla ined in  the following 
paragraph. 
 
When read and wr ite requests t a rget ing different  banks or  same bank with in the 
same rank, the memory controller  must  sat isfy the Read to Write (RTW) t iming 
constra in . Therefore, the memory cont roller  needs to wait  for  the Read to Write 
(RTW) t ime delay from the REQ1 CAS read command before issue REQ 3 CAS write 
command. After  CAS Write command is issued, memory cont roller  needs to wait  for  
Write Latency (WL) in  order  to send the wr ite da ta  to the memory device. After  the 
wr ite da ta  is writ t en out , memory controller  needs to wait  for  Write Recovery (WR) 
t ime before issues the PRE, if row need to be closed. Closing the row depends  on open 
or  close row policy.   
 
While REQ1 is in  act ion  receiving read data , a  new read request , REQ 2, just  
a r r ived ta rget ing the same bank and the same rank  as REQ 1, but  different  row (row 
2, bank 1, rank 1). Since the REQ 2 ta rget  the different  row in  the same bank, the 
previous row which  was being accessed by the REQ1 has to be closed before opening a  
new row to REQ2. To issue PRE command in  order  to close the row, t he memory 
cont roller  has to wait  for  the maximum of either  Read to Pre-charge (RTP) or  RAS 
t iming constra in t . After  issuing the PRE command for  REQ2, the controller  needs to 
issue the ACT command after  wait ing for  the maximum of either  RP or  RC t iming 
constra in t .  After  issu ing the ACT command for  REQ2, the memory cont roller  has to 
wait  for  RCD delay t ime in  order  to issue the CAS read command as per  the DRAM 
protocol. But , REQ 2 CAS read command can  only be issued after  the t iming const ra in 
ca lled Write to Read (WTR) is sa t isfied as shown in  Figure 2.2. It  is impor tant  to point  
out  the difference between RTW and WTR const rain t s. The RTW constra in t  is 
between the r ead command to the wr ite command of the same or  different  requestors. 
But , the WTR const rain t  is between the complet ion  of wr ite burst  da ta  to the r ead 
command of the same or  different  requestors.  
9 
 
 
Figure 2.2: Requests t arget ing different  banks in  the same rank  
 
2.2.3  Requ ests  targe tin g  d ifferen t Ran ks 
 
The Figure 2.3 shows the scenar io where requests t arget  different  ranks. The first  
read request  (REQ1) ar r ives and since it  is a  close request  t a rget ing rank 1, the 
memory cont roller  issues the ACT command. At  the same t ime, another  close read 
request , REQ 2, a lso ar r ives and target ing rank 2. Since it  is a  close request , t he 
memory cont roller  issues the REQ 2 ACT command. This ACT command can  be 
issued r ight  after  the REQ1 ACT command without  wait ing for  the Row to Row delay 
(RRD). It  is due to the fact  that  both requests a re t a rget ing different  ranks and there 
is no const ra int  between ACT commands of requestors tha t  t a rget  different  ranks. As 
it  can be seen from Figure 2.3, the REQ1 CAS read command and the corresponding 
Read Data  follows as per  regular  t iming parameters that  we saw before. But , for  the 
REQ2, after  RCD delay is elapsed from its ACT command, REQ2 CAS wr ite command 
cannot  be issued, instead CAS wr ite command need to be scheduled to sa t isfy the new 
t iming constra in t  ca lled Rank to Rank  (RTR).  The RTR delay is needed to sa t isfy the 
constra in t  between end of read data  of REQ1 from rank 1 and beginning of wr ite data  
of REQ2 from rank 2. In  other  words, the memory controller  can  only begin  sending 
the REQ2 wr ite data  after  RTR t iming is elapsed from the end of the REQ1 read da ta . 
While REQ1 is receiving it s read data , there is a  new request , REQ 3 tha t  a r r ives and 
ta rgets the same rank, bank and rows as of REQ1.  
10 
 
The REQ 3 is considered as open request  since the r ow is a lready opened by the 
REQ1. Therefore, there is no need to issue the ACT command; instead, the memory 
cont roller  can  issue the CAS read command to access the a lready opened row.  But , 
th is CAS read command cannot  be issued r ight  away, instead it  need to be scheduled 
to sat isfy the rank t o rank (RTR). Important  observa t ion  is that  there is no need for  
wr ite to read (WTR) or  read to wr ite (RTW) const ra int  in  th is scenar io where request s 
t a rget  different  ranks. The WTR and RTW t iming const ra in t  are only applicable for  
the requestors t a rget ing either  same bank or  different  banks in  the same rank as we 
saw in  the previous Sect ion  2.2.1. 
 
 
   Figure 2.3: Requests t arget ing different  ranks  
 
2.3 DRAM Row  Buffe r Man age m e n t P olicy  
 
Sect ion 2.1 expla ins the opera t ion  of row buffer  in  DRAM. The policy tha t  manages 
the opera t ion  of row buffer  is ca lled row buffer  management  policy. There a re two 
types of policies exist ing and they are open row and close row. The decision  on 
choosing one of them depends on  the memory cont roller  designer ’s choice in  terms of 
per formance and power consumpt ion. For  the open row policy, the memory controller  
a llows the row buffer  to be a lways open unt il a  request  to read a  different  row. If 
another  memory request  ar r ives to the same row address with  different  column 
address, memory access is possible with  the minimum la tency of CAS Latency (CL) 
11 
 
without  re-opening the row due to the open row policy.  Therefore, th is policy sa ves 
un-necessary RAS to CAS la tency delay by re-opening the row again . When the 
cont roller  send the memory request  to different  row in  the same bank,  the open row 
needs to be closed by PRE command before opening the new row. On other  hand, the 
close row policy automat ica lly closes the row buffer  a fter  a  request  and consequent ly, 
every new request  to the same row has to issue ACT command to open the row even if 
it  accesses the same row as before.  
 
2.4  DRAM Mappin g 
 
The memory controller  receives the memory request  in  the form of just  physica l 
address and the r equest  type. It  is the task of m emory cont roller  to map the incoming 
raw physica l address into correct  rank, bank, row and column addresses to access the 
memory devices. There a re two types of mapping methodologies. 
 
2.4.1 Con tin u ou s  Mem ory Mappin g  
 
The cont inuous memory mapping is where the incoming physica l memory address is 
mapped within  the single row of a  par t icu lar  bank. The sequent ia l access cont inues  
through  different  columns address in the same row unt il the end of the row is 
reached. Only when the cur rent  row is fin ished accessed, the mapping switches to the 
same row number  of next  ava ilable bank  as shown in  Figure 2.4. If the next  bank is 
not  ava ilable, th en, the logica l address is mapped to the next  row in  the cur rent  bank. 
Pr iva te bank mapping is a  sub-set  of cont inuous mapping scheme. When a  pr iva te 
bank scheme is used in  a  mult i requestor  system , each  requestor  is assigned to either  
one bank or  set  of disjoined banks with in  the same rank.  Cont inuous memory 
mapping is very efficient  method with  no bank conflicts when the memory requests 
a re cont inuous sequent ia l addresses. But , t his method becomes inefficient  if t he 
memory requests reach  to different  rows in the same bank. 
 
 
 
F igure 2.4: Cont inuous Memory Mapping 
12 
 
2.4.2 In terleaved  Mem ory Mappin g  
 
The incoming physica l address is mapped to column address loca t ions of a  row from 
a ll banks ava ilable in  rank . Once a ll banks have been  accessed, then , the incoming 
physica l address is mapped to the next  column address loca t ion  of the same row from 
a ll banks. When the row becomes fu ll, the incoming logica l address is mapped the 
next  row from a ll ava ilable banks as shown in  Figure 2.5. An in ter leaved memory 
with  n  banks is sa id to be n -way in ter leaved.  If there a re n  banks, memory loca t ion  i  
would reside in  bank number  i  mod n . The inter leaved method has an  advantage of 
make use of mult iple banks and access a ll banks simultaneously with  addresses a re 
spread over  banks and hence th is mapping provides the efficient  bank para llelism and 
results in  h igher  memory throughput . But , the drawback is tha t  it  involves complex 
design  and it  is only efficient  when you require burst  access to a ll the banks. 
 
 
    Figure 2.5: In ter leaved Memory Mapping 
 
 
2.5  Re late d  Works   
 
The rela ted works on  memory cont roller  design  car r ied out  by r esearchers can  be 
classified under  different  implementat ion  ca tegor ies such  as close row, open row, 
cr it ica l, mixed cr it ica l, rank switch ing and arbit ra t ion  policy. Let  us ana lyse the 
rela ted work under  each  of these ca tegories.  
 
F ir st , we will be looking in to the rela ted work focusing on rea l t ime memory 
cont rollers with  close row policy. The work done by Analyzable Memory Cont roller  
(AMC) [7] and Predator  [8] employ close row policy designed for  cr it ica l systems. The 
in ter leaved banks are chosen  as the bank mapping st ra tegy by [7] and [8]. In  
in ter leaved bank mapping, there is no guarantee tha t  rows opened by one requestor  
will not  be closed by another  requestor .  
13 
 
 
Therefore, both  [7] and [8] offers predictable t imings, but  the la tency can  be 
significant ly h igher  than  cont roller s using open row policy. The work done by Yonghui 
et  a l. [3] presents the a rchitecture of a  dynamica lly scheduled real t ime memory 
cont roller . The paper  ana lyses to minimize the worst  case execut ion through close 
page policy and bank para llelism with  inter leaved bank mapping. Further , th is paper  
[3] specifica lly addresses the issue of having either  fixed or  var iable t ransact ion  size 
for  the rea l t ime memory cont rollers. Fur ther , the papers [4], [10] and [11] a lso u t ilize 
the close page policy in their  approach for  the memory cont roller  design .  
 
From the open page policy ca tegory, Goossens et  a l. [9] have proposed a  new type of 
open page policy ca lled conserva t ive open page policy. The approach  in  [9] sta tes the 
following. Do not  pre-charge if next  request  is known to ta rget  the open row. Do the 
Pre-charge if next  address is not  known in  t ime, or  in  case of a  miss. It  a lso makes 
sure tha t  not  to reduce the guarantees given  by the close-page policy. In  other  words, 
the approach  in  [9] wants to leave a  row open for  a  fixed t ime window to take 
advantage of row hits. In  the worst  case, th is approach  is the same as close row policy 
if no assumpt ions can  be made about  the exact  t ime a t  which  request s a rr ive a t  the 
memory controller . Further , Author  [5] has done extensive analysis on  rank switch ing 
based on  open row policy.  
 
Next , we would discuss the related work focussing on exper iment ing different  
a rbit ra t ion  policies. The authors [8] employ a  credit  cont rolled sta t ic-pr ior ity (CCSP) 
is used to share between mult iple requestors. Authors [8] uses a  hybr id approach 
between the sta t ic DRAM command scheduling, bet ter  for  t iming guarant ies, and the 
dynamic command scheduling, bet ter  for  average-case memory bandwidth  ut iliza t ion.  
Goossens et  a l. [9] use the work-conserving Time-division  mult iplexing (TDM) as the 
a rbit ra t ion . The TDM arbit ra t ion  makes the uncla imed slot s from one applica t ion  to 
be by another  applicat ion  if it  has a  request  ava ilable. On the other  hand, the 
Analyzable Memory Cont roller  (AMC) [7] provides upper  bound latency for  memory 
request s in  a  mult i-core system by u t ilizing a  round robin  a rbiter . Reineke et  a l. [10] 
propose a  memory cont roller  tha t  uses TDMA scheduling. On the other  hand, Akesson 
et  a l. [11] propose an  arbiter  ca lled credit -controlled sta t ic-pr ior ity (CCSP) consist ing 
of a  r a te regula tor  and a  stat ic-pr ior ity scheduler .  
 
Now, let  us focus on  bank mapping methodology. Most  of the research  papers  in 
th is related work  focus on  in ter leaved as the bank mapping methods in  their  design . 
Only very few research  papers pay a t ten t ion  t o the pr iva te banking mapping. 
14 
 
Leonardo et  a l. [4] discuss the pr iva te banking through a  terminology ca lled vir tual 
devices (VD) where each  VD is a  group of two banks from the same rank. Fur ther , 
th is paper  proposes to share each  VD between one cr it ica l and a  pre-determined 
number  of non-cr it ica l applicat ions. The pr ivate banking scheme helps to define the 
clear  boundary between cr it ica l and noncr it ica l applica t ions for  their  m ixed cr it ical 
memory cont roller . Further , Reineke et  a l. [10] propose a  memory controller  tha t  uses 
bank pr iva t izat ion  for  predictability and tempora l isola t ion . 
 
When it  comes to the rank switch ing techniques, the research paper  [4], [5] and [6] 
use the rank switch ing methods. The Wang et  a l. [5] proposed a  rank hopping 
a lgor ithm to maximize DRAM bandwidth  by scheduling a  read group (or  write group) 
to the same rank to leverage bank para llelism unt il t
FAW
 constra in t  is reached. At  that  
poin t , another  group of CAS commands are scheduled for  another  rank. This way, 
they amor t ize the ra nk to rank switch ing t ime across a  group of CAS commands. 
However , this scheduling policy inherent ly re-orders requests  and it  is not  su itable for  
cr it ica l rea l t ime systems tha t  require guaranteed la tency bounds. The work in  [6] 
a lso uses rank scheduling to reduce DRAM power  usage by minimizing the number  of 
sta te t ransit ions from low power  to act ive sta te. In  papers [5] and [6], the rank 
scheduling and opt imiza t ions have only been  applied to non -rea l t ime systems. The 
paper  [4] in t roduces the rank switching ana lysis for  mixed cr it ica l systems. But , the 
rank switching ana lysis is limited to only two ranks. 
 
In  cont rast , the approach  of th is thesis t akes advantage of rank switch ing 
techniques tha t  h ides the la tency of wr ite to read and read to wr ite t ransit ions and 
thereby enable the design  to achieve the t ight  bounds on  worst  case execut ion  t ime 
(WCET) to cr it ical core requestors and the lowest  possible average execut ion  for  non -
cr it ica l core requestors. While most  of the memory cont roller  by other  researchers 
focuses on  close page policy, we a t tempted to implement  memory controller  based on 
open row policy and also take advantage of the pr iva te bank scheme where the 
in ter ference from other  requestors is eliminated. As a  possible downside, using 
pr iva te banks reduces the tota l memory ava ilable to each  requestor  compared to 
in ter leaving methods. But , increasing the DRAM size is not  an  issue compared to 
designing a  memory cont roller  that  can  work in  a  mult i r equestor  rea l t ime 
environment . Fur ther , our  memory cont roller  has three stages of a rbit ra t ion  where 
each  stage has it s own arbit ra t ion  mechanism of FCFS, RR and pr ior ity. 
 
 
15 
 
 
 
 
Close Row or 
Open Row 
Arbitration Policy 
Critical or Mixed 
Critical or Non 
Critical 
Rank Scheduling Bank Mapping 
AMC [7] Close Row RR Arbitration Critical NA Interleaved Bank 
Predator [8] Close Row 
Credit -
Controlled 
Sta t ic Pr ior ity 
(CCSP) 
Critical  NA Interleaved Bank 
Reineke [10] Close Row 
TDM 
Arbit ra t ion  
Critical NA Private Bank 
Wang [5] Open Row RR Arbitration Non Critical Rank Hopping Interleaved Bank 
Yonghui  Li [3] Close Row FCFS Arbitration Mixed Critical NA Interleaved Bank 
Goossens [9] 
Conservative 
Open Row 
TDM Arbitration Mixed Critical NA Interleaved Bank 
Leonardo [4] Close Row Fixed Pr ior ity Mixed Critical Rank Switching Private Bank 
Akesson [11] Close Row 
Credit -
Controlled 
Sta t ic Pr ior ity 
(CCSP) 
Mixed Critical NA Interleaved Bank 
 
   Table 2.2: Summary of the related work  
 
 
 
 
 
 
16 
 
Ch apte r 3  
 
Me m ory Controlle r De s ign  
 
This chapter  discusses deta ils of impor tant  design  decisions and the arbit ra t ion  r u les 
tha t  a re formula ted as a  st rong foundat ion  to our  memory cont roller  implementat ion .  
Based on the design  decisions and arbit ra t ion  ru les from this chapter , the 
implementat ion  of the memory cont roller  design  will be discussed in the next  chapter .  
 
3.1 Des ign  De cis ion s  
The design  decisions such  as type of row management  policy, address mapping 
scheme and rank-switching mechanism will be discussed next . 
 
3.1.1 Row  Man agem en t P olicy  
 
The Row Management  Policies can  be either  open row or  close row. When the same 
requestors t a rget  the same row in  the memory, the CAS command can be issued with 
the minimum CAS Latency (CL) without  re-opening the row due to the open row 
policy.  Therefore, the open row policy avoids the un -necessary RAS to CAS la tency 
delay by re-opening the row again  and therefore, the open row policy reduces la tency 
t ime. To take advantage of the la tency reduct ion , the open row policy is chosen  for  our  
memory cont roller  design , because we know the number  of open and close rows. On 
the other  hand, the downside of the open row policy is tha t  we cannot  t ake advantage 
of au tomat ic pre-charge opera t ion. Fur ther , the open row policy requires addit iona l 
commands to be act ive in  the bus which  eventua lly create the bus content ion . 
 
3.1.2 Address  Mappin g Sch em e  
 
Address mapping scheme can be either  cont inuous or  in ter leaved. The pr iva t e bank 
mapping is a  sub-set  of cont inuous address mapping. In  the pr iva te bank mapping, 
each  requestor  is assigned with  one bank or  set  of banks which  are disjoined. The 
incoming logica l address is mapped to the first  row and when the first  row become 
fu ll, the cont inuous access is mapped to the next  row and so on  with in the same bank. 
17 
 
It  is important  to ana lyse why the in ter leaved banking method is not  su itable for  our  
proposed design.  The in ter leaved bankin g a llows a ll requestors to access a ll the 
banks where banks could be shared by requestors .  But , in  rea l t ime systems, we do 
not  want  each  requestor  to share the banks between them. It  is because; there is a  
h igh  probability where one requestor  can close a  row in  a  bank which  was a lready 
opened by a  second requestor . This kind of in ter ference crea tes unwanted la tency 
delay for  the second requestor  to re-open the row which was closed accident ly by the 
fir st  requestor . Based on  these reasons, the in ter leaved banking is not  su itable for  our  
memory cont roller  design . Therefore, our  rank switching memory cont roller  design 
has chosen  the pr iva te bank mapping where each  requestor  is a lloca ted to one bank  or  
set  of banks. But , there is a  downside of pr iva te bank. Since each  requestor  is 
a lloca ted to just  one pr iva te bank, each  requestor  has limited memory access. But , 
increasing the memory size is not  an  issue and we can  increase the amount  of memory 
tha t  each  pr iva te bank is assigned as a  solu t ion.  
 
3.1.3 Ran k Sw itch in g Mech an ism  
 
In t roducing the Rank switch ing technique provide st rong isolat ion  and composable 
proper t ies to our  proposed memory controller  design . The composability provides the 
way to in tegra te components a t  the same t ime preserving their  t empora l proper t ies.  
In  an  idea l system, we like to achieve a  data  bus u t iliza t ion  of 100 %. In  pract ice, due 
to the many t iming const ra int s deta iled in  Sect ion  2.2, da ta  bus ut ilizat ion  is typica lly 
much lower . This is t rue even if a ll request s a re open, since t
RTW
 and t
WTR
 significant ly 
increase the t iming between successive read and wr ite com mands or  vice-versa . Let  us 
look a t  an  example tha t  illust ra tes the rank switch ing mechanism.   
 
F igure 3.1 (A) depicts the worst  case situa t ion  for  four  successive open requests of 
different  requestors in  a  single-rank system, which  is an  a lterna t ion  of st ore and load 
(write and read CAS commands). Note that  it  takes 52 clock cycles to complete a ll four  
request s, while the da ta  bus is only used for  16 cycles, resu lt ing in a  u t iliza t ion  of only 
31%. Our  key idea is tha t  we can  improve the worst -case la tency by not icing that  t
RTW
 
and t
WTR
 do not  apply between requests tha t  t a rget  ba nks in  different  ranks. 
 
F igure 3.1 (B) shows the schedule der ived by assigning the four  requestors to two 
different  ranks and a lterna t ing servicing requests to the two ranks. Since  the only 
constra in t  between requests to different  ranks is the shorter  t
RTR
, the schedule now 
takes 35 cycles to complete, a  33% improvement .  
18 
 
 
Similar ly, F igure 3.1 (C) shows the effect  of assigning each  requestor  to a  different  
rank. Note tha t  in  th is case, a fter  da ta  is sta r ted a t  cycle 7, we use the da ta  bus for  4 
cycles every 6, resu lt ing in  a  ut iliza t ion  of 2/3. F ina lly, not ice that  a lterna t ing ranks 
a lso helps reducing the la tency of ACT commands of close request s, since the t
RRD
 and 
t
FAW
 const ra in ts do not  apply between different  ranks. 
 
 
    F igure 3.1 (A): Arbit ra t ion  for  1 Rank  
 
 
    Figure 3.1 (B): Arbit ra t ion  for  2 Ranks 
 
 
    Figure 3.1 (C): Arbit ra t ion  for  4 Ranks 
 
Our  illust ra t ive example shows tha t  a  rank -switch ing mechanism in  the back end can 
both  significant ly decrease the latency of memory requests and increase bus 
u t iliza t ion  without  requir ing us to reorder  requests in  the front  e nd, which  is 
unsuitable for  cr it ica l rea l-t ime requestors needing guaranteed la tency bounds.  
19 
 
The challenge is how to implement  such  mechanism in  a  predictable way. In  
par t icular , a  simple sta t ic TDMA schedule is not  su itable since requestors can 
dynamica lly submit  different  types of requests a t  run -t ime. Instead, a  set  of dynamic 
a rbit ra t ion  ru les is proposed in  Sect ion  3.1. Having seen  the advantages of using rank 
switch ing technique from the examples; now the cha llenge is how to implement  such  
rank switching mechanism in  the memory cont roller .  
 
3.1.4 Se lec t ion  of Arbiter Type  
 
Arbiter  scheduling can  be chosen  from one of th e following policies such  as Pr ior ity, 
Round Robin , F ir st  Come Fir st  Served (FCFS) and TDMA. This sect ion  descr ibes the 
logica l reasons for  the select ion  of each  arbiter  type for  our  proposed design . Figure 
3.2 shows an  overview of a ll the a rbiters and it s scheduling type tha t  were used in  our  
design . We will discuss each  arbiter  type tha t  was used under  requestor  a rbit ra t ion , 
rank arbit ra t ion and command arbit ra t ion  ca tegor ies. 
 
 
F igure 3.2: Choice of Arbiter  Types 
 
20 
 
The Requestor  Arbit rat ion  ca tegory consists of two arbiters such  as CAS Arbiter  
and PRE ACT Arbiter . The task of these a rbiter s is to choose a  requestor  from a  set  of 
requestors of the same type. Out  of different  a rbiter  scheduling mechanisms, we need 
to ana lyse why cer ta in type of arbiter  scheduling is su itable and why others are not  
su itable for  our  design. If pr ior ity style was chosen , it  assigns one requestor  as the 
h ighest  pr ior ity over  others. The lowest  pr ior ity requestor  would be sta rving while the 
h igh  pr ior ity requestor  owns the bus master  for  a  long t ime. Since a ll requestors a re of 
the same type, we do not  want  one requestor  to starve for  the bus ownership. If the 
Round Robin  type arbiter  was used, it  rotates the pr ior ity level among a ll the 
requestors where each  requestor  has equal t ime of being the highest  pr ior ity. On the 
other  hand, a  simple sta t ic TDMA schedule is not  su itable since requestors can 
dynamica lly submit  different  types of request s a t  run -t ime. Our  requirement  is tha t  
while achieving t he equal fa irness, we do not  want  to waste the clock cycles by the 
a rbiter  visit ing the requestor  tha t  has no requests.  Therefore, CAS arbiter  was 
designed as a  First  Come First  Served (FCFS) style. On other  hand, PRE ACT Arbiter  
has to do addit iona l t ask of giving PRE command higher  pr ior ity compared to the 
ACT when the ACT is wait ing to sa t isfy the  t iming. The design  decision  is made to 
grant  PRE command higher  pr ior ity than  ACT command and th is make the PRE ACT 
arbiter  to be a  modified Fir st  Come First  Served arbit ra t ion  style (M-FCFS). 
 
The Rank Arbit ra t ion  ca tegory consists of Level 2 PRE ACT arbiter  and level 2 
CAS arbiter .  In  order  to mainta in  the fa irness and equal pr ior ity in  choosing the 
ranks, level 2 PRE ACT arbiter  is designed as Round Robin style. On the other  hand, 
the level 2 CAS Arbiter  has to choose it s level 3 queues based on  the burst  to burst  
(BTB) va lues of the requests a rr iving from level 3. Fur ther , th is level 2 CAS Arbiter  
has to different ia t e clien t s that  a r r ive ear ly versus other  client s who wait  for  a  write 
to read (WTR) or  read to write (RTW) t iming const ra int s to be elapsed. By consider ing 
a ll these requirements, the level 2 CAS Arbiter  was designed to be Modified First  
Come First  Served (M-FCFS). 
 
  In  Command Arbit ra t ion  ca tegory, there exists a  command arbiter  who handles 
a ll commands such  as PRE, ACT and CAS. In  order  to give CAS command to be 
h igher  pr ior ity than other  commands, the pr ior ity based arbiter  is used. All three 
a rbiters in  this design were not  in tended to per form reorder ing of the incoming 
request s and thereby, it  r ea lly help to avoid the unnecessary complexity in  t iming 
ana lysis. Now that  we have seen  the a rbit ra t ion  types, let  us  ana lyse the deta il of the 
a rbit ra t ion  ru les. 
 
21 
 
3.2 Arbitration  Ru le s  
 
The back end memory cont roller  logic is built  in to three levels of a rbiter s as shown in 
Figure 3.3 below. Each level has different  type of a rbiters and it  is important  to 
ana lyse them through arbit ra t ion  ru les. We consider  a  device with  R ≥ 2 ranks. The 
memory cont roller  can  suppor t  both  cr it ica l and non -cr it ical rea l-t ime requestors. Our  
design  goal is to minimize the la tency bound of cr it ica l requests, while simultaneously 
a t tempt ing to maintain  h igh  da ta  bus u t iliza t ion  and thus provided memory 
bandwidth  to a ll requestors. To this end, each  rank is assigned either  to cr it ica l or  to 
non-cr it ica l requestors and each  requestor  uses only one rank; let  Mr  1  ≤  r    ≤  R, be 
the number  of requestors tha t  use rank r . The banks in  cr it ical rank r  a re sta t ica lly 
par t it ioned among the Mr  requestors in  rank r , according to the pr iva te bank 
pr inciple. 
     
         
F igure 3.3: Three Levels of Arbit ra t ion   
  
F igure 3.3 shows an  example block diagram of the three levels of a rbit ra t ion  logic in 
the back end, where Rank 1 is a  cr it ica l rank, Rank R  is a  non-cr it ica l rank. M
1
 = 4 
indica tes tha t  rank 1 has four  requestors. Arbit ra t ion  is per formed in three levels.  
22 
 
For  cr it ica l ranks, commands genera ted by the front  end are en -queued in  the per -
requestor  command queues. Level 3 (L3), or  Requestor  Arbit ra t ion, arbit ra tes among 
requestors within  the same rank. The command a t  the front  of the selected requestor  
queue is propagated to Level 2 (L2), or  Rank Arbit ra t ion , which  arbit ra tes among the 
R ranks. Note tha t  Level 3 and Level 2 a rbit ra t ions a re split  between a  PRE ACT 
Arbiter  tha t  handles PRE and ACT commands only, which  are needed only for  close 
request s, and a  CAS Arbiter  tha t  handles CAS commands, which  are needed by a ll 
request s. F ina lly, Level 1 (L1), or  Command Arbit rat ion, simply assigns h igher  
pr ior ity to CAS than  PRE or  ACT command; i.e., if dur ing the cur rent  clock cycle the 
L2 CAS Arbiter  propagates a  CAS command to Level 1, the Command Arbiter  will 
issue it  to the device, otherwise, if the L2 PRE ACT Arbiter  propagates a  PRE or  ACT 
the L1 Arbiter  will issue it . This is done to ensure tha t  the cr it ica l t imings of CAS 
commands in  the rank-switch ing mechanism are not  disrupted by command bus 
content ion  with  PRE/ACT commands. The following ru les capture the behavior  of the 
Level 2 a rbiters and of the Level 3 a rbiters for  a  cr it ica l rank r . 
 
(1A) A command at  the head of each  per -requestor  queue is sa id to be act ive if a ll 
t iming const ra in ts tha t  a re caused by previous commands of the same reques tor  a re 
sa t isfied;  
 
(1B) A CAS command does not  become act ive unt il the data  of the previous CAS 
command of the same requestor  has been t ransmit ted. In other  words, an act ive 
command can  be issued immediately if there a re no other  requestors in the system.  
 
(2A) The L3 PRE ACT Arbiter  uses a  modified Fir st -Come-First -Serve (FCFS) 
arbit ra t ion; The requestor  is en -queued at  the back of a  modified FIFO Queue as soon 
as it  has an  act ive PRE or  ACT command, and it  is removed from the queue once the 
command is fina lly issued by L1.  
 
(2B) Every clock cycle, the a rbiter  scans the modified FIFO Queue and propagates to 
Level 2  the first  command tha t  can  be issued (without  viola t ing t iming constra in ts), if 
any.  
 
(2C) An act ive PRE command can  a lways be issued; an  act ive ACT command could 
instead by blocked by t
RRD
 or  t
FAW
 const ra in ts caused by other  requestors in the same 
rank. 
 
23 
 
(3) The L3 CAS Arbiter  uses standard FCFS arbit ra t ion , with  a  requestor  being en -
queued once it  has an  act ive CAS command and removed once the CAS command is 
issued by L1. The L3 CAS Arbiter  propagates to L2 the CAS command of the fir st  
requestor  in  FCFS order  (if any) together  with  the ear liest  t ime t
SDr
 a t  which the da ta  
t ransmission  associa ted with  the CAS command could be sta r ted. The t
SDr
 is ca lcu lated 
based on previous CAS commands a lready issued either  from the same or  a  different  
rank. Note that  cont rary to L3 PRE ACT Arbit ra t ion , it  is a llowed to propagate a  CAS 
command that  cannot  yet  be issued; th is is required to proper ly a lterna te among 
ranks. 
 
(4) The L2 PRE ACT Arbiter  can  use either  FCFS or  Round-Robin  (RR) arbit ra t ion; 
we adopt  RR in our  prototype since it  is easier  to implement  in  hardware than FCFS.  
 
(5) The L2 CAS Arbiter  uses a  different , modified FCFS arbit ra t ion; a  rank is en -
queued a t  the back of a  FIFO queue once a  new CAS command is propagated from L3, 
and it  is removed from the FIFO once the command is issued by L1. Let  t
ED
 be the 
t ime a t  which  the data  t ransmission  of the last  issued CAS command will end, or  has 
ended. Then a t  every clock cycle, if for  any queued rank it  holds t
SDr   
≤   t
ED
 + t
RTR
, the 
fir st  such  rank in  FCFS order  is selected. Otherwise, the first  rank in  FCFS order  
with  the smallest  va lue of t
SDr
 is selected. In either  case, the corresponding CAS 
command is propagated to L1 only if it  can  be issued in  the cur rent  clock cycle 
(without  viola t ing t iming const ra in ts).  
 
(6) The L1 arbiter  receives a ll the commands such  as PRE, ACT, REF and CAS. This 
a rbiter  is designed as a  pr ior ity a rbiter  where CAS command is given h igher  pr ior ity 
than  PRE, ACT, REF commands. The acknowledgement  (ACK) is genera ted a t  th is 
level when the commands are sent  out  from this a rbiter . This ACK signal is used by 
level 3 to schedule the DRAM t iming of the commands in  an order ly manner .  
 
For  the Cr it ica l requestors, each  requestor  puts it s request s  in to the L3 CMD queue 
only a fter  its previous request  has been  successfu lly completed with either  read or  
wr ites da ta . Otherwise, t he request  with in  the same requestor  is blocked unt il the 
read or  wr ite da ta  of the previous request  of the same requestor  is completed. Note 
tha t  since each  requestor  has a t  most  one act ive command and each  L3 PRE ACT or  
CAS Arbiter  only propagates one command a t  a  t ime, it  follows tha t  only one instance 
of each  requestor  or  rank can be present  in  a  given  FCFS queue; a fter  a  command of 
tha t  requestor /rank is issued by L1, the requestor  or  rank can  be re -en-queued a t  the 
back of the queue.  
24 
 
Hence, while the system is backlogged the scheme approximates a  fa ir  a rbit ra t ion  
where each  rank is a llowed to t ransmit  once every R t imes, and thus each  requestor  
with in tha t  rank t ransmits once every R · Mr  Times. 
 
Except ions a re made in Rules 2 and 5. The modified FCFS arbit ra t ion of Rule 2 
ensures tha t  PRE commands do not  have to suffer  from t
RRD
 or  t
FAW
 const ra int s; if the 
fir st  requestor  has an  act ive ACT command tha t  cannot  be issued r ight  away, we st ill 
a llow the rank to propagate a  PRE command of a  la ter  requestor , since issu ing the 
PRE command cannot  delay the ACT command of the first  requestor  in  any case. The 
modified FCFS arbit rat ion  of Rule 5 implements the rank -switching mechanism for  
CAS commands as long as the “bur st  to burst  gap” between successive data  
t ransmission  is a t  most  t
RTR
, r anks are scheduled in  FCFS order . However , if 
scheduling the fir st  rank would result  in  a  longer  gap (in  par t icu lar , because of a  t
WTR
 
constra in t), then we reorder  ranks to avoid sta lling the da ta  bus.  
 
We make no assumption  on  arbit rat ion  for  non -cr it ica l ranks, outside of the fact  
tha t  the Level 3 a rbiter  will propagate a t  most  one issuable PRE/ACT command and 
one CAS command with  associa ted t ime t
SDr
 to Level 2 every clock cycle; ra nk-level 
a rbit ra t ion  ensures that  the worst -case la tency for  a  request  of a  cr it ica l requestor  
depends only on  the tota l number  of ranks R and the number  of requestors Mr  with in 
the same rank. L3 arbit ra t ion  for  non -cr it ica l requestors can  be opt imized for  average 
case latency and throughput . In  par t icu lar , we can  use techniques employed by h igh -
per formance commercia l controller s such as per -bank queues ra ther  than  pr iva te 
banks, and request  reorder ing to favor  load over  store and open over  close request s.  
 
F ina lly, due to space limita t ions we only br iefly discuss the issue of da ta  shar ing; 
more deta ils on our  approach are discussed in  [15]. If cr it ica l cores a re shar ing da ta , 
we a lloca te a  separate shared bank par t it ion  and use an  addit iona l “vir tua l” cr it ica l 
requestor  to manage accesses to the shared par t it ion; content ion between da ta -
shar ing cores is then handled in the front  end.  For  I/O communica t ion , DMA is 
t rea ted as a  separate requestor . A communicat ing core can then  access the DMA bank 
par t it ion  while the DMA is not  t ransmit t ing.   
 
This chapter  has shown the backbone st ructure for  our  proposed memory 
cont roller  design . The n ext  sect ion , we will look in to the theoret ica l ana lysis done to 
ana lyse the design  st ructure and the t iming const rain t  from the theoret ica l poin t  of 
view. 
25 
 
Ch apte r 4 
 
Th e ore tica l An alys is   
 
 
Based on the arbit ra t ion  ru les deta iled in  Sect ion  3.2, we will now show how to der ive 
a  safe upper  bound on the la tency of each  memory request  of a  cr it ica l requestor  
assigned to rank r . In  par t icu lar , we consider  the back end worst  case la tency t
Req
 
measured from the t ime when a  request  a rr ives a t  the front  of the per -requestor  
command queue unt il it s data  is t ransmit ted. As shown in  [2], such  la tency can  th en 
be used to der ive the overa ll delay suffered by a  t ask due to main  memory content ion; 
for  example, we can  use the sta t ic ana lysis method descr ibed in  [10] to obtain  the 
worst -case numbers of open/close and load/store request s, which  let  us der ive a  wors t -
case request  pa t tern  for  the task. Since the same st rategy in  [2] can  be used to 
account  for  refresh  opera t ions, we do not  cover  them here. We adopt  the DRAM 
la tency ana lysis framework int roduced in [2]. 
 
 
 
F igure 4.1: Worst  Case Latency Decomposit ion  
 
 
 
26 
 
4.1 Worst Case  P e r-Re qu est Late n cy  
 
The worst  case la tency t
Req
 is decomposed into two par t s, t
AC
 and t
CD
 as shown in  Figure 
4.1.  Time t
AC
 (Arr iva l-to-CAS) is the worst  case interva l between the a r r iva l of a  
request  a t  the front  of the per -requestor  command queue and when the cor responding 
CAS command becomes act ive. The t
CD
 (CAS-to-Data) is the worst  case in terva l 
between the CAS becoming a ct ive and the end of data  t ransfer . In  all figures in  th is 
sect ion , we use a  solid a r row to indica te when a request  a r r ives a t  the front  of the per -
requestor  command queue; we use a  dashed ar row to indica te the t ime instant  a t  
which a  command becomes act ive; solid square boxes denote when commands are 
issued on  command bus; dashed square boxes denote commands that  a re ready to be 
issued but  cannot  be issued r ight  away due to content ion with  other  requestors.  
 
 
F igure 4.2: Arr iva l-to-CAS Decomposit ion for  Close Request  
 
For  a  close request , t
AC
 includes the la tency required to process a  PRE and ACT 
command; we thus fur ther  decompose t
AC
 into smaller  par t s as shown in  Figure 4.2.  
Each  par t  is either  a  J EDEC t iming const ra int  shown in  Table I or  a  parameter  that  
we compute, as shown in  Table 4.1.  Both  t
DP
 and t
DA
 determine the t ime a t  which  a  
PRE and ACT command becomes act ive, respect ively. t
IP
 and t
IA
 r epresent  the worst  
case delay between a  command becoming act ive and when that  command is issued, 
and thus capture in terference caused by other  requestors.  Times t
DP
 , t
DA
 as well as t
AC
 
for  an open request  a re computed based only on  t iming constra in ts caused by the 
previous request  of the requestor  under  ana lysis, a nd are independent  of the specific 
a rbit ra t ion  used by the memory controller ; hence, we can reuse the expressions 
provided in  [2]. Instead, in  the following Sect ions, we will deta il how to compute t
IP
 , t
IA
 
and t
CD
. 
 
27 
 
 
                         Tim in g P arame te r De fin ition s  
t
DP
 
End of previous DATA to PRE 
Act ive 
t
IP
 In ter ference Delay for  PRE  
t
DA
 
End of previous DATA to ACT 
Act ive 
t
IA
 In ter ference Delay for  ACT 
 
    Table 4.1: Timing Parameter  Definit ion  
 
Once a ll t iming components have been  computed, the va lue of t
AC
 for  a  close request  is 
obta ined as: 
 
t
AC
 = m ax (t
DA
, t
DP
 + t
IP
 + t
RP
 ) + t
IA
 + t
RCD
   (1) 
and for  both  open and close request s we simply compute the overa ll la tency as  
t
Req
 = t
AC
 + t
CD
. 
 
4.1.1  In terferen ce  De lay  for P RE an d ACT Com m an ds  
 
We begin  by comput ing the worst -case in ter ference delay for  PRE commands. We 
limit  ourselves to devices for  which  the rela t ion  t
RTR
  ≥  t
RL
 - t
WL
 holds, which  includes 
a ll devices except  the one with  the la rgest  t iming const ra in ts, i.e.,  the least  
per formance ones in  each  speed category, which  are ra rely used. The rela t ion  ensures 
tha t  no more than  one CAS command can  be issued every t
BUS
 cycles, despite the fact  
tha t  t
RL
 is genera lly la rger  than  t
WL
; th is helps bounding the maximum delay suffered 
by PRE and ACT commands due to Level 1 a rbit ra t ion . It  has the benefit  of 
simplifying the proofs. We begin  by determining the maximum delay suffered by PRE 
and ACT commands due to L1 arbit ra t ion . Note tha t  due to space limita t ions, some 
proofs a re provided in  appendix. 
 
 
 
28 
 
Th e orem  1: The worst  case va lue for  t
IP
 is: 
 
t
IP
 =  α
P A 
(R · M
r
) -  1          (2) 
 
                                                K 
 where 
 
 α
PA
 (K)  =   K  + 
        
                                            (t
BUS
 -1) 
 
P roof: Note tha t  there a re no in terfer ing constra in t s between the PRE under  ana lysis 
and commands by other  requestors, since they must  t a rget  different  banks. Since 
fur thermore arbit ra t ion Rule 2 ensures that  commands blocked by t iming const ra in ts 
a re not  considered for  arbit ra t ion , it  follows tha t  the PRE under  ana lysis can  only be 
delayed due to content ion  on  the command bus, i.e., the command bus must  be 
cont inuously in  use between the en -queuing of the requestor  under  ana lysis and when 
it s PRE command is issued. In  the worst  case, when the requestor  under  ana lysis is 
en-queued into the L3 PA Arbiter  FCFS queue, there can  be a  maximum of Mr  - 1 
preceding requestors in  the queue. Note that  requestors en -queued after  the requestor  
under  ana lysis cannot  delay it ; and after  a  PRE/ACT command is issued, the 
cor responding requestor  can  only be re-en-queued a t  the end of the queue. Hence, 
each  other  requestor  in  rank r  can  only issue one PRE/ACT command before the 
requestor  under  ana lysis, leading to a  tota l of Mr  PRE/ACT commands from rank r , 
including the PRE under  ana lysis. Fur thermore, since the L2 PA Arbiter  uses either  
FCFS or  round robin a rbit r a t ion, in  the worst  case R - 1 PRE/ACT commands of other  
ranks must  be issued before any command of rank r . Hence, the worst  case number  of 
issued PRE/ACT commands is (R - 1) Mr  + Mr  = (R · Mr), and the L2 PA Arbiter  is 
backlogged while issu ing them. Based on  Lemma 1, the worst  case t ime required to 
issue a ll R·Mr commands is then  α
PA
(R . Mr). To conclude the proof, it  suffices to 
not ice tha t  t
IP
 does not  include the ext ra  clock cycle required t o t ransmit  the PRE 
under  ana lysis; hence, t
IP
 = α
PA
 (R · Mr) - 1. Note tha t  t
IP
 depends on the number  of 
requestors Mr  in  rank ‘r ’ bu t  it  is independent  from the number  of requestors assigned 
to other  ranks; th is is because L2 arbit rat ion  isola tes rank r  from requestors in  other  
ranks. We will show tha t  the same is t rue for  the der ived t
IA
 and t
CD
, hence making 
our  ana lysis composit iona l. 
 
 
29 
 
 
We next  ana lyze t
IA
. We prove that  the ACT command under  analysis suffers 
maximal delay in  the scenar io shown in  Figure 4.3, where R = 2 and the rank under  
ana lysis is r  = 1 with M
r
 = 5. The worst  case is produced when all M
r
 - 1 other  
requestors of rank r  en-queue an  ACT command a t  the same t ime t
0
 as the core under  
ana lysis, which is placed last  in  the L3 PA Arbiter  FCFS order ; each other  requestor  
t r iggers a  t
RRD  
t iming const ra int . Fur thermore, four  ACT commands have been 
completed as la te as possible before t
0
; th is forces the first  ACT after  t
0
 to wait  for  t
FAW 
- 
4· t
RRD
 before being propagated to Level 2. Once an  ACT has been  propagated to L2, in 
the worst  case it  will have to wait  for  R - 1 PRE/ACT commands of other  ranks and for  
in ter fer ing CAS commands, similar ly to the case of PRE commands in Theorem 1; we 
ca ll th is delay Δ
IA
. F ina lly, we need to consider  th e effect  of t
FAW
 on  successive ACT 
commands after  t
0
.  
 
 
F igure 4.3: In ter ference Delay for  ACT command, R  = 2, r = 1 and Mr  = 5  
 
As shown in Figure 4.3, since the t
FAW 
 applies from the t ime when an  ACT is issued to 
the t ime when the fourth  following ACT can be propagated to L2, we have to take the 
maximum of either   t
FAW 
 or   4.t
RRD
 + 3.Δ
IA
 for  every 4 ACT of rank r  issued before the 
one under  ana lysis.          
30 
 
 
 
Th e orem  2:  The worst  case va lue for  t
IA 
is: 
t
IA
 = t
FAW
 - 4 t
RRD
 + max (  (M
r
 − 1)t
RRD
 + M
r
 ∙ Δ
IA 
, K ∙ t
FAW 
 +  (M
r
 - 1 - 4K) t
RRD
 + (M
r
 - 3K) Δ
IA
 )   
 (3) 
Where Δ
IA
 = α
PA
 (R) − 1 and K =   (M
r
 − 1)/4 
 
P roof: Let  t0 be the t ime a t  which the requestor  with  the ACT under  ana lysis is en -
queued in  the L3 PA Arbiter  FCFS queue. We show that  the worst  case la tency for  the 
ACT under  ana lysis is produced when a t  t ime t
0
 there a re (Mr  – 1) other  requestors 
en-queued before the requestor  under  analysis, a ll with  ACT commands.  
 
F ir st  note that  requestors en -queued after  the ACT under  ana lysis cannot  delay it : if 
the ACT under  ana lysis is blocked by the tRRD or  tFAW t iming const ra in t , then  any 
subsequent  requestor  with  an  ACT command in the L3 PA Arbiter  FCFS queue would 
a lso be blocked by the same const ra int . Requestors with PRE commands en -queued 
after  the requestor  under  ana lysis can  be issued before it  according to a rbit ra t ion  
Rule 2 if the ACT under  ana lysis is blocked, but  they cannot  delay it  because those 
requestors access different  banks, and there are no t iming const rain ts between ACT 
and PRE of a  different  bank. Fur thermore, a fter  a  PRE/ACT command is issued, the 
cor responding requestor  can  only be re-en-queued a t  the end of the queue. Hence, 
each  of the other  Mr  - 1 requestors on  rank r  can  only delay the requestor  under  
ana lysis by one command, either  ACT or  PRE. A PRE command can only in ter fere 
with  the ACT under  ana lysis due to command bus content ion , i.e., one bus cycle. On 
the other  hand, each  ACT of another  requestor  en -queued before the requestor  under  
ana lysis can  contr ibute to it s la tency for  a t  least  a  factor  tRRD, which is la rger  than  
one clock cycle on  a ll devices. This shows tha t  the worst  ca se is produced when a ll 
other  requestors on  rank r  have ACT commands.  
 
Second, we show that  a ll requestors of rank r  en-queuing their  ACT command a t  
the same t ime t
0 
are the worst  case pat tern . Requestor  en-queuing an  ACT after  t
0
 
does not  cause in ter ference as a lready shown. If a n  requestor  en -queues an  ACT a t  
t ime t
0
 - Δ with  Δ <  t
RRD
, the overa ll la tency is reduced by Δ since the requestor  cannot  
en-queue another  ACT before t
0 
due to a rbit ra t ion  Rule 1 (the next  ACT would not  be 
act ive due to t
RRD
).  
 
31 
 
Third, we consider  the la tency of ACT commands issued after  t
0
 due to t
RRD
 and 
L2/L1 arbit ra t ion; similar ly to the proof of Theorem 1, each  ACT command of rank r  
can suffer  command bus content ion  delay of  Δ
IA
 = Δ
PA
(R) - 1 (as an  example, Δ
IA
 = 2 in  
Figure 4.3). Fur thermore, once an  ACT command of rank r  is issued, not ice tha t  the 
next  ACT command of the same rank r  cannot  be propagated from L3 to L2 unt il a fter  
the t
RRD
 const ra int  has elapsed; hence, each ACT command can  take Δ
IA
 + t
RRD
 before 
being issued.  
 
F ina lly, we consider  the effect  of the t
FAW
 t iming const ra in t . Note that  a  requestor  
could issue an ACT a t  or  before t
0
- t
RRD
 and then en -queue another  ACT a t  t
0
 before the 
ACT under  ana lysis. Due to the t
FAW
 const ra int , ACT commands after  t0 could then 
suffer  addit ional delay. Since the t
FAW
 const ra in t  is act iva ted by four  consecut ive ACT 
commands, the worst  case is produced when four  ACT commands are issued as la te as 
possible before t
0
, as shown in  Figure 4.3. The fir st  ACT after  t0 is then  blocked unt il 
t ime t
1
 = t
0
 + t
FAW
 - 4·t
RRD
. Note tha t  similar ly, the second ACT after  t
0
 cannot  be 
propagated from L3 to L2 before t
0
 + t
FAW
 - 3t
RRD
 = t
1
 + t
RRD
 due to the same const ra in t ; 
however , th is const ra int  does not  a ffect  the worst  case pa t tern  since the second ACT 
after  t
0
 is blocked unt il t
1
 + Δ
IA
 + t
RRD
 anyway due to the t
RRD
 constra in t  genera ted by 
the fir st  ACT and L2/L1 arbit rat ion. It  remains to consider  the case when t
FAW
 is 
act iva ted by ACT commands of rank r  issued after  t0. Since t
FAW
 applies from the t ime 
when an  ACT of rank r  is issued to the t ime when the fourth  next  ACT of rank r  can  
be propagated from L3 to L2, if the constra in t  is act iva ted it  effect ively replaces the 
delay of four  t
RRD
 constra in t s (generated by the CAS tha t  star t s t
FAW
 and the next  three 
CAS commands of rank r ) and three Δ
IA
 t imes (for  each  of the next  three CAS; see a lso 
the example in  Figure 4.3). Fur thermore, the tota l number  of t
FAW
 constra in t s tha t  can 
be act iva ted for  CAS commands of rank r  after  t1 is K = b(Mr  - 1) = 4c, since we need 
a t  least  four  CAS commands to block the fifth  one.  
 
In  summary, if t
FAW 
  ≤ 4·t
RRD
 - 3Δ
IA
, then  t
FAW
 is not  act iva ted after  t
0
 and the final 
bound on  t
IA
 is then  obta ined by summing the delay t
1
 - t
0
, Mr  - 1 t imes the delay t
RRD
 
(once for  each  other  requestor  on  rank r ), and Mr  t imes the delay Δ
IA
 (once for  each 
other  requestor  on rank r  plus once for  the requestor  under  ana lysis), yielding a  
bound: t
FAW
 · 4t
RRD
 + (Mr-1)t
RRD 
+ Mr  · Δ
IA
. If instead t
FAW 
  ≥  4t
RRD
 - 3Δ
IA
, the bound on  t
IA
 
can  be obta ined as: t
FAW
 - 4t
RRD
 + K · t
FAW
 + (Mr - 1 - 4K)t
RRD
 + (Mr  - 3K)Δ
IA
, where for  
each  of the K t imes the t
FAW
 constra in t  is act iva ted, we replace a  t erm 4t
RRD
 +3Δ
IA
 with 
a  t erm t
FAW
. To end the proof, it  suffices to not ice tha t  in  Eq.(3) we consider  the 
maximum of the two bounds.         
 
32 
 
4.1.2  CAS-to-Data  
 
We now focus on  comput ing a  bound on  t
CD
 for  a  request  using rank r . Similar ly to the 
case of t
IA
, we prove that  the cur rent  request  suffers worst  case in ter ference when a ll 
M
r
 − 1 other  requestors have an  act ive CAS command arr iving a t  the same t ime t
0
 as 
the requestor  under  ana lysis, which  is then  serviced last  according to FCFS 
arbit ra t ion . Our  proof scheme proceeds as follows. We fir st  compute the delay for  
successive CAS commands of rank r . Specifically, Lemma 1 computes the delay for  a  
read followed by a  read and a  wr ite followed by a  wr ite  (which we denote as t
RRD
 and 
t
WWD
, respect ively), while Lemma 2 covers the cases of wr ite -to-read t ransit ion  and 
read-to-wr ite t ransit ion (t
WRD
 and t
RWD
), which  are more complex due to the t
WTR
 and t
RTW
 
constra in ts. Then, Lemma 3 computes the delay for  th e first  CAS of rank r  issued 
after  t
0
. F ina lly, Theorem 3 uses the computed delays to der ive the fina l va lue of t
CD
. 
The t iming constra in ts tha t  cont r ibute to the worst  case la tency are shown as solid 
black hor izonta l ar rows. 
 
 
 
F igure 4.4: Read to Read La tency, R = 2 and r  = 1 
 
 
33 
 
Le mm a 1: Assume that  the L3 CAS Arbiter  for  rank r  prop-agates a  read command 
to L2 immediately a fter  a  previous read command of rank r  is issued (i.e., the L3 CAS 
Arbiter  is backlogged). Then the worst  case latency between the complet ion  of data  
t ransmissions for  the fir st  read command and for  the second read command is: 
 
t
RRD
 = R(t
BUS
  + t
RTR
)   (4) 
Similar ly, for  the case of a  wr ite followed by a  wr ite, the worst  case la tency is t
WWD
 = 
t
RRD
. 
 
P roof: We prove the lemma for  t
RRD
; the proof for  t
WWD
 is equiva lent , by exchanging 
read with  wr ite commands and t
RL
 with  t
WL
.  
Let  t
0
 be the t ime at  which  the first  read command of rank  r  is issued; then  by 
defin it ion  after  t
0
, t
ED
 = t
0
 + t
RL
 + t
BUS
 (see Figure 4.4 above).  
 
Since there a re no t iming constra in ts between  consecut ive read commands of the 
same rank, the second read command of rank r  (dashed boxes in Figure 4.4) could 
sta r t  data  t ransmission a t  t ime t
SDr
 = t
ED
 if other  ranks were not  serviced before it . 
 
After  the first  read command is issued a t  t ime t
0
, r ank r  will be re-en-queued a t  the 
back of the L3 CAS Arbiter  FIFO a t  t ime t
0
 + 1; in  the worst -case, R − 1 ranks can  be 
en-queued before the rank under  ana lysis. Note tha t  whenever  another  rank issues a  
CAS command after  t
0
, the va lue of t
ED
 will be updated; due to the t
RTR
 t iming 
constra in t  between different  ranks, the va lue of t
SDr
 will instead be updated to t
ED
 + t
RTR
 
(see the example in  Figure 4.4 after  a  CAS of rank 2 is issued a t  t ime t
1
). In  any case, 
the condit ion t
SDr 
t
ED
 + t
RTR
 a lways hold. Due to th is reason and based on  Arbit ra t ion  
Rule 5, each  of the other  R − 1 ranks can  issue a t  most  one CAS command before the 
second read of rank r . Fur thermore, each  such  R − 1 da ta  t ransmissions (let  us say, of 
rank j) must  begin  a t  most  t
RTR
 t ime unit s after  the previous da ta  t ransmission  has 
fin ished; otherwise, the condit ion  t
SDj
 t
ED
 + t
RTR
 would be viola ted and rank j could not  
issue a  CAS before rank r  according to Rule 5. In  summary, a t  most  R CAS commands 
must  be issued, including the second read of rank r , and each  data  t ransmission  
incurs a  delay of at  most  t
RTR
 + t
BUS
. Hence, the lemma follows.     
 
 
 
 
34 
 
 
Le mm a 2: Assume that  the L3 CAS Arbiter  for  rank r  prop-agates a  read command 
immedia tely after  a  write command of rank r  is issued. Then the worst  case la tency 
between the complet ion of data  t ransmissions for  the wr ite command and for  the read 
command is: 
t
WRD
 =  m ax( R (t
BUS
 + t
RTR
) ,  t
WTR
 + t
RL
 +  2 t
BUS
 + t
RTR
 − 1)                        (5)  
   
 
F igure 4.5: Write to Read Latency, Case (a ) with  R = 2 and r  = 1  
 
Similar ly, for  the case of a  read followed by a  wr ite, the worst  case latency is: 
 
t
RWD
 = m ax (R (t
BUS
 + t
RTR
), t
RTW
 +  t
WL 
 −  t
RL
 + t
B US
 + t
RTR
 − 1)                    (6) 
 
P roof: We first  compute t
WRD
. Let  t
0
 be the t ime a t  which  the wr ite command of rank r  
is issued; then  by definit ion , the CAS Arbiters set  t
ED
 = t
0
 + t
WL
 + t
BUS
 (see Figure 4.5 
above).  Due to the t
WTR
 const ra int , the L3 CAS Arbiter  of rank r  will a lso set  a  t ime 
t
SDr
 = t
ED
 + Δ for  the star t  of the successive read command, with  Δ = t
WTR
 + t
RL
.  Since t
WTR
 
and t
RL
 a re la rger  than  t
RTR
 and different ly from Lemma 1, we have t
SD
r  > t
ED
 + t
RTR
. We 
consider  two possible cases. 
 
 
35 
 
Case  A: In  th is case, the read command of rank r  is delayed by a  CAS command of 
another  rank j en -queued after  r  in  the L2 CAS Arbiter  F CFS order . This is possible if 
t
SDj
 < t
SDr
; in  the worst  case shown in Figure 4.5, t
SDj
 = t
SDr
 − 1 , resu lt ing in  a  la tency 
t
WRD
 = Δ − 1 + t
BUS
 + t
RTR
 + t
BUS.
  Note that  a fter  the rank under  ana lysis is delayed by a  
command of j, it  will hold t
SDr
 = t
ED
 + t
RTR
 and thus rank r  cannot  be delayed by 
another  rank en-queued after  it .  
 
 
Case  B: The read command of rank r  is delayed by CAS commands of ranks en -
queued before r  in  the L2 CAS Arbiter  FIFO, similar ly to the case in  Lemma 1. Note 
tha t  for  a  rank j to be en -queued before r  in  the FIFO, the CAS command of rank j 
must  have been  propagated to Level 2 before or  a t  t ime t0 + 1 (dashed ar row for  Rank 
1 in  Figure 4.6 below). We dist inguish  two sub cases with in  Case B: Case B_1, the 
CAS command of rank j is not  delayed by a  t
WTR
 t iming constra in t . In  th is case, the 
da ta  t ransmissions of rank j can sta r t  a t  t
ED
 + t
RTR
.  For  example, see Rank 3 in  Figure 
4.6. In  Case B_2, a  previous wr ite command of rank j has been  issued before t0, and 
the successive read command is thus delayed by the t
WTR
 const ra in t  (Rank 2 in Figure 
4.6). In  th is case, the rea d command of rank j could be associa ted with  a  va lue t
SDj
 > 
t
ED
 + t
RTR
. However , since the preceding wr ite command of rank j must  have 
completed its data  t ransmission  a t  least  t
BUS
 + t
RTR
 before the write command of rank r  
completes it s da ta  t ransmission , it  must  a lso hold tha t  the difference between t
SDj
 and 
t
SDr
 is a t  least  t
BUS
 + t
RTR
 (see the dot ted boxes in Figure 4.6 below). Hence, rank j a lone 
cannot  delay the read command of rank r , unless there are other  ranks tha t  can  sta r t  
da ta  t ransmission  a t  t
ED
 + t
RTR
. In  either  sub case, it  follows tha t  the read of rank r  can 
only be delayed if other  ranks cont inuously t ransmit  data  every t
BUS
 + t
RTR
 t ime unit s 
sta r t ing a t  t
ED
 + t
RTR
. Fur thermore, following the same reasoning as in  Lemma 1, in  
th is case no rank en -queued after  rank r  can  cause delay on  rank r . Hence, we obta in 
the same expression  as for  t
RRD
, i.e. t
WRD
 = R (t
BUS
 + t
RTR
). F ina lly, t aking the maximum 
of Case A) and B) will yield  Equat ion  (5). 
 
 
 
 
 
36 
 
 
 
 
F igure 4.6: Write to Read Latency, Case b) with R = 3 and r  = 1 
 
For  t
RWD
, it  suffices to note tha t  the distance Δ between the end of da ta  t ransmission 
for  the read and the star t  of da ta  for  the successive wr ite is Δ = t
RTW
 + t
WL
 − t
RL
 – t
BUS     
Again , t aking the maximum of Case A) and B) will yield Equat ion (6).   
 
It  is in terest ing to note tha t  for  the DDR3-1333H device in  Table 2.1 and for  R = 4, the 
term R(t
BUS
 + t
RTR
) in  Eq.(5), (6) is maximal, meaning t
WRD
 = t
RWD
 = t
RRD
 = t
WWD
; hence, in  
th is condit ion  ROC guarantees a  da ta  bus ut iliza t ion  of t
BUS
/( t
BUS
 + t
RTR
) = 2/3 to a  
backlogged system. Furthermore, the worst -case la tency is completely unaffected by 
the t
WTR
 and t
RTW
 t iming constra in ts.  
 
Le mm a 3: Assume tha t  a  CAS of the requestor  under  ana lysis in  rank r  becomes 
act ive a t  t ime t
0
, and tha t  a t  t
0
 there a re other  Mr  −1 requestors with  act ive CAS 
commands before it  in  the L3 CAS Arbiter  FCFS order .  
Then if the fir st  CAS of rank r  issued after  t
0
 is a  read, the worst  case la tency between 
t
0
 and the complet ion  of da ta  t ransmission for  the fir st  read command is: 
 
37 
 
 
 
t
RD
 = m ax(t
RL
 + t
BUS
 − 1 + R (t
BUS
 + t
RTR
), t
WTR
 + t
RL
 + 2 t
BUS
 + t
RTR
 − 1);  (7)  
 
Otherwise if the fir st  CAS is a  wr ite, the worst  case la tency is:  
 
t
WD
 = t
RL
 + t
BUS 
− 1 + R (t
BUS
 + t
RTR
)          (8) 
 
P roof: The proof is similar  to Lemma 2. The main  difference is tha t  now a  requestor  
of another  rank could issue a  request  immedia tely before t
0
 and st ill be en -queued 
before rank r  (see Rank 2 in  Figure 4.7 as an example for  t
WD
); th is cont r ibutes the 
addit iona l delay term t
RL
 + t
BUS
 − 1.     
 
Th e orem  3: The worst  case CAS-to-Data  la tency for  a  wr ite or  read command, 
respect ively, is: 
 
 
P roof: We show tha t  the pa t tern  in  Lemma 3 result s in  the worst  case la tency t
CD
; 
in tu it ively, we maximize the number  of requestors of rank r  tha t  in ter fere with  the 
requestor  under  analysis. Hence, we can  compute the la tency for  the first  CAS of rank 
r  a fter  t
0
 as either  t
RD
 or  t
WD
; for  each  of the other  Mr  − 1 requestors of rank r  
(including the one under  ana lysis), we then add a  t erm t
RRD
, t
WWD
, t
WRD
 or  t
RWD
 based on 
the sequence of CAS commands. 
38 
 
 
 
   Figure 4.7: In it ia l Write Latency, R = 3 and r  = 1 
 
Now note that   t
WRD 
≥  t
RRD
 = t
WWD 
 and t
RWD 
 ≥  t
RRD
 =  t
WWD
; hence, we prove tha t  the worst  
case sequence is an a lterna t ion  of read and wr ite commands (a lso not ice tha t  t
RD
 ≥ t
WD
, 
bu t  we prove that  the effect  of a lt erna t ing read and wr ite commands on the worst  case 
la tency is larger  compared to sta r t ing with  a  rea d ra ther  than  a  wr ite). To conclude 
the proof, note tha t  if the requestor  under  ana lysis issues a  read, then  in  an  
a lterna t ing sequence of Mr  Commands there are (Mr  − 1)/2   wr ite-to-read t ransit ions 
and  (Mr −1)/2   read-to-wr ite t ransit ions, and vice-versa  for  a  wr ite.                      
 
So far , we have seen the theoret ica l ana lysis of our  memory cont roller  design . In  the 
next  Sect ion, we show you the deta il implementa t ion  of our  rank -switching open-row 
memory cont roller  design .
39 
 
Ch apte r 5  
 
Me m ory Controlle r Im plem e n tation  
 
The proposed rank-switch ing memory cont roller  consist s of front  end and back end as 
shown in  Figure 5.1. This memory cont roller  design  can  accept  requests from number  
N of both  cr it ica l and non -cr it ica l rea l t ime requestors where 0 < N <= 32. The front  
end logic receives memory request s tha t  have the physica l address and the request  
type in  order  to genera te the corresponding DRAM commands such as ACT, PRE, 
REF and CAS. The genera ted commands from the front  end are  dispa tched to the 
command queues in  the back end. The command queues are the first  clocked 
in ter face between front  and back end. The back end logic is responsible for  requestor  
a rbit ra t ion , rank arbit ra t ion  and command arbit ra t ion . Each  level of a rbit rat i on 
consists of sequencers to check if the chosen  command sat isfies the DRAM protocol 
t iming. Once the t iming check is sa t isfied, it  would dispa tch  the command to 
appropr ia te ranks, banks in  the physical DDR Memory Device.  
 
 
Figu re  5.1: Me m ory Con trolle r w ith  Fron t an d Back e n d logic . 
 
 
40 
 
The object ive of this proposed rank switch ing open row memory controller  design  is to 
achieve worst  case upper  bound latency for  cr it ica l requestors and average bandwidth  
for  non-cr it ical requestors. Both front  end and back end logic works with  one clock 
domain. As you can  see from Figure 5.1, there a re three buses such as DATA bus, 
ADDR Bus and CMD bus that  connect  our  memory cont roller  with  the memory device. 
This CMD bus represent  impor tant  memory device signals such  as CS, RAS, CAS, 
WE. Similar ly, the ADDR Bus represents the device signals such as A0 to A15. 
Fur ther  B0 to B2 represent  the 8 banks and CS signal represents  the number  of 
Ranks as per  J DEC standard. To carry out  the eva lua t ion  and test ing, the design  was 
implemented in  such  a  way tha t  the number  of requestors, ranks and banks can  be 
customized as per  user  request . Next , the front  end will be discussed in detail.   
 
5.1 Fron t End Me mory Con trolle r 
 
 
 
F igure 5.2: Front  End Memory Cont roller  
 
The front  End Memory Cont roller  consists of Address Mapping, Command Genera tor , 
Refresh Controller  and Row Table as shown in  Figure 5.2.  The N number  of 
requestors would require N number  of address mapping logic blocks and N number  of 
command genera tors in the front  logic sh own in  Figure 5.2.  
41 
 
 
F ir st , the incoming physica l addresses from a ll N requestors would go through N 
number  of Address Mapping logic blocks which  would split  the incoming physical 
addresses in to normalized rank, bank, row and column Addresses. Then, the 
normalized rank, bank, row, column addresses a re fed in to their  respect ive N number  
of command genera tors for  the proper  command generat ions. The Command 
Generators a re responsible for  genera t ing the necessary DDR Memory commands 
such  as PRE, ACT, REF and CAS based on  the incoming request  type (read or  wr ite), 
physica l address and the sta tus of row table tha t  indica te if the row is open or  close.  
When a  request  ta rget s a  specific rank, bank, row combinat ion , tha t  par t icu lar  
request  en t ry is en tered in to the row table. The row table keep the record of which  
rows, banks and ranks have been  accessed for  each incoming request . Using the 
previous access record in  the row table, the command genera tor  is able to generate the 
r ight  command. Fur ther , main ta in ing the row sta tus of previously accessed memory 
request s a lso enhance the command scheduling efficiency. Since pr iva te bank 
mapping is used, every requestor  is assigned to one bank or  set  of banks in  a  rank. 
Two requestors cannot  access the same bank in  a  rank. The number  requestors, N, 
can  range from 0 < N <= 32. But , for  th is implementat ion , we considered maximum N 
= 16 requestors. 
 
In  th is proposed design, the row table updates its ent ry in to one of the following 
three scenar ios in  order  to assist  the command genera tors to genera te the 
cor responding commands such as PRE, ACT, REF and CAS. 
 
Row  Con flic t:  A row conflict  occurs when there is a  new request  to row in  a  
par t icular  bank which a lready has different  row opened. In  th is scenar io, the 
command genera tor  would generate the following commands in order .  F ir st , it  would 
genera te PRE command to close the already opened row. Second, it  needs to issue an  
ACT Command to open the row for  the new request . F ina lly, it  would issue the Read 
or  wr ite CAS command.  
 
Row  Miss: A row miss occurs when there is a  request  to a  row in  a  par t icu lar  bank of 
DRAM tha t  does not  have any row a lready open. The Command Genera tor  needs to 
send ACT command (row open) and then Read or  wr ite CAS command.  
 
42 
 
Row  Hit: A row hit  occurs when there is a  request  to a  row in a  par t icu lar  bank of 
DRAM tha t  is a lready opened by previous request . In  th is case, there is no need to 
open the bank aga in and therefore, it  simply sends read or  wr ite CAS command.  
 
 
Refresh  Controlle r 
 
Every row in  a  DRAM needs to be regular ly refreshed to avoid da ta  lost  which  would 
eventua lly make sure the memory access predictable. Refresh  process is separately 
handled by a  refresh  cont roller  which  genera te the refresh  command for  every refresh 
per iod. Refresh  per iod depends on  the DDR Memory device.  The refresh  of a  memory 
rank is par t it ioned in to 8,192 smaller  refresh  opera t ions.  Such  refresh  opera t ion  has 
to be issued every 7800 ns (64 ms divided by 8192).   In  other  words, a  refresh 
opera t ion  must  be performed every 7800 ns on  average to refresh  the ent ire DRAM in 
64ms (retent ion  t ime). This 7800 ns in terva l is referred to as the refresh  in terva l, t
REF
. 
Each  Refresh  opera t ion last s for  a  t ime limit  tha t  is refer red to as the refresh  cycle 
t ime, t
RFC
, which depends on the devices. For  our  design  simulat ion , DDR3 1333-H 
device is used where t
RFC
 = 160 ns and t
REF
 = 7800 ns. These parameter  va lues might  be 
different  for  other  h igh  end memory devices. All of the banks must  be pre-charged 
before a  refresh  command is issued. During every refresh per iod, a ll the banks and 
ranks need to be refreshed.  
 
 
 
 
F igure 5.3: Refresh  Cont roller  Timing 
 
 
 
43 
 
5.2   Back En d Me mory Con trolle r 
 
This sect ion  will ana lyse the deta iled implementa t ion  of each  logic component  tha t  
was used to build the back end logic as shown in  Figure 5.4 below. The back end logic 
is designed as 3 levels of a rbiter s and 4 stages of pipeline a rchitecture. Each  level of 
a rbiters is ca tegor ized as requestor  a rbit rat ion  (L1), rank arbit rat ion  (L2) and 
Command arbit ra t ion (L3). The design  was implemented in  a  4 stage pipeline 
a rchitecture in  order  to increases the number  of commands throughput  by execut ing 
opera t ions in  a ll four  stages in  para llel. Each pipelined stage is separa ted by a  
sequent ial register  element . In  the next  sect ion , let  us look a t  the logic behind 
command queues which are the first  in ter face unit  of this back end logic. 
 
 
F igure 5.4: Back End Memory Cont roller  
44 
 
5.2.1 Com m an d Qu eu es  in  Stage  4 
 
The commands genera ted by the front  end are stored in to the command queues as 
shown in  Figure 5.5. It  is the first  sequent ia l in ter face to the back end.  The command 
queue cont roller  is specia lly designed to increase the efficiency of th is proposed 
memory controller  in  the following manner . In  a  regular  FIFO cont roller  design , the 
read enable should be sent  out  to the queue before reading out  the da ta .  But , th is 
customized queue controller  is designed to funct ion  like a  look-ahead manner  where 
the head of the queue is visible to the receiver  so that  the receiver  logic is able to 
decide if the next  command is a  PRE or  ACT or  REF or  CAS.  This look -ahead fea ture 
helps to different ia te CAS versus PRE, ACT, REF commands. Therefore, th is look-
ahead feature process the CAS and PRE, ACT commands separa tely in  para llel by 
separate a rbiter s in  order  to minimize the la tency.   
 
As out lined in  the a rbit ra t ion ru le (1 A), the command at  the head of each 
commands queue is act ive only if the cor responding t iming const rain t s of the previous 
commands of the same requestor  are sat isfied. Only if the command sa t isfies the 
t iming, the command is a llowed to be fetched from the par t icu lar  command queue and 
will be propagated to sta ge 3. If not , the command will be staying in  the command 
queue unt il it s DDR t iming is sa t isfied. The number  of command queues depends on 
how many requestors are connected with the memory cont roller  design . For  a  system 
with  sixteen  requestors, there will be sixteen  command queues to store their  
respect ive commands. The design  was implemented in  such  a  way tha t  number  of 
command queues can be dynamically configured dur ing the run  t ime depending on 
the number  of requestor  that  were chosen by the user . 
 
 
F igure 5.5:  L3 CMD Queue, L3 PRE, ACT Queue and RRD FAW Sequencer  
45 
 
5.2.2  P RE, ACT Arbiter an d  CAS Arbiter in  Stage  4 
 
The stage 4 consist s of two arbiter s namely PRE ACT Arbiter  and CAS a rbiter  as 
shown in  Figure 5.5. Both  arbiters work in para llel to a rbit ra te among commands 
wait ing a t  the head of L3 CMD queues.  The Arbit ra t ion  ru le (2A) says t he requestor  
is put  in  the back of a  L3 PRE ACT Queue in  stage 4 as soon as it  has an  act ive PRE 
or  ACT command and it  is removed from the queue once the command is fina lly 
issued by L1. But , a ll the commands in  CMD queues cannot  be dispa tched a ll once 
instant ly as sta ted in  ru le (2A). From the hardware implementa t ion poin t  of view, 
each  command in  the L3 CMD queue can  only be dispa tched one per  clock cycle. 
Therefore, we need an addit iona l arbit ra t ion mechanism to choose the PRE, ACT 
command per  clock cycle from L3 CMD queues. Similar ly, we a lso need an  arbit rat ion  
mechanism to choose one CAS command per  clock cycle from CMD queues. The CAS 
arbiter  a rbit rates a  L3 CMD queue that  has CAS commands a t  the front  (head) of the 
queues. As per  the a rbit ra t ion  ru le (1B), a  CAS command does not  become act ive unt il 
the da ta  of the previous CAS command of the sam e requestor  has been received by the 
memory cont roller  from the memory device. While CAS arbiter  is in  act ion , PRE ACT 
arbiter  a rbit ra tes a  queue tha t  has PRE or  ACT commands a t  the front  (head) of the 
queue. Both  arbiters ignore those queues tha t  do not  have any command to be served.   
 
Having two separate a rbiters to a rbit ra te a t  the same t ime plays a  major  role in 
reducing overa ll la tency in  our  proposed rank switch ing DDR memory cont roller  
design .  This L3 PRE ACT Arbiter  is designed based on  the modified FCFS arbit rat ion  
style.  This modified FCFS ensures tha t  when the requestor  has an  ACT wait ing for  
it s t
RRD
 or  t
FAW
 const raint s, the PRE commands do not  have to suffer  due to t
RRD
 or  t
FAW
 
constra in ts. This a llows the late a r r iving PRE command to get  propagate.  On the 
other  hand, the CAS arbiter  is built  from the regular  First  Come Fir st  Se rved (FCFS) 
arbit ra t ion  style. Having two separate a rbiters for  PRE ACT and CAS rea lly help to 
reduce the overa ll la tency. Let  us ana lyze a  scenar io with an  example where REQ 1, 
REQ2, REQ3, REQ4 queues receive the commands in  the order  of ACT, PRE, ACT, 
and CAS respect ively.  If one Arbiter  was used to handle a ll the commands, then, the 
a rbiter  would sta r t  a rbit ra t ing from REQ1 => REQ2 => REQ3 => REQ4. The CAS 
command that  is wait ing a t  REQ4 has to wait  unt il the a rbiter  fin ished arbit ra t ing 
REQ1, REQ2 and REQ3. This is not  efficient  and there is no reason to keep wait ing 
the important  CAS command a t  REQ4. Instead, th is CAS command should be 
dispatched as ear ly as possible to save the la tency in  get t ing the da ta  back from 
memory. The proposed design  has one arbiter  to handle PRE, ACT and other  a rbiter  
46 
 
to handle CAS in  order  to reduce the wait ing t ime of the impor tant  CAS commands 
inside the Command queues and thereby minimizes the tota l la tency.  
 
5.2.3  P RE, ACT, CAS Sequ en cer in  Stage  4 
 
When the command is ava ilable a t  the head of the L3 CMD queue, the PRE, ACT, 
CAS sequencer  in stage 4 will star t  ver ifying the t iming check and mak e sure that  the 
cur rent  command from the requestor  sa t isfies the t iming const rain t  with  the previous 
command from the same requestor  as shown in  Figure 5.5.  The look-ahead na ture of 
L3 CMD queue allows th is sequencer  to scan  the commands a t  the head of the 
command queues and ver ify the t iming without  actua lly fetch ing it  from the command 
queue. Note tha t  both  PRE ACT Arbiter  and CAS arbiter  car ry out  their  a rbit ra t ion  
task while checking with  th is sequencer  to determine if the par t icu lar  command can 
be chosen  for  the arbit ra t ion .  Each CMD queue represents the commands that  a re 
solely coming from a single r equestor . This sequencer  checks the t iming const rain t  
such  as RAS to CAS (RCD), CAS La tency (CL), RC, RAS, RP for  the commands from 
the very same requestor . It  is impor tant  to note tha t  the t iming const ra in t  such  as 
ACT to ACT (RRD) or  Four  Act ive Windows (FAW) for  the commands tha t  come from 
different  requestors t arget ing different  banks a re not  ver ified by th is sequencer . 
 
As shown in  Figure 5.5, PRE ACT arbiter , CAS a rbiter  works in  para llel with  PRE 
ACT CAS sequencer  to check if the par t icu lar  command sat isfied the t iming before 
the a rbiter s can  choose the command for  it s arbit ra t ion  process . If the command did 
not  sat isfy with  the t iming, then , the arbiters would not  choose that  command and 
instead, it  would move onto the next  requestor  in  it s a rbit ra t ion path . The Stage 4 
PRE, ACT, CAS Sequencer  was designed with  la rge set  of down counters to represent  
a ll the DRAM t iming const ra int s as shown in  Table 2.1. These t iming counters in  this 
par t icular  sequencer  in  stage 4 would only check the t iming const ra int  for  the 
commands with in each requestor . As indicated in  Figure 5.4, the feedback shown in  
green  color  is a  combinat ional pa th  coming from level 1 towards level 2 and level 3.   
This feedback indica tes tha t  the command is dispa tched a t  level 1 and th is 
acknowledgement  is used by the t iming sequencers in  level 3 and  level 2 to in it ia lize 
the corresponding t iming counters for  the next  commands.  It  is impor tant  to note 
tha t  the t iming constrain t  such  as Row to Row (RRD) or  Four  Act ive Windows (FAW) 
for  the commands tha t  come from different  requestors, t a rget ing different  banks in 
the same rank are not  ver ified by th is t iming sequencer  in  stage 4 and it  will be 
discussed in  sect ion  5.2.5. 
47 
 
5.2.4  P RE, ACT Qu eu e  in  S tage  4 
 
Once the DRAM t iming const ra in t  has been  sa t isfied by the PRE , ACT, CAS 
sequencer , it  would propagate the PRE and ACT commands to PRE ACT queue for  a  
t emporary storage as shown in Figure 5.5. This PRE, ACT queue receives commands 
from requestors t a rget ing different  banks with in  the same rank. This queue is a  
custom made design  and it  is not  based on  either  regular  First  in  First  out  or  Last  in  
Fir st  out  architectures. In  regular  queue architecture, the da ta  will be fetched on  the 
next  cycle after  receiving the read enable from the receiving block. But , th is PRE ACT 
queue was designed in  such  a  way where the next  ava ilable command is au tomat ica lly 
just  visible without  wait ing for  the read enable from the receiving block. This look -
ahead feature a llows the receiving block, RRD FAW sequencer , to va lida te 
cor responding RRD, FAW t iming check even before actually fetch ing it . Fur ther , th is 
queue was designed to offer  h igher  pr ior ity to PRE command over  ACT command 
when ACT was subjected to addit iona l delay due to the RRD, FAW t iming const ra int . 
This PRE ACT queue receives ACT commands tha t  a re coming from a ll four  
requestors t a rget ing different  banks in  the same rank.  Therefore, t hose ACT 
commands ta rget ing different  banks in  the same rank are expected to sa t isfy the 
t iming parameters such as Row to Row Delay (RRD) and Four  Act ive Window (FAW).   
 
The PRE ACT sequencer  eva lua tes the t iming of ACT command and decides when 
the ACT command can  be fetched from the PRE, ACT queue based on  it s RRD, FAW 
sequencer  t iming. While the ACT is wait ing for  it s cor responding RRD, FAW t imer  to 
elapsed, there is no reason to hold those PRE command inside the queue. In  th is 
scenar io, PRE are given  h igher  pr ior ity than  ACT command. To facilit a te th is 
pr ior ity, the PRE ACT queue was designed to make necessary up sh ift ing of PRE 
commands so that  PRE commands can be released ear ly as shown in  the following 
Figure 5.8.  This ear ly release makes a  reasonable improvement  over  the overa ll 
la tency of the operat ion  of our  proposed design . The next  sect ion  descr ibes in  deta il 
how th is RRD FAW sequencer  works in para llel with  PRE ACT queue. 
               
 
 
 
 
 
48 
 
5.2.5  P RE ACT Arbiter an d RRD FAW Sequen cer in  Stage  3 
 
The sect ion  descr ibes the task of the PRE, ACT arbiter  and RRD, FAW sequencer  as 
shown in  stage 3 of Figure 5.5. This logic unit  consist  of PRE, ACT a rbiter  and RRD, 
FAW sequencer . The arbiter  logic arbit rates commands tha t  a re in  PRE, ACT queue. 
If the front  (head) of the queue is an ACT command, the a rbiter  would choose the ACT 
commands only if the Row to Row Delay (RRD) and Four  Act ive Window (FAW) delay 
is sat isfied for  that  ACT command by the RRD, FAW Sequencer  logic. The arbit ra t ion 
ru le (2C) says tha t  an act ive PRE command can  a lways be issued; an  act ive ACT 
command could instead by blocked by t
RRD
 or  t
FAW
 const ra int s caused by other  
requestors in the same rank. Now, let  us ana lyse how th is ru le (2 C) is implemented. 
The RRD and FAW t imings are ver ified by RRD, FAW Sequencer  logic for  an ACT 
command. Once fir st  ACT command is inser ted in to the L3 PRE ACT queue, t he Row 
to Row Delay (RRD) counter  and Four  Act ive Window (FAW) counter  will be 
in it ia lized to RRD value and FAW value respect ively. When it  receives the second 
ACT command, it  would check if the RRD counter  has been  a lready elapsed or  not . If 
it  is not  elapsed, th is logic block will wait  and will not  send the read enable to the 
PRE ACT Queue unt il the RRD t iming is sa t isfied. Only after  RRD t imer  is elapsed, it  
would a llow th is second ACT command from PRE ACT Queue to propagate to next  
stage in  the pipeline. At  the same t ime, it  would send back  the read enable to the PRE 
ACT Queue so that  queue would pop up the next  ava ilable command  to it s head.  This 
process would keep on  cont inuing every t ime there is a  new ACT command. The FAW 
counter  will a llow only four  ACT commands to pass with in the FAW t ime window.   
 
 
5.2.6  P RE, ACT Arbiter in  Leve l 2  
 
As shown in  the fu ll system level view of Figure 5.4, four  PRE ACT queues a re in  L3. 
This L2 PRE ACT Arbiter  was designed based on  the round robin  a rbit rat ion 
architecture as expla ined in  Sect ion  3.1.4 Select ion  of a rbiter  type. When a  L3 PRE 
ACT queue has the command a t  the head of the queue wait ing to be sent , fir st , it  
would send the request  (REQ) to th is Level 2 PRE ACT Arbiter . In  a  fu ll system level 
opera t ion , the L2 PRE ACT Arbiter  receives REQs from a ll four  PRE ACT queues 
residing in  level 3.  The arbit ra t ion  ru le (2 A) says tha t  the PRE ACT command is 
removed once the PRE ACT command is issued by L1. This sub-ru le is implemented 
as follows. The L2 PRE ACT a rbiter  will pick one of the L3 PRE ACT queue 
represent ing different  ranks.  
49 
 
 
L2 PRE ACT arbiter  will issue the acknowledgment  (ACK) to tha t  chosen  L3 PRE 
ACT queue. Only after  receiving the ACK, the cor responding L3 PRE ACT queue will 
release the PRE or  ACT command to th is L2 PRE ACT Arbiter . Note tha t  REQ and 
ACK exchange between L3 to L2 are combinat ional logic and it  does not  consume any 
clock cycle. The arbit rat ion  ru le (4) says tha t  th is level 2 PRE ACT Arbiter  can  use 
either  FCFS or  Round-Robin  (RR) arbit ra t ion . But , th is level 2 PRE ACT was 
implemented using RR due to the ease of implementat ion  in  hardware compared to 
FCFS. The command received by th is level 2 PRE, ACT Arbiter  would then  dispa tch  it  
to the next  pipelined storage which  reside between level 2 PRE ACT Arbiter  and level 
1 Arbiter . 
 
 
F igure 5.6:  Scheduling of PRE, ACT CMDs through L3, L2, L1 
 
50 
 
 
Let  us look in to the deta il of scheduling events of PRE, ACT commands from levels of 
L3 to L2 to L1 as shown in  Figure 5.6. At  clock cycle 0, L3 PRE, ACT queue of rank 0 
conta ins ACT0, ACT0, PRE0, PRE0 commands. Similar ly, a t  clock 0, L3 PRE, ACT 
queue of rank 1 conta ins a ll four  ACT1 commands. At  clock cycle 0, assume that  rank 
0 takes higher  pr ior ity than  rank 1. Even though  both  rank 0 and rank 1 send the 
REQs to L2 PRE ACT arbiter , L2 PRE ACT a rbiter  choose the L3 PRE ACT queue 
from rank 0 due to the h igher  pr ior ity. Therefore, PRE ACT queue of rank 0 will own 
the bus ownership and will dispatch  the ACT0 command in t o L2 PRE, ACT a t  clock 
cycle 1. Next , t he L2 PRE ACT will dispa tch  the ACT0 from L2 to L1 a t  clock cycle 2. 
Due to the Row to Row Delay (RRD) between ACT commands of the sa me rank, the 
ear liest  t ime that  L1 PRE ACT can receive the next  ACT0 command would be a t  clock 
cycle 6. Therefore, ear liest  t ime the next  ACT0 command from L3 PRE, ACT queue of 
rank 0 could be dispa tched is a t  clock cycle 4.  
 
The moment  tha t  ACT0 is dispa tched by L1 PRE ACT register  a t  clock cycle 2, 
the RRD counter  is in it ia lized to a  RRD value. The design  of th is down counter  t ake 
in to the fact  tha t  RRD coun ter  is actua lly get t ing it s  in it ia lized va lue on  the next  
clock cycle so that  in it ia lized va lue should be RRD – 2 so tha t  the counter  would be 
down countered to zero a t  clock cycle 5 where next  ACT0 command will be checked for  
it s RRD const ra int  to be elapsed at  the cor rect  t ime so tha t  next  ACT0 command is 
dispatched to L1 PRE ACT a t  clock cycle 6. P lease note that  the counter  in it ia l va lue 
modifica t ion  is done carefu lly in our  design  for  a ll the counters involved in  check ing 
the DDR t iming constrain t s. 
 
On other  hand, t he L3 PRE, ACT queue of rank 1 will dispa tch  its first  ACT1 
command from L3 PRE ACT queue into L2 PRE ACT register  a t  clock cycle 2, only 
a fter  the ACK is given  by the L2 PRE ACT arbiter . Note tha t  rank 1 is less pr ior ity 
than  rank 0 as per  our  assumpt ion .  Next , L2 PRE ACT register  will dispatch  the ACT 
1 command in to L1 at  clock cycle 3 a fter  ACT0 from rank 0 is dispa tched. Note tha t  
ACT commands from different  ranks does not  have any t iming constra in t s  between 
them and can  be processed back to back a t  clock cycle 2 and clock cycle 3 as shown in  
Figure 5.6. The PRE command has no impact  on  either  RRD or  FAW t iming. 
Therefore, if there is a  PRE command in  the queue, there is no need for  the PRE 
command to wait  unnecessar ily while wait ing for  e ither  RRD or  FAW. The Figure 5.6 
clear ly illust ra te scenar io of ear ly dispatching of PRE 0 a t  clock cycle 2 while ACT0 
command is wait ing for  RRD or  FAW at  L3 PRE ACT queue for  rank 0.  
 
51 
 
5.2.7  CAS FIFO in  Stage  4 
 
Once the PRE, ACT, CAS Arbiter  & Sequencer  unit  chooses a  CAS command, it  
propagates tha t  CAS command to the temporary storage of CAS FIFO as shown in  
Figure 5.7 below. The CAS FIFO has a  var iable depth size depending on  the number  
of requestors connected. When either  4 or  8 requestors a re connected in  the above 
logic unit , the CAS FIFO s ize would take either  4 or  8 respect ively. This CAS FIFO 
would not  release (pop) the CAS Command to the next  stage downstream unt il the 
required DRAM t iming constra in t  is successfu lly checked by the CAS BTB Sequencer  
which  will be descr ibed in  next  sect ion . Only when  the CAS FIFO is chosen  as the 
winner  by level 2 CAS Arbiter  the CAS FIFO would release the CAS command and 
send it  to level 2 CAS Arbiter  as shown in  Figure 5.10. If the CAS FIFO is not  chosen  
as the winner , then, it  would not  release the CAS Command from the CAS FIFO. The 
winner  signal coming from level 2 CAS Arbiter  is used as the read acknowledgement  
for  th is CAS FIFO logic to get  the next  command to pop up a t  the head of the CAS 
FIFO for  the next  operat ion . 
 
 
 
F igure 5.7: CAS FIFO and CAS BTB Sequencer  
 
 
 
 
52 
 
5.2.8  CAS Arbiter an d CAS, BTB Sequ en cer in  Stage  3 
 
This logic block is direct ly in teract ing with CAS FIFO as shown in  Figure 5.7 above. 
Three logic opera t ions such  as CAS arbit ra t ion, CAS t iming and BTB t iming are built  
in  th is logic block . As per  a rbit ra t ion  ru le (3), the commands at  the head of the L3 
CAS FIFO are a rbit rated by the CAS Arbiter  unit . This arbiter  select ion process 
depends on  the t iming check done by CAS sequencer  and BTB sequencer  logic blocks. 
The t iming constra in t  such  as Write to Read (WTR) and Read to Write (RTW) 
between banks will be ver ified by th is CAS sequencer .  
 
Note tha t  the a rbit ra t ion  ru le (5) only discusses the t
SDr
 and t
ED
 and it  does not  
discuss about  the BTB t iming parameter . It  is important  to understand how this BTB 
is important  for  the processing of CAS commands. The BTB stands for  Burst  to Burst  
and the BTB sequencer  ca lcu la tes the t ime difference between two burst  da ta  of the 
different  requestors t a rget ing either  same or  different  banks in the same rank.  As per  
the a rbit rat ion  ru le (5), the BTB = t
SDr
 - t
ED. 
The ca lcula ted BTB value for  each  L3 CAS 
will be dispa tched to the level 2 CAS Arbiter  and t h is process is repea ted by other  
CAS BTB sequencer s in other  ranks as shown in  F igure 5.8.  At  the L2 CAS Arbiter , it  
r eceives CAS commands and the cor responding BTB values from CAS, BTB 
Sequencers represent ing a ll four  ranks or  two ranks depending on  the configurat ion . 
In  the next  sect ion , we will see how level 2 CAS Arbiter  chooses one of the CAS FIFOs 
based on  both  the BTB value received and the pr ior ity of the CAS FIFOs. 
 
 
5.2.9   CAS Arbiter in  Leve l 2  
 
The L2 CAS Arbiter  consist s of three important  logic units and they are CAS Arbiter , 
BTB Compara tor  and Rank to Rank (RTR) Sequencer . The L2 CAS arbiter  a rbit ra tes 
L3 CAS FIFOs represent ing different  ranks  as shown in  Figure 8 below. The 
arbit ra t ion  task car r ied out  by L2 CAS arbiter  not  only just  depends on  the 
a rbit ra t ion  policy it self, but  a lso depends on  the outcome of BTB Com parator  and 
Rank to Rank (RTR) sequencer . The factors tha t  decides on  how L2 CAS arbiter  is 
supposed to choose the L3 CAS FIFOs is as follows. If L2 CAS arbit er  makes it s 
select ion  as per  round robin  policy, each  L3 CAS FIFO will be visit ed in  an  equal 
fa irness and order ly manner  regardless of which  request s are wait ing for  a  long t ime 
in  the L3 CAS FIFOs. But , our  design  of L2 CAS arbiter  gives impor tance to t hose 
request s which  ar r ived ear ly wait ing a t  the L3 CAS FIFO to be served.  
53 
 
It  a lso gives equal impor tance to other  request s which  are wait ing for  their  read to 
wr ite or  wr ite to read or  burst  to burst  t iming constra in t  to be elapsed. The impor tant  
fact  is that  we cannot  use either  fixed pr ior ity or  round robin  policies for  the L2 CAS 
arbiter  without  consider ing the request ’s a rr iva l t ime.  Therefore, we decided to 
choose First  Come First  Served st ructure. Since th is arbit ra t ion process is a lso 
depend  on  Burst  to Burst  (BTB) va lues, the exist ing FCFS policy need to be modified 
and thereby, Modified FCFS (M-FCFS) is well su ited for  our  design  as the a rbit ra t ion 
policy for  L2 CAS arbiter . Now, let  us look in to the BTB Compara tor  logic which  is 
par t  of the L2 CAS arbiter  logic. 
 
 
 
 
F igure 5.8: Level 3 CAS BTB Sequencer  and Level 2 CAS Arbiter  
 
 
 
54 
 
5.2.9.1  BTB Com parator 
 
Whenever  there a re CAS commands a va ilable a t  L3 CAS FIFOs, the L2 CAS Arbiter  
receives the Burst  to Burst  (BTB) va lue from all BTB Sequencers represent ing a ll the 
ranks. Note tha t  BTB value is issued by the BTB sequencer  to L2 CAS arbiter  even if 
the CAS cannot  be issued a t  any clock cycle. In  other  words, the BTB is issued a t  
every clock cycle and the BTB compara tor  compares the BTB values coming from a ll 
the BTB sequencers as shown in Figure 5.9. The winner  is chosen based on BTB 
values and based on  the pr ior ity level of each  of the L3 CAS FIFOs at  tha t  t ime. The 
winning L3 CAS FIFO is a llowed to release the CAS command to the L2 CAS Arbiter . 
Other  L3 CAS FIFOs which  were not  chosen  would keep their  CAS Commands. Once 
the CAS command is chosen from a  par t icular  L3 CAS FIFO, it  st ill need to go 
through one more t iming check ca lled Rank to Rank (RTR).  The RTR Sequencer  unit  
will check the t iming before the CAS Command can  actua lly be indeed released by the 
L3 CAS FIFO.  
 
 
 
F igure 5.9: Logic to Calcula te the Smallest  BTB 
 
55 
 
5.2.9.2  Ran k to  Ran k (RTR) Sequ en cer 
 
Having completed the BTB compar ison, the L2 CAS Arbiter  a lso needs to do another  
t iming check ca lled Rank to Rank. It  is important  to ana lyze why we need th is rank to 
rank sequencer  between ranks. The synchroniza t ion  t ime is needed for  one bus 
master  to hand off the bus ownership to another  bus master . This t ime is ca lle d 
turnaround t ime which is inser ted to account  for  skew on the bus and to prevent  
different  bus masters from dr iving the bus a t  the same t ime. To avoid such  collisions, 
a  second rank must  wait  a t  least  t
RTR
 a fter  a  fir st  rank has fin ished using the bus.  
This synchroniza t ion  t ime is ca lled Rank to Rank Time, RTR.  
 
The RTR Sequencer  was designed to ensure tha t  the second rank would not  dr ive the 
da ta  bus while the first  rank is in  the process of dr iving the da ta  bus and avoiding 
collision  between da ta  movem ent  among ranks. With in  each  rank itself, the t iming 
conflict  in  the direct ions of da ta  movements is ca lled wr ite to read (WTR) and read to 
wr ite (RTW). But , when the da ta  movements occur  between ranks, the t iming 
conflict s such  as WTR or  RTW due to the direct ion  of da ta  movement  a re no longer  
applicable. Instead, only Rank to Rank t iming check need to be ver ified by th is level 2 
CAS Arbiter  before dispa tching the received CAS command towards level 1.   
 
It  is impor tant  to fur ther  ana lyse how the CAS comman ds are scheduled and 
propagated from L3 to L2 to L1 levels.  The Figure 5.10 shows an  example scenar io to 
illust ra te the scheduling of CAS command  a long with  their  cor responding BTB 
values.  
56 
 
 
 
F igure 5.10:  Scheduling of CAS CMDs through L3, L2, L1 
 
Let  us assume the following for  our  example scenar io. The burst  size of 1 clock cycle 
and read, wr ite la tency of 3 clock cycles a re used as a  sca le down view for  the 
illust ra t ive purpose in  th is Figure 5.10. Further , assume that  four  CAS commands are 
stored in  L3 CAS FIFO of rank 0 and three CAS commands are stored in  L3 CAS 
FIFO for  rank 1. Also, a ssume tha t  RDATA2 is present  at  clock cycle 1 by ran k 2 to 
illust ra te the BTB processing for  rank 0 and rank 1. Assume tha t  the rank 0 has 
h igher  pr ior ity than rank 1. 
 
 
 
57 
 
 
At  clock cycle 0, due to the RDATA2, the BTB values for  Rank 0 a nd Rank 1 a re 
ca lcu la ted to be 3. Since both  BTB values a re equal to be 3 a t  clock cycle 0 and since 
rank 0 has h igher  pr ior ity than  rank 1, the L3 CAS FIFO (rank 0) is chosen  as t he 
winner  by the L2 CAS arbiter .  Therefore, RCAS0 command in  L3 CAS FIFO of rank 
0 is dispa tched to L2 CAS register  a t  clock cycle 1.  At  clock cycle 2, th is RCAS0 will 
be fur ther  dispa tched to L1 CAS for  output . The read da ta , RDATA0 is received a t  
clock cycle 5 a fter  RL t iming.  
 
While th is RCAS0 is propagat ing through L3 , L2 and L1, the RCAS 1 wait ing a t  L3 
CAS FIFO (rank 1) cannot  be dispa tched to L2 CAS register  unt il clock cycle 4 due to 
the fact  tha t  Burst  to Burst  (BTB) need to be mainta ined between RDATA0 an d 
RDATA1 as shown in  Figure 5.10. The BTB values for  both  rank 0 and rank 1 a re 
mainta ined to be same va lue as 3 for  clock cycle 0, 1, 2  due to the presence of RDATA 
2 from rank 2.  Once the RCAS0 is dispa tched a t  clock cycle 2, the next  WCAS 0 a t  
rank 0 is wait ing unt il clock cycle 8 to sa t isfy the read to wr ite (RTW) t iming 
constra in t . Due to th is WCAS0, the new BTB value for  rank 0 is updated a t  clock 
cycle 3 to be 5 as per  the following equat ion .  
 
BTB    = t RTW + t WL – t RL – t Burst      
  = 6      + 3      – 3   – 1  
BTB = 5 
 
Similar ly, a t  clock cycle 3, BTB for  rank 1 is calcu la ted to be 0, since there is no data  
in  the bus a t  tha t  t ime for  rank 1.   
 
The following Figure 5.11 shows deta il view of how the BTB values a re ca lcu lated. For  
an  example, t he RCAS from rank 0 can  be dispatched only if the R1_BTB value can  be 
sa t isfied between the la test  RDATA from rank 1 to the fu ture RDATA from rank 0.  
The T0_MAX is ca lcu lated as the maximum delay of the da ta  from other  ranks. If the 
RCAS command a t  rank 0 is followed by previous wr ite da ta , then , we need to wr ite to 
read (WTR) delay as par t  of the BTB ca lcula t ions. All the relevant  equat ions for  the 
BTB ca lcula t ions a re shown in Figure 5.11. 
 
 
 
58 
 
 
 
F igure 5.11:  Example BTB ca lcula t ion  for  RCAS CMD from Rank0 
 
 
5.2.10   P RE ACT CAS Arbiter in  Leve l 1 
 
As can  been seen  from the system level view in  Figure  5.4, the commands from a ll the 
requestors from different  ranks a r r ives simultaneously a t  th is fina l stage ca lled level 
1 PRE ACT CAS arbiter . This L1 arbiter  arbit ra tes for  the PRE, ACT and CAS 
commands coming from a ll ranks. This arbiter  was designed as a  pr ior ity style arbiter  
since the proposed design  expects the CAS commands to be given  h igher  pr ior ity than  
PRE, ACT commands. By issu ing the CAS comm ands as ear ly as possible offer  bet ter  
read latency and hence reduces the overa ll la tency. Note tha t  there is only one 
command bus and therefore, only a  single command can  be dispa tched out  of level 1 
towards the memory device.  Since CAS command get  the h ighest  pr ior ity, th is a rbiter  
is built  with  the ability to t emporar ily store those PRE and ACT which may ar r ive a t  
the same t ime as CAS commands from different  requestors of different  ranks. Only 
after  CAS is dispa tched out  of level 1 a rbiter , those stored  PRE, ACT commands will 
be dispa tched as per  the order  they arr ived. As stated in  a rbit rat ion  rule (6), th is level 
1 a rbiter  would send back the ACK to both level 2 and level 3 every t ime the 
commands have been  dispa tched out  of th is level 1 arbiter . 
59 
 
5.3 P ipe lin e  Im plem en tation  of th e  Mem ory Con trolle r  
 
This sect ion  will ana lyse the pipeline implementat ion  of the back end memory 
cont roller . As shown in  Figure 5.12, the back end is designed to be three stage 
pipeline st ructures. This sect ion  would go in to deta il  on  each  stages and the t iming 
ana lysis as pointed out  below. 
 
1) Pipeline Stage 4 – Request  Arbit ra t ion  
2) Pipeline Stage 3 – Bank Arbit ra t ion  
3) Pipeline Stage 2 – Rank Arbit ra t ion  
4) Pipeline Stage 1 – Command Arbit rat ion  
5) Timing Analysis of Pipeline Stages  
 
 
 
 
F igure 5.12: Three Stage Pipeline for  the backend Memory Cont roller  
 
 
 
60 
 
5.3.1 P ipe lin e  Stage  4 – Re qu est  Arbitration  
 
This pipeline stage 4 receives the commands from a  sequent ia l unit  ca lled CMD 
queues. This stage conta ins  logic units such  as PRE, ACT Arbiter , CAS Arbiter  and 
PRE, ACT, CAS Sequencer . It  is impor tant  to note tha t  th is sequencer  per forms the 
t iming checks on  the commands coming from same requestor , not  between requestors. 
Once the a rbit ra t ion  and t iming checks are completed, the PRE, ACT commands are 
stored in to L3 PRE ACT queue and the CAS commands are stored in to L3 CAS FIFO 
as shown in  Figure 5.13. The commands stored in to th is L3 PRE ACT queue and L3 
CAS FIFO represents four  requestors of the same ran k.  
 
 
          
                    
  F igure 5.13: Stage-4 Pipeline         F igure 5.14: Stage-3 Pipeline  
 
5.3.2 P ipe lin e  Stage  3 – Ban k Arbitration  
 
The pipeline stage 3 conta ins the logic block ca lled  PRE, ACT arbiter  and RRD, FAW 
sequencer  which receives the commands from L3 PRE, ACT queue of stage 4. 
Similar ly, the stage 3 a lso con ta ins CAS arbiter  and CAS, BTB sequencer  which 
receives the command from L3 CAS FIFO of stage 4. Having completed the 
a rbit ra t ion  and t iming check task s, the PRE, ACT commands will be dispa tched in to 
the L2 PRE, ACT register  and CAS commands will be dispatched in to L2 CAS register  
as shown in Figure 5.14.  
 
 
61 
 
5.3.3 P ipe lin e  Stage  2 – Ran k Arbitration  
 
Stage 2 has two arbiters: PRE, ACT Arbiter  and CAS Arbiter . F irst  PRE ACT Arbiter  
per forms the rank arbit ra t ion  for  PRE ACT commands tha t  a re stored in to the L3 
PRE ACT Queues. At  the same t ime, the CAS Arbiter  performs the a rbit rat ion  on 
CAS Commands tha t  are stored L3 CAS FIFOs. As you can  see in  Figure 5.15, there 
a re registers between the stage 3 and stage 4. These registers a re used to carry out  
the pipelined nature of the design . It  is impor tant  to note that  these registers are not  
buffer  to store the commands for  a  longer  per iod; ra ther  these registers are used to 
delay the commands by just  one clock cycle in  order  to per form the pipeline na ture for  
our  design . Actua lly, the L3 PRE ACT queue and L3 CAS FIFO are the storage place 
where the commands are get t ing buffered unt il they receive the ACK from L2 arbiters 
and a lso they wait  in  the buffer  unt il their  t iming const ra in t  is sa t isfied .  
 
 
 
F igure 5.15: Stage 2 Pipeline – Rank Arbit rat ion  
 
62 
 
5.3.4 P ipe lin e  Stage  1 – Com m an d Arbitration  
 
As shown in  system level Figure 5.4, the s tage 1 receives commands from a ll the 
requestors represent ing different  ranks. This is the fina l stage where a ll commands 
come together  from a ll the requestors of a ll the r anks.  The stage 1 conta ins PRE, 
ACT, CAS Arbiter  of pr ior ity type. At  th is level, the command arbit rat ion  is 
per formed where CAS command is given  h igher  pr ior ity than  PRE or  ACT if a ll 3 
commands ar r ive a t  the same t ime. Since stage 1 a rbiter  is the fina l stage of our  
memory cont roller  design , t he output  coming out  of th is stage 1 should be clocked in 
order  to send the sequent ia l signa ls in to the memory device. This will prevent  any 
glit ches coming from the stage 1 combinat ional logic being passed onto the memory 
device. The Stage 1 PRE, ACT CAS arbiter  a lso send out  the feedback signal to Level 
3 and Level 2, once the commands are dispatched out  of th is a rbiter  towards the 
memory device. This feedback is used by the level 3 and level 2 t iming sequencers to 
eva lua te the t iming const ra int s for  the next  commands.  
 
Up to now, a ll 3 sta ges of pipeline design  was presented. The fu ll implementa t ion  of 
the front  and back end design  was done using Ver ilog RTL language and the code can  
be found at  [18]. 
 
 
                                            
F igure 5.16: Stage 1 Pipeline – Command Arbit ra t ion  
 
63 
 
 
5.3.5 Tim in g An alys is  of P ipe lin e  s tages  
 
 
Figure 5.17: Timing Analysis of Pipeline Stages  
64 
 
The deta iled t iming of each  pipelin e stages is shown in  Figure 5.17. The da ta  a rr iva l 
t ime indica tes the t ime a t  which  the data  a r r ived after  being subjected to the 
combinat ional delay by the logic in each stage. The da ta  required t ime indica tes the 
safest  t ime at  which  the da ta  is expected to reach  to avoid the setup t ime, Ts.  The 
Time durat ion  of T4, T3, T2, T1 indica tes the Data  Arr iva l t ime and i t  includes the 
delay suffered by logic for  the cor responding pipeline stages.  In  order  to ana lyse the 
t ime taken by each  stage of the pipeline, the Sta t ic Timing Analysis (STA) was car r ied 
on  back end memory cont roller  by using the Xilinx Timing Analyzer  Tool. As a  
requirement  for  the tool, the User  Constrain  File (UCF) was cr ea ted with  the 
following four  t iming inter faces such  as Input  PAD to FLOP, FLOP to FLOP, FLOP to 
Output  PAD and Input  PAD to output  PAD 
 
The following results were achieved from the Sta t ic Timing Analysis 
T4 = Time taken by the Stage 4   = 2.57 ns 
T3 = Time taken by the Stage 3   = 2.47 ns 
T2 = Time taken by the Stage 2   = 2.51 ns 
T1 = Time taken by the Stage 1   = 2.38 ns 
Ts = Setup Time of the flip flop.   = 0.29 ns 
Minimum per iod    = 2.86 ns  
Maximum Frequency:    = 350.00 MHz  
   
 
5.3.6 Data  P ath  of th e  Mem ory Con trolle r  
 
Our proposed memory cont roller  is capable of reading da ta  from the memory 
device as well a s wr it ing da ta  in to the memory device. Both  the read and write 
da ta  a re sent  out  through 64 bit  bi-direct iona l DQ bus a long with  bi-direct ional 
DQS st robe signa l which  is used to capture the read da ta .  For  the read 
process, the read da ta  DQ and st robe DQS a re dr iven  by the memory device for  
each  of the requestor  represent ing different  banks and ranks. Simila r ly, for  
the write process, the memory cont roller  is capable of dr iving the write da ta  
through DQ bus a long with  DQS st robe signa l. The memory model is design  in  
such  a  way to send and receive the da ta  and st robe signa ls.   
 
 
65 
 
Even though the memory cont roller  design  is  capable of receiving read da ta  
from memory device, the received read da ta  is not  sen t  to the system bus in  
our  design . We a re not  concerned with  the system bus for  our  ana lysis, because 
the absence of this addit iona l da ta  processing to the system bus does not  have 
any impact  on  the read and write la tency tha t  we a re concerned for  th is thesis.  
 
 
5.3.7 Testin g  of th e  Mem ory Con trolle r 
 
Ident ifying the main corner  cases and wr it ing test s to address those corner  
cases a re the most  cha llenging par t  of the simula t ion  setup. As inputs to the 
design , va r ious read and write memory t races were used as input  st imuli. Once 
the test  cases and input s a re ready, next  impor tan t  t a sk was to design  the 
memory model tha t  would in teract  with  our  memory cont roller . The memory 
model was designed to mimic the in ter face of DDR3 memory device.  Next , the 
test  bench  st ructu re was designed where Design  under  Test  (DUT) and the 
memory model were instan t ia ted. Test  bench  a lso includes the proper  clocking 
and reset  genera t ion . For  design  ent ry, simula t ion and synthesis purpose, the 
Xilinx ISE Design su it , v14.4, was used.  
 
To obta in  the simula t ion  resu lt s, we implem ented the en t ire memory cont roller  
for  the front -end for  t he command genera tor  and back -end for  t he a rbit ra t ion  
and memory t iming check using Ver ilog RTL. Our  implementa t ion  uses a  fu lly 
pipelined a rchitecture with  four  stages to increase the ha rdware speed. We 
synthesized the design  using Xilinx Kintex 7 FPGA and obta in ing a  maximum 
command bus clock frequency of 350 MHz. While this frequency is lower  than 
the 666 MHz frequency used in  our  simula t ions, we a rgue tha t  an  ASIC 
implementa t ion  would resu lt  in  s ignificant ly h igher  speed. The next  sect ion  
deta ils out  the background informat ion  on  hardware simula t ion setup and the 
eva luat ion  resu lt s. 
 
 
 
 
66 
 
Ch apte r 6 
 
Evaluation  
 
The evalua t ion  of our  rank switch ing open-row memory controller  design  was car r ied 
out  through hardware simulat ion  process  for  var ious memory configura t ions. We 
evalua te the per formance of our  memory cont roller  simula ted with  CHStone and 
SPEC benchmarks. Since AMC result s a re only su itable for  cr it ica l requestors, we 
compare the la tency resu lts using cr it ica l requestor  a rbit ra t ion  where CHStone and 
SPEC benchmarks were used. Fur ther , the sim ula t ion  for  non -cr it ica l requestor  
a rbit ra t ion  was a lso carr ied out  to measure the throughput  for  non-cr it ica l t asks.  The 
hardware simulat ion  used DDR3-1333H memory device with  data  bus size of 64 bits. 
Fur ther , th is ana lysis considers the memory device with  2 ranks or  4 ranks with  8 or  
4 banks per  rank respect ively. This allows us to assign  one bank to each  requestor .  
 
 
6.1 Syn th e tic  ben ch m ark Resu lts  
 
Synthet ic benchmarks a re used to show how the worst  case ana lyt ical bound var ies 
as a  funct ion  of benchmark’s parameters . Since the la tency bound is a  funct ion  of the 
number  of open/close and load/store requests performed by the requestor  under  
ana lysis, we decided to plot  the average per -request  worst  case latency in  nano-
seconds (y-axis) for  a  synthet ic t ask as we vary the row hit  ra t io (percentage of open 
request s, x-axis) and fixing the per centage of store request s to 20%. Both  figures 6.1 
and 6.2 plot  result s for  da ta  bus width  of 64 bit s and 16, 8 requestors respect ively. 
F igure 6.3 shows the case of 8 requestors and 32 bit s bus. From a ll three figures 
below, we can  see that  AMC’s plot  is constant  in  the graph since it  uses close row 
policy; hence the la tency does not  depend on row hit  rat io. When the number  of 
requestors and ranks a re increased, our  approach  per forms compara t ively much 
bet ter . For  8 requestors with  32 bits bus and 0% row hit  ra t io, AMC st ill has 50% 
higher  la tency compared to ROC 4 rank scenar io.  Similar ly, the la tency from paper  
[2] is a t  least  50% higher  than  ROC 4 ranks overa ll cases. Note tha t  the synthet ic 
results does not  account  for  the refresh  la tency. 
 
67 
 
 
 
F igure 6.1: Synthet ic 16 Requestor s 64 bits da ta  bus result  
 
 
 
F igure 6.2: Synthet ic 8 Requestors 64 bit s da ta  bus resul t  
68 
 
 
       
 
 
F igure 6.3: Synthet ic 8 Requestors 64 bit s da ta  bus result  
 
 
6.2 Late n cy of ope n  an d c lose  m em ory re ad access  
 
The memory access can  be one of the four  types such  as open read or  close read or  
open wr ite or  close wr ite. In  t h is exper iment , the worst -case la tency for  memory read 
and memory write is der ived from the simula t ion  for  the request  under  ana lysis, 
REQ0 while other  requestors are used to apply in t er ference. The la tency t ime means 
the t ime taken from the moment  the read command is sent  out  from the front  end 
unt il the read da ta  is received by the back end logic from the memory device. Every 
t ime a  command is dispa tched from the front  end, it  is ident ified as either  open read 
or  close r ead. For  both open read and close read, the la tency t ime is captured and 
recorded in to an  output  file. Ou t  of a ll the captured la tency t imes, only the worst  case 
va lues in  each  memory access ca tegory is considered for  this ana lysis to be compared 
aga inst  synthet ic ana lysis result s. We only consider  open read and close read since 
the read la tency is much more cr it ica l than  wr ite la tency. Therefore, t he theoret ica l 
69 
 
values a re compared with  ha rdware simula t ion  results for  the open read and close  
read with  0 % wr ite scenar io. The exper iment  was carr ied out  for  the memory 
configura t ion  of 16 requestors with  4 ranks and 2 ranks as shown in  Table 6.1. We 
exper imented with  many different  benchmarks from CHStone family to get  the worst  
la tency delay for  open read and close read. Note tha t  theoret ica l ana lysis results did 
not  include the delay caused by refresh . Therefore, we did th is exper iment  by turn ing 
off the refresh  operat ion  in  hardware simula t ion  in  order  to make the cor rect  
compar ison  with  the theoret ica l results. For  the hardware simula t ion , the open and 
close latency delay is ext racted as number  of clock cycles an d it  is mult iplied by the 
clock per iod of 1.5 ns to get  the actual delay as shown in Table 6.1. 
 
    REQs = 16; Ran ks = 2     REQs = 16; Ran ks = 4 
 Th e ore tica l  
An alysis  
     HW 
sim u la tion  
Th e ore tica l  
An alysis  
      HW 
sim u la tion  
Ope n  Re ad  
100 % row  h it  
230.5 n s  222.0 n s  162.5 n s  136.5 n s  
Close  Re ad  
0 % row  h it  
364  n s  277.5 n s  278 n s  211.5 n s  
 
Table  6.1 Late ncy of ope n , c lose  acce ss  w ith  0 % w rite  
 
6.3 S im u lation  of Critical tasks  on ly  
 
In  th is case, a ll 16 requestors a re assigned to cr it ica l t asks. F igure 6.4 shows the 
Memory configura t ion  1 where the first  Requestor , named REQ0, is assigned as 
requestor  under  ana lysis and the remaining requestors, REQ1 to REQ15 are used to 
provide extensive in ter ference to the Requestor  under  Analys is, REQ0. The Requestor  
under  Analysis receives memory request  inputs from CHStone benchmark family tha t  
include mips, adpcm, aes, bf, gsm, dfadd, dfdiv, dfmul, dfsin , jpeg, mot ion  and sha . 
Other  Requestors, REQ 1 to REQ 15 receives memory request  input  from LBM of 
SPEC benchmark family to provide the t iming in terference to Requestor  under  
Analysis, REQ0. After  the simulat ion  is completed, the total execut ion  t ime for  
Requestor  under  Analysis, REQ0, was der ived from the simulat ion  output .  
70 
 
This process was repea ted for  each  of the twelve mem ory t races of CHStone 
benchmark family. Having achieved the execut ion  t ime for  each  benchmark, the 
graph is plot ted for  execut ion  t ime against  each  of the CHStone benchmark. This 
process is repea ted for  a ll four  configurat ions that  a re listed below where a ll the 
requestors a re assigned as cr it ica l applica t ions. F igure 6.4 shows Configura t ion  1. 
 
1. Configura t ion 1:  Requestors = 16,  Ranks = 4,  Banks = 4  
2. Configura t ion 2:  Requestors = 16;  Ranks = 2;  Banks = 8 
3. Configura t ion 3:   Requestors = 8; Ranks = 4; Banks = 2 
4. Configura t ion 4:   Requestors = 8; Ranks = 2; Banks = 4 
 
 
 
 
 
F igure 6.4 Simula t ion  setup for  Memory Configura t ions 1 
 
 
 
 
 
 
 
 
 
 
 
71 
 
 
 
  
  
  
  
  
  
  
  
F
ig
u
re
 6
.5
: 
   
C
H
S
to
n
e
: 
 1
6
 R
e
q
u
e
s
to
rs
 w
it
h
 R
a
n
k
s
 4
 a
n
d
 R
a
n
k
 2
 
72 
 
 
F
ig
u
re
 6
.6
: 
 
  
C
H
S
to
n
e
: 
 8
 R
e
q
u
e
s
to
rs
  
 w
it
h
  
 R
a
n
k
 4
 a
n
d
 R
a
n
k
 2
 
73 
 
The Figures 6.5 and 6.6 show the total execut ion  t ime (y-axis) consumed by the 
Request  under  Analysis (REQ0) for  each  of the memory t races (x-axis) in  CHStone 
benchmark family. F igure 6.5 is der ived when the 16 requestors a re in  act ion  making 
memory request s to the memory cont roller  for  both  4 ranks and 2 ranks scenar ios. 
Similar ly, F igure 6.6 is der ived when the 8 requestors a re in  act ion  making memory 
request s to the memory cont roller  for  4 ranks and 2 scenar ios . The y-axis is the 
normalized execut ion  t ime of the benchmarks aga inst  the worst  case ana lyt ica l bound 
of our  published paper  [2]. The T-bars a re the worst  case ana lyt ica l bounds while 
rectangular  boxes with shades a re simulat ion  results.  In  t erms of ana lyt ica l bounds, 
the result s of our  memory controller  with  4-ranks and 2 ranks per forms well 
compared to the theoret ica l result s of our  published paper  [2]. It  a lso performs well 
compared to the AMC per formance. But , as you can  see from Figure 6.5, and 6.6, 
la tency from hardware simulat ion  is h igher  than  the software simula t ion  result s. This 
is due to the fact  tha t  hardware design  uses the three stages of pipelines with  four  
level of arbit rat ion . The difference between simula ted  and ana lyt ica l t ime is a lways 
quite small for  AMC, less than 10%. However , our  simulated t ime is significant ly 
lower  than  the ana lyt ica l bounds. This is because the ana lysis assumes a  precise 
worst  case pat tern  of in ter fer ing request s by other  requestors. The probability that  
such  pa t tern  is produced a t  run -t ime is very low, a lbeit  non-zero.  
 
6.4  S im u lation  of Critical an d Non -Critical tasks  
 
This sect ion  ana lyses the scenar io where both cr it ica l and non -cr it ical requestors do 
make memory request s a t  the same t ime to our  memory controller . Memory 
configura t ion  1 is used for  eva lua t ion  with  the following setu p.  Rank 1 and Rank 2 
a re assigned with  8 cr it ica l requestors whereas the Rank 3 and Rank 4 a re assigned 
with non-cr it ica l requestors as shown in Figure 6.7. The LBM memory t race from 
SPEC benchmark family is split  into four  sub-files as per  address range, 64 bytes and 
each  sub files a re sent  to each  of the four  banks in  rank 3 and rank 4.  For  th is mixed 
simulat ion , cr it ica l t asks will be made to be in -order  and the non -cr it ica l t asks will be 
made to be out  of order . It  is worth  to m ent ion  how in-order  and out  of order  requests 
a re made at  the front  end.  When each  request  a r r ives a t  the front  end, it  ca r r ies the 
type of request  (read or  write), physica l address and a  delta  delay. This delta  delay 
indica tes the t ime difference between  the last  request s to cur rent  request . This delta  
delay between requests makes them to be in -order . On the other  hand, to genera te the 
out  of order  requests, the incoming request s with  fixed delay will be used. This fixed 
delay for  each  request  will enable the out  of order  t ransfer  for  the non -cr it ica l t ask. 
74 
 
 
 
 
F igure 6.7 Simula t ion  setup for  Cr it ica l and Non -Cr it ical requestors 
 
The bandwidth ca lcu la t ion  is carr ied out  when both  cr it ica l  (mot ion) and non-cr it ica l 
(LBM) requestors are sending request s to the memory cont roller  a t  the same t ime. As 
shown below, the bandwidth  for  cr it ica l and non -cr it ica l a re ca lcu la ted to achieve the 
tota l bandwidth. 
 
75 
 
 
 
 Since we included the rank switch ing techniques in  our  design , our  t otal 
bandwidth  may not  reach  close to the theoret ica l bandwidth  va lue.  Our  design  is able 
to achieve 5.898/10.664 = 55.3 % bandwidth  which  is somewhat  close to the expected 
ra te of 66 % due to the addit ion  of rank switch ing logic in  our  memory cont roller  
design . For  the fu ture work, the design  can be opt im ized fur ther  to get  h igher  
bandwidth . 
 
 
 
 
76 
 
Ch apte r 7 
 
Con clu s ion  
 
A rank-switch ing open-row memory cont roller  design  for  mixed-cr it ica l system is 
presented in  th is thesis.  Our  design  was built  to handle dynamic command 
scheduling while exist ing memory controller s solely rely on  sta t ic command 
scheduling. The exist ing memory cont roller s usua lly take advantage of the close row 
policy to easily handle the complex t iming const ra ins. But , our  object ive is to ut ilize 
both  open row policy and pr iva te bank mapping to offer  the worst  case la tency for  the 
cr it ica l requestors and minimum average bandwidth  for  non -cr it ica l requestors. 
 
Fur ther , our  rank-switch ing mechanism improves the u t iliza t ion  of the da ta  bus by 
guaranteeing tha t  consecut ive da ta  t ransfers are spaced by a t  most  one rank to rank 
t ransit ion  delay. This delay is shor ter  than  the wr ite to read and read to wr ite delays 
tha t  apply to the da ta  t ransfers of the same rank. As a  result , our  proposed rank 
switch ing memory controller  design  significant ly improves the worst  case la tency of 
memory requests while guaranteeing the isolat ion  among requestors.   
 
Our  eva luat ion  is carr ied out  for  both  cr it ica l and non -cr it ica l requestor s. The 
eva lua t ion  on  cr it ica l requestor  has demonstra t ed reduct ion  in  la tency for  cr it ica l 
requestors. The outcome of la tency from our  hardware s imulat ion  of cr it ica l 
requestors is compared aga inst  our  theoret ical va lues and AMC and our  hardware 
design  results per form well. For  bandwidth , we eva lua ted the bandwidth  for  the 
cr it ica l and non-cr it ical and ca lcu lated the tota l bandwidth  which  is compared aga inst  
the theoret ica l bandwidth . The eva luat ion  result s show tha t  our  rank -switch ing open -
row memory cont roller  per forms well as the number  of ranks increases. As a  fu ture 
work, the design  can  be opt imized to achieve h igher  speed. 
 
 
 
 
 
 
77 
 
Re fe re n ce s  
 
[1] Kr ishnapilla i, Yogen; Zheng Pei Wu; Pellizzoni, R, “A Rank -Switching, Open-Row 
DRAM Controller  for  Time-Predictable Systems” in  Real-Time Systems (ECRTS), 
2014 26th Euromicro Con ference, Publicat ion  Year : 2014, Page(s): 27 –  38 
 
[2] Z. Wu, Yogen . Kr ishnapilla i, and R. Pellizzoni, “Worst  Case Ana lysis of DRAM 
Latency in  Mult i-Requestor  Systems,” in  Real-Time Systems Symposium (RTSS), 
2013. 
 
[3] Yonghui Li, Benny Akesson and Kees Goossens, “Dynamic Command Scheduling 
for  Real-Time Memory Cont rollers” in  Proc. Euromicro Conference on  Real-Time 
Systems (ECRTS), 2014 
 
[4] Leonardo Ecco, Sebast ian  Tobuschat , Selma Saidi, and Rolf Ernst , "A Mixed 
Cr it ica l Memory Cont roller  Using Bank Pr iva t iza t ion  and Fixed Pr ior ity Scheduling" 
in  Proc. of the 20th  IEEE In terna t ional Conferen ce on  Real-Time Comput ing Systems 
and Applica t ions (RTCSA), August  2014  => close 
 
[5] D. T. Wang, “Modern  DRAM Memory systems: Per formance Analysis and 
Scheduling Algor ithm,” Ph.D. disser ta t ion , University of Maryland at  College Park, 
2005. 
 
[6] S. Kim, S. Kim, and Y. Lee, “DRAM power -aware rank scheduling,” in  ISLPED, 
2012. 
 
[7]  M. Paolier i, E . Quinones, F . Cazor la , and M. Valero, “An Analyzable  Memory 
Cont roller  for  Hard Real-Time CMPs,” Embedded Systems  Let ters, IEEE, vol. 1, no. 4, 
pp. 86–90, 2009. 
 
[8]  B. Akesson, K. Goossens, and M. Ringhofer , “Predator : a  predictable  SDRAM 
memory cont roller ,” in CODES+ISSS, 2007. 
 
 
78 
 
[9]  S. Goossens, B. Akesson, and K. Goossens, “Conserva t ive Open -row Policy for  
Mixed Time-Crit ica lity Memory Cont rollers,” in DATE, 2013. 
 
[10] J . Reineke, I. Liu , H . D. Pa tel, S. Kim, and E. A. Lee, “PRET  DRAM Cont roller : 
Bank Pr iva t iza t ion for  Predictability and Tempora l Isola t ion ,” in CODES+ISSS, 2011. 
 
[11] B. Akesson, L. Steffens, E. Strooisma, and K. Goossens, “Real-t ime scheduling 
using credit -cont rolled  sta t ic-pr ior ity arbit rat ion ,” in RTCSA, 2008.  
 
 
[12] R. Bourgade, C. Ballabr iga , H. Cass, C. Rochange, and P. Sainra t , “Accurate 
analysis of memory la tencies for  WCET est imat ion  (regular  paper ),” in  RTNS, 2008. 
 
[13] I. Liu, J . Reineke, and E. A. Lee, “A PRET Architecture Support ing  Concur rent  
Programs with Composable Timing Proper t ies,” in  ASILOMAR, 2010. 
 
[14] S. A. Edwards and E. A. Lee, “The Case for  the Precision  Timed  (PRET) 
Machine,” in DAC, 2011. 
 
[15] D. Bui, E . A. Lee, I. Liu , H . D. Pa tel, and J . Reineke, “Tempora l  isolat ion  on 
mult iprocessing  architectures,” in DAC, 2011. 
 
[16] J EDEC, “DDR3 SDRAM Standard J ESD79-3F,” J u ly 2012. 
 
[17] Zheng. Wu, “Worst  Case Analysis of DRAM Latency in  Hard Real Time Systems, 
MASc Thesis, University of Water loo, 2013. 
 
[18] Memory Cont roller  Design  code designed for  th is thesis is ava ilable at  
h t tp://ece.uwater loo.ca /rpellizz/techreps/roccode.zip  
 
 
 
 
 
79 
 
Appendix A: Design Block Diagrams  
 
 
 
     Back End Block Level View  
 
 
 
 
 
80 
 
 
 
 
Front  End Block Level View 
81 
 
 
  Front  and Back End Block Level View 
82 
 
Appendix B:  Simulation Output – Example snapshot 
 
83 
 
 
