HAPPY: Hybrid Address-based Page Policy in DRAMs by Ghasempour, Mohsen et al.
HAPPY: Hybrid Address-based Page Policy in DRAMs
Mohsen Ghasempour†, Aamer Jaleel?, Jim Garside† and Mikel Luján†
School of Computer Science, University of Manchester†
NVidia Research?
ABSTRACT
Memory controllers have used static page closure poli-
cies to decide whether a row should be left open, open-
page policy, or closed immediately, close-page policy, af-
ter the row has been accessed. The appropriate choice
for a particular access can reduce the average mem-
ory latency. However, since application access patterns
change at run time, static page policies cannot guar-
antee to deliver optimum execution time. Hybrid page
policies have been investigated as a means of covering
these dynamic scenarios and are now implemented in
state-of-the-art processors. Hybrid page policies switch
between open-page and close-page policies while the ap-
plication is running, by monitoring the access pattern of
row hits/conflicts and predicting future behavior. Un-
fortunately, as the size of DRAM memory increases,
fine-grain tracking and analysis of memory access pat-
terns does not remain practical.
We propose a compact memory address-based encod-
ing technique which can improve or maintain the perfor-
mance of DRAMs page closure predictors while reduc-
ing the hardware overhead in comparison with state-
of-the-art techniques. As a case study, we integrate
our technique, HAPPY, with a state-of-the-art moni-
tor – the Intel-adaptive open-page policy predictor em-
ployed by the Intel Xeon X5650 – and a traditional
Hybrid page policy. We evaluate them across 70 mem-
ory intensive workload mixes consisting of single-thread
and multi-thread applications. The experimental re-
sults show that using the HAPPY encoding applied to
the Intel-adaptive page closure policy can reduce the
hardware overhead by 5× for the evaluated 64 GB mem-
ory (up to 40× for a 512 GB memory) while maintaining
the prediction accuracy.
1. INTRODUCTION
The performance of DRAM is sensitive to the mem-
ory access pattern of the running applications. Tradi-
tionally DRAM controllers have used a static row-buffer
access policy, either open-page or close-page, to decide
whether a row should be left open or closed immedi-
ately after their access. For workloads with high local-
ity of accesses open-page works best since the target
row is already open and multiple accesses to that row
can be serviced with one activation. However, for work-
loads with more random memory accesses, close-page is
a better option. In this case a row will be closed im-
mediately after a memory access so the next memory
request within the same bank does not need to wait
for the precharge process of the open row. Moreover,
neither the open-page nor close-page policy can deliver
the ‘best’ execution time for all the workloads due to the
dynamic nature of memory accesses. In this situation a
hybrid-page policy, which is a mixture of open-page and
close-page, is more desirable.
Different techniques have been proposed in the lit-
erature to select between open-page and close-page in
DRAM memory controllers. Access-based techniques
are those that monitor and keep a history of the row
hits and row misses at different granularity in DRAMs
to make a prediction of the future page closure pol-
icy. On the other hand, time-based techniques focus
on predicting the optimum time that a row can be left
open. In general, these techniques rely on predictors
that monitor the number of accesses per row, the num-
ber of row hits or row misses, the time between hits or
misses, etc. to predict the open-page or close-page for
each row in DRAM. Intel included in the Xeon X5650
two time-based techniques.
As the size of DRAM is increasing with Data An-
alytic applications, having a fine-grain prediction and
monitoring scheme is inefficient and not scalable. On
the other hand going toward the coarse-grain schemes
reduces the accuracy of the prediction. A key challenge
for page-closure techniques is to balance the hardware
overhead and the prediction accuracy.
The trend towards keeping entire databases in DRAMs,
such as RAMCloud (e.g. 64 TB of DRAMs) [1] or Face-
book using 150 TB of DRAMs with memcache [2], turns
the scalability issue into a critical problem for future
DRAM systems.
Our contribution is a scalable and compact mem-
ory address-based encoding technique, called HAPPY,
that can be employed in DRAM memory controllers.
HAPPY is an efficient encoding that reduces the cost
of implementation of existing page closure techniques
while maintaining the prediction accuracy of the orig-
inal implementation. As case studies, we show how to
integrate HAPPY with a state-of-the-art implementa-
tion – the Intel-adaptive open-page policy employed by
Intel – and with a traditional hybrid-page. We eval-
uate HAPPY and existing techniques across 70 mem-
ar
X
iv
:1
50
9.
03
74
0v
1 
 [c
s.A
R]
  1
2 S
ep
 20
15
ory intensive workload mixes consisting of single-thread
and multi-thread applications. The experimental re-
sults show that using the HAPPY memory address-
based encoding applied to the Intel-adaptive page pol-
icy can reduce the hardware cost of implementation by
5× for the evaluated 64 GB memory while maintaining
the prediction accuracy. In other words, we can achieve
similar, or better, performance as existing high perfor-
mance industry and academic techniques while requir-
ing less hardware overhead.
2. BACKGROUND AND MOTIVATION
DRAM Structure: Figure 1 presents a high-level
structure of a typical DRAM organization. A DRAM
device (Figure 1a) consists of multiple banks, each of
which includes a data array and a sense amplifier or
row buffer. The data array is a matrix of rows and
columns comprising the storage cells. The basic oper-
ation of DRAMs requires that to access a specific cell
within a bank the entire row (e.g. 1 KB data) has to be
moved into the row buffer. Then, read or write opera-
tions can be performed on the data stored in the row
buffer. Although the banks within a DRAM device can
be accessed in parallel, since they share the communi-
cation bus only one bank at the time can transfer data
out of the DRAM device. Each DRAM device typically
supports read/write operations of 4-16 bits per mem-
ory request depending on the DRAM model. To sup-
port the required bandwidth, multiple DRAM devices
work in parallel within a Rank (Figure 1b). In modern
DRAMs, 64-bit data can be read/written per cycle and
typically a burst of size 4 or 8 is supported by these
modules to fill a full cache-line [3, 4].
Data Array
Row Buffer
Data Array
Row Buffer
Data Array
Row Buffer
DRAM 
Banks
Data Array
Row Buffer
8
-b
it
8
-b
it
8
-b
it
8
-b
it
8-bit
(a) DRAM
Device
DRAM
Device
DRAM
Device
DRAM
Device
DRAM
Device
8
-b
it
32-bit
8
-b
it
8
-b
it
8
-b
it
DRAM Rank
(b) DRAM Rank
Figure 1: DRAM Structure.
DRAM Basic Operation: To perform a read or
write operation, the target row has to be opened first
using an activation command which transfers a row to
the row buffer, imposing a delay tRCD. When the row is
in the row buffer, a read or write command can be issued
with a delay tCL. Considering the internal structure of
DRAMs, only one row can be processed at a time. Thus,
to access to a different row (within the same bank),
the open row has to be closed first using the precharge
command with a delay tRP. This command prepares
the row buffer to accept the new row. Considering the
basic operation of DRAMs, each memory request can
be classified into one of the following three categories
depending on the status of the bank to be accessed:
page-hit, page-miss or page-empty.
A page-hit is defined as a read/write operation to
an open row within a bank. In this situation there is
no need to use an activation command and the memory
request can be serviced immediately. A page-miss is de-
fined as a read/write operation to a different row than
the open row within a bank. In this situation the open
row must first be closed before accessing the second row.
Finally, a page-empty is defined as a read/write com-
mand to a bank that has no open row in the row buffer.
In this case an activation command is required to open
the target row. Page-misses are the most expensive
memory request while the page-hits are the cheapest to
service. Page-empties are cheaper than the page con-
flicts but more expensive than the page-hits.
DRAM Static Page Closure Policies: DRAM
memory controllers have a page closure policy, to alle-
viate the effect of page-misses on the memory system’s
performance. The traditional schemes are the open-
page and the close-page policy. A DRAM that uses the
open-page policy would leave the last accessed row open
in the row buffer to eliminate the activation cost of the
next memory request to the same row. A DRAM that
uses the close-page policy would close the row immedi-
ately after it has been accessed to eliminate the possibil-
ity of getting a page-miss for the next memory request
[5]. In general, the open-page policy is more beneficial
for the systems with high access locality whereas the
close-page policy is more appropriate for systems with
high entropy memory access. Table 1 presents the tim-
ing cost of page-hits and page-misses when using the
static page closure policies.
Page Policy Page Hit Page Miss
Open-Page tCL tRCD + tCL + tRP
Close-Page tRCD + tCL tRCD + tCL
Static Profiling tCL tRCD + tCL
Table 1: Cost of Page-hits and Page-misses when
using different page closure policies.
Motivation: Figure 2 depicts the normalised execu-
tion time of all the workloads (to open-page policy) that
is used in this paper using open-page and close-page
policy. The results show that around 68% of workloads
prefer the open-page policy while 32% of workloads de-
liver a better performance using the close-page policy.
According to this figure, a memory system that employs
the open-page policy can save up to 18%, in comparison
with the close-page policy, when running ‘libquantunm’
from SPEC benchmark and, at the same time, might
lose up to 18% when running ‘tigr’ from BIOBENCH
benchmark. Therefore, there is almost a 40% of perfor-
mance variation in the system depending on the static
page policy that a memory controller employs. This
motivates to start thinking about developing dynamic
page policies that switch between open and close-page
policy at run time based on the application access be-
havior. The Static Profiling presented in Table 1 shows
the cost of page-hits and page-misses when the mem-
2
0.70$
0.80$
0.90$
1.00$
1.10$
1.20$
1.30$
sp
ec
.li
bq
ua
nt
um
$
sp
ec
_m
cf
_r
$
sp
ec
_m
ilc
_s
$
hp
c4
$
sp
ec
_s
op
le
x_
r$
sp
ec
_s
ph
in
x3
_a
$
sp
ec
_o
m
ne
tp
p_
o$
sp
ec
_g
cc
_c
p$
sp
ec
_a
st
ar
_B
$
pa
rs
ec
_s
tr
ea
m
cl
us
te
r$
sp
ec
_g
cc
_c
$
hp
c1
$
hp
c1
0$
hp
c7
$
hp
c2
$
sp
ec
_g
cc
_1
$
hp
c6
$
hp
c8
$
hp
c1
2$
sp
ec
_x
al
an
cb
m
k_
r$
hp
c1
1$
sp
ec
_G
em
sF
DT
D_
r$
hp
c5
$
sp
ec
_g
cc
_g
$
hp
c9
$
sp
ec
_g
cc
_2
$
sp
ec
_g
cc
_s
c$
sp
ec
_b
zi
p2
_t
$
pa
rs
ec
_f
ac
es
im
$
hp
c1
3$
sp
ec
_z
eu
sm
p_
z$
hp
c3
$
pa
rs
ec
_f
er
re
t$
pa
rs
ec
_f
re
qm
in
e$
co
m
m
5$
pa
rs
ec
_b
la
ck
sc
ho
le
s$
pa
rs
ec
_s
w
ap
Jo
ns
$
pa
rs
ec
_fl
ui
da
ni
m
at
e$
sp
ec
_c
ac
tu
sA
DM
_b
$
co
m
m
4$
sp
ec
_b
zi
p2
_l
$
co
m
m
3$
co
m
m
2$
co
m
m
1$
pa
rs
ec
_c
an
ne
al
$
bi
ob
en
ch
_m
um
m
er
$
bi
ob
en
ch
_J
gr
$
N
or
m
al
is
ed
$E
xe
cu
Jo
n$
Ti
m
e$
68% Open Page 32% Close Page
Figure 2: Performance of Static Page Policies for standard workloads.
ory controller selects the best static page closure policy
scheme for each workloads by static profiling of mem-
ory accesses. The static profiling provides a baseline
to evaluate the performance of dynamic page closure
policies discussed in this paper.
Motivated by this kind of observations, hybrid-page
closure policies emerged. This type of page policies uses
various prediction algorithms to switch dynamically be-
tween open- and close-page according to the applica-
tion access behavior and improve performance. Pre-
diction accuracy and scalability with increasing mem-
ory size are the two main constraints when designing
such page closure predictors. For most techniques in
the literature, there is a linear relationship between the
DRAM memory size and the required resources to mon-
itor the memory access pattern. Thus, as the memory
size grows, the required cost of page closure policy pre-
dictor grows and, as a result, the on-chip memory con-
troller complexity and area increase.
Overall, considering the scalability of emerging mem-
ory systems like HMCs, increasing the interest of us-
ing a large amount of DRAM instead of disk storage in
servers and database analytic applications such as us-
ing 64 TB of DRAMs in RAMCloud [1, 6, 2] and using
150 TB of DRAMs by Facebook in memcache [2] all de-
mand a scalable approach for efficient design. Section
3 introduces HAPPY as a compact encoding scheme to
address the scalability problem of the page closure pol-
icy prediction technique for DRAM memory systems.
3. HAPPY:HYBRIDADDRESS-BASEDPAGE
POLICY
3.1 HAPPY – Basic Principles
HAPPY is a compact memory address-based encod-
ing built on the observation that there is a strong cor-
relation between physical address bits and the internal
structure of DRAMs. Understanding the basic oper-
ation of DRAMs shows that one of the first steps to
access to the DRAM structure is the address mapping
process. During this process, the physical address bits
provided by a core are translated to the correspond-
ing channel, rank, bank, row and column of a DRAM
device using a fixed and pre-defined address-mapping
algorithm. Having a fixed translation mapping creates
a strong correlation between physical address bits and
the DRAMs structure. It means that, if some useful
information can be extracted from the physical address
bits after translation, it is possible to extract the same
information before this stage.
All the page closure algorithms proposed so far fo-
cus on monitoring the memory access behavior after
the translation phase. In general, these techniques use
different performance counters in a channel, bank or
row basis to monitor page hits/conflicts, the time that
a row could be kept open etc. HAPPY proposes a
novel binary-encoding scheme with performance coun-
ters storing page closure history directly from the physi-
cal address bits. In the other words, HAPPY introduces
one predictor per physical address bit to forecast the
page closure policy of each row in the memory system
according to the run-time memory accesses.
Sections 3.2 and 3.3 show how HAPPY can be applied
to the two most main page closure categories: access-
based and time-based techniques. We illustrate one tra-
ditional and one state-of-the-art technique to demon-
strate how the HAPPY encoding can be applied to dif-
ferent systems with different implementation character-
istics. The methodology proposed in this paper can be
applied to other aspects of a DRAM structure as well.
3.2 HAPPY – Access-based Prediction
To demonstrate how HAPPY can be applied to access-
based algorithms we select the traditional hybrid-page
policy algorithm. This employs simple, saturating coun-
ters to monitor the memory access pattern behavior and
dynamically switch between open- and close-page pol-
icy at run time. Figure 3 depicts the basic structure of
such page closure policy predictors.
In this technique, one saturating counter (e.g. a 2-bit
counter initialized to zero - open-page policy) is assigned
to each row of a DRAM bank. Every time that a row-
miss happens the corresponding counter is incremented.
Whenever a row-hit occurs the counter is decremented.
For each memory request, the accessed row’s counter
value determines the page closure policy; if the value
is 0 or 1 the open-page policy is predicted and if the
3
OP Weak OP Weak CP CP
Conflict
Hit Hit Hit
Conflict Conflict
Row Buffer
0 0 Row 1
0 0 Row 0
0 0 Row 32,768
DRAM Bank
2-bit Saturating Counter
Figure 3: Basic structure of Hybrid Page Policy.
value of the counter is 2 or 3 the close-page policy is
predicted. However, having a counter for each row in a
DRAM device imposes a high area and power overhead
to the memory system. For example, a 4 GB DRAM
memory system with 1 channel, 2 ranks, 8 banks and
32,768 rows per bank, require 524,288 counters, which
is not scalable (analysis presented in Section 5).
Figure 4 depicts the HAPPY implementation of a Hy-
brid page policy. The binary representation of the re-
quested physical address is a pattern of zeros and ones.
HAPPY dedicates two encoding counters per physical
address bit location: one counter to monitor the position
when its value is aˆA˘Y¨1aˆA˘Z´ and one to monitor it when
it is aˆA˘Y¨0aˆA˘Z´. Training of these counters is similar to
the original implementation of Hybrid; that means for
every page conflict the counter corresponding to each
physical address bit is incremented and for every page
hit the same counter will be decremented. Considering
the page-hits and page-misses happen in the row ba-
sis then there is no need to monitor all the available
physical address bits. Thus, the corresponding physical
address bit to columns and cache-lines offset will not
be used. This reduces even further the implementation
costs of HAPPY.
Encoding ‘1’
Address Bits 1 1 0 0 1 1 0 1 1 1 0 1 0 1 1
0xC66EB
OP Weak OP Weak CP CP
Conflict
Hit Hit Hit
Conflict Conflict
Row Bank 
1 1 0 0
Rank 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0
Physical Address
Encoding ‘0’Encoding 
Counters
Figure 4: HAPPY implementation of Hybrid Page
Policy.
Having done this, each physical address bit correlates
with the possibility of getting page hit/conflict depend-
ing on the value of that bit. Therefore, for a given phys-
ical address the possibility of getting page hit/conflicts
can be calculated simply by considering at all the par-
ticipant bitaˆA˘Z´s counter values in the requested address
and using one of the following techniques: Majority vote
or Aggregation.
Majority vote: Figure 5 explains this scheme us-
ing a simple example. Each physical address bit has
a counter which has its own standalone vote to choose
an open- or close-page policy for the requested physical
address. The page closure policy vote of each bit can be
extracted by looking at the more significant bit of the
saturating counters. If this bit is aˆA˘Y¨0aˆA˘Z´ there is an
open-page policy vote and if the value is aˆA˘Y¨1aˆA˘Z´ there
is a close-page policy vote. As the name of decision im-
plies, the final vote is determined by the majority vote
across all the counters.
1 1 0 0 1 1 0 1 1 1 0 1 0 1 1
0xC66EB
1 1 0 0
0 0 0 2 0 0 0 0 0 0 3 0 3 0 00 0 1 3
0 1 0 0 2 2 0 0 1 1 0 0 0 0 33 2 0 0
Physical Address
C
P
C
P
O
P
C
P
O
P
O
P
O
P
C
P
C
P
C
P O
P
C
PO
P
O
P
O
P
O
P
C
P
C
P
O
P
Close Page Votes: 9
Open Page Votes: 10
Final Decision: Open Page
‘0’ Counters
‘1’ Counters
Figure 5: Example of HAPPY Majority vote decision.
Aggregation: the final page closure policy decision
can also be determined by comparing the aggregation
of the all the counters values, Equation (1), against a
certain threshold, Equation (2).
if
∑
AddressBitCounters < Threshold→ Open Page
else → Close Page
(1)
Threshold =
AddrBitsWidth× CounterV alue
2
(2)
The experimental results show similar results using
either majority or aggregation. Thus only the majority
vote decision scheme is used in the final experimental
results.
3.3 HAPPY – Time-based Prediction
As a case study to show how HAPPY can be applied
to a time-based prediction algorithm we chose the Intel-
adaptive open-page policy [7, 8] employed by the Intel
Xeon X5650 [9]. The basic structure of such a page
closure policy is presented in Figure 6.
DRAM
Bank
Row Buffer
Tim
e
o
u
t 
C
o
u
n
te
r
COMP
Tim
e
o
u
t 
R
e
giste
r
Close0000
0001
0010
0010
1100
1101
1110
1111
Less Aggressive 
Page Closing 
Policy 
More Aggressive 
Page Closing 
Policy 
Mistake Counter
Adjust
In
cr
e
m
e
n
t De
cre
m
e
n
t
H
ig
h
 T
h
re
sh
o
ld
Lo
w
 T
h
re
sh
o
ld
Figure 6: Intel-adaptive page policy predictor basic
structure.
The integrated memory controller used in this In-
tel processor can be configured at boot time to employ
4
one of the following three different page closure policy
schemes: close-page, fixed-open and Intel-adaptive open
page. The fixed-open page policy keeps a row open for a
fixed period of time and closes it after that. The Intel-
adaptive scheme is an advanced version of the fixed-
open schemes. Similar to the fixed-open policy, in this
structure, each row buffer within a bank has a Time-
out Counter (TC) and a Timeout Register (TR). A row
will be kept open until TC reaches the TR and then
closed. However, the initial TR might not be a suitable
value for all the benchmarks. Thus, the Intel-adaptive
scheme provides a technique to update the TR at run
time using a 4-bit Mistake Counter (MC). Whenever
a page conflict happens that could have been a page-
empty, since there was enough time to precharge the
last accessed row, then the MC is decremented. When-
ever a page empty could have been a page-hit, since
the row being accessed is the same as the last accessed
row in that bank, then the MC is incremented. After a
specific time interval the MC will be checked against a
pre-predefined low and high threshold to see if either a
less or more aggressive page closure policy is required.
If the MC is higher than the high-threshold then the
TR will be incremented to keep the accessed row open
for a longer period and if the MC is lower than the
low-threshold the TR will be decremented to close the
accessed row sooner.
Figure 7 depicts the HAPPY implementation of the
Intel-adaptive open page policy. This time the aim is
to extract the timeout value for each row to be kept
open from the physical address bits. We use the same
methodology explained in the previous section; the only
difference is that instead of using simple saturating coun-
ters a Monitor Unit is dedicated to each physical address
bit location. Each monitoring unit includes a MC and
a TR with the same function as the original implemen-
tation of Intel-adaptive page policy. A global timeout
counter is still required to keep track of row closing and
opening times on a per bank basis.
Encoding ‘1’
Address Bits 1 1 0 0 1 1 0 1 1 1 0 1 0 1 1
0xC66EB
Row Bank 
1 1 0 0
Rank 
MMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMM
Physical Address
Encoding ‘0’Encoding 
Counters
Timeout 
Register
0000
0001
0010
0010
1100
1101
1110
1111
Less Aggressive 
Page Closing 
Policy 
More Aggressive 
Page Closing 
Policy 
Mistake Counter
Adjust
In
cr
e
m
e
n
t De
cre
m
e
n
t
H
ig
h
 T
h
re
sh
o
ld
Lo
w
 T
h
re
sh
o
ld
Monitoring Unit
Figure 7: HAPPY implementation of Intel-adaptive
page policy predictor.
Updating the MCs is as before but this time it is
applied per physical address bit basis rather than per
bank. Moreover, instead of having a global timeout
register per bank, the time period that a row can be
kept open will be calculated from the aggregation of all
the participant bits for an accessed physical address.
To make a fair comparison between HAPPY and orig-
inal implementation of this page policy, the size of the
MC is chosen as before (e.g. 4-bit) and the size of TR in
each physical bit location is chosen small enough that
the maximum time that a row can be kept open is equal
to having one global TR per bank.
3.4 HAPPY – Intuition
• Observation: The main intuition behind HAPPY
is based on the observation that addresses that
are spatially close together tend to have a similar
page-closure policy preference. HAPPY is devised
to exploit such behavior by fine grain monitoring
of physical address bits behavior.
• Ensemble Methods: Although HAPPY was de-
veloped by analysis and observation of the exper-
iments, we believe that there are Machine Learn-
ing principles that can justify the intuition be-
hind HAPPY. A mathematical/theoretical frame-
work that can explain HAPPY is that of Ensem-
ble Methods [10]. The family of algorithms cat-
egorized as Ensemble methods combines multiple
(normally simple) predictors. The theory explains
how combining such predictors, it can be obtained
a much improved predictor provided certain di-
versity properties among the predictors are met.
Random forest, and neural networks are exam-
ples of very successful prediction algorithms part
of the ensemble family. This paper addresses an
online learning scenario and uses a fixed number of
predictors with a non-linear combination function.
When applying our technique to Intel-adaptive, we
generate a pair of simple predictors per physical
address bit and solve a regression problem. Each
pair of predictors is trained using a different single
physical bit (different features) and each member
of the pair is trained using only Zero or One oc-
currences (different dataset); mechanisms that can
improve diversity. In Happy, we use two counters
per physical address bits; e.g. 4GB is represented
with 38 counters. If we were to limit ourselves to
only a binary decision, then the maximum number
of possible decisions that can be stored would be 2
to the power of 38. If we increase to 8GB, then we
use 40 counters, and thus the maximum number
of possible decisions that can be stored would be
2 to the power of 40. Thus as we increase the size
of memory we are also increasing the number of
possible decisions representable.
4. EVALUATION METHODOLOGY
Page closure prediction algorithms for DRAMs are
sensitive to the application memory access patterns. To
5
address this, we carry out an extensive evaluation, de-
scribed as follows:
Simulator: USIMM [11], a detailed memory system
simulator, is used as our main simulation platform. We
extended USIMM to support five different, existing page
closure policies (i.e. Open-Page, Close-Page, Hybrid,
Fixed-Open and Intel-adaptive open page) plus the two
implementations of HAPPY which are described in Sec-
tion 3 (i.e. Hybrid-HAPPY and Intel-adaptive-HAPPY).
The scheduling algorithm in the memory controller is
FR-FCFS; first ready, first come first serve. We evalu-
ated HAPPY based on different memory configurations,
2 GB for single-thread and 4 GB for multithread work-
loads. To increase the memory congestion we config-
ured USIMM with 1 channel and 1 rank. The baseline
USIMM system configuration parameters are captured
in Table 2.
Processor
Clock Speed 3.2GHz
Pipeline depth 10
ROB size 32
DRAM Parameters
DRAM Size 2 - 4 GB
Bus Speed 800MHz
Configuration 1Channel,1Rank,8Banks
Row per bank 65,536
Cache line per row 128
Table 2: Simulation paramteres.
Address Mapping Schemes: The number of page
conflicts in DRAMs and as a result the memory perfor-
mance is susceptible to the memory address mapping
scheme. The experiments consider 3 different address
mappings presented in Figure 8. The first mapping 1
is a standard mappings to maximize row buffer locality.
The next two address interleaving policies are state-of-
the-art schemes proposed by Kaseridis et al. [12] and
Zhang et al. [13]. The proposed mapping by Zhang et al.
[13] XORs part of the row address bit with the banks ad-
dress bits to produce a new bank index (see Figure 8b).
Kaseridis et al. [12] extend this technique by produc-
ing the column index using different section of physical
address bits (Figure 8c). Both techniques aim to re-
duce page conflicts in DRAMS. Our experiments shows
that the minimalist open page policy (Mapping 3) per-
forms better for most of the workloads. Therefore, this
address mapping scheme is employed in all the exper-
iments. Focusing on the best page closure policy (i.e.
Intel-Adaptive-HAPPY), we also report the sensitivity
of HAPPY for all the three address mappings schemes.
Workloads: The workloads include a wide range of
memory intensive applications (i.e. 48 workloads) from
different benchmark suites (PARSEC [14], SPEC [15],
BIOBENCH [16], HPC and COMMERCIAL) and rep-
resentative regions of interest for each application. Ta-
ble 3 lists the workloads and their corresponding bench-
mark suites.
The USIMM simulator can run arbitrary multi-application
Row RA Bank CH Column Block Offset
(a) Mapping 1: Maximise row-buffer locality
RA BankCH Column Block Offset
Row
Row RA BankCH Column Block Offset
XOR
(b) Mapping 3: Permutation-based Page
Interleaving [13]
RA BankCH Column Block Offset
Row
XOR
Col
Row RA BankCH Column Block OffsetCol
(c) Mapping 4:Minimalist Open-Page Scheme [12]
Figure 8: Different Address Interleaving schemes.
workloads using multiple traces. To increase the variety
of memory access patterns, we set up USIMM for multi-
applications to produce 22 random workload mixes; a
combination of 4-thread, 8-thread and 16-thread appli-
cations. Table 4 listed these multi-core workloads con-
sidering the prefix of single thread workloads presented
in Table 3. Overall the experiments consider 70 work-
load mixes.
Memory Footprint (MF): To evaluate the perfor-
mance of page closure predictors a careful study has to
be carried out otherwise the performance and accuracy
numbers might be misleading. For instance, if an ap-
plication targets a very small portion of memory then
it might be possible to predict its behavior using very
small number of performance counter whereas if the ap-
plication accesses all over of the memory space then it
might be more difficult to keep track of the application
access pattern with only a few counters (e.g. HAPPY).
To have a fair evaluation methodology we made sure
that the memory traces cover a wide range of access
pattern. To this aim, we monitored the total physical
pages accessed (Memory Footprint) per application at
run time and we can confirm that our single thread ap-
plications have the average MF of 30% (up to 97%), our
4-thread workloads have the average MF of 50% (up to
75%), our 8-thread workloads have the average MF of
70% (up to 85%) and our 16-thread workloads have the
average MF of 95% (up to 99.8%).
5. RESULTS AND DISCUSSIONS
This section analyzes the different page closure pol-
icy prediction schemes compared with using HAPPY
by looking at execution time, accuracy and scalability.
Before jumping to the result graphs the following sum-
mary might be helpful:
• The HAPPY implementation of Hybrid page pol-
6
Benchmark Suites
SPEC PARSEC COMMERCIAL
(a) GemsFDTD r (k) astar B (u) canneal (D1) comm1
(b) bzip2 l (l) bzip2 t (v) streamcluster (D2) comm2
(c) cactusADM b (m) gcc 1 (w) blackschols (D3) comm3
(d) gcc 2 (n) gcc c (x) facesim (D4) comm4
(e) gcc cp (o) gcc g (y) ferret (D5) comm5
(f) gcc sc (p) mcf r (z) fluidanimate BIOBENCH
(g) milc s (q) omnetpp o (A) freqmine (E) mummer
(h) soplex r (r) sphinx3 a (B) swaption (F) tigr
(i) xalancbmk r (s) zeusmp z HPC
(j) libquantum (t) leslie (C) hpc1 - hpc13
Table 3: Standard workloads and benchmark suites.
icy is called Hybrid-HAPPY for brevity.
• The HAPPY implementation of Intel-adaptive open-
page policy is called Intel-adaptive-HAPPY for brevity.
• The results in Figure 10 and Figure 11-14 are nor-
malized to the ‘static profiling’; the lower the bar
the better performance.
5.1 Prediction Accuracy
Understanding the prediction accuracy for the differ-
ent types of page closure predictors has its pitfalls. For
instance, prediction accuracy in the case of Hybrid pre-
dictors is straight forward as the prediction outcome is
either opening or closing a page (binary classification).
However, in the case of the Intel-adaptive technique the
accuracy needs to be described based on the timeout
value (regression). Consider a scenario where a page
has to be open for 40 clock cycles to get a page hit and
Intel-adaptive predicts 39 clock cycle. In this case the
prediction accuracy should be calculated differently.
To have a fair evaluation across all the predictors with
a different nature of prediction, we calculate the pre-
diction accuracy based on the Page-Hit and the Page-
Miss prediction outcome. In fact, the main purpose
of using page policy predictors is to increase the page
hits and reduce the page misses in the DRAM. Thus,
we calculate the Oracle page-hits (maximum possible
page-hits when having a perfect predictors) and Oracle
page-misses (minimum possible page-misses when hav-
ing a perfect predictor) and evaluate the actual page-
hits and page-misses occurred during execution time of
each workloads against the oracle numbers. Figure 9
presents the prediction accuracy (GMEAN) of differ-
ent predictors across all the workloads evaluated in this
work. The open-page and close-page policies deliver the
maximum prediction accuracy for page-hits and page-
misses, respectively. This happens because an open-
page policy leaves all the page open and then it can
cover all the possible page-hits in the system and non of
the page misses. A close-page policy behaves similarly
but in the opposite scenario. The hybrid-page policy
delivers a moderate page-miss and page-hit prediction
accuracy (around 60%). The Intel-adaptive and fixed-
open both deliver a good prediction accuracy for both
page-hits (80% and 75.8%) and page-misses (83.5% and
90.4%) respectively. Overall, the HAPPY implementa-
tion of both Intel-adaptive and hybrid are slightly more
accurate than the original implementation. This predic-
tion accuracy numbers justify the execution time pre-
sented in Figure 10. Also, from the accuracy results this
can be concluded that the page-hit prediction accuracy
has a higher impact on the overall execution time than
the page-miss prediction accuracy.
100#
0#
68.3# 68.9#
75.8# 80.0# 81.9#
0#
100#
59.0# 61.5#
90.4#
83.5# 84.9#
0#
20#
40#
60#
80#
100#
Ope
n1P
age
#
Clo
se1P
age
#
Hyb
rid#
Hyb
rid1
HAP
PY#
Fixe
d1O
pen
#
Inte
l1ad
apE
ve#
Inte
l1ad
apE
ve1
HAP
PY#Pr
ed
ic
Eo
n#
Ac
cu
ra
cy
#(%
)#
Page1Hit# Page1Miss#
Figure 9: Prediction accuracy for different predictors.
5.2 Performance Analysis
Figure 10 summarizes the performance of different
prediction algorithms, normalized to the ‘Static Profil-
ing’, for all the benchmarks. Each bargraph in this fig-
ure represents the Geometrical Mean (GMEAN) of the
execution time for the number of running workloads for
each category. The detailed performance of the pre-
diction algorithms for individual workloads is presented
in Figures 11–13. These figures again confirm that a
static page closure policy cannot deliver the optimum
execution time for all the workloads. The correspond-
ing workloads for HPC and SPEC benchmarks mostly
prefer open-page policy. On the other hand, the cor-
responding workloads for PARSEC, BIOBENCH and
COMMERCIAL workloads mostly prefer open-page pol-
icy.
Overview: Our experimental results show that the
best page closure prediction scheme (i.e. Intel-adaptive-
HAPPY) delivers 5% and 8% better average perfor-
mance across all the workloads (up to 12% and 20%)
7
0.90$
0.95$
1.00$
1.05$
1.10$
SPEC$ PARSEC$ BIOBENCH$ HPC$ COMMERCIAL$ GMEAN$
N
or
m
al
is
ed
$E
xe
cu
Ao
n$
Ti
m
e$
StaAc$Profiling$ OpenHPage$ CloseHPage$ Hybrid$ HybridHHAPPY$ FixedHOpen$ IntelHadapAve$ IntelHadapAveHHAPPY$
Figure 10: Average relative performance to static profiling for all the single-thread workloads.
0.90$
0.92$
0.94$
0.96$
0.98$
1.00$
1.02$
1.04$
hpc1$ hpc2$ hpc3$ hpc4$ hpc5$ hpc6$ hpc7$ hpc8$ hpc9$ hpc10$ hpc11$ hpc12$ hpc13$
N
or
m
al
is
ed
$E
xe
cu
=o
n$
Ti
m
e$
Sta=c$Profiling$ OpenFPage$ Close$Page$ Hybrid$ HybridFHAPPY$ FixedFOpen$ IntelFadap=ve$ IntelFadap=veFHAPPY$
Figure 11: Relative performance to static profiling for HPC workloads.
0.90$
0.92$
0.94$
0.96$
0.98$
1.00$
1.02$
1.04$
Gem
sFD
TD_
r$
asta
r_B
$
bzip
2_l$
bzip
2_t
$
cac
tus
ADM
_b$
gcc
_1$
gcc
_2$ gcc
_c$
gcc
_cp
$
gcc
_g$
gcc
_sc
$
lesl
ie3d
_l$
libq
uan
tum
$
mcf
_r$
mil
c_s
$
om
net
pp_
o$
sop
lex_
r$
sph
inx3
_a$
xala
ncb
mk
_r$
zeu
smp
_z$N
or
m
al
si
ed
$E
xe
cu
Ko
n$
Ti
m
e$
StaKc$Profiling$ OpenPPage$ Close$Page$ Hybrid$ HybridPHAPPY$ FixedPOpen$ IntelPadapKve$ IntelPadapKvePHAPPY$
Figure 12: Relative performance to static profiling for SPEC workloads.
0.90$
0.92$
0.94$
0.96$
0.98$
1.00$
1.02$
1.04$
black$ canneal$ face$ ferret$ fluid$ freq$ stream$ swapt$ mummer$ =gr$ comm1$ comm2$ comm3$ comm4$ comm5$
N
or
m
al
is
ed
$E
xe
cu
=o
n$
Ti
m
e$
Sta=c$Profiling$ OpenJPage$ Close$Page$ Hybrid$ HybridJHAPPY$ FixedJOpen$ IntelJadap=ve$ IntelJadap=veJHAPPY$
Figure 13: Relative performance to static profiling for PARSEC, BIOBENCH and Commercial workloads.
0.9$
0.95$
1$
1.05$
1.1$
1.15$
1.2$
MIX1$ MIX2$ MIX3$ MIX4$ MIX5$ MIX6$ MIX7$ MIX8$ MIX9$ MIX10$ MIX11$ MIX12$ MIX13$ MIX14$ MIX15$ MIX16$ MIX17$ MIX18$ MIX19$ MIX20$ MIX21$ MIX22$ GMEAN$
N
or
m
al
is
ed
$E
xe
cu
@o
n$
Ti
m
e$
Sta@c$Profiling$ OpenJPage$ CloseJPage$ Hybrid$ HybridJHAPPY$ FixedJOpen$ IntelJadap@ve$ IntelJadap@veJHAPPY$
Figure 14: Relative performance to static profiling for Multi-Core workloads.
8
in comparison with open-page and close-page policy re-
spectively. Overall, the HAPPY implementation of both
Hybrid and Intel-adaptive achieved similar performance
when compared with the original implementation of these
page closure policies albeit with a much lower hardware
overhead. Comparing the Intel-adaptive with the Intel-
adaptive-HAPPY page policy shows that the HAPPY
implementation can reduce the cost of implementation
by 5× for the evaluated 64 GB memory size (up to 40×
for a memory size of 512 GB) while improving the per-
formance up to 2%. Similar behavior can be observed
for Hybrid and Hybrid-HAPPY. Hybrid-HAPPY shows
182,000× reduction in cost of implementation for the
evaluated 64 GB memory size (up to 1.2M× for mem-
ory size of 512 GB) while improving the performance
up to 3% for some workloads.
Similarly, as Figure 14 presents, for multi-thread ap-
plications, even with a very high MF, HAPPY perfor-
mance is consistent and it delivers similar performance
to the original implementation. The experimental re-
sults show that Intel-adaptive-HAPPY delivers 5% and
14% better average performance across all the work-
loads (up to 9% and 22%) in comparison with open-page
and close-page policy respectively.
Sensitivity to Address Mapping Schemes: To
investigate the sensitivity of HAPPY to different ad-
dress mappings we select the best page closure policy
(i.e. Intel-adaptive-HAPPY) across all the predictors
presented in this paper and evaluate it with the three
address mappings presented in Figure 8. Figures 15 and
16 illustrate the prediction accuracy of Intel-adaptive
(original and HAPPY implementation) using the dif-
ferent mapping schemes. These results show that the
HAPPY implementation of Intel-adaptive always deliv-
ers identical or slightly better results than the origi-
nal implementation no matter which address mapping
is used.
0"
20"
40"
60"
80"
100"
Mapping"1" Mapping"2" Mapping"3"
Pr
ed
ic
4o
n"
Ac
cu
ra
cy
"(%
)"
Intel?adap4ve" Intel?adap4ve?HAPPY"
Figure 15: Page-hit prediction accuracy with different
address mappings.
5.3 Sensitivity to Memory Size
We have evaluated HAPPY for up to 64 GB DRAM
size and the results shows that HAPPY has a consistent
behaviour as the memory size increase. Our experimen-
tal results suggests that the effective factor on HAPPY
performance is the utilization of memory address space
rather than the size of memory. For this reason, we con-
sidered a 4 GB memory organization with up to 99.8%
memory space utilization for our multithread experi-
ments (results are presented in Figure 14). Even in
0"
20"
40"
60"
80"
100"
Mapping"1" Mapping"2" Mapping"3"
Pr
ed
ic
4o
n"
Ac
cu
ra
cy
"(%
)"
Intel?adap4ve" Intel?adap4ve?HAPPY"
Figure 16: Page-miss prediction accuracy with
different address mappings.
this situation, the results show that HAPPY delivers a
competitive performance against the original implemen-
tation of both Hybrid and Intel-adaptive page policies
while reduces the hardware overhead significantly.
5.4 Scalability with Memory Size
Figure 17 depicts the required storage (bytes) for each
prediction algorithm for different sizes of memory. The
HAPPY implementation of the hybrid prediction tech-
nique is orders of magnitude (e.g. up to 1.2M×) cheaper
than the original implementation while we show that
it delivers similar performance to original implementa-
tion. In the case of Intel-adaptive page closure policy,
the HAPPY implementation requires slightly more re-
sources than the original implementation for memory
sizes of less than 8 GB. However, as the memory size
grows, the Intel-adaptive-HAPPY outperforms the scal-
ability over the original implementation up to 40× for
a memory size of 512 GB. Table 5 depicts the required
performance counters for different page closure policies
with and without HAPPY considering a memory sys-
tem with X channels, Y ranks, Z banks and W rows.
1"
10"
100"
1,000"
10,000"
100,000"
1,000,000"
10,000,000"
100,000,000"
4"G
B"
8"G
B"
16
"GB
"
32
"GB
"
64
"GB
"
12
8"G
B"
25
6"G
B"
51
2"G
B"R
eq
ui
re
d"
St
or
ag
e"
(B
yt
es
)"
Memory"Size"
Hybrid" HybridBHAPPY" IntelBadapJve" IntelBadapJveBHAPPY"
Figure 17: Scalability of different page closure
prediction algorithms.
Implementation Required Counters
Hybrid X × Y × Z ×W
Hybrid-HAPPY (log2X + log2 Y + log2 Z + log2W )× 2
Intel-adaptive (X × Y × Z)× 2
Intel-HAPPY (log2X + log2 Y + log2 Z + log2W )× 4
Table 5: Required performance counters for different
page closure policies.
9
Multi-Core Workloads
MIX1: (w-D3-D3-F) MIX12: (b-n-s-w)
MIX2: (D1-D5-E-F) MIX13: (w-D1-D5-x-y-z-t-F)
MIX3: (D1-x-y-F) MIX14: (D1-D4-D5-x-z-x-B-F)
MIX4: (D2-D4-A-F) MIX15: (D1-D4-D5-j-E-D5-v-F)
MIX5: (D2-D4-j-E) MIX16: (D2-D4-D5-z-A-A-D4-F)
MIX6: (D2-y-t-F) MIX17: (C5-C6-u-l-e-o-p-h)
MIX7: (D4-D5-g-F) MIX18: (C13-C14C17-C18-C21-C2-C4-v-k-l-c-m-e-n-h-s)
MIX8: (D4-x-x-F) MIX19: (C13-C18-C21-C2-C6-u-v-C21-u-l-l-o-t-p-h-s)
MIX9: (x-y-z-F) MIX20: (C14-C17-C21-C22-C2-C4-C5-C8-C14-C21-C4-k-e-o-a-p)
MIX10: (C21-C22-C4-b) MIX21: (C17-C21-u-C17-q-q-i-t-o-b-o-a-t-p-q-i)
MIX11: (C5-C6-u-e) MIX22: (C18-C22-C5-C6-C8-u-k-l-d-e-n–o-p-q-h-r)
Table 4: Randomly generated Multi-Core workloads.
5.5 Prediction Algorithms -Weakness & Strength
Due to the nature of implementation and structure of
each predictor, each of them might or might not work
in a specific situation. Here, we discuss such situations.
Static Policies: the open-page policy works best for
high locality workloads but degrades the performance
of DRAMs significantly for the workloads with highly
random or dynamic memory accesses. The close-page
policy has the completely opposite behavior. PARSEC
and SPEC workloads are good examples which show the
different behavior of open-page and close-page policy.
Fixed-Open: The performance of this type of al-
gorithms is fairly susceptible to its predefined timeout
value. Similarly to the methodology presented in [12],
we select this value to be equal to tRC, that is the min-
imum time limitation between consecutive accesses to
different rows within a bank, in the experiments. This
time delay provides enough opportunity to capture a
possible page hit; it does, according to the results pre-
sented in Figure 10. However, for non-memory intensive
threads with high locality or memory intensive with low
locality (e.g. ‘mummer’ and ‘tigr’) this technique might
not work well. The reason is that, for the first category,
the time interval between memory requests might be
higher than the fixed timeout value which means this
technique will close the row before a page hit happens.
Similarly, for the second category the time interval be-
tween memory requests might be lower than the fixed
timeout value, which means that a row would be kept
open for an unnecessary time which, most likely, would
lead to get other page conflicts.
Hybrid: the integrated saturating counters employed
in this category (either the original or HAPPY imple-
mentation) train by the number of page-hits and page-
conflicts that they face. Therefore, the prediction ac-
curacy of these types of techniques is fairly sensitive to
the distribution of page hits/conflicts within DRAMs.
For instance, ‘streamcluster’ presents such a behavior.
Intel-adaptive: our experiments show that this pre-
diction algorithm is the best across all the presented
schemes in this paper. However, one weakness of this
technique is the updating granularity of TR. In our ex-
periments, every time that checking of the MC suggests
a more or less aggressive page closure policy, TR is in-
cremented or decremented by one respectively. Updat-
ing granularity by one step (increment or decrement)
delivers a fine tuning of the TR but reduces the train-
ing rate of the overall prediction technique. This means
that, for workloads where the application access pattern
behavior changes frequently (e.g. between high and low
locality accesses pattern) within different time phases,
the Intel-adaptive scheme might not be able deliver its
best performance. Similar behavior can be observed in
‘canneal’ or ‘comm1’.
HAPPY: so far we have just explained that advan-
tages of HAPPY. However, like all the other proposed
techniques, HAPPY also has weaknesses. Considering
the global nature of a HAPPY implementation it is ex-
pected that HAPPY cannot perform as efficiently as fine
grain schemes for workloads with fairly dynamic behav-
ior targeting a small part of DRAMs locally. This can
be seen in workloads like ‘tigr’.
5.6 Flexibility
HAPPY is the proof of the concept that the physi-
cal address bits can be the source of useful information
that can be extracted using the right encoding and de-
coding techniques. This makes HAPPY a fairly flexible
tool that can be applied to different prediction algo-
rithms that have not been practical due to the cost of
implementation, making them feasible. In this paper we
applied HAPPY to two completely different prediction
schemes and showed how the performance and scalabil-
ity of these scheme improved. However, page closure
policies are not the only candidate that can take ad-
vantage of the knowledge presented by HAPPY in this
paper and we will present more interesting show cases
in the near future.
6. RELATEDWORK
Succinctly, prior research in this area can be catego-
rized in two main groups: access-based and time-based
techniques.
Access-based techniques are those that monitor
and keep a history of the row hits and row misses at dif-
ferent granularity in DRAMs to make a prediction of the
future page closure policy for each row or bank within a
10
DRAM memory system. Xu et al. [17] proposed a two-
level dynamic SDRAM policy predictor which collects
the row hit/miss behavior for the last n accesses in a
history register. For each entry in the history register,
there is a 2-bit saturating counter that keeps track of
the page closure policy for each access. Huan et al. [18]
proposed the Processor-Directed dynamic page policy
where the processor keeps track of the last row access
to each bank to predict page hits or misses for future
memory requests. The processor sends this information
to the memory controller to specify the page closure for
the next memory access. Awashti et al. [19] keep track
in a history table of the number of accesses each row
has before closing it. When a row is open the number
of expected accesses to that row is looked up, if there
is no recorded entry for the accessed row in the his-
tory table that row is kept open. However if there is
an entry for the accessed row it will be closed after the
expected number of accesses suggested by the history
table. More techniques using access-based page closure
prediction can be found in [20, 21, 22, 23, 24, 25, 26].
Time-based techniques mainly focus on predict-
ing the optimum time that a row can be left open.
Blackmore [27] presented a quantitative analysis of page
closure predictors. This work specifically focused on
the Intel-adaptive page policy structure and tried to
improve it by introducing the inter-arrival distribution
concept. Stonkovic et al. [24] used the concept of live-
time and dead-time to predict the page closure. The
live-time is the time interval between opening a row
until the last access to that row while dead-time is the
interval from the last access to an open row until its
closing. If the predictor predicts a zero live-time or if
it predicts that the row has entered its dead-time pe-
riod, then the row will be closed immediately after the
DRAM access otherwise it will be kept open for future
accesses. In another work, Kaseridis et al. [12] used the
concept that in DRAMs there is a minimum time limi-
tation of tRC between two activations within a bank and
speculatively leave the pages open for the tRC period.
To sum up, HAPPY is the only technique which con-
siders an encoding based on the memory address bits of-
fering a compact means of storing history to inform the
predictions. In addition, we have shown how to apply
HAPPY to Time-based (i.e. Intel Adaptive HAPPY)
and Access-based techniques (i.e. Hybrid HAPPY).
7. CONCLUSIONS
DRAM performance is dependent on the memory ac-
cess pattern and, more specifically, the number of page-
hit and page-conflicts that occur at run time. Modern
DRAM controllers employ advanced page closure policy
predictors to improve performance by trying to trans-
form page-conflicts into page-empty (e.g. by closing the
last accessed row at the “right time”), and page-empty
cases into page-hits (e.g. by keeping open the last ac-
cessed row for longer time). However the main challenge
is to balance the prediction accuracy of these predictors
with manageable hardware overheads (scalability) as we
increase the size of DRAM.
We have described HAPPY – a compact and efficient
binary-encoding technique – to alleviate the scalability
problem of DRAM page closure predictors. HAPPY
relies on the simple observation that there is a strong
correlation between the physical address bits of mem-
ory addresses requested by processors and the inter-
nal structure of the DRAM as there is a fixed-address
mapping scheme. Thus, the physical address bits carry
the information that a memory controller needs to pre-
dict the page-hit or page-conflict for a particular ac-
cess. Considering this, the required performance coun-
ters and monitoring units needed by the page closure
prediction algorithms can be encoded from the physical
address bits. Doubling the size of DRAM only implies
one extra physical address bit. This means that with
HAPPY only one extra monitoring unit is required to
predict the DRAM page closure policy when the size
of memory is doubled. In other words, HAPPY offers
the smallest hardware overhead to implement dynamic
DRAM page closure predictor algorithms.
We have evaluated HAPPY by applying it to a tradi-
tional Hybrid page closure policy, as well as the state-
of-the-art Intel-adaptive open page policy included in
Intel Xeon X5650. The experimental results show that
the HAPPY implementation of Intel-adaptive page pol-
icy can reduce the cost of implementation by 5× for the
evaluated 64 GB memory size (up to 40× for a mem-
ory size of 512 GB) while maintaining the prediction
accuracy. The other case study shows 182,000× reduc-
tion in cost of implementation for the evaluated 64 GB
memory size (up to 1.2M× for memory size of 512 GB)
while maintaining the prediction accuracy. The experi-
ments have also reported the accuracy of the predictors
and have studied the sensitivity towards the memory
address-mapping. In both scenarios, HAPPY maintains
its key advantage of offering no degradation of predic-
tion accuracy while reducing significantly the hardware
overhead.
8. ACKNOWLEDGEMENTS
The research leading to these results has received
funding from the European Union’s Seventh Framework
Programme (FP7/2007-2013) under grant agreement n◦
318633; AXLE project http://axleproject.eu/. Mikel
Luja´n is funded by a Royal Society University Research
Fellowship and further supported by UK EPSRC grants
DOME EP/J016330/1 and PAMELA EP/K008730/1.
9. REFERENCES
[1] John Ousterhout, “Ramcloud.”
[2] D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, and
M. Rosenblum, “Fast crash recovery in ramcloud,” in
Proceedings of the Twenty-Third ACM Symposium on
Operating Systems Principles, pp. 29–41, ACM, 2011.
[3] B. Jacob, S. Ng, and D. Wang, Memory systems: cache,
DRAM, disk. Morgan Kaufmann, 2010.
[4] K. Itoh, VLSI memory chip design, vol. 5. Springer New
York, 2001.
[5] B. Keeth, DRAM circuit design: fundamental and
high-speed topics, vol. 13. Wiley. com, 2008.
11
[6] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis,
J. Leverich, D. Mazie`res, S. Mitra, A. Narayanan,
G. Parulkar, M. Rosenblum, et al., “The case for
ramclouds: scalable high-performance storage entirely in
dram,” ACM SIGOPS Operating Systems Review, vol. 43,
no. 4, pp. 92–105, 2010.
[7] J. Dodd, “Adaptive page management,” July 11 2006. US
Patent 7,076,617.
[8] Rajinder Gill, “Everything you always wanted to know
about sdram memory but were afraid to ask.”
[9] Intel, “Intel xeon processor x5650.”
[10] G. Brown, “Ensemble learning,” Encyclopedia of Machine
Learning, 2010.
[11] N. Chatterjee, R. Balasubramonian, M. Shevgoor,
S. Pugsley, A. Udipi, A. Shafiee, K. Sudan, M. Awasthi,
and Z. Chishti, “Usimm: the utah simulated memory
module,” University of Utah, Tech. Rep, 2012.
[12] D. Kaseridis, J. Stuecheli, and L. K. John, “Minimalist
open-page: A dram page-mode scheduling policy for the
many-core era,” in Proceedings of the 44th Annual
IEEE/ACM International Symposium on
Microarchitecture, pp. 24–35, ACM, 2011.
[13] Z. Zhang, Z. Zhu, and X. Zhang, “A permutation-based
page interleaving scheme to reduce row-buffer conflicts and
exploit data locality,” in Proceedings of the 33rd annual
ACM/IEEE international symposium on
Microarchitecture, pp. 32–41, ACM, 2000.
[14] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec
benchmark suite: characterization and architectural
implications,” in Proceedings of the 17th international
conference on Parallel architectures and compilation
techniques, pp. 72–81, ACM, 2008.
[15] K. M. Dixit, “The spec benchmarks,” Parallel computing,
vol. 17, no. 10, pp. 1195–1209, 1991.
[16] K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin,
B. Jacob, C.-W. Tseng, and D. Yeung, “Biobench: A
benchmark suite of bioinformatics applications,” in
Performance Analysis of Systems and Software, 2005.
ISPASS 2005. IEEE International Symposium on, pp. 2–9,
IEEE, 2005.
[17] Y. Xu, A. S. Agarwal, and B. T. Davis, “Prediction in
dynamic sdram controller policies,” in Embedded Computer
Systems: Architectures, Modeling, and Simulation,
pp. 128–138, Springer, 2009.
[18] D. Huan, Z. Li, W. Hu, and Z. Liu, “Processor directed
dynamic page policy,” in Advances in Computer Systems
Architecture, pp. 109–122, Springer, 2006.
[19] M. Awasthi, D. W. Nellans, R. Balasubramonian, and
A. Davis, “Prediction based dram row-buffer management
in the many-core era,” in Parallel Architectures and
Compilation Techniques (PACT), 2011 International
Conference on, pp. 183–184, IEEE, 2011.
[20] S.-I. Park and I.-C. Park, “History-based memory mode
prediction for improving memory performance,” in Circuits
and Systems, 2003. ISCAS’03. Proceedings of the 2003
International Symposium on, vol. 5, pp. V–185, IEEE,
2003.
[21] C. Ma and S. Chen, “A dram precharge policy based on
address analysis,” in Digital System Design Architectures,
Methods and Tools, 2007. DSD 2007. 10th Euromicro
Conference on, pp. 244–248, IEEE, 2007.
[22] S. Miura, K. Ayukawa, and T. Watanabe, “A
dynamic-sdram-mode-control scheme for low-power systems
with a 32-bit risc cpu,” in Proceedings of the 2001
international symposium on Low power electronics and
design, pp. 358–363, ACM, 2001.
[23] V. Stankovic´ and N. Milenkovic´, “Access latency reduction
in contemporary dram memories,” Facta
universitatis-series: Electronics and Energetics, vol. 17,
no. 1, pp. 81–97, 2004.
[24] V. V. Stankovic and N. Z. Milenkovic, “Dram controller
with a close-page predictor,” in Computer as a Tool, 2005.
EUROCON 2005. The International Conference on, vol. 1,
pp. 693–696, IEEE, 2005.
[25] V. Stankovic and N. Milenkovic, “Dram controller with a
complete predictor: Preliminary results,” in
Telecommunications in Modern Satellite, Cable and
Broadcasting Services, 2005. 7th International Conference
on, vol. 2, pp. 593–596, IEEE, 2005.
[26] R. C. Schumann, “Design of the 21174 memory controller
for digital personal workstations,” Digital Technical
Journal, vol. 9, pp. 57–70, 1997.
[27] M. Blackmore, “A quantitative analysis of memory
controller page policies,” Notes, vol. 2013, pp. 01–01, 2013.
12
