DReAM: Dynamic Re-arrangement of Address Mapping to Improve the
  Performance of DRAMs by Ghasempour, Mohsen et al.
DReAM: Dynamic Re-arrangement of Address Mapping
to Improve the Performance of DRAMs
Mohsen Ghasempour†, Jim Garside†, Aamer Jaleel? and Mikel Luján†
School of Computer Science, University of Manchester†
NVidia Research?
ABSTRACT
The initial location of data in DRAMs is determined
and controlled by the ‘address-mapping’ and even mod-
ern memory controllers use a fixed and run-time-agnostic
address mapping. On the other hand, the memory ac-
cess pattern seen at the memory interface level will dy-
namically change at run-time. This dynamic nature of
memory access pattern and the fixed behavior of ad-
dress mapping process in DRAM controllers, implied
by using a fixed address mapping scheme, means that
DRAM performance cannot be exploited efficiently.
DReAM is a novel hardware technique that can de-
tect a workload-specific address mapping at run-time
based on the application access pattern which improves
the performance of DRAMs. The experimental results
show that DReAM outperforms the best evaluated ad-
dress mapping on average by 9%, for mapping-sensitive
workloads, by 2% for mapping-insensitive workloads,
and up to 28% across all the workloads. DReAM can be
seen as an insurance policy capable of detecting which
scenarios are not well served by the predefined address
mapping.
1. INTRODUCTION
Increasing the number of general purpose cores and
accelerator cores (e.g. GPU cores) integrated into a sin-
gle chip and competing for access to DRAM, demands
better performance from the main memory. In this sit-
uation, exploiting the maximum performance obtain-
able from the memory system is crucial. However, due
to the internal structure and organization of DRAMs,
described in Section 2, there is always some memory
bandwidth (Performance) wasted due to internal con-
flicts. One of the most serious conflicts in a DRAM
memory system is referred to as ‘page conflict’. This
happens when two consecutive memory requests go to
different rows within the same bank. In this situation,
these memory requests must be serviced one after an-
other which causes a high access latency for the sec-
ond request. Dealing with page conflicts becomes even
more challenging considering the fact that they are com-
pletely dependent on the memory access pattern. This
means that the rate of page conflicts and the time of
their occurrence change dynamically according to the
application behavior. To mitigate the vulnerability of
DRAMs performance to page conflicts, state-of-the-art
memory controllers have evolved into complex hardware
components employing subsystems such as schedulers.
These schedulers take advantage of workload run-time
information (the sequence of memory requests) to re-
duce page conflicts. An important role of the sched-
uler is to minimize DRAM page conflicts by reordering
the memory commands that are available to issue to
the DRAM. However, the main limitation for sched-
ulers is the number of options (memory requests) that
they have to choose from at the time of scheduling. In
general, the number of available memory requests at the
time of scheduling is limited by data dependencies be-
tween memory requests, the number of running threads,
the number of cores etc. Therefore, there are conflicts
that schedulers cannot eliminate. These page conflicts
result from the address-mapping and data placement
in DRAMs. As discussed in the next section, the ad-
dress mapping is a process that maps the physical ad-
dress bits provided by processors to the internal struc-
ture of DRAMs. This process controls the initial data
placement in memory. Thus, it is important to under-
stand how to select a good address-mapping scheme to
place and distribute data in DRAM devices to mitigate
page conflicts. This is possible using a software-only
approach; e.g. with OS support and intelligent memory
allocators. However, this option faces complex problems
when considering multiple independent applications ex-
ecuting concurrently, or with virtualized scenarios (both
hypervisors and containers) and relies on software being
compiled for specific memory hardware.
This paper presents DReAM, a novel hardware tech-
nique based on approximating the entropy of each mem-
ory address bit for a set of memory requests, to generate
workload specific address-mappings at run-time. To re-
arrange the address mapping at run-time DReAM needs
to support the online-data migration imposed by chang-
ing the address-mapping. DReAM investigates differ-
ent scenarios for data migration with different levels of
complication. The proposed solutions were evaluated
over a wide range of mapping-sensitive and mapping-
insensitive workload mixes. Three different address map-
ping schemes were evaluated over all the workloads and
ar
X
iv
:1
50
9.
03
72
1v
1 
 [c
s.A
R]
  1
2 S
ep
 20
15
the best one was chosen to compare against DReAM.
Overall, DReAM is the first on-the-fly mechanism ca-
pable of generating workload specific address-mappings
without requiring to stop the running applications.
2. BACKGROUND
Figure 1 presents the basic organization of a DRAM
device. Each DRAM device consists of multiple banks
each of which has a data array and one row buffer. In
practice, the data array within a bank consists of mul-
tiple subarrays, each of which has its own local row
buffer. The local row buffers within a bank are con-
nected to other local row buffers as well as the global
row buffer. There are some interesting works by Chang
et al. [1], Kim et al. [2] and Seshadri et al. [3] to exploit
these subarrays to improve the DRAM performance and
bulk data copy in DRAMs.
Bank
Subarray
Subarray
Subarray
Global Row Buffer
Rows
Local Row Buffer
Bank
Bank
Bank
Global Row Buffer
DRAM Device Subarray
Global Row Buffer
Figure 1: DRAM device organisation.
The address mapping mechanism for DRAMs trans-
forms the flat 1D of physical addresses into the inter-
nal 2D structure of DRAMs devices (row & column).
Figure 2 illustrates how one physical address can be
interpreted with two different mapping schemes. Most
memory systems contain DIMMs and a DIMM can have
multiple ranks of DRAMs. Multiple DIMMs can be
placed on a channel ; i.e. the physical connection be-
tween a memory controller and DRAMs [4]. The rea-
son for these many hierarchical levels is to maximize the
parallelism that can be exploited when servicing multi-
ple memory requests.
0x594B9AF
1
Memory Request
111010110011101001010011010
Row: 9,518 Column: 431Bank: 5
Rank: 0
1111010110011101001010011010
Row: 14,767Column: 594Bank: 5
Rank: 0Scheme 1
Scheme 2
Figure 2: Two different address mapping schemes.
In general an address-mapping scheme extracts the
corresponding address for Channel, Rank, Bank, Row
and Column from the physical address. Due to the in-
ternal structure and electronic circuit characterization
of the DRAMs, consecutive access to different memory
locations can have a different memory cost depending
on the previous state of the memory. For instance, if
there are two consecutive accesses to the same row in
the same bank of a DRAM, the second access can have
significantly smaller latency than the first access since
the target row has been ‘opened’ by the first memory
request. On the other hand, if there are two consecu-
tive accesses to different rows within the same bank,
the second access has significantly higher latency in
comparison with the first access. The reason is that,
in this case, the previous row must be ‘closed’ before
the new row is ‘activated’. These scenarios describe a
page conflict and degrades the overall performance of
DRAMs. Page conflicts are sensitive to the data place-
ment in DRAMs and data placement is determined by
the address-mapping schemes in the first place. There-
fore, choosing an address mapping scheme carefully can
reduce the page conflicts and improve the performance
of DRAMs.
2.1 Motivation - Address Mapping Analysis
Figure 3 presents three different well-known address-
mapping schemes currently employed by modern DRAM
controllers. The first mapping (Figure 3a) is a standard
mapping intended to exploit the spacial locality by plac-
ing column address at bottom. The next two address in-
terleaving policies are schemes proposed by Kaseridis et
al. [5] and Zhang et al. [6]. The proposed mapping by
Zhang et al. XORs some of the row address bits with
the bank’s address bits to produce a new bank index
(Figure 3b). This tries to change the bank ID when-
ever the Row ID is changed to reduce the page conflict.
Kaseridis et al. [5] extend this technique by producing
the column index using a different section of the physi-
cal address (Figure 3c). Both techniques aim to reduce
page conflicts in DRAMs.
There might be other variation of address-mapping
schemes, than those presented in this figure, that can
be used to perform the required translation phase to ser-
vice a memory request. However, the important point
to consider is that the current memory controllers can
only use one of such address-mapping schemes to trans-
late the physical address to the internal structure of
DRAMs. Moreover, modern DRAM controllers are lim-
ited to perform read/write operations in bursts (typi-
cally bursts of 4 or 8 items). This implies some bits
are used as a block offset, presented in Figure 3. To
motivate the technique presented in this paper, Fig-
ures 4 presents the performance comparison of different
address-mapping schemes for all the benchmark suites
evaluated in this work. Each bar in these graphs rep-
resents the normalized execution time to the baseline
address-mapping scheme (address mapping 1 in Fig-
ure 3). Our experimental results (considering the re-
sults of an individual workload) suggest that a prede-
fined address mapping schemes is not efficient in all sit-
uations and thus employing a fixed address mapping
scheme cannot deliver the best execution time across
all workloads. As Figure 4 suggests, the permutation-
2
0.94%
0.96%
0.98%
1.00%
1.02%
1.04%
1.06%
BIOBENCH% COMMERCIAL% HPC% PARSEC% SPEC% GMEAN%
N
or
m
al
is
ed
%E
xe
cu
Do
n%
Ti
m
e% Baseline% PermutaDon% Minimalist%
Figure 4: Performance comparison among different address-mapping schemes.
Row RA Bank CH Column Block Offset
(a) Mapping 1: Maximise row-buffer locality (Baseline)
RA BankCH Column Block Offset
Row
Row RA BankCH Column Block Offset
XOR
(b) Mapping 2: Permutation-based Page Interleaving [6]
RA BankCH Column Block Offset
Row
XOR
Col
Row RA BankCH Column Block OffsetCol
(c) Mapping 3:Minimalist Open-Page Scheme [5]
Figure 3: Different address mapping schemes.
based address mapping almost always (except for the
BIOBENCH benchmark) delivers a better geometric av-
erage (GMEAN) execution time compared with other
two address mapping schemes. This address mapping
is chosen as the best baseline of those presented in this
paper to be compared against DReAM mapping.
3. DREAM:DYNAMICREARRANGEMENT
OF ADDRESS MAPPING
DReAM is a novel technique to analyze the mem-
ory access pattern (produced either by single or multi-
threaded applications) at run-time and estimate an ef-
ficient address-mapping scheme, that reduces page con-
flicts and improves page hits. DReAM consists of two
main phases: ‘online prediction of address mapping’ and
‘on-the-fly data migration’.
3.1 Online Prediction of Address Mapping
The first step is to discover whether the current work-
load, a set of executing applications, is a good match
with the baseline address mapping scheme. A baseline
address-mapping scheme decides which physical address
bits should be used to address which specific part of a
DRAM device (e.g. rank, bank, row, etc.). Therefore,
a physical address is divided in to different sets of bits
each set pointing to a specific part of internal hierarchy
of the DRAM system. Considering consecutive requests
to a DRAM module, the changing rate of each physical
address bit (as a result of the changing rate of each bit
within different sets) in comparison with the previous
access has a strong correlation with the changing rate
of a specific DRAM location that has been accessed.
On the other hand, accessing different rows within the
same bank causes page conflicts and imposes a power
and performance overhead. Therefore, ideally, it is de-
sired to keep the change rate of the physical address
bits that are used to address the row, as low as possi-
ble to reduce the row switches within a bank. DReAM
estimates how much each physical address bit changes
by observing memory requests over a period of time
as a means of generating improved memory mappings.
The estimations of change per bit require minimum ex-
tra hardware; one counter per physical address bit per
memory controller. Those bits changing the most have
higher entropy and those bits changing the least have
smaller entropy. For a given period, these counters (or
frequency change estimators) keep track of the number
of changes of each bit of the physical address in com-
parison with the previous memory address request. The
given period creates time windows and can be based on
number of clock cycles or number of memory requests.
Figure 5 shows an example of five consecutive accesses
to demonstrate the function of these counters.
0x594B9AF
0
C
o
n
s
e
c
u
ti
v
e
 R
e
q
u
e
s
t
000020002022001000002002040Counters
0 1 0 1 1 0 0 1 0 1 0 0 1 0 1 1 1 0 0 1 1 0 1 0 1 1 1 1
0x1B42B8F 0 0 0 1 1 0 1 1 0 1 0 0 0 0 1 0 1 0 1 1 1 0 0 0 1 1 1 1
0x4B429AF 0 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0 1 0 0 1 1 0 1 0 1 1 1 1
0x0B431AF 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 0 1 1 0 1 0 1 1 1 1
0x59439AF 0 1 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 1 0 1 0 1 1 1 1
Physical Address bit
Figure 5: Bit-counters mechanism.
The counter value of the two highlighted bits shows
that bit 15 and bit 26 have been changed once and 4
times, respectively, in the last five memory requests.
These counters generate a pattern (or signature) that
3
010200
0.5
1
1.5
2x 10
5
(.1) Comm1
010200
5
10
15x 10
4
(.2) Comm2
010200
5
10
15x 10
4
(.3) Comm3
010200.5
1
1.5
2x 10
5
(.4) Comm4
010200
0.5
1
1.5
2x 10
5
(.5) Comm5
010200
0.5
1
1.5
2x 10
5
(.6) hpc1
010200.5
1
1.5
2x 10
5
(.7) hpc2
010200.5
1
1.5
2x 10
5
(.8) hpc3
010200
0.5
1
1.5
2x 10
5
(.9) hpc4
010200.5
1
1.5
2x 10
5
(.10) hpc5
010200.8
1
1.2
1.4
1.6x 10
5
(.11) hpc6
010200.5
1
1.5
2x 10
5
(.12) hpc7
010200.5
1
1.5
2x 10
5
(.13) hpc8
010200.5
1
1.5
2x 10
5
(.14) hpc9
010200.5
1
1.5
2x 10
5
(.15) hpc10
010200.5
1
1.5
2x 10
5
(.16) hpc11
010200.5
1
1.5
2x 10
5
(.17) hpc12
010200.5
1
1.5
2x 10
5
(.18) hpc13
010200
0.5
1
1.5
2x 10
5
(.19) black
010200
5
10
15x 10
4
(.20) caneal
010200
0.5
1
1.5
2x 10
5
(.21) face
010200
0.5
1
1.5
2x 10
5
(.22) ferret
010200
0.5
1
1.5
2x 10
5
(.23) fluid
010200
0.5
1
1.5
2x 10
5
(.24) freq
010200
1
2
3x 10
5
(.25) stream
010200
0.5
1
1.5
2x 10
5
(.26) swapt
010200
0.5
1
1.5
2x 10
5
(.27) astar-B
010200.5
1
1.5
2x 10
5
(.28) bzip2-l
010200.5
1
1.5
2x 10
5
(.29) bzip2-t
010200.8
1
1.2
1.4
1.6x 10
5
(.30) cactusADM-b
010200
0.5
1
1.5
2x 10
5
(.31) gcc-1
010200
0.5
1
1.5
2x 10
5
(.32) gcc-2
010200
0.5
1
1.5
2x 10
5
(.33) gcc-c
010200.5
1
1.5
2x 10
5
(.34) gcc-cp
010200
0.5
1
1.5
2x 10
5
(.35) gcc-g
010200
0.5
1
1.5
2x 10
5
(.36) gcc-sc
010200.5
1
1.5
2x 10
5
(.37) GemsFDTD-r
010200.5
1
1.5
2x 10
5
(.38) leslie3d-l
010200
0.5
1
1.5
2x 10
5
(.39) libquantum
010200
0.5
1
1.5
2x 10
5
(.40) mcf-r
010200
0.5
1
1.5
2x 10
5
(.41) milc-s
010200
0.5
1
1.5
2x 10
5
(.42) omnetpp-o
010200
0.5
1
1.5
2x 10
5
(.43) soplex-r
010200
0.5
1
1.5
2x 10
5
(.44) sphinx3-a
010200
0.5
1
1.5
2x 10
5
(.45) xalancbmk-r
010200.5
1
1.5
2x 10
5
(.46) zeusmp-z
010200
5
10
15x 10
4
(.47) mummer
010200
5
10
15x 10
4
(.48) tigr
Figure 6: Extracted bit-change pattern for all benchmark suites.
4
is representative of the current memory access behav-
ior as perceived by the memory controller. Figure 6
shows such a signature extracted from these counters
for all the benchmarks evaluated in this paper. The X-
axis in each plot represents the corresponding counter
ID per physical address bits and Y-axis shows the over-
all bit change rate over the application execution time.
There is an exponential growth in the rightmost five
bits of almost all the patterns. This is due to spacial
locality that implies accessing the sequential physical
addresses. Looking at these pattern and the address-
mapping schemes presented in Figure 3 justifies why
the column address bits are typically placed in the bot-
tom of the physical address space. In this way, accessing
consecutive cache lines will be mapped to the consecu-
tive columns within the same row (i.e. Page Hit).
3.1.1 Address Mapping Prediction
Given the signature for a set of running applications,
the next issue is how to generate an optimized address-
mapping scheme. The idea is to map the physical ad-
dress bits with low variation to rows (to reduce the row
switching or page conflicts), the physical address bits
with medium variation to banks and the physical ad-
dress bits with highest rate of changes to columns to
increase the locality and decrease the page conflicts.
Moreover, it is possible to limit DReAM to rearrange
only a part of the physical address bits to mitigate
the associated cost of the address mapping change in
DRAMs that will be discussed later (data migration).
For instance, in this paper, DReAM does not rearrange
the column-address bits to avoid cache-line-level mi-
gration. To produce a new address mapping scheme
at run-time, (i) the bit-change rate of physical address
bit will be monitored for each time-window, (ii) a new
address mapping scheme will be estimated based on
each time-window monitoring information, (iii) the bit-
change rate monitored, based on the predefined and
new address mapping schemes for each time-window
will be compared. If the new address-mapping scheme
can improve the bit-change rate in comparison with
the baseline address mapping above a desired (and pro-
grammable) threshold (for consecutive time-windows de-
fined by ‘Consistency Threshold’) then the new address
mapping will be used as the primary address mapping
scheme in the system.
3.1.2 Mathematical Insight
Intuitively, DReAM proposes a simple technique to
detect an application-specific address mapping scheme
based on the physical address bit-change monitoring
process. However, the question is to find an analyti-
cal proof to show that the application-specific address-
mapping scheme predicted using this method can actu-
ally improve the performance of the memory system.
As discussed, the predicted address-mapping scheme
will be exploited only if it can reduce the bit-change
rate, in comparison with the baseline address mapping,
beyond a certain threshold. This means that DReAM
assumes that there is a correlation between the bit-
change rate of physical address bits and the perfor-
mance of DRAMs. To investigate this, the correlation
coefficient between the average bit-changed improve-
ment reported by DReAM and the performance im-
provement of memory system, while using the DReAM
address mapping, was investigated. The experimental
results shows that there is a strong correlation, 0.89
with a very small P-value (i.e. 1.97 × 10−15), between
the bit-change rate and the final performance improve-
ment. This justifies why the predicted address map-
ping scheme proposed by DReAM can improve the per-
formance of DRAMs. Figure 7 shows the bit-change
rate improvement reported by DReAM and the final
performance improvement achieved using the predicted
address-mapping scheme by DReAM.
0"
5"
10"
15"
20"
25"
30"
35"
bi
ob
en
ch
_m
um
m
er
"
bi
ob
en
ch
_2
gr
"
co
m
m
1"
co
m
m
2"
co
m
m
3"
co
m
m
4"
co
m
m
5"
hp
c1
"
hp
c2
"
hp
c3
"
hp
c4
"
hp
c5
"
hp
c6
"
hp
c7
"
hp
c8
"
hp
c9
"
hp
c1
0"
hp
c1
1"
hp
c1
2"
hp
c1
3"
pa
rs
ec
_b
la
ck
"
pa
rs
ec
_c
an
ne
al
"
pa
rs
ec
_f
ac
e"
pa
rs
ec
_f
er
re
t"
pa
rs
ec
_fl
ui
d"
pa
rs
ec
_f
re
q"
pa
rs
ec
_s
tr
ea
m
cl
us
te
r"
pa
rs
ec
_s
w
ap
t"
sp
ec
_G
em
sF
DT
D_
r"
sp
ec
_a
st
ar
_B
"
sp
ec
_b
zi
p2
_l
"
sp
ec
_b
zi
p2
_t
"
sp
ec
_c
ac
tu
sA
DM
_b
"
sp
ec
_g
cc
_1
"
sp
ec
_g
cc
_2
"
sp
ec
_g
cc
_c
"
sp
ec
_g
cc
_c
p"
sp
ec
_g
cc
_g
"
sp
ec
_g
cc
_s
c"
sp
ec
_l
es
lie
3d
_l
"
sp
ec
_l
ib
qu
an
tu
m
"
sp
ec
_m
cf
_r
"
sp
ec
_m
ilc
_s
"
sp
ec
_o
m
ne
tp
p_
o"
sp
ec
_s
op
le
x_
r"
sp
ec
_s
ph
in
x3
_a
"
sp
ec
_x
al
an
cb
m
k_
r"
sp
ec
_z
eu
sm
p_
z"
Im
pr
ov
em
en
t"(
%
)"
Prdicted"BitSChange"Rate"Improvement" Performance"Improvement"
Figure 7: Bit-change rate improvement vs. the overall
performance improvement.
3.1.3 Mapping-Sensitive Vs. Mapping-Insensitive
Looking at the patterns presented in Figure 6 and
considering the basic principles behind the address map-
ping prediction explained so far, it is possible to cate-
gorise workloads based on their sensitivity to the ad-
dress mapping. If there is an opportunity to swap a
physical address bit with a high change rate (that corre-
sponds to the row address space) with another physical
address bit with a smaller change rate (that is not a part
of the row address space) then this is called a mapping-
sensitive workload. Otherwise, this is categorised as an
mapping-insensitive workload. For instance, ‘stream’
(Figure 6.25) is a mapping-insensitive workload since
all the bits dedicated to the row address space have a
smaller change rate than other bits. On the other hand,
‘libquantum’ (Figure 6.39) is a mapping-sensitive work-
load since there is an opportunity to swap bit 14 with
another bit with smaller change rate (let’s say bit 10).
3.2 Data Migration Challenge
Changing the address-mapping scheme of a DRAM,
on-the-fly, has a very important obstacle which is the
requirement for the Data Migration. Initially, a DRAM
places data into memory based on a predefined address
mapping scheme. Therefore, changing the address map-
ping scheme implies that the data previously loaded into
the DRAM cannot be accessed using the new address
mapping scheme. Thus, before employing the new ad-
dress mapping, the existing data in DRAMs must be
migrated to a new location based on the new address
mapping scheme. This imposes some overhead to the
5
overall performance of memory system. To alleviate this
overhead this paper investigate two different scenarios
explained as follows.
3.3 Data Migration Solutions
3.3.1 Scenario 1 - Offline Data Migration
This scenario explains the simplest DReAM imple-
mentation that imposes a minimal hardware overhead
to the overall memory system. In general, this scenario
is well suited for application-specific computer architec-
tures, e.g. database systems, where a specific applica-
tion is running on the system over and over. For in-
stance, in a database system, depending on the type of
database (e.g. financial, medical etc.), usually only a
few specific queries with minor variations are used to
search for specific data. Moreover, in the big-data re-
search area running a query over a database might take
a few days or weeks. This produces a specific memory
access pattern in the system that usually is consistent
over a long period of time. In this implementation, the
memory access pattern of applications (single or mul-
tithread) will be monitored at run-time for a desired
period (e.g. it can be a few hours, a few days etc). This
period is called Region Of Interest (ROI). Ideally, the
ROI should be chosen to be long enough to represent
the application access behavior. For instance, if the ROI
for a medical database is chosen to be one day, then the
memory access pattern of almost all the possible queries
that are usually run on the database during the day can
be covered by the ROI. In this situation, DReAM will
estimate an optimized address-mapping scheme based
on the average bit-change rate extracted from the ded-
icated counters per physical address bits for the entire
ROI. This new mapping will be saved on the memory
controller and upon rebooting the system user has an
option to choose the DReAM address mapping scheme
over the baseline from the system BIOS. Thus, when-
ever user reboots the system the memory controller can
employ a new address mapping that is estimated based
on DReAM calibration mode. A similar approach has
been implemented for Intel-adaptive page policy and a
special beta BIOS provided by ASUS that allows the
user to choose a desired page closure policy at system
start up [7, 8].
In this scenario, there is a penalty for the rebooting
process but after that, as far the usual workloads run-
ning on the system, the overall performance of memory
system will be improved by taking advantage of new
address-mapping scheme. This is why this scenario is
well suited for the systems with consistent behavior over
the time.
3.3.2 Scenario 2 - Online Data Migration
This scenario investigates the possibility of perform-
ing on-the-fly data migration inside a DRAM device by
proposing small modifications to the internal structure
of this memory system.
Basic Procedure: Figure 8 presents the basic flowc-
hart of servicing a memory request while using DReAM
considering the second data migration scenario. To min-
imize the overhead of migration, a row is migrated only
when it has been accessed. In practice, this means that
the migration occurs gradually on demand.
Wait For a
Memory Request
Translate the address using 
both PAMS & EAMS
Migrated ?
Service The Memory 
Request Using EAMS
YES
Swapped ?
NO
Find the swapped row
YES
Service The Memory 
Request Using PAMS
NO
Service The Memory Request
From the swapped row
Migration 
&
Swap
 Process
Figure 8: DReAM flowchart.
On the first access to a row, the requested physi-
cal address is translated to the internal structure of
the DRAM using both the Predefined Address-Mapping
Scheme (PAMS) and the Estimated Address-Mapping
Scheme (EAMS). The translated address by PAMS is
the source row address and the translated address by
EAMS is the destination row address. There are two
main functions that might be applied on the requested
address in different situations which are Migration and
Swap. The requirement for these two functions and
what they are will be discussed later on in this section
and they are declared here just for initial familiarity to
explain the flowchart.
The first step is to determine if the accessed row is
in its original location, pointed to by PAMS, or not.
Two bits are dedicated to each row in a DRAM bank
to keep track of the current status of that row: one bit
(Migration-Bit) to determine if the row has been moved
to its new location (migrated) and one bit (Swap-Bit)
to determine if the row has been swapped (this process
will be discussed later). Two tables can be dedicated to
accommodate these bits for the entire DRAM module:
the Migration Table (MT) and the Swap Table (ST).
At this point several situations might happen:
• If the requested row is in its original location (the
migration-bit and swap-bit are 0) then, (i) the PAMS
will be used to access and service the requested row,
(ii) the requested row will be migrated to the destina-
tion location pointed by EAMS, (iii) if the destination
location is occupied by a different row then intuitively
the content of destination row also needs to be mi-
grated to a third place. This can produce a chain of
unnecessary data migration which is costly. To avoid
this, a simple row-swap algorithm is employed which
means that in such situations the content of destina-
tion row will be swapped by the content of source row
(corresponding swap-bit will change to 1).
• If the requested row has been migrated then the EAMS
will be used to access and service the requested row
6
• If the requested row has been swapped then, (i) the
swapped location will be calculated by applying the
reverse address-mapping mechanism to the source lo-
cation, (ii) step i will be repeated until the swap-bit
of the pointed location by reverse address mapping
scheme is 0, (iii) the request will be serviced, (iv)
The requested row will be migrated to the destina-
tion location pointed by EAMS, (v) a swap will be
performed if it is necessary.
To make all this happens inside DRAM some modi-
fication needs to be done to the traditional structure of
DRAMs which is explained below.
Required DRAM Modification: There are two
main requirements for DReAM to perform data migra-
tion in a DRAM device: the capability of bulk data copy
inside DRAM and the capability of on-the-fly buffering
of the entire row to perform the swap operation. Both
of these requirements have been studied individually by
previous work to address different issues, using existing
subarray level parallelism in DRAMs,[3, 2] which are
described in the following.
Bulk Data Copy in DRAM: Seshadri et al. [3]
exploits the existing subarrays per bank in DRAMs to
copy the entire row from one location to another in-
side DRAMs. Depending on the location of the source
and destination rows, there are three different scenarios
that should be considered: (i) copying between two rows
within the same subarray (intra-subarray), (ii) copying
between two rows in different subarrays in the same
bank (inter-subarray), (iii) copying between two rows
in different banks (inter-bank).
Subarray-Level Parallelism: Kim et al. [2] pro-
posed some small modification to DRAMs to be able to
exploit existing subarray level parallelism in DRAMs.
They discussed three different levels of modification to
DRAM to improve the access latency by making subar-
rays working independently. Part of this work which, is
more interesting from the point of view of this paper, is
that called MASA. The key idea of MASA is to allow
multiple activated subarrays in the same bank. MASA
imposes (i) a designated-bit latch to each subarray, (ii)
a new DRAM command, subarray-select (SA-SEL) and
(iii) routing of a new global wire. Based on their ex-
perimental methodology, they showed that the required
extra latches imposes 0.15% area overhead and consume
72.2 µW additional power for each ACTIVATE com-
mand. Moreover, they evaluated that there is an extra
0.56 mW of static power in the steady state imposed by
multiple activation of subarrays.
Having explained the above techniques, an overview
of the DReAM architecture will be explained in the fol-
lowing sections.
3.4 DReAM - Overview of Architecture
Figure 9 presents a high-level overview of the DReAM
architecture. DReAM includes two main phases, Address-
Mapping Estimation and Online Data Migration.
3.4.1 Address-Mapping Estimation
PAMS
DReAM 
Address-Mapping Estimation
DReAM 
Monitoring & Pattern Extraction 
Migration Table
Swap Table
Address-Translation
Stream of Memory Requests
Reverse Mapping
Translation
EAMS
C
o
lu
m
n
C
h
a
n
n
e
l
R
an
k
B
a
n
k
R
o
w
History-Tables
Servicing the 
Memory Request
Migration Process
Swap Process
DReAM Monitoring DReAM Data Migration
Figure 9: DReAM architecture.
Address-Mapping Estimation requires minimal archi-
tecture support. Only one counter per physical address
bit, a history register to hold the last accessed address
and an array of XORs to detect the bit-change between
two consecutive memory requests are required to ex-
tract the access pattern at run-time. Figure 10 presents
a simple overview of such a structure. In this struc-
ture each bit of the currently accessed address will be
XORed with the corresponding bit of the last accessed
address. Then, if there is a difference in the accessed bit
the corresponding counter will be incremented. As dis-
cussed, this will produce a pattern of physical address
bit changes over a period that can be employed to esti-
mate an application-specific address-mapping scheme.
4
0
0
9
1
1
2
0
1
1
1
0
1
1
0
5
1
1
5
0
1
8
0
0
7
1
0
6
1
1
Counter Array
Last Access
Current Access
Figure 10: DReAM monitoring counter structure.
3.4.2 Data Migration - Operation
The Data Migration required by DReAM can be de-
scribed considering the following observations:
First, all the local row buffers (one local row buffer
in each subarray) within a bank are connected to the
global row buffer using global bitlines and all the row-
buffers (either local or global) within a DRAM device
are connected together using a narrow I/O bus (64-bit
wide) [9, 2]. Second, considering the modification pro-
posed by Kim et al. [2] the DRAM module supports
MASA. This supports multiple activation of subarrays
while only one of them can be connected to the global
bitline at a time. Figure 11 presents the possible scenar-
ios that data migration might happen. To describe the
following scenarios it is assumed that the destination
row always has been occupied by another row (worst
7
case scenario) and thus a swap process is necessary.
Subarray 1
Local Row Buffer 1
Destination Row
Global Row Buffer
Bank A
Source Row
(a) Intra-
Subarray
Subarray 1
Local Row Buffer 1
Source Row
Subarray 2
Local Row Buffer 2
Destination Row
Global Row Buffer
Bank A
(b) Inter-
Subarray
Subarray 1
Local Row Buffer 1
Global Row Buffer
Bank A
Source Row
Subarray 1
Local Row Buffer 1
Destination Row
Global Row Buffer
Bank B
6
4-
b
it
 I/
O
(c) Inter-Bank
Figure 11: Different data migration scenarios.
Inter and Intra bank Migration: although Fig-
ure 11 presents all different possible scenarios for data
migration considering the main purpose of DReAM (i.e.
reducing page conflicts) the first two scenarios are inef-
ficient. The reason is that in the first two scenarios the
data migration happens within the same bank which
does not reduce the page conflict occurrence probabil-
ity. Thus, there is no point of paying extra penalty to
perform the data migration for these two scenarios.
Inter-bank Migration: in this scenario (Figure 11c),
source and destination rows are in different banks. There-
fore, both of source and destination rows can be acti-
vated in parallel. Thus, to perform this scenario the
memory controller, (i) activates both source and desti-
nation row and load their contents into their local row-
buffer, (ii) puts bank A into the read mode and puts
bank B into the write mode, (iii) transfers the source
row from local row-buffer 1 in bank A to the global row
buffer of the bank B using the narrow I/O bus, (iv)
puts bank A into the write mode and puts bank B into
the read mode, (v) transfers the destination row from
local row-buffer 1 in bank B to the global row buffer of
the bank A using the narrow I/O bus, (vi) connects the
global bitlines of the global row-buffer in bank A to the
source row and the global row-buffer of bank B to the
destination row.
3.4.3 Data Migration - Timing Overhead
The latency overhead imposed by the data migration
for each workload is the number of inter-bank migration
(Figure 11c) times cost of transferring a row using the
internal narrow I/O (i.e. 64-bit) bus. Considering the
transfer rate of 64 bits/clock and a row buffer size of 4
Kbit (per device) then 64 clock cycles are required to
transfer a row from one bank to another. Another 64
clock cycles are required in the case that a swap is nec-
essary. Therefore in the worst case scenario, the penalty
for each data relocation between two banks is 128 mem-
ory clock cycles. Assuming that the CPU clock cycle is
4 times faster than memory clock cycle then the data
migration penalty is 512 CPU clock cycles. In a very
pessimistic situation it is assumed that the processor
will be stalled while the data migration is happening.
Therefore the 512 clock cycles times number of required
inter-bank data migrations delivers a good estimation
of the extra overhead imposed on the overall execution
time.
3.4.4 Rollback Process to Avoid Degradation Loop
DReAM predicts an application-specific address map-
ping scheme based on the monitoring period of the past
application access pattern. However, it is not guaran-
teed that the application access pattern will not change
again in the future. Therefore, the predicted address-
mapping scheme by DReAM might not be efficient any-
more and, as a result, using such an address-mapping
scheme might degrade the performance of the DRAMs
(i.e. Degradation Loop). To work around this issue,
DReAM supports ‘Rollback’ procedure. As discussed,
DReAM will switch to the predicted address-mapping
scheme if the new mapping can improve the bit-change
rate in comparison with the baseline, over a predefined
threshold, for consecutive time windows. A similar ap-
proach will be used to evaluate the efficiency of the
predicted address-mapping at run-time. DReAM keeps
monitoring the bit-change pattern over the time win-
dows even after a new address-mapping scheme is pre-
dicted. If the bit-change improvement of the predicted
address mapping scheme no longer outperforms the base-
line DReAM will switch back to the predefined address
mapping scheme. This triggers the roll back function
to return the migrated rows to their original location.
In this situation the memory controller can switch be-
tween at least two address-mapping scheme based on
the application access pattern. A third address map-
ping scheme can be employed if the rollback process
completes which means that all the rows migrated by
the previous address-mapping have returned to their
original locations.
4. EVALUATION METHODOLOGY
Simulator: USIMM [10] was used as the main sim-
ulation platform for these experiments. USIMM was
modified to support Permeation-based Page Interleav-
ing [6] and Minimalist Open-Page scheme plus a full
implementation of the DReAM architecture. DReAM
was evaluated based on a 4 GB DRAM organized in
1 channel, running single thread applications. To in-
crease the randomness of memory access patterns the
size of memory was fixed while running multithread ap-
plications. A FR-FCFS scheduling algorithm is used
in our experiments. Table 1 presents the configuration
parameters of USIMM.
Model Description Value
Processor
Clock Speed 3.2 GHz
ROB size 32
Memory System
Bus Speed 800 MHz
Number of Channels 1
Ranks per channel 1
Bank per rank 8
Row per bank 65,536
Cache line size 64 Byte
Table 1: USIMM configuration parameters.
Address Mapping Schemes: The memory access
pattern, and as a result the number of page conflicts in
8
SPEC PARSEC COMMERCIAL
(a) GemsFDTD r (k) astar B (u) canneal (D1) comm1
(b) bzip2 l (l) bzip2 t (v) streamcluster (D2) comm2
(c) cactusADM b (m) gcc 1 (w) blackschols (D3) comm3
(d) gcc 2 (n) gcc c (x) facesim (D4) comm4
(e) gcc cp (o) gcc g (y) ferret (D5) comm5
(f) gcc sc (p) mcf r (z) fluidanimate BIOBENCH
(g) milc s (q) omnetpp o (A) freqmine (E) mummer
(h) soplex r (r) sphinx3 a (B) swaption (F) tigr
(i) xalancbmk r (s) zeusmp z HPC
(j) libquantum (t) leslie (C) hpc1 - hpc13
Table 2: Evaluated workloads and benchmark suites.
DRAMs, can be affected by the predefined memory ad-
dress mapping scheme. The experiments consider three
different address mappings presented in Figure 3. The
experimental results presented in Section 2.1 (Figure
4) show that the Permutation-based Page interleaving
policy (Mapping 2) performs best for most of the work-
loads. Therefore, this address mapping scheme is em-
ployed as a fair baseline to compare with the DReAM
scheme.
Workloads: the workloads include a wide range of
memory intensive applications (i.e. 48 workloads) from
different benchmark suites (PARSEC [11], SPEC [12],
BIOBENCH [13], HPC and COMMERCIAL) and rep-
resentative regions of interest for each application. Ta-
ble 2 lists the workloads and their corresponding bench-
mark suites. An identifier is assigned to each appli-
cation to facilitate the naming of multithread work-
loads constructed from these applications. To increase
the variety of memory access patterns, USIMM was
set up for multithread applications to evaluate 20 ran-
domly selected workload mixes; a combination of 4-
thread and 8-thread applications. Table 3 lists these
multithread workloads employing the identifier of sin-
gle thread workloads presented in Table 2.
Multithread Workloads
M1:C13-C5-x-t M11:C9-C13-C5-w-x-t-j-q
M2:C9-w-j-q M12:C8-C3-w-x-y-a-t-j
M3:w-x-y-t M13:C8-C5-x-y-a-t-p-q
M4:C8-C5-t-p M14:C9-C12-C13-C9-C12-C12-p-q
M5:t-t-p-g M15:C13-x-t-g-p-t-p-g
M6:C8-w-p-q M16:C8-C3-C5-w-C5-C5-p-q
M7:C3-C5-C5-C5 M17:C9-w-y-w-w-a-t-g
M8:C9-w-y-w M18:C13-C3-x-C13-a-a-p-g
M9:C12-C13-a-a M19:C12-C13-y-a-a-a-g-q
M10:x-t-j-q M20:x-y-p-a-x-a-p-q
Table 3: Randomly selected multithread workloads.
5. RESULTS AND DISCUSSIONS
5.1 Performance Analysis
In this section the performance of DReAM is inves-
tigated. Before jumping to the result graphs, the fol-
lowing summary might be helpful: (i) The performance
numbers presented in this section are normalized to the
baseline (Permutation-based address mapping) which
delivers the best average execution time among three
address-mapping schemes presented in Figure 3. (ii) As
discussed, the offline mapping is desired only in the case
of applications with a consistent behavior and will be
achieved after a long calibration period. Therefore, the
rebooting cost will be negligible considering the long-
period running application. Thus, in the results pre-
sented in Figures 12 to 16 the cost of rebooting is ig-
nored in the case of DReAM-Offline and only the effi-
ciency of the address mapping detected by DReAM is
investigated, in comparison with the baseline mapping.
Figure 12 presents the execution time for BIOBENCH
and PARSEC benchmarks. This result suggests that
the baseline address-mapping scheme is good enough
for the workloads presented in these benchmarks and
DReAM does not have margin to predict a better ad-
dress mapping scheme. Therefore, there is no bit-change
rate improvement when using DReAM in comparison
with the baseline. The small degradation by DReAM-
Offline (i.e. around 1%) manifested in Figure 12 is due
to slightly different access patterns caused by reordering
the baseline address bits. This can be counted as noise.
On the other hand, DReAM-Online mitigates this is-
sue by on-the-fly checking the bit-change improvement,
between two consecutive time windows, against a pre-
defined threshold. For instance in these experiments
DReAM-Online employs the new address mapping only
if it can improve the bit change rate by more than 7%.
Thus, although DReAM cannot predict a better address
mapping scheme than the baseline it does not degrade
the performance for most of the cases. A similar behav-
ior can be observed in Figures 13-16.
Overall, DReAM-Offline outperforms the permuta-
tion based address-mapping scheme (the best evaluated
baseline) by 5%, on average, and up to 28% across all
the workloads. In the case of DReAM-Online, 12 work-
loads satisfy the DReAM’s threshold at run-time (i.e.
improve the bit change rate by more than 7%) and for
these workloads DReAM-Online outperforms the base-
line by 4.5%, on average, and up to 23%. Figure 16
depicts the execution time for the randomly selected
multithread workloads presented in Table 3. These re-
9
0.94%
0.96%
0.98%
1.00%
1.02%
mummer% .gr% comm1% comm2% comm3% comm4% comm5%N
or
m
al
is
ed
%E
xe
cu
.o
n%
Ti
m
e% Baseline% DReAM%C%Offline% DReAM%C%Online%
Figure 12: Results for BIOBENCH and COMMERCIAL benchmark suites.
0.80$
0.85$
0.90$
0.95$
1.00$
hpc1$ hpc2$ hpc3$ hpc4$ hpc5$ hpc6$ hpc7$ hpc8$ hpc9$ hpc10$ hpc11$ hpc12$ hpc13$
N
or
m
al
is
ed
$E
xe
cu
=o
m
$T
im
e$ Baseline$ DReAM$E$Offline$ DReAM$E$Online$
Figure 13: Results for HPC benchmarks.
0.90$
0.95$
1.00$
1.05$
black$ canneal$ face$ ferret$ fluid$ freq$ stream$ swapt$N
or
m
al
is
ed
$E
xe
cu
>o
n$
Ti
m
e$ Baseline$ DReAM$E$Offline$ DReAM$E$Online$
Figure 14: Results for the PARSEC benchmark suite.
0.70$
0.80$
0.90$
1.00$
Ge
ms
FD
TD
_r$
ast
ar_
B$
bz
ip2
_l$
bz
ip2
_t$
ca
ctu
sA
DM
_b
$
gcc
_1
$
gcc
_2
$
gcc
_c
$
gcc
_c
p$
gcc
_g
$
gcc
_sc
$
les
lie
3d
_l$
lib
qu
an
tum
$
mc
f_r
$
mi
lc_
s$
om
ne
tpp
_o
$
so
ple
x_
r$
sp
hin
x3
_a
$
xa
lan
cb
mk
_r$
zeu
sm
p_
z$
N
or
m
al
is
ed
$E
xe
cu
Jo
n$
Ti
m
e$ Baseline$ DReAM$L$Offline$ DReAM$L$Online$
Figure 15: Results for the SPEC benchmark suite.
0.9$
0.92$
0.94$
0.96$
0.98$
1$
M1$ M2$ M3$ M4$ M5$ M6$ M7$ M8$ M9$ M10$ M11$ M12$ M13$ M14$ M15$ M16$ M17$ M18$ M19$ M20$ GMEAN$
N
or
m
al
is
ed
$E
xe
cu
i>
on
$T
im
e$ Baseline$ DReAM$D$Offline$ DReAM$D$Online$
Figure 16: Results for multithreaded benchmarks.
sults show that DReAM can still predict a better ad-
dress mapping scheme than the baseline even in the
case of multithread workloads which produce a highly
random memory access pattern. Looking into the re-
sults from a different angle suggests that DReAM out-
performs the best evaluated baseline address mapping
on average by 9% and 2% for mapping-sensitive and
mapping-insensitive workloads respectively.
Considering the results presented in Figure 15, libquan-
tum achieves a significant performance improvement tak-
ing advantage of DReAM. To understand this outcome,
it is useful to check the extracted pattern for this work-
load presented earlier in Figure 6.39. This figure shows
that there is a high change rate for bit 14. This bit is
mapped to the rows address space increasing the possi-
bility of accessing different rows within the same bank
(i.e. Page Conflict) and so imposes a significant per-
formance overhead. DReAM simply assigns this bit to
10
the another address space (e.g. bank or column address
space) by replacing it with a bit with a minimal change
rate. In this situation, the excessive change rate of this
bit increases the possibility of interleaving the accesses
to different banks which improves the level of paral-
lelism (or access locality, if using column address space)
in the system and as a result improves the performance
significantly. A similar argument explain the significant
performance for the other workloads, such as ‘face’.
5.2 Data Relocation Analysis
As discussed, the data relocation required by DReAM
is composed of two main phases: Migration and Roll-
back. In the following some statistical analysis of mi-
grations and rolls back required by DReAM will be dis-
cussed.
Migration vs. Rollback: The experimental results
(presented in Figure 12 to Figure 15) show that 12 stan-
dard workloads undergo dynamic data relocation. Out
of these 12 workloads only two workloads require data
rollback which are ‘ferret’, with 10% of data relocation
spent on data rollback, and ‘libquantum’, with 39% of
data relocation spent on data rollback.
Inter Bank vs. Intra Bank Data Relocation:
The results show that 87.5% of data relocation hap-
pens between banks (Inter-bank relocation) and 12.5%
happens within banks (Intra-bank relocation). As men-
tioned, DReAM does not perform the Intra-bank sce-
narios to reduce the cost of data relocation.
5.3 Storage Overhead and Scalability
Address Mapping Prediction: As discussed, there
is only one counter and one XOR gate per physical ad-
dress bit plus one history buffer to keep track of the last
access address is required to extract the monitoring pat-
tern. Thus, assuming a sampling window of 250K mem-
ory requests, 18-bit counters times the number of phys-
ical address bits are the main storage overhead for the
first phase of DReAM. Our experimental results show
that this number is no more than 60 bytes for 512 GB
DRAM.
Data Migration: the online data migration requires
to keep track of migrated and swapped pages. Therefore
the required MT and ST impose extra storage overhead
to the overall memory system. Figure 17 depicts the
overall storage overhead imposed by online data migra-
tion. This result shows that DReAM imposes a storage
overhead of 3× 10−5 % to the overall DRAM size. De-
pending on the implementation choice the MT and ST
can be implemented as a part of the memory controller
or as Metadata inside the DRAM.
6. RELATEDWORK
The shortfalls of DRAMs with respect to page con-
flicts are widely recognized in the area of memory sys-
tem design. Prior work proposed a wide range of dif-
ferent techniques such as memory interleaving schemes,
scheduling algorithms and some architectural modifica-
tions to the current structure of DRAMs to mitigate
this issue. For instance, Zhang et al. [6] proposed a
0"
2"
4"
6"
8"
10"
12"
14"
16"
18"
4G" 8G" 16G" 32G" 64G" 128G" 256G" 512G"
St
or
ag
e"
O
ve
rh
ea
d"
(M
B)
"
Memory"Capacity"
Figure 17: DReAM data migration overhead.
page interleaving scheme to reduce page conflicts and
exploit data locality. Hsu et al. [14] proposed another
memory interleaving scheme to address the same issue.
There are many other interesting works in the area of
developing new scheduling algorithms[15, 16, 17, 18, 19,
20, 21] that prioritize servicing certain memory requests
to reduce page conflicts and improve the memory per-
formance. Some other types of work in this area are
those that propose either a new architecture for DRAMs
or a small modification to the traditional structure of
these memory systems. For instance, Sudan et al. [22]
proposes a technique to recognise the highly accessed
data in DRAM and place them in the same row to im-
prove the data locality. Kim et al. [2] proposed a tech-
nique to exploit the existing subarray level parallelism
in DRAMs to improve the bank conflicts. PARDIS by
Bojnordi et al. [23] is a programmable memory con-
troller that can be configured using a specific instruc-
tion set architecture (ISA). Although the focus of this
work was not on developing optimized address-mapping
scheme they configured PARDIS by the application-
specific address mapping heuristic achieved by offline
profiling analysis and presented a good performance im-
provement in the memory system.
7. CONCLUSIONS
This paper has introduced DReAM which is a novel
hardware technique based on approximating the en-
tropy of each memory address bit for a set of memory
requests. DReAM presents three main contributions:
first, a low-cost pattern recognition technique is devel-
oped to extract the memory access pattern at run-time.
Then, a methodology is proposed to estimate an op-
timized address-mapping based on the detected access
pattern. Finally, a technique is proposed for the on-
the-fly migration of data within DRAMs to reduce page
conflicts. An extensive performance evaluation was car-
ried out with 48 different workloads from 5 benchmark
suites and 20 multithreaded applications. In summary,
DReAM-Offline outperforms the permutation-based ad-
dress mapping scheme (the state-of-the-art mapping)
by 5%, on average, and up to 28% across all work-
loads. In the case of DReAM-Online, 12 workloads
satisfy DReAM’s threshold at run-time (i.e. improve
the bit change rate over 7%) and for these workloads
DReAM-Online outperforms the baseline by 4.5%, on
average, and up to 23%. Categorising workloads to
mapping sensitive and insensitive, DReAM outperforms
the best evaluated baseline address mapping on average
11
by 9% and 2% for the first and second category respec-
tively. Overall, DReAM is the first on-the-fly mecha-
nism capable of generating workload specific address-
mappings without requiring running applications to be
stopped.
8. ACKNOWLEDGEMENTS
The research leading to these results has received
funding from the European Union’s Seventh Framework
Programme (FP7/2007-2013) under grant agreement n◦
318633; AXLE project http://axleproject.eu/. Mikel
Luja´n is funded by a Royal Society University Research
Fellowship and further supported by UK EPSRC grants
DOME EP/J016330/1 and PAMELA EP/K008730/1.
9. REFERENCES
[1] K. K.-W. Chang, D. Lee, Z. Chishti, A. R. Alameldeen,
C. Wilkerson, Y. Kim, and O. Mutlu, “Improving DRAM
Performance by Parallelizing Refreshes with Accesses,” in
30th Annual International Symposium on High
Performance Computer Architecture (HPCA), 2014.
[2] Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, “A case
for exploiting subarray-level parallelism (SALP) in
DRAM,” in 39th Annual International Symposium on
Computer Architecture (ISCA), pp. 368–379, IEEE, 2012.
[3] V. Seshadri, Y. Kim, C. Fallin, D. Lee,
R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu,
P. B. Gibbons, M. A. Kozuch, and T. C. Mowry,
“Rowclone: Fast and energy-efficient in-DRAM bulk data
copy and initialization,” in Proceedings of the 46th Annual
IEEE/ACM International Symposium on
Microarchitecture, pp. 185–197, ACM, 2013.
[4] B. Jacob, S. Ng, and D. Wang, Memory Systems: Cache,
DRAM, Disk. Morgan Kaufmann, 2010.
[5] D. Kaseridis, J. Stuecheli, and L. K. John, “Minimalist
open-page: A DRAM page-mode scheduling policy for the
many-core era,” in Proceedings of the 44th Annual
IEEE/ACM International Symposium on
Microarchitecture, pp. 24–35, ACM, 2011.
[6] Z. Zhang, Z. Zhu, and X. Zhang, “A permutation-based
page interleaving scheme to reduce row-buffer conflicts and
exploit data locality,” in Proceedings of the 33rd annual
International Symposium on Microarchitecture, pp. 32–41,
ACM, 2000.
[7] Rajinder Gill, “Everything you always wanted to know
about SDRAM memory but were afraid to ask.”
http://www.anandtech.com/show/3851/everything-you-
always-wanted-to-know-about-sdram-memory-but-were-
afraid-to-ask/6. [Accessed: 28-April-2015].
[8] J. Dodd, “Adaptive page management.”
http://www.google.com/patents/US7076617, July 11 2006.
US Patent 7,076,617.
[9] K. Itoh, VLSI memory chip design, vol. 5. Springer New
York, 2001.
[10] N. Chatterjee, R. Balasubramonian, M. Shevgoor,
S. Pugsley, A. Udipi, A. Shafiee, K. Sudan, M. Awasthi,
and Z. Chishti, “USIMM: the utah simulated memory
module,” University of Utah, Tech. Rep, 2012.
[11] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC
benchmark suite: characterization and architectural
implications,” in Proceedings of the 17th international
conference on Parallel Architectures and Compilation
Techniques, pp. 72–81, ACM, 2008.
[12] K. M. Dixit, “The SPEC benchmarks,” Parallel computing,
vol. 17, no. 10, pp. 1195–1209, 1991.
[13] K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin,
B. Jacob, C.-W. Tseng, and D. Yeung, “BioBench: A
benchmark suite of bioinformatics applications,” in IEEE
International Symposium on Performance Analysis of
Systems and Software, 2005. ISPASS 2005., pp. 2–9,
IEEE, 2005.
[14] W.-C. Hsu and J. E. Smith, “Performance of cached DRAM
organizations in vector supercomputers,” ACM SIGARCH
Computer Architecture News, vol. 21, no. 2, pp. 327–336,
1993.
[15] E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A.
Joao, O. Mutlu, and Y. N. Patt, “Parallel application
memory scheduling,” in Proceedings of the 44th Annual
IEEE/ACM International Symposium on
Microarchitecture, pp. 362–373, ACM, 2011.
[16] E. Ipek, O. Mutlu, J. F. Mart´ınez, and R. Caruana,
“Self-optimizing memory controllers: A reinforcement
learning approach,” in 35th International Symposium on
Computer Architecture, 2008. ISCA’08., pp. 39–50, IEEE,
2008.
[17] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter,
“ATLAS: A scalable and high-performance scheduling
algorithm for multiple memory controllers,” in 16th
International Symposium on High Performance Computer
Architecture (HPCA), pp. 1–12, IEEE, 2010.
[18] Y. Kim, M. Papamichael, O. Mutlu, and
M. Harchol-Balter, “Thread cluster memory scheduling,”
Micro, IEEE, vol. 31, no. 1, pp. 78–89, 2011.
[19] O. Mutlu and T. Moscibroda, “Stall-time fair memory
access scheduling for chip multiprocessors,” in Proceedings
of the 40th Annual IEEE/ACM International Symposium
on Microarchitecture, pp. 146–160, IEEE Computer
Society, 2007.
[20] O. Mutlu and T. Moscibroda, “Parallelism-aware batch
scheduling: Enhancing both performance and fairness of
shared DRAM systems,” in ACM SIGARCH Computer
Architecture News, vol. 36, pp. 63–74, IEEE Computer
Society, 2008.
[21] K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith,
“Fair queuing memory systems,” in 39th Annual
IEEE/ACM International Symposium on
Microarchitecture. MICRO-39., pp. 208–222, IEEE, 2006.
[22] K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi,
R. Balasubramonian, and A. Davis, “Micro-pages:
increasing DRAM efficiency with locality-aware data
placement,” ACM Sigplan Notices, vol. 45, no. 3,
pp. 219–230, 2010.
[23] M. N. Bojnordi and E. Ipek, “PARDIS: A programmable
memory controller for the DDRx interfacing standards,” in
39th Annual International Symposium on Computer
Architecture (ISCA), 2012, pp. 13–24, IEEE, 2012.
12
