An Energy-Efficient Heterogeneous Memory Architecture for Future Dark
  Silicon Embedded Chip-Multiprocessors by Onsori, Salman et al.
IEE
E P
ro
of
An Energy-Efficient Heterogeneous Memory
Architecture for Future Dark Silicon Embedded
Chip-Multiprocessors
SALMAN ONSORI, ARGHAVAN ASAD, KAAMRAN RAAHEMIFAR, AND MAHMOOD FATHY
S. Onsori is with the Computer Engineering Department, Bilkent University, Ankara 06800, Turkey
A. Asad and M. Fathy are with the Computer Engineering Department, Iran University of Science and Technology, Tehran, Iran
K. Raahemifar is with the Electrical and Computer Engineering Department, Ryerson University, ON M5B 2K3, Canada
CORRESPONDING AUTHOR: S. ONSORI (s.onsori66@gmail.com)
ABSTRACT Main memories play an important role in overall energy consumption of embedded systems.
Using conventional memory technologies in future designs in nanoscale era causes a drastic increase in leak-
age power consumption and temperature-related problems. Emerging non-volatile memory (NVM) technolo-
gies offer many desirable characteristics such as near-zero leakage power, high density and non-volatility.
They can significantly mitigate the issue of memory leakage power in future embedded chip-multiprocessor
(eCMP) systems. However, they suffer from challenges such as limited write endurance and high write
energy consumption which restrict them for adoption in modern memory systems. In this article, we present a
convex optimization model to design a 3D stacked hybrid memory architecture in order to minimize the
future embedded systems energy consumption in the dark silicon era. This proposed approach satisfies endur-
ance constraint in order to design a reliable memory system. Our convex model optimizes numbers and place-
ment of eDRAM and STT-RAM memory banks on the memory layer to exploit the advantages of both
technologies in future eCMPs. Energy consumption, the main challenge in the dark silicon era, is represented
as a major target in this work and it is minimized by the detailed optimization model in order to design a dark
silicon aware 3D Chip-Multiprocessor. Experimental results show that in comparison with the Baseline mem-
ory design, the proposed architecture improves the energy consumption and performance of the 3D CMP on
average about 61.33 and 9 percent respectively.
INDEX TERMS Heterogeneous memory architecture, non-volatile memory (NVM), convex-optimization
problem, 3D integration tehnology, energy efficient design, dark silicon
I. INTRODUCTION
Energy consumption is an essential and important constraint
in embedded systems since these systems are generally
restricted by battery lifetime. It is widely acknowledged that
energy consumption of memory systems is a significant con-
tributor to the overall system energy due to integration of
increasingly larger memory closer to the processor [47].
Therefore, there is a critical need to considerably reduce
energy consumption of memory architectures. Memory
energy consists of two components: 1) leakage, and 2) energy
of the read/write access. In order to reduce memory energy,
both the leakage and dynamic energy should be minimized.
Moreover, 42 percent of the overall energy dissipation in the
90 nm generation [1] and over 50 percent of the overall energy
dissipation in 65 nm technology [4] are due to leakage. Hence,
leakage energy has become comparable to dynamic energy in
current generation memory modules and soon will exceed
dynamic energy in magnitude if voltage and technology are
furthur scaled down [3], [24].
Due to physical limitations of two dimensional integration
technologies (2D IC), three dimensional chip-multiproces-
sors (3D CMPs) receive a lot of attention in these days
[25]–[28]. 3D integration technology compare with 2D
designs reduces interconnection wire length resulting in
lower power consumption and shorter communication
latency [23]. On the other hand, Network on Chips (NoC)
architectures have been extended to the third dimension by
the help of through silicon vias (TSVs) [44], [45]. 3D NoCs
combine the benefits of short vertical interconnects of 3D
ICs and the scalability of NoCs. Therefore, 3D NoCs have
the potential to achieve better performance with higher scal-
ability and lower power consumption.
Received 2 December 2015; revised 25 April 2016; accepted 26 April 2016.
Date of publication 0 2016; date of current version 0 2016.
Digital Object Identifier 10.1109/TETC.2016.2563323
VOLUME 1, NO. X, XXXXX 2016
2168-6750 2016 IEEE. Translations and content mining are permitted for academic research only.
Personal use is also permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 1
IEE
E P
ro
of
Inorder to exploit 3D CMP and benefit from the advan-
tages of 3D NoC, CMP architectures with 3D stacked mem-
ory system has been proposed to reduce power consumption
of CMP and increase its performance [7], [35], [36], [53],
[54]. Stacked traditional memory systems on the core layer
may drastically degrade performance, power density and
temperature-related problems [46] such as negative bias tem-
perature instability (NBTI) [42]. For example by stacking
eDRAM/DRAM on top of cores as on-chip memory, the
heat generated by the core-layer can significantly aggravate
the refresh power of DRAM layers. In such case, the
designer needs to consider the power consumption due to
the refreshing phase when designing the power management
policy for stacked DRAM memory or cache. Non-volatile
memories (NVMs) are newly emerging memory technology
with potential application in designing new classes of mem-
ory systems due to their benefits such as higher storage den-
sity and near zero leakage power consumption [37]–[39].
Spin-transfer torque random-access memory (STT-RAM)
as a promising candidate of NVM technology combines
the speed of SRAM, the density of DRAM and the non-
volatility of Flash memory. In addition, excellent scalability
and very high integration with conventional CMOS logic
are the other superior characteristics of STT-RAM [2].
Although NVMs have many benefits as described above,
their drawbacks such as high write energy consumption,
long latency writes and limited write endurance prevent
from their direct use as a replacement for traditional memo-
ries [32], [48].
In order to overcome the aforementioned disadvantages,
we use eDRAM and STT-RAM as two different types of
memory banks in the stacked memory layer in a 3D eCMP.
This hybrid memory architecture leads us to the best design
possible exploiting the benefits of both of memory technolo-
gies. In this work, we use Non Uniform Memory Architec-
ture (NUMA) stacked directly on top of the core layer in the
proposed eCMP.
Recently, dark silicon has emerged as a trend in VLSI
technology [29], [30], [49], [50]. The rise of utilization wall
due to thermal and power budgets restricts active compo-
nents and results in a large region of dark silicon. Uncore
components, such as memory and cache subsystem, consume
a significant amount of power consumption [31]. Thereby,
power management of uncore components is critical for max-
imizing design performance in dark silicon era. We exploit
3D die-stacking and emerging NVM in this work to design
high performance 3D CMP architecture for minimizing
energy consumption as a solution to combat dark silicon
challenge. Previous research has mainly focused on energy
efficient core designs [29], [40], and the design of uncore
components for reducing energy consumption has been
rarely explored. Heterogeneous architectures can be a prom-
ising solution to tackle the challenges of multicore scaling in
the dark silicon era because of slight improvement in CMOS
technology. NVMs can be efficiently integrated with CMOS
circuits in energy-efficient designs.
To the best of our knowledge, this paper is the first work to
examine an energy efficient heterogeneous memory architec-
ture design based on a convex optimization approach for
future eCMPs. We exploit 3D die-stacking and emerging
NVMs to design a high performance 3D eCMP architecture
to minimize energy consumption as a solution to combat
dark silicon challenge for future CMP.
Figure 1 shows an overview of the proposed design using
an example of an 8 homogeneous cores in the lower layer
and hybrid memory architecture in the upper layer. In the
proposed heterogeneous memory system, STT-RAM as a
well-known candidate of NVMs is incorporated with
eDRAM banks in the second layer.
This paper makes the following novel contributions:
 We provide convex optimization based platform to
design a heterogeneous memory system consistsing of
NVM and eDRAM memory banks.
 Our proposed model can optimally find the number of
eDRAM and STT-RAM memory banks in the memory
layer of the embedded 3D CMP based on the access
behavior of mapped applications to minimize energy
consumption.
 We demonstrate that our ILP formulation extends the
lifetime of the hybrid memory architecture and provides
significant energy savings in comparison with the base-
line designs.
 We developed a simulator with hybrid memory and 3D
NoC platform to evaluate the proposed design in
embedded 3D CMP using PARSEC benchmarks.
The rest of this paper is organized as follows. Section II
describes a brief background. Section III describes related
work. In Section IV, the details of convex optimization-based
problem and its formulation are investigated. In Section V,
evaluation results are presented. Finally, the paper is con-
cluded in Section VI.
II. BACKGROUND
A. STT-RAM TECHNOLOGY
STT-RAM has been one of the most popular NVM structures
due to its scalability in sub-nanometer technology and the
low writing current in comparison with the conventional
Magnetic Random Access Memory (MRAM).
As it is illustrated in Figure 2, to performe a read operation
from the STT-RAM cell, the NMOS transistor will be turned
ON and a small voltage will be set between the bit line and
FIGURE 1. An overview of the proposed architecure.
2 VOLUME 1, NO. X, XXXXX 2016
Onsori et al.: An Energy-Efficient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors
IEE
E P
ro
ofthe source line. This voltage causes a current in the magnetictunnel junction (MTJ). The amount of this current dependson the state of the MTJ. A current sensor senses the currentand compares it with a reference current. As a result, thelogic value of that cell will be determined.
For a write operation, the amount of the current would
vary and will depend on the cell value. In order to write a the
logic value of ‘0’ a positive current and for writing the logic
value of ‘1’, negative current is injected between bit line and
source line. The amount of the current for a reliable write
operation is known as threshold current which is depended
on the type of material used to construct the MTJ and its
shape [14], [41].
B. 3D DIE-STACKING TECHNOLOGY
The three-dimensional integrated circuits (3D ICs) technol-
ogy, where multiple silicon layers are stacked vertically, has
proven to be a promising solution for increasing the number
of transistors on a chip [55]. In 3D IC designs, the critical
paths can be significantly shortened and the bandwidth
between processor cores and memories can be greatly
increased [22], [23]. In addition to the aforementioned advan-
tages, 3D ICs also provide heterogeneous integration, on-chip
interconnect length reduction, and a modular and scalable
design. Thus, 3D integration is envisioned as a solution for
future many-core design to tackle the memory wall problem.
In this paper, we assume that the stacking approach is used for
3D embedded CMP design, in which core and memory layers
are vertically stacked and connected by through silicon vias
(TSVs).
III. RELATEDWORK
Numerous studies [8], [9], [33], [34] have proposed hybrid
architectures, wherein the SRAM is integrated with NVMs,
in order to take advantages of both technologies. Energy con-
sumption is still a primary concern in embedded systems
since they are limited by battery constraint. Several techni-
ques have been proposed to reduce energy consumption of
hybrid memory architectures in embedded systems. Fu et al.
[12] presented a technique to improve energy efficiency
through a sleep-aware variable partitioning algorithm for
reducing the high leakage power of hybrid memories.
Hajimiri et al. [11] proposed a system-level design approach
that minimizes dynamic energy of a NVM-based memory
through content aware encoding for embedded systems. Our
work is different from all the prior works as we focus on
placement of eDRAM and STT-RAM banks in a stacked
memory architecture in future CMPs to minimize energy
consumption using a convex optimization based approach.
As mentioned before, there are some obstacles for employ-
ing STT-RAM without integration with tradi-tional technolo-
gies in modern memory systems. One of these obstacles is
the limited number of write operations. After number of write
operations has reached its limit, it is not possible to write
another value into a STTRAM cell, and only the stored values
can be read [43]. A number of researches presented different
techniques to address the endurance problem of NVMs.
Qureshi et al. [10] proposed wear leveling techniques for a
PRAM-based memory system to enhance the lifetime. Wang
et al. [5] proposed an algorithm to evenly distribute write
events in the address space of scratchpad memory to extend
the endurance of NVM. Luo et al. [6] presented a writing tech-
nique called Min-Shift to reduce the total number of writes to
NVM and to enhance the lifetime of NVM. Hu et al. [13]
proposed a software wear leveling technique to extend the
lifetime of NVM in hybrid memory structure of embedded
systems. However, our paper is the first work to propose an
endurance model for NVM technology. This endurance model
is used as a constraint in the proposed optimiza-tion problem
to design a high endurance heterogeneous memory system
with minimum energy consumption.
IV. OPTIMIZATION MODEL
In this section, we formulate our energy optimization prob-
lem to design a minimum energy heterogeneous memory
structure in embedded 3D CMP. Figure 3 shows block dia-
gram of our model for designing the proposed hybrid mem-
ory with minimum energy consumption.
The outputs of our optimization problem are 1) finding the
optimal number of eDRAM and STT-RAM memory banks
based on the memory access behavior of mapped applica-
tions with respect to the endurance constraint, 2) the appro-
priate placement of eDRAM incorporated with STT-RAM
banks in the memory layer to minimize energy consumption.
DRC and STC represent our optimization variables. These
two binary variables indicate that a particular memory bank
FIGURE 2. Structure of a STT-RAM.
FIGURE 3. Overview of our model.
VOLUME 1, NO. X, XXXXX 2016 3
Onsori et al.: An Energy-Efficient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors
IEE
E P
ro
ofin the proposed design is either an eDRAM or a STT-RAMbank. Our convex optimization model finds DRC and STCvariables for each banks in the second layer. Based on thesevariables, the hybrid memory layer is constructed (Figure 4).After constructing the second layer and knowing actualplacement of eDRAM and STT-RAM banks on it, we can
count the number of banks and hence we can find the optimal
number of each memory technology in our design.
Table 1 gives the constant terms used in our convex for-
mulation. To solve the models, we used CVX [15], an effi-
cient convex optimization solver.
Assuming that P denotes the total number of processor
cores, DR the total available number of eDRAM memory
banks, ST the total available number of STT-RAM memory
banks, ðCX ;CYÞ the dimensions of the chip, ðPX ;PYÞ the
dimensions of the processor core. In this work, DR and ST are
equal to P; however, these numbers can be different values.
Our approach uses 0–1 variables to specify the coordinates of
each memory bank and processor core. Note that, we do not
consider application mapping in our proposed model and
applications are randomly mapped to cores in the core layer.
We useDRC and STC to identify the coordinates of a mem-
ory bank. We have two types of memory banks, eDRAM and
STT-RAM, so we have two variables.
 DRCdr;x;y;l: indicates whether an eDRAM bank is in
ðx; yÞ in layer l ¼ 2.
 STCst;x;y;l: indicates whether a STT-RAM bank is in
ðx; yÞ in layer l ¼ 2.
The mapping between coordinates and blocks in the second
layer are ensured by variables DRMap and STMap for the
eDRAM and STT-RAMmemory banks, respectively. That is,
 DRMAPdr;x;y;l: indicates whether coordinate ðx; yÞ is
assigned to an eDRAM bank in layer l ¼ 2.
 STMAPst;x;y;l: indicates whether coordinate ðx; yÞ is
assigned to a STT-RAM bank in layer l ¼ 2.
A memory bank needs to be assigned to a unique coordi-
nate. In Equation (1), i and j correspond to the x and y coordi-
nates, respectively.
XCX1
i¼1
XCY1
j¼1
ðDRCdr;i;j;l þ STCst;i;j;lÞ <¼ 1; 8dr; 8st; l ¼ 2;
(1)
STMAPst;x;y;l  STCst;x1;y1;l
8st; x; y; x1; y1 such that
x1þ TX  x > x1 and y1þ TY  y > y1; l ¼ 2;
(2)
DRMAPdr;x;y;l  DRCdr;x1;y1;l
8dr; x; y; x1; y1 such that
x1þ RX  x > x1 and y1þ RY  y > y1; l ¼ 2:
(3)
Also, the sum of used STT-RAM and eDRAM banks in
the second layer is equal to P as follow:
XCX1
x¼0
XCY1
y¼0
XDR
i¼1
DTCi;x;y;l þ
XST
i¼1
STCi;x;y;l
 !
¼ P; l ¼ 2:
(4)
In this work, the memory banks and their associated
router/controller in the upper layer are the same as size the
cores in the lower layer. This will prevent VLSI problems
related to layout and TSV design.
In order to prevent multiple mappings of a coordinate in
our grid, we assign a coordinate in the second layer to a
memory bank (eDRAM or STT-RAM).
XDR
i¼1
DRMAPi;x;y;l þ
XST
i¼1
STMAPi;x;y;l ¼ 1; 8x; y; l ¼ 2:
(5)
FIGURE 4. Construction of hybrid memory layer based on
optimization variables.
TABLE 1. Constant Terms Used in Our Optimization Problem.
The Values of FREQp;m;r and FREQp;m;w Are Obtained by Collecting
Statistics Through Simulation the Code and Capturing Accesse
to Each Storage block.
Constant Definition
P Number of cores in the core layer
DR Total number of eDRAM memory banks
ST Total number of STT-RAMmemory banks
CX , CY Dimensions of the chip
PX , PY Dimensions of a core
RX , RY Dimensions of an eDRAMmemory bank
TX , TY Dimensions of a STT-RAM memory bank
N The number of lines in STT-RAMmemory bank
l Index of layers in the 3D CMP
FREQp;m;r Number of read access to memory bank m by core p
FREQp;m;w Number of write access to memory bank m by core p
Ereaddr , Ewritedr Dynamic energy consumption per read and write
access by the eDRAM memory bank
Ereadst ;Ewritest Dynamic energy consumption per read and write
access by the STT-RAM memory bank
’ Using STT-RAM versus eDRAM ratio
trdr , t
w
dr Read and write latency of eDRAM bank
trst , t
w
st Read and write latency of STT-RAM cache bank
Pstaticdr Static power consumed by each eDRAM memory
bank at maximum temperature limit
Pstaticst Static power consumed by each STT-RAM memory
bank at maximum temperature limit
STTLineendurance Maximum write number for each line of STT-RAM
memory bank
4 VOLUME 1, NO. X, XXXXX 2016
Onsori et al.: An Energy-Efficient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors
IEE
E P
ro
ofThe static power dissipation depends on the temperature.
Since this optimization approach is solved at design time, we
consider pessimistic worst case temperature assumption and
calculate Pstaticdr and Pstaticst at maximum temperature limit.
Pstatic ¼
XCX1
i¼0
XCY1
j¼0
XDR
k¼1
DRCk;i;j;l  Pstaticdr
þ
XST
k¼1
STCk;i;j;l  Pstaticst

; l ¼ 2:
(6)
We consider endurance problem of STT-RAM in our con-
vex model. Hence, we exploit an endurance constraint for
optimal placement of eDRAM and STT-RAM memory
banks. In our model, if placing a STT-RAM memory bank in
a special position leads to destruction of more than half of
the lines of that memory due to writing frequency of cores,
STT-RAM memory bank is not chosen for that position.
This endurance constraint can be expressed as follows:
PP
i¼1 FREQi;st;w
STTLineendurance
 STCst;x;y;2 < N2 ; 8x; y; st: (7)
Figure 5 shows the overview of our endurance model.
Since STT-RAM has an endurable write threshold, we
can only write a limited number of times in each line of
STT-RAM. If the number of writes into one line is more than
the threshold, that line will be destroyed. We assume a worst
case scenario in which all write operations are written in one
line until the line is destroyed and after that a new line is
selected for rest of write operations. When 50% of lines in a
STT-RAM memory bank have been destroyed, a new write
operation only has 1/2 chance to go to a valid line which has
not been already destroyed. More specifically, there is equal
chance for a successful or an unsuccessful write to the
STT-RAM bank. If more than half lines of a STT-RAM banks
is destroyed, chance of successful write to this bank is even
less than 1/2. Thus, the maximum tolerable amount to
guarantee writing in a healthy line with more that 50 percent
probability is N/2. Increasing this amount to a number like
3N/4, decreases our chance of writing in a healthy line of a
STT-RAM bank to 1/4. On the other hand, if we decrease the
amount to a number less than N/2, for example N/4, our
chance to write to a healthy line will be increased to 3/4; how-
ever, it limits our design because we only can place our
STT-RAM in special positions with smaller amount of write
operations. We selected N/2 because it is exactly at the middle
and it can make a good tradeoff for increasing endurance of
STT-RAM andmaintaining flexibility in our design; however,
this amount can be changed based on the design’s purpose.
Note that, we assume the number of lines for a STT-RAM
bank is equal to N. Thus, in our endurance constraint model,
if placing a STT-RAM memory bank in the special position
leads to destruction of more than half lines of that memory
due to writing frequency of cores, STT-RAM bank is not
chosen for that position. Figure 5 illustrates the workflow of
the endurance model.
Having specified the necessary constraints in our convex
formulation, we next consider the objective function. The
goal of our objective function is to minimize energy con-
sumption of the stacked heterogeneous memory architecture
in the target 3D CMP with respect to the endurance con-
straint. A weighted objective function is considered to cap-
ture its potential effects on power consumption and overall
performance. This is achieved by the ’ constant which is
used as a knob for choosing eDRAM versus STT-RAM bank
in each x and y coordinates in the memory layer. As men-
tioned before, in comparison with eDRAM technology STT-
RAM is slower and has higher density and near zero leakage
power. Consequently, STT-RAM banks are applicable for
memory-intensive blocks and eDRAM banks are applicable
for computation-intensive blocks. Therefore, with changing
’ value, it is possible to have an optimized design based on
the designer’s preference. In this work, we select ’ ¼ 0:5 in
the objective function. Based on this selection, STT-RAM
energy function obtains half weight in comparison with the
eDRAM cost function. Thus, the proposed optimization
model has more freedom to choose STT-RAM banks at the
memory layer. Since STT-RAM memory banks have near-
zero leakage power, we can have a low power design strategy
with ’ ¼ 0:5 (’ < 1 in general). The amount of ’ can be set
differently for the other design purposes.
The static energy of eDRAM and STT-RAM banks for each
write and read operations are defined as multiplication of their
static power consumptions and read and write durations.
Estaticdr ¼ trdr þ twdr
  Pstaticdr ; (8)
Estaticst ¼ ðtrst þ twstÞ  Pstaticst : (9)
In Equation (10), Ereaddr , Ewritedr , Ereadst and Ewritest indi-
cate dynamic energy consumed by eDRAM and STT-RAM
banks per read and write access. Figure 6 shows eDRAM
and STT-RAM banks in the second layer and illustrates the
static and dynamic energy parameters of each memory
FIGURE 5. Overview of endurance model.
VOLUME 1, NO. X, XXXXX 2016 5
Onsori et al.: An Energy-Efficient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors
IEE
E P
ro
oftechnology. Edynamic, the dynamic energy consumption of theproposed heterogeneous memory system is calculated as:Edynamic ¼ XCX1
i¼0
XCY1
j¼0
XP
p¼1
XDR
k¼1
DRCk;i;j;l
 FREQp;k;r  Ereaddr þ FREQp;k;w  Ewritedr
 
þ
XST
k¼1
STCk;i;j;l  ðFREQp;k;r  Ereadst
þ FREQp;k;w  Ewritest Þ

; l ¼ 2:
(10)
Consequently, our objective function can be expressed as:
minimize ETotal ¼ ðEstaticdr þ Edynamicdr Þ þ ’:ðEstaticst þ Edynamicst Þ
(11)
To summarize, objective function ETotal is minimized
under constraints (1) through (10). This proposed memory
system and convex optimization model is very flexible. For
example in the proposed architecture, we can use other types
of NVM technologies such as PCM instead of STT-RAM
banks in the memory layer.
V. EVALUATION
In this section, we first describe the experimental environ-
ment for evaluation of the proposed architecture. In the next
part, different experiments are performed to quantify the
advantages of the proposed architecture over the baseline
architectures.
A. EVALUATION SETUP
We used GEM5 [16] full system simulator to implement
memories and cores. To simulate accurate behavior of the 3D
CMP design and its NoC architecture, we integrated GEM5
with 3D-Noxim [18] which is a SystemC-based NoC simula-
tor. We also integrated McPAT [17] with the aforementioned
simulation platform in order to calculate the power consump-
tion of the design. Furthermore, the cache capacities and
energy consumption of eDRAM and STT-RAM have been
estimated from CACTI [19] and NVSIM [20], respectively.
Figure 7 demonstrates the structure of the core layer and its
network on chip characteristics in the proposed 3D eCMP
design. Also, the simulation platform of this work is shown in
Figure 8. Tables 2 and 3 list the details of system configuration
for the evaluation part along with the parameters used in our
experiments for eDRAM and STT-RAM memory technolo-
gies. We used multithreaded workloads in our experiments.
The multithreaded applications with small working sets are
selected from the PARSEC benchmark suit [21]. Moreover,
Pbudget and Tmax were considered 100W and 80C for the
experimental evaluation part.
B. EXPERIMENTAL RESULTS
In this sub-section, we evaluate the target 3D eCMP with
stacked memory in four different cases: the CMP with
eDRAM-only stacked memory (Baseline-eDRAM), the CMP
with hybrid stacked memory that has four eDRAM banks at
the middle (eDRAM-centric), the CMP with hybrid stacked
memory that has same number of eDRAM and STT-RAM
banks (Hybrid-symmetric), and the CMP with the proposed
hybrid stacked memory based on convex optimization model.
In the proposed method, we consider 16 eDRAM banks
(4 MB each) and 16 STT-RAM banks (4 MB each) as the
maximum available memory which can be used for designing
the hybrid memory architecture. For evaluation purposes, the
results of the proposed design are compared with those of
the baseline designs. Baseline designs are shown in Figure 9.
FIGURE 6. Energy and power parameters of a memory bank in
second layer of the design.
FIGURE 8. Simulation platform of the design.
FIGURE 7. 3D eCMP configuration.
6 VOLUME 1, NO. X, XXXXX 2016
Onsori et al.: An Energy-Efficient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors
IEE
E P
ro
of
Figure 10 shows the results of energy consumption for
each PARSEC application. As shown in this figure, the pro-
posed design reduces energy consumption by, on average,
about 61.33, 32 and 36 percent compared with the Baseline-
eDRAM, eDRAM-centric and Hybrid-symmetric designs,
respectively. The educed energy consumption is due to effi-
cient use of eDRAM and STT-RAM banks on the memory
layer which is done by the proposed optimization model.
Figure 11 compares the instruction per cycle (IPC) of the
proposed 3D-stacked hybrid memory architecture with the
baseline designs. eDRAM and STT-RAM capacity is slightly
same. Therefore, IPC differences amongst the baseline desi-
gns is due to various read and write latencies of eDRAM
and STT-RAM memory technologies. Based on Table 1,
although read latency in eDRAM is higher than read latency
in STT-RAM, STT-RAM’s write problem has a worse
impact on IPC than eDRAM’s read latency. For example, in
Hybrid-symmetric design, half of STT-RAM banks are
replaced with eDRAM banks. Hence, Hybrid-symmetric can
give a higher IPC than Baseline-STTRAM design since the
write problem of STT-RAM can be mitigated by eDRAM
banks. Also, it is possible that Baseline-STTRAM has better
IPC than Hybrid-symmetric design in read intensive bench-
marks. This is because there are too many read operations in
read intensive benchmarks, and this increases time required
to access the memory layer due to higher read latency of
eDRAM. The proposed hybrid memory architecture based
on our convex optimization model has maximum IPC
compared with the baseline design for all the benchmarks.
Experimental results show that the proposed hybrid memory
architecture gives, on average, about 9, 2.8 and 1 percent
speedup on over Baseline-eDRAM, Hybrid-symmetric and
eDRAM-centric designs, respectively.
TABLE 2. Specification of The Baseline eCMPs Configuration.
Component Description
Number of Cores 16, 4 4 mesh
Core Configuration Alpha21164, 3GHz, area 3.5mm2, 32nm
Private Cache per
each Core
SRAM, 4 way, 32B line, size 32KB per core
On-chip Memory Baseline-eDRAM: 64MB (4MB eDRAM
bank on each core)
Baseline-STTRAM: 64MB (4MB STT-
RAM bank on each core)
Hybrid-symmetric: 32MB STT-RAM and
32MB eDRAM (8 STT-RAM and 8
eDRAM banks, 4MB each bank)
eDRAM-centric: 48MB STT-RAM and
16MB eDRAM (12 STT-RAM and
4 eDRAM banks, 4 MB each bank)
Hybrid proposed: the proposed hybrid mem-
ory based on the convex optimization model
Network Router 2-stage wormhole switched, virtual channel
flow control, 2 VCs per port, 5 flits buffer
depth, 8 flits per a data packet, 1 flit per
address packet, 16-byte in each flit
TABLE 3. Different Memory Technology Comparisons at 65 nm.
Technology Area Read Latency Write Latency Leakage Power at 80C Read Energy Write Energy
128 KB SRAM 3:62 mm2 2:252 ns 2:264 ns 131:1 mW 0:895 nJ 0:797 nJ
512 KB STTRAM 3:30 mm2 2:318 ns 11:024 ns 16 mW 0:858 nJ 4:997 nJ
512 KB eDRAM 3:51 mm2 4:053 ns 4:015 ns 120 mW 0:790 nJ 0:788 nJ
2 MB PCRAM 3:85 mm2 4:636 ns 23:180 ns 31 mW 1:732 nJ 3:475 nJ
FIGURE 9. Different baseline designs.
FIGURE 10. Comparison of energy consumption for the different
baselines and the proposed memory architecture normalized
with Baseline-eDRAM.
FIGURE 11. Comparison of instruction per cycle (IPC) for the dif-
ferent baselines and the proposed memory architecture normal-
ized with Baseline-eDRAM.
VOLUME 1, NO. X, XXXXX 2016 7
Onsori et al.: An Energy-Efficient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors
IEE
E P
ro
ofFigure 12 compares the lifetime of the proposed design
with the Hybrid-symmetric for each benchmark. We assume
the endurable maximum write number for eDRAM and dif-
ferent NVM memory technologies are as reported in Table 4
[51], [52].
To evaluate the lifetime, we assume that each benchmark
continuously runs until one of the memory lines in each
memory bank exceeds the number of maximum endurable
writes (shown in Table 4). Figure 12 shows that the lifetime
of our proposed heterogeneous memory architecture is higher
than the lifetime of the baseline designs for all the bench-
marks. The proposed hybrid memory architecture yields on
average 3.03 times (and up to five times) improvement in
lifetime when compared with Hybrid symmetric memory
design. Thus, our hybrid memory architecture results in a
more reliable 3D eCMP design due to opthe timal number
and optimal placement of STT-RAM and eDRAM banks on
the memory layer.
Figure 13 shows the results of energy delay product (EDP)
for each PARSEC application. As shown in this figure, based
on the energy consumption and performance improvement of
the proposed architecture, our design improves the EDP
by about 65 percent on average compare with the baseline-
eDRAM.
The generated hybrid memory architectures for the canneal
and fluidanimate benchmarks based on the proposed convex
optimization model are shown in Figure 14. As we mentioned
earlier, the number and placement of banks for each memory
technology (eDRAM and STT-RAM) in the memory layer
are calculated in order to minimize the performance cost func-
tion of the 3D eCMP while keeping the power budget at
the satisfactory level. In other words, it depends on distribution
of threads/applications on the core layer for each individual
benchmark based on the convex optimization model.
VI. CONCLUSION
In this work, we proposed a convex optimization based
model to design a heterogeneous memory organization
using eDRAM and STT-RAM memory banks in order to
minimize energy consumption of future 3D eCMPs. We
proposed an endurance model for NVM memory technolo-
gies in our optimization problem to design a reliable hybrid
memory structure for the first time. The experimental results
showed that the proposed method improves energy-delay
product by 65 percent on average when compared with
the traditional memory designs in which single technology
is used. Furthermore, our 3D eCMP yields on average
9 percent performance improvement when compared with
baseline designs.
REFERENCES
[1] J. Kao, S. Narendra, and A. Chandrakasan, “Subthreshold leakage model-
ing and reduction techniques,” in Proc. IEEE/ACM Int. Conf. Comput.
-Aided Des., 2002, pp. 141–148.
[2] A. K. Mishra, T. Austin, X. Dong, G. Sun, Y. Xie, N. Vijaykrishnan, and
C. R. Das, “Architecting on-chip interconnects for stacked 3D STT-RAM
caches in CMPs,” in Proc. 38th Annu. Int. Symp. Comput. Archit., 2011,
pp. 69–80.
[3] X. Guo, E. Ipek, and T. Soyata, “Resistive computation: Avoiding the
power wall with low-leakage, STT-MRAM based computing,” in Proc.
Annu. Int. Symp. Comput. Archit., 2010, pp. 371–382.
[4] W. Wang and P. Mishra, “System-wide leakage-aware energy minimiza-
tion using dynamic voltage scaling and cache reconfiguration in multi-
tasking systems,” IEEE Trans. Very Large Scale Integr. Syst., vol. 20,
no. 5, pp. 902–910, May 2012.
[5] Z. Wang, Z. Gu, M. Yao, and Z. Shao, “Endurance-aware allocation of
data variables on NVM-based scratchpad memory in real-time embedded
systems,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 34,
no. 10, pp. 1600–1612, 2015.
[6] X. Luo, D. Liu, K. Zhong, D. Zhang, Y. Lin, J. Dai, and W. Liu, “Enhanc-
ing lifetime of NVM based main memory with bit shifting and flipping,”
in Proc. IEEE 20th Int. Conf. Embedded Real-Time Comput. Syst. Appl.,
2014, pp. 1–7.
[7] J. Meng and A. K. Coskun, “Analysis and runtime management of 3D
systems with stacked DRAM for boosting energy efficiency,” in Proc.
Int. Conf. Des., Autom. Test Eur. Conf. Exhib., 2012, pp. 611–616.
FIGURE 12. Expected life time comparison of the proposed
design.
TABLE 4. Comparison of Maximum Possible Write Number for
Various Memory Technologies.
Technology SRAM eDRAM STT-RAM PRAM
Endurance 1016 1016 4 1012 109
FIGURE 13. Comparison of energydelay consumption for the
different baselines and the proposed memory architecture nor-
malized with Baseline eDram.
FIGURE 14. Hybrid memory layer for the canneal and fluidanimate
benchmarks based on the proposed convex optimization model.
8 VOLUME 1, NO. X, XXXXX 2016
Onsori et al.: An Energy-Efficient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors
IEE
E P
ro
of
[8] Z. Wang, D. A. Jimenez, C. Xu, G. Sun, and Y. Xie, “Adaptive placement
and migration policy for an STT-RAM-based hybrid cache,” in Proc. Int.
Conf. High Perform. Comput. Archit., 2014, pp. 13–24.
[9] A. Valero, J. Sahuquillo, S. Petit, P. Lopez, and J. Duato, “Design of
hybrid second-level caches,” IEEE Trans. Comput., vol. 64, no. 7,
pp. 1884–1897, Jul. 2015.
[10] M. Qureshi, M. Franceschini, L. A. Lastras-Monta~no, and J. Karidis,
“Morphable memory system: A robust architecture for exploiting multi-
level phase change memories,” in Proc. 37th Annual Int. Symp. Comput.
Archit., 2010, pp. 153–162.
[11] H. Hajimiri, P. Mishra, S. Bhunia, B. Long, Y. Li, and R. Jha, “Content-
aware encoding for improving energy efficiency in multi-level cell resistive
random access memory,” in Proc. IEEE/ACM Int. Symp. Nanoscale
Archit., 2013, pp. 76–81.
[12] C. Fu, M. Zhao, C. J. Xue, and A. Orailoglu, “Sleep-aware variable parti-
tioning for energy-efficient hybrid PRAM and DRAM main memory,” in
Proc. Int. Symp. Low Power Electron. Des., 2014, pp. 75–80.
[13] J. Hu, M. Xie, C. Pan, C. J. Xue, Q. Zhuge, and E. H. Sha, “Low overhead
software wear leveling for hybrid PCM þ DRAM main memory on
embedded systems,” IEEE Trans. Very Large Scale Integr. Syst., vol. 23,
no. 4, pp. 654–663, 2015.
[14] Z. Diao, Z. Li, S. Wang, Y. Ding, A. Panchula, and E. Chen, L.-C. Wang,
and Y. Huai, “Spin-Transfer torque switching in magnetic tunnel junctions
and spin-transfer torque random access memory,” J. Phys. Condensed Mat-
ter, vol. 19, no. 16, p. 13, 2007.
[15] M. Grant, S. Boyd, and Y. Ye, “CVX:Matlab software for disciplined convex
programming,” (2008). [Online]. Available at www.stanford.edu/ boyd/cvx/
[16] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,
J. Hestness, et al., “The gem5 simulator,” ACM SIGARCH Comput. Archit.
News 39, vol. 39, no. 2, pp. 1–7, May 2011.
[17] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and
N. P. Jouppi, “McPAT: An integrated power, area, and timing modeling
framework for multicore and manycore architectures,” in Proc. Annu.
IEEE/ACM Int. Symp. MICRO-42, 2009, pp. 469–480.
[18] M. Palesi, S. Kumar, and D. Patti. (2010). Noxim: Network-on-chip simu-
lator [Online]. Available: http://noxim.sourceforge.net.
[19] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “CACTI 6.0: A
tool to model large caches,” HP Laboratories, Chicago, USA, Tech. Rep.
HPL-2009-85, 2009.
[20] X. Dong, C. Xu, N. Jouppi, and Y. Xie, “NVSim: A circuit-level perfor-
mance, energy, and area model for emerging non-volatile memory,” in
Proc. Emerging Memory Technol. Springer, 2014, pp. 15–50.
[21] M. Gebhart, M. Gebhart, J. Hestness, E. Fatehi, P. Gratz, and S. W. Keckler,
“Running PARSEC 2.1 on M5,” Univ. Texas Austin, Dept. Comput. Sci.,
Tech. Rep. TR-09-32, 2009.
[22] C. C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari, “Bridging the proces-
sor-memory performance gap with 3D IC technology,” IEEE Des. Test
Comput., vol. 22, no. 6, pp. 556–564, Nov./Dec. 2005.
[23] Y. Xie, G. Loh, B. Black, and K. Bernstein, “Design space exploration for
3D architectures,” ACM J. Emerging Technol. Comput. Syst., vol. 2, no. 2,
pp. 65–103, 2006.
[24] W. Wang and P. Mishra, “System-wide leakage-aware energy minimiza-
tion using dynamic voltage scaling and cache reconfiguration in multitask-
ing systems,” IEEE Trans. Very Large Scale Integr. Syst., vol. 20, no. 5,
pp. 902–910, May 2012.
[25] J. Zhao, X. Dong, and Y. Xie, “An energy-efficient 3D CMP design with
fine-grained voltage scaling,” in Proc. Des., Autom. Test Eur. Conf. Exhib.,
2011, pp. 1–4.
[26] J. Meng, K. Kawakami, and A. K. Coskun, “Optimizing energy effi-
ciency of 3D multicore systems with stacked DRAM under power and
thermal constraints,” in Proc. 49th Annu. Des. Autom. Conf., 2012,
pp. 648–655.
[27] K. Swaminathan, H. Liu, J. Sampson, and V. Narayanan, “An examina-
tion of the architecture and system-level tradeoffs of employing steep
slope devices in 3D CMPs,” in Proc. Int. Symp. Comput. Archit., 2014,
pp. 241–252.
[28] J. Lee, J. Ahn, K. Choi, and K. Kang, “THOR: Orchestrated thermal man-
agement of cores and networks in 3D many-core architectures,” in Proc.
Des. Autom. Conf., 2015, pp. 773–778.
[29] H. Esmaeilzadeh, E. Blem, R.S. Amant, K. Sankaralingam, and D. Burger,
“Dark silicon and the end of multicore scaling,” in Proc. 38th Annu. Int.
Symp. Comput. Archit., 2011, pp. 365–376.
[30] P. Bose, “Is dark silicon real?: Technical perspective,” Commun. ACM
Mag., vol. 56, pp. 92–92, 2013.
[31] H. Y. Cheng, J. Zhan, J. Zhao, Y. Xie, J. Sampson, and M. J. Irwin, “Core
vs. uncore: The heart of darkness,” in Proc. 52nd Annu. Des. Autom.
Conf., 2015, pp. 1–6.
[32] Q. Li, J. Li, L. Shi, C. J. Xue, Y. Chen, and Y. He, “Compiler-assisted
refresh minimization for volatile STT-RAM cache,” in Proc. Des. Autom.
Conf., 2013, pp. 273–278.
[33] M. S. Haque, A. Li, A. Kumar, and Q. Wei, “Accelerating non-volatile/
hybrid processor cache design space exploration for application specific
embedded systems,” in Proc. 20th Asia South Pacific Des. Autom. Conf.,
2015, pp. 435–440.
[34] J. Ahn, S. Yoo, and K. Choi, “Prediction hybrid cache: An energy-efficient
STT-RAM cache architecture,” IEEE Trans. Comput., vol. 65, no. 3,
pp. 940–951, Mar. 2016.
[35] S. K. Lim, “3D-MAPS: 3D massively parallel processor with stacked mem-
ory,” in Proc. Int. Conf. Design High Performance, Low Power, Reliable
3D Integrated Circuits, 2013, pp. 537–560.
[36] X. Dong, X. Wu, G. Sun, Y. Xie, H. Li, and Y. Chen, “Circuit and micro-
architectureevaluation of 3D stacking magnetic RAM (MRAM) as a uni-
versal memory replacement,” in Proc. 45th Annu. Des. Autom. Conf.,
Jun. 2008, pp. 554–559.
[37] E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Evaluat-
ing STT-RAM as an energy-efficient main memory alternative,” in Proc.
Int. Conf. Perform. Anal. Syst. Softw., 2013, pp. 256–267.
[38] W. Xu, H. Sun, X. Wang, Y. Chen, and T. Zhang, “Design of last-level on-
chip cache using spin-torque transfer RAM (STT RAM),” IEEE Trans.
Very Large Scale Integr. Syst., vol. 19, no. 3, pp. 483–493, Mar. 2011.
[39] H. Yoon, J. Meza, N. Muralimanohar, N. P. Jouppi, and O. Mutlu, “Effi-
cient data mapping and buffering techniques for multilevel cell phase-
change memories,” ACM Trans. Archit. Code Optimization, vol. 11, no. 4,
2014, Art. no. 40.
[40] B. Raghunathan, Y. Turakhia, S. Garg, and D. Marculescu, “Cherry-
picking: Exploiting process variations in dark-silicon homogeneous
chip multi-processors,” in Proc. Design, Autom. Test Eur. Conf.
Exhib., 2013, pp. 39–44.
[41] L. Wilson, “International technology roadmap for semiconductors (ITRS),”
Semiconductore Ind. Assoc., 2013.
[42] H. Tajik, H. Homayoun, and N. Dutt, “VAWOM: Temperature and process
variation aware wearout management in 3D multicore architecture,” in
Proc. 50th ACM/EDAC/IEEE Design Autom. Conf., 2013, pp. 1–8.
[43] X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony, and Y. Xie, “Hybrid
cache architecture with disparate memory technologies,” in Proc. 36th
Annu. Int. Symp. Comput. Archit., 2009, pp. 34–45.
[44] J. Knechtel, I. L. Markov, and J. Lienig, “Assembling 2D blocks into 3D
chips,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 2012,
pp. 228–241.
[45] S. Das, D. Lee, D. H. Kim, and P. P. Pande, “Small-world network enabled
energy efficient and robust 3D NoC architectures,” in Proc. 25th Edition
Great Lakes Symp. VLSI, 2015, pp. 133–138.
[46] M. Guan and L. Wang, “Temperature aware refresh for DRAM perfor-
mance improvement in 3D ICs,” in Proc. 16th Int. Symp. ISQED, 2015,
pp. 207–21.
[47] C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and T. W. Keller,
“Energy management for commercial servers,” Computer, vol. 36, no. 12,
pp. 39–48, 2003.
[48] J. Wang, Y. Tim, W. F. Wong, Z. L. Ong, Z. Sun, and H. Li, “A
coherent hybrid SRAM and STT-RAM L1 cache architecture for
shared memory multicores,” in Proc. Asia South Pacific Des. Autom.
Conf., 2014, pp. 610–615.
[49] H. Esmaeilzadeh, “Approximate acceleration: a path through the era of
dark silicon and big data,” in Proc. Int. Conf. Compilers, Archit. Synthesis
Embedded Syst., 2015, pp. 31–32.
[50] J. Henkel, H. Bukhari, S. Garg, M. U. K. Khan, H. Khdr, F. Kriebel, and
M. Shafique, “Dark Silicon: From computation to communication,” in
Proc. 9th Int. Symp. Netw.-on-Chip, 2015, Art. no. 23.
[51] Y. T. Chen, J. Cong, H. Huang, B. Liu, C. Liu, M. Potkonjak, and G. Rein-
man, “Dynamically reconfigurable hybrid cache: An energyefficient last-level
cache design,” inProc. Des., Autom. Test Eur. Conf. Exhib., 2012, pp. 45–50.
[52] M. T. Chang, P. Rosenfeld, S. L. Lu, and B. Jacob, “Technology compari-
son for large last-level caches (L3Cs): Low-leakage SRAM, low write-
energy STT-RAM, and refresh-optimized eDRAM,” in Proc. High Perfom.
Comput. Archit., 2013, pp. 143–154.
[53] D. H. Woo, N. H. Seong, D. L. Lewis, and H. H. S. Lee, “An optimized
3D-stacked memory architecture by exploiting excessive, high-density
TSV bandwidth,” in Proc. High Perform. Comput. Archit., 2010, pp. 1–12.
VOLUME 1, NO. X, XXXXX 2016 9
Onsori et al.: An Energy-Efficient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors
IEE
E P
ro
of
[54] Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T. M. Low, L. Pi-leggi,
J. C. Hoe, and F. Franchetti, “3D-stacked memory-side ac-celeration:
Accelerator and system design,” in Proc. Workshop Near-Data Process.,
2014, pp. 1–6.
[55] G. H. Loh, “3D-stacked memory architectures for multi-core processors,”
in Proc. ACM SIGARCH Comput. Archit. News, 2008, pp. 453–464.
Salman Onsori received the BS degree in com-
puter engineering (hardware) from the Shahed
University, Iran, in 2010 and the MS degree in
computer architecture from the Shahid Beheshti
University, Iran, in 2013. He is currently working
toward the PhD degree in the Bilkent university,
Turkey. His current research interests include
design of the emerging non-volatile memory and
cache architectures, 3D chip-multi processors and
embedded systems as well as their hardware
modelling.
Arghavan Asad recieved the MS degree in com-
puter architecture from the Iran University of Sci-
ence and Technology, Tehran, Iran, in 2012. She is
currently working toward the PhD degree at the
Iran University of Science and Technology. Her
research interest include interconnection network,
low power hardware and memory hierarchy design.
Kaamran Raahemifar received the BSc degree
from the Sharif University of Technology, the MS
degree from the Waterloo University, and the PhD
degree from the Windsor University, all in electri-
cal and computer engineering. He is currently a
professor in the Department of Computer Engineer-
ing at Ryerson University. His research interests
are in the areas of optimization in engineering,
modeling, simulation, design and VLSI circuits.
Mahmood Fathy received the BS degree in elec-
tronics from the Iran University of Science and
Technology, Tehran, Iran, in 1985, the MS degree
in computer architecture from the Bradford Univer-
sity, West Yorkshire, United Kingdom, in 1987,
and the PhD degree in image processing and com-
pute architecture from the University of Manchester
Institute of Science and Technology, Manchester,
Unnited Kingdom, in 1991. Since 1991, he has
been an associate professor with the Department of
Computer Engineering, Iran University of Science
and Technology. His research interests include the quality of service in com-
puter networks.
10 VOLUME 1, NO. X, XXXXX 2016
Onsori et al.: An Energy-Efficient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors
