Abstract-Energy consumption becomes the most critical limitation on the performance of nowadays embedded system designs. On-chip memories due to major contribution in overall system energy consumption are always significant issue for embedded systems. Using conventional memory technologies in future designs in nano-scale era causes a drastic increase in leakage power consumption and temperature-related problems.
INTRODUCTION
Chip-multiprocessor (CMP) architectures have been extensively adopted to meet ever-increasing demands on performance in embedded systems. The increase in the number of cores in embedded CMPs comes with an increase in energy consumption. Energy consumption is an essential and important constraint for embedded systems since these systems are generally limited by battery lifetime. In addition, significant amount of embedded system's power consumption is due to memory system. Therefore, there is a critical need to reduce energy consumption of memory architecture in embedded systems.
In order to reduce memory energy, it is needed to address both the leakage and dynamic energy. On the other hand, 42% of overall energy dissipation in the 90nm generation is consumed by leakage energy [1] and this value can exceed above 50% in 65nm technology [2] . Hence, leakage energy has A number of researchers proposed 3D CMP architectures with 3D stacked memory system [4, 5] . Stacking main memory directly on top of a core layer is a natural way to attack the memory wall problem. Stacked traditional memories such as SRAM and DRAM on the core layer may cause a drastic increase in perfonnance degradation, power density and temperature-related problems.
Various Non-volatile memories such as Spin-Torque System optimization techniques are widely used to improve overall performance as well as energy efficiency. In this work, we propose a convex optimization based approach to design a heterogeneous memory system consists of NVM and SRAM memory banks. To the best of our knowledge, this is the first time that a convex model is used for architecting an optimal hybrid memory system using compiler. Our proposed model minimizes energy consumption of the embedded 3D CMP with respect to the performance. Figure 1 shows an overview of the proposed approach. The main contributions of this work can be summarized as follows:
• To the best of our knowledge this is the first work that proposes optimization model to distribute data blocks into SRAM and STT-RAM banks based on compiler analysis.
• We efficiently allocate data blocks based on read and write access patterns.
• We minimize energy consumption of stacked hybrid memory onto eCMP by utilizing compiler for the first time.
The remainder of this paper is organized as follows. Section II describes related works. In Section III, the details of convex optimization-based problem and its formulation are investigated. In Section IV, evaluation results are presented.
Finally, we draw conclusions at Section V.
II. RELATED WORK
Recent studies [6] 
III. PROPOSED METHODE

A. Data Access Pattern Extraction and Analysis
As illustrated in Figure 1 , the first step of our approach is to extract data-access pattern information from the application code. While it is possible to do this by profiling the code under consideration, the resulting access pattern may be very sensitive to the particular input used in profiling. Instead, in this work, we use static compiler analysis to extract read and write information of data-access patterns of given embedded application. Then, we use this read and write access pattern for allocating data blocks to the appropriate memory bank. In this work, we force write intensive blocks to be allocated in SRAM banks due to its higher endurance and lower energy consumption for write operations.
With this policy, we can assign STT-RAM banks for read intensive blocks to take advantage of near zero leakage power of NVM technology and more reliable design with preventing write operations in STT-RAM banks.
A sample data-access pattern is shown in Figure 2 (a). In this figure, we also represent type of accesses (read and write) to each data block. For example, ar represents read access and aw represents write access to data block a in the sample data pattern. Specifically, in this work, we approved the concept of a step to define these transitional intensive blocks. Even though, in theory, we have the flexibility to assign any number of iterations between two transitional events, these points should be selected carefully. In other words, in moving from one step to another during execution, the data-access pattern should exhibit significant variation.
The unit of data that is being stored in SRAM or STT-RAM banks in our experiments is a data block. The data-block size is a crucial factor which affect data-access pattern. We manually selected suitable data-block sizes for a given application. Figure   3 shows a general view of allocating a data block to on-chip memory layer. In this figure, a two-dimensional array is divided into data blocks at the left part. Data blocks can be allocated to SRAM or STT-RAM banks by result of optimization model.
D1Vldmga two-dnllcnsional: : array into data blocks : Specifically, when this loop nest is executed, read and write data access pattern of the blocks is aw, en i n gn hn bw' Assuming that the entire code fragment is considered as a single execution step, these are also the blocks accessed in this step. However, if we assume that each step consists of only Q2 /4 loop iterations, then the iteration space of the code fragment shown in Figure   5 (b) spans two steps, In this case, the data blocks accesses by the first step are aw, en i n gn and hr; and those accessed by the second step are bw , er, i r, gr, and hr. These two sequences collectively constitute the data-block access pattern for this code fragment. Consequently, dividing loop nest into steps can change access pattern sequence of our application. To solve the models, we use CVX [16] , an efficient convex optimization solver. Assuming that P denotes the total number of cores, N the total number of SRAM memory banks, M the total number of STT-RAM memory banks, B the total number of data blocks and S the total number of steps.
We use Rm,s and Wm,s to identify if there is a read or write access to a data block in one step. More specifically:
• Rm,s: Indicates whether data block m is read accessed at step s.
• Wm,s: Indicates whether data block m is write accessed at step s.
Assignment of a data block to a memory bank is identified
by LSRm,n and LSTm,n' That is,
• LSRm,n: Indicates whether data block m is assigned to SRAM bank n.
• LST m,n : Indicates whether data block m is assigned to STT-RAM bank n.
Read or write access to a memory bank with a data block at a particular step is captured by SRn,m and STn,m' Specifically, we have:
• SRn,m: Indicates whether SRAM bank n is accessed by data block m.
• STn,m: Indicates whether STT-RAM bank n is accessed by data block m.
After having defmed integer variables, we can now discuss SRn , m ;::: Rm , s x LSRm , n' Vm, n, S
STn , m ;::: Rm , s x LST m , n' Vm, n, S
SRn , m ;::: Wm , s x LSRm , n' Vm, n, S
STn , m ;::: Wm , s x LST m , n' Vm, n, S
Since a data block can reside only in a single bank at any given time, it must satisfy the following constraint.
The limited bank capacity establishes the basis for the next constraint that needs to be included in our model. Assuming that the size of a block is sizeblock and the available memory space is sizeSRA M and sizeSTTRA M for SRAM and STT-RAM memory space, respectively. Hence, each memory bank will be of size sizeSRAM for SRAM and sizeSTTRAM for STT-RAM.
If number of writes for a data block is more than a threshold number, we force the data block to be allocated in SRAM bank.
We employ following constraint for this target:
LSRm,n x (I Wm,j ) + LSTm,n x threshold wr ite 2: threshold wr ite,
To force a data block to be allocated in SRAM, we also need to prevent allocation of the data block to STT-RAM simultaneously. Hence, constraint (11) allows allocation of the data block to STT-RAM bank only when number of writes are less than the threshold:
LSTm,n x (t Wm,j ) :5 thresholdwrite. 'tm.q.m"* q.'tn (11) So far in our discussion we have not put any limit on the potential performance degradation due to using SRAM or STT RAM memory banks for allocating data blocks, One might envision a case where only a limited degradation in performance could be tolerated, The performance overhead in our model can be captured using an additional constraint In our design, the performance overhead is mainly due to different delay of write
Iread activities in SRAM and STT-RAM banks. Assuming that
Omax is the maximum performance overhead allowed for the design (which can be 0 to obtain the best energy savings without tolerating any performance penalty), then our performance constraint can be expressed as follows:
We define the dynamic energy consumption as the sum of read and write energies of data blocks in SRAM or STT-RAM banks.
In addition, we calculate static energy, The static power dissipation depends on temperature, Since this optimization approach is solved at design time, we consider pessimistic worst-case temperature assumption and calculate Pstatic sr and Pstaticst at maximum temperature limit Specifically, that is:
Having specified the necessary constraints in our convex optimization model, we next propose our objective function, We denote the total energy consumption of the proposed 3D-stacked heterogeneous memory system as ET ot a l, ET ot a l is comprised of dynamic and statics power components:
To summarize, objective function ETotal is minimized under constraints (1) 
A. Experimental Setup
We use GEMS [21] , McPAT [25] and a SystemC-based NoC simulator, 3D-Noxim [24] , to setup the system platform, The detailed for baseline system configuration is listed in Table II .
The cache capacities and energy consumption of SRAM and STT-RAM are estimated from CACTI [23] and NVSIM [22] , respectively, The proposed compilation technique is implemented on LL VM [27] . The parameters we used in our experiments for SRAM and STT-RAM cache banks are shown in Table III .
We use multithreaded workloads for performing our experiments. The multithreaded applications with small working sets are selected from the PARSEC benchmark suite Note that in this figure, the baseline is a memory architecture with only STT-RAM banks. In this figure. We assumed the endurable maximum write number for SRAM and different NVM memory technologies based on Table IV [28] . 
COMPARISON OF MAXIMUM WRITE NUMBER FOR VARIOUS MEMORY TECHNOLOGIES
To evaluate lifetime, we assumed that each benchmark continuously run until one of the cache blocks exceeds the number of maximum endurable writes (shown in Table IV) 
