ABSTRACT Scratchpad memory (SPM) is widely utilized in many embedded systems as a softwarecontrolled on-chip memory to replace the traditional cache. New non-volatile memory (NVM) has emerged as a promising candidate to replace SRAM in SPM, due to its significant benefits, such as low-power consumption and high performance. In particular, several representative NVMs, such as PCM, ReRAM, and STT-RAM can build multiple-level cells (MLC) to achieve even higher density. Nevertheless, this triggers off higher energy overhead and longer access latency compared with its single-level cell (SLC) counterpart. To address this issue, this paper first proposes a specific SPM with morphable NVM, in which the memory cell can be dynamically programmed to the MLC mode or SLC mode. Considering the benefits of high-density MLC and low-energy SLC, a simple and novel optimization technique, named theory of thermal expansion and contraction, is presented to minimize the energy consumption and access latency in embedded systems. The basic idea is to dynamically adjust the size configure of SLC/MLC in SPM according to the different workloads of program and allocate the optimal storage medium for each data. Therefore, an integer linear programming formulation is first built to produce an optimal SLC/MLC SPM partition and data allocation. In addition, a corresponding approximation algorithm is proposed to achieve near-optimal results in polynomial time. Finally, the experimental results show that the proposed technique can effectively improve the system performance and reduce the energy consumption.
I. INTRODUCTION
Scratchpad memory (SPM) has emerged as a softwarecontrolled on-chip memory to replace the hardwarecontrolled cache in embedded systems [1] . Due to its outstanding advantages in low latency, power efficiency and small area, it is widely used in embedded systems design, such as Altera Nios II and Xilinx MicroBlaze processors [2] . However, traditional SPM consists of SRAM, in which the leakage power will consumes, on average, 30%-50% of the total memory energy [3] , [4] . The non-negligible energy in SPM might accelerate the exhaustion of limited energy in embedded systems. Therefore, this paper focuses on exploiting emerging non-volatile memory to replace SRAM in SPM, for reducing the energy consumption and improving the performance in embedded systems.
New energy-efficient, byte-addressable, non-volatile memories (NVMs), such as Shared Transistor Technology Random Access Memory (STT-RAM), resistive random access memory (ReRAM) and memristor have emerged as promising memory techniques [5] - [7] . All of NVMs have many attractive characteristics such as high-density, zero leakage power and low latency, which results in the excessive use of NVMs in different levels of the memory hierarchy. In addition, the memory mode of some typical NVMs like STT-RAM and PCM can dynamically switch between SLC mode and MLC mode [8] - [10] . A NVM cell in MLC mode can store multiple bits, which contributes to implementing an even high-density memory chip in limited-resources embedded system. Nevertheless, it gives birth to the price of longer latency and higher access energy cost compared to its SLC mode. Combining with the benefits from both memory modes in NVM, in this paper, we propose to build an SPM with ''morphable NVM'' for optimizing the performance and energy consumption of embedded systems.
Many studies about morphable memory have been proposed to take advantages of different memory modes. With the assistance of hardware techniques, Qureshi et al. first designed a hardware monitor to collect the memory requirement and dynamically tuned the number of bits in memory cell [11] . Meanwhile, Jiang et al. proposed a data classification management for allocating frequently access data on fast SLC mode [12] . And various software techniques have been proposed to periodically track access information of program or memory pages, in order to determine the optimal SLC/MLC partition or the mode of physical page [13] - [16] . In particular, Zhou and Li [13] and Long et al. [16] were implemented on virtualization platform, in which needs more memory-resources. Combining with the benefits of both SLC mode and MLC mode, these existing techniques dynamically tune the memory mode of NVM cell according to different workloads in systems, in order to achieve a balance between capacity and access latency. However, all of the above work is proposed in main memory level for replacing DRAM, which cannot be effectively applied on SPM.
In cache level, Zhou et al. proposed energy-aware morphable cache management, which brought the possibilities of a new SPM with morphable NVM [10] .Related work on SPM with NVM concentrates on data allocation technique with different optimizing parameters, which statically or dynamically allocates address for each data according to the feature of NVM [17] . And these work is based on the pure NVM SPM or the hybrid SRAM/NVM SPM. On the pure NVM SPM architecture, Wang et al. [18] and Rodríguez et al. [19] proposed an allocation optimization algorithms to distribute the write operations evenly in the SPM address space and minimize the energy consumption. On the hybrid SRAM and NVM SPM architecture, by exploiting the different feature of memories, Hu et al. first proposed a dynamical data allocation scheme for preferentially allocating read-intensive variables into NVM and write-intensive variables into SRAM [20] . Considering the task scheduling problems, Wang et al. further proposed an energy-aware data allocation on hybrid SPM [21] .
However, these above work cannot be directly apply on SPM with morphable NVM, in which the size of memories can be dynamically changed. We should dynamically change the size configure of SLC/MLC SPM according to different workloads in program, and produce corresponding optimal data allocation for each configure, to minimize the energy consumption and improve the performance of embedded systems. The process is like the theory of thermal expansion and contraction (named TTEC), which is the tendency of matter to change in shape, area, and volume (i.e, the size of SPM) in response to a change in temperature (i.e. the workload of program). Therefore, in this paper, we present a simple and novel data allocation technique, named TTEC, to improve performance and reduce the energy consumption of embedded systems with morphable SPM. The basic idea is first to divide a given program into multiple regions according to the applicationspecific features of embedded program. Then, according to the data access information of each region, we dynamically tune the size configure of SLC/MLC in SPM and produce an optimal data allocation for each program region, in order to minimize the energy consumption and improve the performance of embedded systems.
To address this issue, this paper first formulate this problem of data allocation as an Integer Linear Programming (ILP) model for achieving optimal SLC/MLC SPM partition and data allocation. After that, a heuristic algorithm, named Energy-Aware Data Allocation Algorithm (EADA),is proposed to achieve near-optimal results in polynomial time. Finally, our evaluation based on Mibench shows that ILP and EADA can not only efficiently improve the performance by 10.21% and 12.50% on average, but also decrease the energy consumption by 12.8% and 15.3% on average.
In general, this paper make the following contributions:
• Combining with the benefits from both memory modes in NVM, this paper first propose a SPM with morphabe NVM in embedded systems, to improve the performance and reduce the energy consumption of embedded systems.
• In TTEC, an Integer Linear Programming (ILP) formulation and a heuristic algorithm, named Energy-Aware Data Allocation Algorithm(EADA), are proposed to achieve the optimal and near-optimal SLC/MLC partition in SPM and the corresponding optimal data allocation scheme, respectively, for minimizing the energy consumption.
• Finally, we develop a simulator with SimpleScale to evaluate the proposed ILP formulation and the heuristic algorithm with two baseline schemes, and adopt a set of typical benchmarks in embedded systems to conduct a series of experiments.
The rest of this paper is organized as follows. Section II presents a background of NVM and gives the hardware architecture of embedded systems. Then, a simple motivational example is illustrated in Section III. Section IV gives the detail introduction of TTEC techniques. After that, we evaluate TTEC in Sections V. Finally, we conclude this work in Section VI.
II. BACKGROUND
In this section, we first present a brief introduction about morphable non-volatile memories. Then, the hardware architecture in this paper is illustrated in detail. Finally, we introduce data allocation and the definition of unresolved problem in this paper.
VOLUME 6, 2018
A. MORPHABLE NVMS New non-volatile memories like PCM, STT-RAM and ReRAM have achieved a great advance on storage density [22] - [24] . It is an inevitable trend to optimize the performance and power consumption of embedded systems by using these high-density memory technologies. Taking PCM as an example, IBM has successfully implemented a reliable three-tier cell in a 4M cell PCM array (i.e, 32 bit/cell), which becomes an important event in the development of NVMs. Though these memory technologies are built on different materials, all of them own the same features such as high density, low energy power and non-volatile [25] . Moreover, they can build MLC mode to achieve higher density and larger capacity, as shown in Figure 1 . According to the mechanism of storing data, NVMs can be classified as resistance and magnetization memories. In the first place, ReRAM and PCM utilize the size of resistance in memory cell to represent data. The large resistance range allows to contain multiple resistance intervals, which can be exploited to represent multiple bits in one physical cell, as shown in Fig.1 [16] , [25] . With the benefits of the large resistance range, a ReRAM cell or a PCM cell can be easily programmed to be either a fast SLC cell or a high-density MLC cell. In the second place, STT-RAM takes advantage of magnetization direction in the reference layer of each STT-RAM cell to represent data. As shown in Fig.1(c) , the MLC STT-RAM cell have two reference layers, which can contain two-bit data [26] , [27] . If its one reference layers is remained unused, it can be converted to a SLC STT-RAM cell, in which only one bit can be stored. Therefore, no matter whether NVM is built by resistance mechanism or magnetization mechanism, the NVM cell can be dynamically programmed to SLC mode or MLC mode.
When a NVM cell is programmed to MLC mode, it can achieve high density and capacity. Unfortunately, it needs a complex precision design to identify multiple resistance intervals, or recognize the magnetization direction in multiple reference layer. As shown in Fig.1 (d) , the complex precision design might lead to high energy, long latency, the decline of lifetime and reliability. However, compared to its SLC counterpart, MLC still have the absorbing advantage of high density. Therefore, this paper focuses on using morphable SPM with morphable NVM, and presenting the optimizing scheme for making the best of the benefits from both SLC mode and MLC mode. Fig. 2 shows the hardware architecture of embedded systems in this paper. In the model, compared with the traditional embedded systems, the major difference is to adopted the on-chip SPM. Moreover, the on-chip SPM is built with morphable NVM, in which a memory cell can be dynamically transferred to SLC mode or MLC mode. In addition, the target embedded systems in this paper also uses DRAM as main memory to store operating data in process of program running, and utilizes NAND Flash as the second storage to maintain program data for a long period of time. 
B. THE HARDWARE ARCHITECTURE OF EMBEDDED SYSTEMS

C. DATA ALLOCATION AND PROBLEM DEFINITION
In this paper, we adopt dynamical data allocation technique, in which the program is divided into multiple program regions, as shown in Fig. 3 . Program region is identified by FIGURE 3. Data allocation of a given program. VOLUME 6, 2018 the specific feature of the given program, such as the structure of loop and subfunction [28] , [29] . Before each region of the given program, corresponding data allocation code is inserted to produce the optimal SLC/MLC configure on SPM and the optimal data allocation schemes. Data allocation code is first executed before each running of region for minimizing the cost energy and time. Therefore, the purpose of this paper is to achieve the optimal SLC/MLC configure on SPM and the corresponding data allocation schemes for each region in the given program, in order to minimize the cost of data access in embedded systems.
Formally, the problem in this paper can be defined as follows:
Given the access information of each data in program (obtained by profiling), the problem is to find an SLC/MLC partition and data allocation scheme, in order to minimize the energy cost of data access in embedded systems and improve the performance of embedded systems.
Profiling is a process technique where a compiler is adopted to collect information from a given program. Before presenting the optimizing technique in this paper, a motivational example is first given to explain the idea of the optimizing technique in this paper.
III. MOTIVATIONAL EXAMPLE
In this section, a motivational example is illustrated to explain the basic idea of data allocation in this paper. First, we choose a simple program as our example. Then, different data allocations in the new architecture are provided. Finally, for each kind of data allocation, a detailed analysis is presented to explain the advantage of the technique in this paper.
As shown in Fig. 4 (a), we choose a fibonacci sequence function as our example. Assuming the function need to compute the fibonacci sequence of the value 5 (i.e, n = 5), the writes and reads information of variables in this function can be obtained, as shown in Fig. 4 (b) . For brief illustration, we assume that the size of morphable SPM is three bytes, and all SPM cells are initialized with SLC mode. Besides, the size of DRAM is six bytes, and the size of each variable is one byte. In this function, there six variables need be accessed: i, a, b, r, c and n. As shown in Fig. 4 (b) , i, a, b, r, c and n need At first, we assume all variables are stored in DRAM, the total energy cost can be computed as: (5 + 4 + 4 + 4 + 1) * WD + (4 + 3 + 6 + 4 + 4 + 3) * RD = 1260. In this paper, SPM is exploited to reduce energy cost in DRAM, and four solutions of data allocation can be obtained, as shown in Fig. 6 . 
A. THE FIRST SOLUTION
Three variables with largest access energy cost are chosen to store in on-chip SLC SPM (i.e, i, b and r). And the rest of variables are located in DRAM (i.e, a, c and n). Then, the total energy cost can be computed as (5 + 4 + 4) * WS + (4 + 6 + 4) * RS + (4 + 1) * WD + (3 + 4 + 3) * RD + 3 * (D ← S) = 621. Therefore, compared with the original scheme, the total energy cost is reduced by 50.71%.
B. THE SECOND SOLUTION
Two variables with largest access energy cost are chosen to store in on-chip SLC SPM (i.e, i and b). Then, we convert one byte SLC SPM to two bytes MLC SPM, and choose two variables with largest access energy cost in rest of variables to store in on-chip MLC SPM (i.e, a and r). After that, the rest of variables are still stored in DRAM. According to this data allocation, the total energy cost can be computed as:
Therefore, the total energy cost is reduced by 55.40%.
C. THE THIRD SOLUTION
Only one variable with largest access energy cost is chosen to store in on-chip SLC SPM (i.e, b). Then, we convert two bytes SLC SPM to four bytes MLC SPM, and choose four variables with largest access energy cost in rest of variables to store in on-chip MLC SPM (i.e, i, a, r and c). After that, variable n is still stored in DRAM. According to this data allocation, the total energy cost can be computed as: 4 * WS +6 * RS +(4+ 4+4+1) * WM +(3+6+4+4) * RM +3 * RD+2 * con+1 * (D ← S) + 4 * (D ← M ) = 569. Therefore, the total energy cost is reduced by 54.84%.
D. THE FORTH SOLUTION
We convert all of SLC SPM to MLC SPM, and move six variables to store in on-chip MLC SPM. Then, the total energy cost can be computed as:
Therefore, the total energy cost is reduced by 49.52%.
In conclusion, by exploiting SPM, the total energy cost is reduced by 52.62% on average, as shown in Table 1 . We trade three bytes of SPM for a half of energy cost. Moreover, the second solution achieve the minimal energy cost, which is reduced by 9.50%, 1.23% and 11.63% respectively compared to the other three solutions. From the example, we can clarify that energy cost depends on the partition of SLC/MLC in morphable SPM and the data allocation of variable in program. Therefore, for minimizing the energy cost of data access in embedded systems, Section IV-A first formulates this problem as an integer linear programming formulation (ILP) to achieve an optimal partition of SLC/MLC in morphable SPM and an optimal data allocation scheme. In addition, a polynomial-time algorithm, named EnergyAware Data Allocation Algorithm (EADA), has been proposed to achieve a near-optimal result. 
IV. TTEC: DATA ALLOCATION OPTIMIZATION FOR MORPHABLE SCRATCHPAD MEMORY IN EMBEDDED SYSTEMS
In this section, considering the benefits from both SLC mode and MLC mode of NVM, we present the data allocation strategy, named TTEC, to reduce the energy consumption and enhance the performance of embedded systems, as shown in Figure 7 . By exploiting the features of morphable scratchpad memory, we dynamically tune the SLC/MLC configure of morphable scratchpad memory according to the workloads of a given program, and allocate optimal address space for each variable in the program, for minimizing the energy consumption and improve the performance. Therefore, we first build an Integer Linear Programming (ILP) formulation to generate the optimal configure of morphable scratchpad and the data allocation scheme. Because ILP problem is NP-Complete. Then, this paper present a heuristic algorithm to achieve a near-optimal result. 
A. INTEGER LINEAR PROGRAMMING FORMULATION
In this section, we formulate the problem of data allocation in this paper, in order to determine both optimal partition of SLC/MLC SPM and data allocation. At first, we model the size of SLC SPM/MLC SPM in morphable SPM. Then, data allocation among SLC/MLC SPM and DRAM is simulated. After that, the energy overhead of movements is modeled. Finally, we give the object function.
Before presenting the detail of the model, we first give the definition of the related notations, as shown in Table 2 .
1) MORPHABLE SPM
In the data allocation strategy, the SLC/MLC configure in morphable SPM can be dynamically tune between multiple program region. Therefore, the SLC/MLC configure is independent in different program regions. Nevertheless, the total number of SLC cells and MLC cells in SPM must be equal to the total cells of SPM.
2) THE DATA ALLOCATION Then, a binary variable is adopted to modeled the data allocation among the SLC SPM, MLC SPM or DRAM. Each data in each region must be stored on on-chip memory or main memory. Therefore, the relationship can be modeled as follows. After that, the total size of data in each type of memory cannot exceed the physical size of SLC/MLC parts in morphable SPM.
3) THE MOVEMENTS AMONG SLC SPM, MLC SPM AND DRAM Data allocation need dynamically move the location of variables in the given program. Between multiple program regions, the energy cost of the movements among SLC SPM, MLC SPM and DRAM is non-negligible. For achieving the effectiveness of the data allocation in this paper, we model the relationship of movements among multiple memories. According to the ILP theorem, given binary variable x 1 , x 2 , ..., x n and y, assuming x 1 , x 2 , ..., x n are all 1, if and only if y = 1 can be modeled as follows:
''variable i is moved from SLC to MLC'' can be first modeled as follows. (8) ''variable i is moved from MLC to SLC'' can be modeled as: (10) ''variable i is moved from DRAM to SLC'' can be modeled as: (12) ''variable i is moved from DRAM to MLC'' can be modeled as: (14) ''variable i is moved from SLC to DRAM'' can be modeled as: (16) ''variable i is moved from MLC to DRAM'' can be modeled as: (18) 
4) THE ENERGY COST
In this paragraph, we formulate the energy cost in the process of data allocation in this paper. The energy cost can be divided into three major parts, such as the energy cost of data access, movements and conversion in morphable SPM. At first, the energy of data access of the given program can be modeled as: (19) Then, the total energy of movements among SLC SPM, MLC SPM and DRAM can be modeled as:
Finally, the total energy overhead of SLC/MLC mode conversion in morphable SPM can be modeled as:
5) THE OBJECTIVE FUNCTION
In the objection function, we want to minimize the total energy cost of the data allocation in this paper. Therefore, our goal is to find an optimal SLC/MLC configure in morphable SPM and an optimal data allocation among SLC SPM, MLC SPM and DRAM, to minimize the total energy cost of data access, movements and SLC/MLC mode conversion in the process of data allocation. The objective function can be modeled as follows.
By exploiting the ILP model, we can achieve the optimal partition of SLC/MLC SPM and data allocation to minimizing the energy cost in embedded systems. Nevertheless, ILP problems are NP-Complete. Therefore, in next section, we further propose a polynomial time algorithm, named Energy-Aware Data Allocation Algorithm(EADA), to achieve a near-optimal result.
B. ENERGY-AWARE DATA ALLOCATION ALGORITHM
In this section, we present a polynomial-time algorithm, named Energy-Aware Data Allocation Algorithm (EADA), to achieve a near-optimal result. For simple illustration, we divide the algorithm into two major parts. In first part, we present the major work flows of EADA for a given program in embedded systems. Then, for a region in the program, we further propose a polynomial-time data allocation algorithm, named Optimizing Region Data Allocation Algorithm (ORDA), to minimize the energy cost of a given program region under a specific SLC/MLC configure of morphable SPM in embedded systems. Therefore, the detail description about the two algorithms are presented as follows.
1) THE MAJOR WORK FLOWS OF EADA
Firstly, given the the access information of each variable in each program region, we can get the access energy of each variable if the variable is allocated to DRAM, SLC SPM or MLC SPM (e (d,h,i) , e (ss,h,i) , e (sm,h,i) ). Then, considering benefits from both SLC SPM and MLC SPM in morphable SPM, we present an energy-aware data allocation algorithm to minimize the total energy overhead, as shown in Algorithm 1.
In Algorithm 1, we first build an array to record the size of SLC SPM in region h (SS(h)), and an array to record the size of MLC SPM in region h (SM(h)). Then, we initialize the size of SLC SPM and MLC SPM before the program runs. Then, an array (R(h,i)) is built to record the allocation of each variable in each region. Assuming all variable is located in DRAM in the initial stage, and set all initial values of R(h,i) equal to 0. If variable is allocated to SLC SPM, we set (R(h,i) = 1), and if variable is allocated to MLC SPM, we set (R(h,i) = 2).
After that, for different SLC/MLC SPM partitions in morphable SPM, we seek the optimal data allocation and the optimal SLC/MLC configure of each region in turn (line [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] . For a given region, according to the different sizes of SLC SPM, the corresponding size of MLC SPM can be obtained. Combining with the predetermined size of SLC/MLC SPM, the optimizing region data allocation algorithm is utilized to get the optimal allocation under which the energy cost of this region is minimized (E region ), as shown in Algorithm 2. Then, a variable (optimal_energy) is given to maintain the minimal energy overhead of the region under different SLC/MLC SPM divisions. The value of variable optimal_energy would be updated when the obtained energy cost in Algorithm 2 is low than optimal_energy. Meanwhile, the size configure of SLC/MLC SPM and the corresponding data allocation are recorded. Finally, we return the optimal size configure of SLC/MLC SPM and the optimal data allocation after all program regions achieve the minimal energy cost.
Algorithm 1 Energy-Aware Data Allocation Algorithm (EADA).
Input: the total number of regions in the given program(m), the total number of variable (n), the size of variable i (size(i)), the access energy of variable i in region h if variable i is allocated to DRAM, SLC SPM or MLC SPM (e (d,h,i) , e (ss,h,i) , e (sm,h,i) ),the total number of SPM cell N spm . Output: an optimal SLC/MLC configure in morphable SPM and an optimal data allocation when the total energy overhead of the given program is minimized. for SS ← N spm to 0 do 9:
Call Algorithm 2, obtain the allocation and the total energy cost of region h (E region ) when the energy cost is minimized; 11: if optimal_energy > E region then 12: optimal_energy ← E region , SS(h) ← SS, record the data allocation of region h; 13: end if 14: end for 15 : end for 16: return an optimal SLC/MLC configure in morphable SPM and the optimal data allocation.
2) OPTIMIZING REGION DATA ALLOCATION ALGORITHM
In this section, combining with the predetermined size of SLC/MLC SPM, we introduce the process of finding the optimal data allocation for the given region. The basic idea is to utilize the thinking of dynamical programming for computing the best data allocation, in order to minimize the energy cost of the region. As shown in Algorithm 2, the major process can be divided into three steps. The first step is to calculate the conversion energy of morphable SPM in region h compared with the size of SLC PCM in region (h-1) . Then, build an array E[i, ss, sm] to record the energy overhead where the empty size of SLC SPM and MLC SPM is ss and sm, respectively. Assuming all variable is in DRAM, the initial value of E[i, ss, sm] is equal to the sum of Algorithm 2 Optimizing Region Data Allocation Algorithm (ORDA). Input: the total number of variable (n), the size of variable i (size(i)), the access energy of variable i in region h if variable i is allocated to DRAM, SLC SPM or MLC SPM (e (d,h,i) , e (ss,h,i) , e (sm,h,i) ), the size of SLC SPM in region h (SS(h)), the predetermined size of SLC/MLC SPM SS, SM, the data allocation of variable i in region h R(h,i). Output: a data allocation when the total energy overhead of the given region is minimized, the minimal energy overhead. 1: con_energy ←| SS(h − 1) − SS | ×e c ; 2: Build an array E[i, SS, SM ], and assuming all data is first stored on DRAM, thus,
Build an array to record the data allocation of each variable in region h (r(i)); 4: for i ← 1 to n do 5: for ss ← SS to 0 do 6: for sm ← SM to 0 do 7: According to R(h-1,i), computing the movement energy of variable i if the variable is allocated to DRAM(ed m ), SLC SPM(es m ) or MLC SPM(em m ); 8: Assuming data i is allocated to DRAM, the optimal energy cost:
Assuming data i is allocated to SLC SPM, the optimal energy cost: E2 = E[i − 1, ss − size(i), sm] + e (ss,h,i) + es m ; 10: Assuming data i is allocated to MLC SPM, the optimal energy cost:
E[i, ss, sm] ← min(E1, E2, E3)
12:
if E[i, ss, sm] = E1 then 13: r(i)=0, variable i is allocated to DRAM; 14: else if E[i, ss, sm] = E2 then 15: r(i)=1, variable i is allocated to SLC SPM; 16: else 17: r(i)=2, variable i is allocated to MLC SPM; 18: end if 19: end for 20: end for 21: end for 22: return the minimal energy overhead and the array(r(i)). the access energy in DRAM and the conversion energy.
E[i, ss, sm]
The second step is to build the recursive relation for the data allocation in this paper, as shown in Equation 23 . At first, according to the data allocation in region (h-1) (R(h-1,i) 
, the access energy of variable i in MLC SPM and the energy cost of movement. Finally, we compare the energy overhead among the three cases, and choose the minimal energy overhead as our solution (line 11). Therefore, utilizing the recursive relation of the data allocation, we can obtain the optimal solution by combining the solutions of subproblems.
Finally, we adopt an array (r(i)) to record the data allocation of each variable in region h while the energy overhead is minimal (line 12-18) . If the energy overhead is minimal when variable i is allocated to DRAM, we set r(i) = 0; If the energy overhead is minimal when variable i is allocated to SLC SPM, we set r(i) = 1; If the energy overhead is minimal when variable i is allocated to MLC SPM, we set r(i) = 2. Finally, we return the optimal data allocation (r(i)) and the minimal energy overhead.
The time complexity of the EADA algorithm is O(m × s 2 × n), where m is the number of regions in the given program, s is the size of the morphable SPM, and n is the number of variables in the given program.
V. EXPERIMENTS
In this section, we conduct a series of experiments for evaluating the effectiveness of our proposed technique in this paper. First, the experimental setup is illustrated in detail. Then, we present the related schemes for comparison. Finally, we give the experimental results and the corresponding analyses.
A. EXPERIMENT SETUP
In our experiments, we conduct experiments with the SimpleScalar simulator. In the simulation platform, we realize the target systems in this paper. Then, we choose morphable NVM as the material of on-chip SPM. CACTI and NVsim are often adopted to achieve parameters for typical NVM, and integrate these parameters into the simulator. The experiment setup of the target systems will refer the previous work, especially for the cost of data conversion, and the detailed information is shown in Table 3 [10], [21] , [28] , [30] .
After that, we choose benchmarks from Mibench, which is a representative embedded benchmark [31] . 11 benchmarks,such as basicmath, bitcount, crc32, dijkstra, FFT, patricia, qsort, rsynth, sha, stringsearch, susan, have been chosen, which are related with Automotive, Network, office Automation and Security. According to the features of each program, we divide it into multiple program regions. Then, we run these benchmarks on simulation platform to collect the memory trace of each data in each program region.
As shown in Fig. 8 , we input the access information into TTEC strategy in this paper, which includes ILP model and EDEA algorithm, in order to achieve optimal and nearoptimal SLC/MLC SPM configure and the corresponding data allocation. And then, we recompile the program and insert the related instruction of data allocation and the size configure of SLC/MLC in SPM. Finally, we execute the modified program on the simulation platform to collect the related experimental results. 
B. SCHEME FOR COMPARISON AND METRICS
In this paper, we compare our results with two different schemes: 16KB-SLC SPM and 32KB-MLC SPM. In the 16KB-SLC SPM, the on-chip SPM is consisted of the pure-SLC NVM, and the data allocation techniques adopts the data allocation technique in [30] . In 32KB-MLC SPM, the on-chip SPM is built with the pure-MLC NVM, and it adopts the same allocation technique in 16KB-SLC SPM. In addition, we collect the related performance metrics for evaluating the effectiveness of our schemes, the detail introduction of metrics as follows:
Energy consumption of each program run: According to the access information of the given program, we utilized the TTEC strategy to achieve the optimal data allocation and insert the related instructions of data allocation in the given program. Then running the modified program to collect the energy consumption. Finally, we compare the energy consumption of TTEC with the energy consumption of 16KB-SLC SPM and 32KB-MLC SPM, and give the detail analysis.
Running time of each program run: To evaluate the effectiveness of the proposed techniques in this paper, we collect the running time of each program run in 16KB-SLC SPM, 32KB-MLC SPM, TTEC-ILP and TTEC-EADA. And the detail analysis is presented.
Overhead: As shown in Section IV-A, time and energy overhead might be caused by movements among SLC SPM, MLC SPM and DRAM. In addition, the cost of conversion in morphable SPM can't be neglected. Therefore, we collect the time and energy overhead in the proposed techniques and give the corresponding discussion concern the overhead. Table 4 shows the running time of each program run with 16KB-SLC SPM, 32KB-MLC SPM, TTEC-ILP and TTEC-EADA. Compared with 16KB-SLC SPM, 32KB-MLC SPM reduces 4.85% running time on average. However, due to the high latency in MLC NVM, some benchmarks are unable to reduce running time, such as bitcount, FFT and blowfish. In this paper, we propose the SPM with morphable NVM. Under the same area overhead, the morphable SPM with EADA scheme reduces 10.21% and 5.61% running time, respectively, compared with 16KB-SLC SPM and 32KB-MLC SPM. Meanwhile, if the morphable SPM with ILP would decline 12.50% and 7.45%, respectively. Because ILP is NP-Complete problem, some benchmarks can't achieve the final result in the ILP solving tool (i.e, Lingo). We run the problem 12 hours on the tool to collect the best data as our results which is marked with an asterisk ( * ), such as dijkstra, FFT and susan.
C. RESULTS OF EXPERIMENTS AND ANALYSIS
As shown in Table 5 , we present the energy consumption of each program run with 16KB-SLC SPM, 32KB-MLC SPM, TTEC-ILP and TTEC-EADA. Compared to 16KB-SLC SPM, 32KB-MLC SPM can effectively reduce 5.89% the total energy consumption on average. Like the running time, some benchmarks, such as bitcount and blowfish, enhance the energy consumption due to the high energy consumption in MLC NVM. Compared with the previous two schemes, we adopt 16KB morphable SPM with EADA to save 12.8% and 7.3% energy consumption on average. In addition, the morphable SPM with ILP can cut down 15.30% and 10.13% energy consumption, respectively. For these benchmarks with heavy read operations, like bitcount, CRC32 and rjindael, the reduction of energy is more obvious when we adopt MLC SPM or morphabe SPM.
Finally, to evaluate the effectiveness of TTEC in this paper, we collect the time overhead and the energy overhead of data allocation for EADA and ILP in TTEC. The overhead include two major parts in the proceed of program run:1) data movements among SLC/MLC SPM and DRAM, 2) the conversion overhead of the morphable SLC/MLC SPM. At first, the time overhead is presented, as shown in Fig. 9 . EADA and ILP in TTEC lead to an average of 2.74% and 3.74% time overhead, respectively. Meanwhile, as shown in Fig. 10 , they also produce an average of 2.15% and 2.28% energy overhead, respectively. Compared with the performance improvement, the time overhead and the energy overhead are negligible. In summary, TTEC effectively enhance the performance and reduce the energy consumption of embedded systems.
VI. CONCLUSION
In this paper, we have proposed TTEC scheme, which adopts the morphable NVM as SPM in embedded systems to replace the high-energy SRAM. The morphable NVM can be dynamically programming to SLC/MLC mode. Compared to the low-latency SLC mode, MLC mode have higher density for enhancing the capacity of SPM. For utilizing the benefits from both SLC/MLC mode in morphable SPM, an ILP model is first built to achieve optimal SPM partition and data allocation scheme, in order to minimize the energy consumption and the access latency of embedded systems. Then, an corresponding heuristic algorithm (EADA) is presented to achieve near-optimal results in polynomial time. According to the results of experiments, the proposed techniques not only save an average of 12.8% and 15.3% the energy consumption compared to the baseline scheme, but also reduce 10.21% and 12.50% the running time of each benchmark run. Combining with the features of morphable NVM, we expect this work can become a first step for reducing the energy consumption of SPM in embedded systems. Meanwhile, more issues on software systems will be incurred by exploring these emerging mediums. LINBO 
