Abstract-Resource scheduling is one of the most important issues in mobile cloud computing due to the constraints in memory, CPU, and bandwidth. High energy consumption and low performance of memory accesses have become overwhelming obstacles for chip multiprocessor (CMP) systems used in cloud systems. In order to address the daunting "memory wall" problem, hybrid on-chip memory architecture has been widely investigated recently. Due to its advantages in size, real-time predictability, power, and software controllability, scratchpad memory (SPM) is a promising technique to replace the hardware cache and bridge the processor-memory gap for CMP systems. In this paper, we present a novel hybrid on-chip SPM that consists of a static random access memory (RAM), a magnetic RAM (MRAM), and a zero-capacitor RAM for CMP systems by fully taking advantages of the benefits of each type of memory. To reduce memory access latency, energy consumption, and the number of write operations to MRAM, we also propose a novel multidimensional dynamic programming data allocation (MDPDA) algorithm to strategically allocate data blocks to each memory. Experimental results show that the proposed MDPDA algorithm can efficiently reduce the memory access cost and extend the lifetime of MRAM.
systems to reduce energy consumption and speed up the system. We have two observations: First, there is a tradeoff between energy consumption and execution time in implementing tasks. Second, the gap between processor and memory has become wider and wider. Therefore, how to develop new energy-aware architecture, algorithms, and techniques is the major issue in this paper, which is urgently needed to contribute to the sustainable "Green" world for the new century.
Low power consumption and short-latency memory access are critical to the performance of CMP computing systems. Therefore, continuous development of current CMP systems is substantially hindered by the daunting memory wall and power wall issues. To bridge the ever-widening processor-memory speed gap, traditional computing systems widely adopted hardware caches. Caches, benefitting from temporal and spatial locality, have effectively facilitated the layered memory hierarchy. Nonetheless, caches also present notorious problems to CMP systems, such as lack of hard guarantee of predictability and high penalties in cache misses. For example, caches consume up to 43% of the overall power in the ARM920T processor [1] .
Hence, it is critical to develop alternative power-efficient techniques to replace the current hardware-managed cache memory. Scratchpad memory (SPM), which is a softwarecontrolled on-chip memory, has been widely employed by key manufacturers, due to two major advantages over the cache memory. First, SPM does not have the comparator and tag static random access memory (SRAM), since it is accessed by direct addressing. Therefore, the complex decode operations are not performed to support the runtime address mapping for references. This property of caches can save a large amount of energy. It has been shown that an SPM consumes 34% less chip area and 40% less energy consumption than a cache memory does [2] . Second, SPM generally guarantees singlecycle access latency, whereas accesses to cache may suffer from capacity, compulsory, and conflict misses that incur very long latency [3] . Given the advantages in size, power consumption, and predictability, SPM has been widely used in CMP systems, such as Motorola M-core MMC221, IBM CELL [4] , TI TMS370CX7X, and NVIDIA G80. Based on the software management characteristics of SPM, the most critical task is to manage SPM and perform data allocation with the help of compilers.
Conventionally, SPMs are configured from small and fast SRAMs. SRAM is superior in access latency memory while inferior in cost and leakage power. Magnetic RAM (MRAM) has gathered wide interests for various appealing characteristics, such as high density, fast access speed, and excellent nonvolatility [5] [6] [7] [8] . However, long latency and high energy associated with the write operations, as well as prohibitive cost of MRAM impeded their extensive usage. A new memory technology, i.e., zero-capacitor RAM (Z-RAM), can overcome the high cost of SRAM and MRAM, with virtually no performance degradation. In addition, Z-RAM is manufactured with only one transistor, instead of six transistors as in SRAM. Therefore, Z-RAM affords much higher density with extremely low leakage power. In this paper, we propose a hybrid SPM on-chip memory architecture that incorporates SRAM, MRAM, and Z-RAM to enhance the overall performance of memory systems. Prior works have investigated hybrid cache that used SRAM and MRAM and proved that hybrid memory architecture can save a significant amount of energy [9] , [10] .
A challenging problem that a hybrid SPM architecture must resolve is how to reduce energy consumption, memory access latency, and the number of write operations to MRAM. To take advantage of the benefits of each type of memory, we must strategically allocate data on each memory module so that the total memory access cost can be minimized. Recall that SPMs are software controllable, which means the datum on it can be managed by programmers or compilers. Traditional hybrid memory data management strategies, such as data placement and migration [9] , [11] [12] [13] , are unsuitable for hybrid SPMs, since they are mainly designed for hardware caches and are unaware of write activities. Fortunately, embedded system applications can fully take the advantage of compiler-analyzable data access patterns that can offer efficient data allocation mechanisms for hybrid SPM architecture [14] .
In this paper, we propose a novel hybrid on-chip SPM architecture comprising SRAM, MRAM, and Z-RAM. Based on the hybrid SPM, we propose a multidimensional dynamic programming data allocation (MDPDA) strategy to efficiently allocate data on each memory unit in polynomial time. To the best of our knowledge, this is the first paper to address the data allocation issue for CMP systems with hybrid SPMs comprising three types of memory modules.
The remainder of this paper is organized as follows. Section II gives an overview of related work on data allocation for SPM. Section III presents the system model. Section IV describes a motivational example. Detailed algorithms are presented in Section V. Experimental results are given in Section VI. Section VII concludes this paper.
II. RELATED WORK
Hybrid memory architectures for processor cores and cache systems design have drawn much attention recently. Qiu and Sha designed several novel algorithms for heterogeneous computing systems [15] and cloud systems [16] . In [9] and [17] , Sun et al. and Mishra explored the performance of MRAM and confirmed its potential use in cache due to advantages in access latency and power consumption. In [18] , Saripalli investigated the advantages of heterogeneous technologies for processor cores. They discussed the integration of Tunnel-FEL and an MRAM in a cache together with an SRAM. These papers are basically provided the foundations of this paper, but they do not explore the details of hybrid memory. Udayakumaran and Barua [12] proposed a heuristic algorithm to allocate data for an SPM, mainly considering stack and global variables. Udayakumaran et al. [19] applied a dynamic data allocation method to heap data for embedded systems with SPMs. Three types of program objects are considered in their allocation method: global variables, stack variables, and program code. They divided a program into multiple regions, where each program region is associated with a time stamp. Based on time stamps, they then utilized a heuristic algorithm to determine the data allocation for each program region. These papers have touched the approach of data allocation with SPM. However, their results are not efficient enough compared with our proposed dynamic programming approach.
III. DEFINITIONS AND MODELS

A. System Model
We first give Table I to indicate the acronyms will be used in this paper. Then we use Fig. 1 to illustrate the architecture of a target CMP system with hybrid SPMs. Each core is tightly coupled with an on-chip SPM, which is composed of an SRAM, an MRAM, and a Z-RAM. We call a core accessing the SPM owned by itself as local access, whereas accessing an SPM held by other cores is referred to as remote access. Generally, the remote access is supported by an on-chip interconnect. All cores access off-chip main memory (usually a DRAM device) through a shared bus. The CELL processor [20] is an example that adopts this architecture. In a CELL processor, a multichannel ring structure allows the communication between any two cores without intervention from other cores. Consequently, we can safely assume that the data transfer cost between cores is constant. Generally, accessing the local SPM is faster and dissipates less energy than fetching data from a remote SPM, whereas accessing the off-chip main memory incurs the longest latency and consumes the most energy.
When a core reads data from a specific memory module, it may encounter a miss. In this case, we need to move the data from the memory unit holding these data. Inevitably, this movement will incur much higher overhead, since it needs to access a remote SPM or the main memory. In this case, the data transfer overhead is composed of two major parts: reading the memory module of a remote SPM or main memory that holds the data and writing the data to the target memory module.
Therefore, the memory access cost (either latency or energy) of a specific data block B i consists of local access cost, remote access cost, and data move cost. It can be calculated using
where NL(B i ) and NR(B i ) represent the number of local access and remote access to block B i , respectively. CL(B i ) and CR(B i ) represent the cost of local access and remote access to block B i , respectively. CM (B i ) represents the data move cost for block B i . We have the following lemma.
Lemma 1: Power consumption of memory is proportional to memory latency. By minimizing memory access latency, energy consumption can be reduced at the same time. Therefore, memory access cost in this paper refers to either latency or energy consumption.
The cost of processing a data block B i , C(B i ), generally involves two parts: computation cost and memory access cost, i.e.,
where C Compu (B i ) is the computation cost, and C Mem (B i ) is the memory access cost. However, we only consider the memory part in this work for two reasons. First, the memory access is the bottleneck, since it accounts for most of the time and energy overheads. Second, the computation cost of a specific data block is usually constant or changes very little. Therefore, the total allocation cost C total of a set of N data blocks can be computed as follows: 
B. Allocation Granularity
The granularity of data allocation is critical to the SPM allocation problem. Usually, three types of basic granularity exist: a variable, a block (an instruction or even a function), and a page. The advantages of different granularity vary with different programs. Most work in the literature focused on the variables and blocks, since these two kinds of granularity are easier to partition and handle by inserting programming points. However, the biggest issue of variable-and block-based allocation is the memory fragmentation incurred by their nonuniform sizes [21] .
While page-oriented data allocation can overcome the fragmentation problem due to the effectiveness of the memory management unit, it suffers from the locality problem. Our hybrid SPM architecture can enlarge the on-chip memory space with the use of high density of MRAM and Z-RAM. Therefore, the locality problem outweighs the fragmentation problem, and we use the data block as the basic allocation granularity. In this paper, we assume that the data blocks of a program are partitioned using profile tools before execution, and they can be mapped to every memory block with different latency and energy consumption.
IV. MOTIVATIONAL EXAMPLE
Similar to the mechanism used in [22] , we assume data moving latency and energy consumption between different memory modules are given in Tables II and III, respectively. For illustration purposes, we normalize latency and energy consumption of memory access to MRAM, SRAM, Z-RAM, and off-chip main memory in Table IV . In this table, the columns of "LS," "S," "LM," "RM," "LZ," "RZ," and "MM" represent the memory access cost to local SRAM, remote SRAM, local MRAM, remote MRAM, local Z-RAM, remote Z-RAM, and off-chip DRAM, respectively. "La" and "En" represent latency and energy consumption, respectively. During the execution of an application, a piece of data may be allocated to any memory module and moved back and forth among all memory modules in SPMs.
We assume that the target system has two cores, and each core is equipped with hybrid SPM consisting of SRAM, MRAM, and Z-RAM. The off-chip shared memory is a DRAM. In order to demonstrate the viability of our data allocation strategy, we assume a simple program that has 18 data blocks obtained from a program, namely, B 1 , B 2 , . . ., and B 18 . Initially, only data block R is stored in core2's SRAM, and all other blocks are stored in off-chip DRAM. For simplicity, we assume the number of accesses for each data by each core is given in Table V . In this table, the "DATA" column indicates the data blocks used. The rows of "Read" and "Write" represent the number of reads and writes to each data block by each core.
To illustrate the efficiency of our approach, we compare it with a greedy algorithm proposed in [12] . The basic idea of this algorithm is as follows: It greedily selects the most frequently accessed data and allocates the data to a memory unit associated with the core that most frequently accesses the data. If all memory modules of this core cannot provide room for the data, the data will be allocated to the SPM of the core that accesses the data the second most frequently. Due to the very high overhead of main memory access, this algorithm does not allocate any data to the off-chip DRAM, unless all on-chip SPMs are occupied. Although their target system has SPM, the SPM is configured by a pure SRAM.
The total memory access cost of a piece of data involves local reads, local writes, remote reads, remote writes, and data movement between different memory units. For example, if we allocate data B 1 to core1's SRAM, according to Table V, the memory access latency can be calculated as:
The memory access latency of allocating block B 18 to core2's SRAM can be computed as:
In this example, for simplicity, we first assume that each core has 800 B of onchip SPM, including a 200-B SRAM, a 400-B MRAM, and a 200-B Z-RAM. We also assume that each data block is 100 B, which means an MRAM can accommodate four data blocks, whereas each SRAM and each Z-RAM can only provide room for two data blocks. Table VI shows one possible data block allocation obtained by the greedy algorithm. The total latency and energy consumption are 6928 and 677.99, respectively. This allocation needs 99 writes to MRAMs.
However, an improved algorithm can significantly reduce the number of writes to MRAMs, as well as reduce latency and energy consumption. The data allocation generated by the improved algorithm is shown in the "Improved" row of Table VI . The total latency, energy consumption, and the number of writes to MRAMs are 6071, 588.4, and 45, respectively. Compared with the greedy algorithm, the improved strategy reduces the total latency by 12.37%, energy consumption by 13.21%, and the number of write operations to MRAMs by 54.54%, respectively.
Given the high density property of Z-RAM, we can use a larger Z-RAM to enlarge the size of an on-chip SPM. For example, if we use a 1000-B SPM, which is composed of a 200-B SRAM, a 400-B MRAM, and a 400-B Z-RAM. One possible allocation using the greedy algorithm and an improved algorithm are shown in Table VII . In this table, we can see that the total memory access latency is reduced by 16.10%, energy consumption is reduced by 17.11%, and the number of writes to MRAMs is reduced by 69.16%.
From the above example, we can see that the data allocation scheme is critical to the memory access performance and durability of a memory hierarchy. The "improved" algorithm illustrated in the example is intrinsically the MDPDA algorithm that we will discuss in the next section.
V. DATA ALLOCATION ALGORITHM
A. Allocation Cost Table
Assume that N data blocks need to be allocated to a system with P cores. Each core has a proposed hybrid SPM configured from an SRAM, an MRAM, and a Z-RAM. In order to calculate latency and energy consumption for each data conveniently, we build an allocation cost table to represent the cost of allocating each data block to different memory modules, as shown in Table VIII . In this table, we compute the 18 data blocks given in Section IV, with the assumption that the target CMP system is a dual-core platform with the proposed hybrid SPM memory. The "Data" column represents the data given in the example in Section IV. The "Core1" and "Core2" columns represent the two cores. "SRAM," "MRAM," and "ZRAM" indicate the SRAM, MRAM, and ZRAM of the corresponding SPM. The "La" and "En" columns are the latency and energy consumption of allocating each data to a memory module.
The allocation cost for blocks B 1 , B 2 , B 3 , B 4 , B 5 , B 6 , and B 18 , is obtained according to (1) . We use a function 
B. Recursive Formulation
The most critical part of a dynamic programming algorithm is the construction of the recursive formulation, which breaks down the target problems. At each step, we get the best solutions of memory allocation for the subproblems. These solutions will be used to generate candidates of the best solution of the global problem in the next step together with the new inputs. We will reduce the redundance results in each step and only keep the best result of each subproblem. This way, the dynamic programming algorithm usually can achieve polynomial time solutions, i.e., Our dynamic programming algorithm for hybrid memory allocation is described as follows. First, we define a memory allocation function AllocM em(n, x), which represents the total cost of the first n − 1 blocks when the nth block is allocated to memory x. Then, we define a total cost function f (n, x) to represent the total allocation cost of the first n data blocks when the nth block is allocated to memory x. For example, f (4, C1 S ) indicates the total allocation cost of the first four data blocks when block 4 is allocated to Core1's SRAM. We define a multidimensional matrix, i.e., AllC, to store the total cost for data allocation.
, where P is the number of input data blocks, and size(x) is the size of the memory x. For example, AllC[4, 1, 2, 1, . . . , 1] indicates the total cost of allocating the first four data blocks to on-chip hybrid SPMs, when the available space of the Core1's SRAM, Core1's MRAM, Core1's Z-RAM, . . ., and Core P's Z-RAM is 1, 2, 1, . . ., 1, respectively. Then, we can compute the total allocation cost by allocating block i to different memory modules using
Equation (5) shows that the minimum allocation cost is always preserved, since the total allocation cost by adding the current data block is always selected from the best allocation scheme of the previous blocks and the current one. Equations (4)- (6) show the recursive formulation for deriving the minimum total allocation cost of the target problem. In (6) , AllC[bi, s 1 , m 1 , . . . , z n ] records the minimum allocation cost when the available memory block for each on-chip memory module are s 1 , m 1 , . . . , z n , respectively. Initially, if all memory blocks in SPMs (including SRAM, MRAM, and Z-RAM) are unavailable (which means s 1 = m 1 = · · · = z n = 0), then all the blocks will be assigned to the shared off-chip main memory. The allocation of a specific block is always determined by the optimal allocation of the previous data block. For , s 1 , m 1 , z 1 
For any other item in the matrix, the total cost is determined by both the allocation of the previous blocks and the cost of allocating this block to different memory modules. There are totally 3 × P + 1 choices to assign a data block, where P is the number of cores in the target CMP system. The dynamic programming algorithm always selects the combination that achieves the minimum total cost for all present data blocks. 
is the minimum total allocation cost for block
is the minimum total allocation cost by adding data block b i+1 , when the available on-chip memory resources of Core1's SRAM, Core1's MRAM, Core1's Z-RAM, . . ., and Core P's Z-RAM are s 1 , m 1 , z 1 , . . ., and z n , respectively. From (5), all the allocation schemes of data block b i+1 are searched, and their results are preserved in the total allocation cost function f (n, x). Since the minimum total allocation costs for block b 1 to block b i are obtained from the previous step, (6) obtains the minimum cost by adding block b i+1 from all possible allocation schemes.
C. MDPDA Algorithm
Based on the recursive formulation presented in Section V-B, we describe the MPPDA algorithm in Algorithm 1. The input of the MDPDA algorithm is N data blocks obtained by profiling tools, the constructed allocation cost table, and the total cost table. The output of the algorithm is the minimum total cost (latency or energy consumption) for the N data blocks. for
Algorithm 1 Multi-dimensional Dynamic Programming for Data Allocation (MDPDA).
Input
. . .
19:
for z n ← 0 to size(CP Z ) do 20:
Apply (4) to get minimum memory allocation cost for the first b i − 1 data blocks when b i is allocated to different memory modules.
21:
Apply (5) to calculate the cost of allocating block b i to different modules.
22:
Apply (6) to get the minimum total allocation cost for block b 1 to block b i .
23:
end for 24:
end for 25: end for 26: end for 27: Get the data allocation using the BGDA Algorithm 2. 28:
We initialize the algorithm in Lines 1-3. When there is no on-chip memory space available, we have to assign all the data blocks to the shared off-chip main memory, which is the worst case of the algorithm. In this case, the total cost of the first b i tasks is the summation of the cost of allocating them to the main memory. In order to compute the total cost correctly, we add a boundary for the matrix (Lines 4-14) . Lines 15-26 are used to recursively compute the total cost from the first data block to the last one, according to the recursive formulations given in (5) and (6) . For the target system with P cores, there are 3 × P + 1 layers of loops. The first loop specifies a data block under consideration, and the second loop to the (3P + 1)th loop are employed to determine the best allocation for the first data block to the current data block. Due to space limitations, we only list several loops in the algorithm.
In line 27, the MDPDA algorithm will call the subalgorithm, which is named backtrack to get the data allocation (BGDA) Algorithm 2, to backtrack the path that derives the minimum total cost. Since there are N data blocks, we need to perform N traces to determine the allocation for all blocks.
Algorithm 2 Backtrack to Get the Data Allocation (BGDA).
Input: Allocation cost We use a simple example, as shown in Fig. 2 , to demonstrate the execution of the MDPDA algorithm. For simplicity, we only consider blocks A, B, C, D, E, and F to introduce the motivational example. We assume that the target CMP system is a single-core system with a hybrid on-chip SPM consisting of a Table IX . In this case, the dimension of the total allocation cost matrix, i.e., AllC, is 6 × 1 × 2 × 2. In  Fig. 2, s, m, and z represent the available number of memory blocks in SRAM, MRAM, and Z-RAM, respectively. According to the recursive formulation and the MDPDA algorithm, we can determine that the minimum allocation cost (latency) for these six data blocks is 1481. The solution for the allocation is to assign B 1 and B 3 to MRAM, B 4 to main memory, B 5 to SRAM, and B 2 and B 6 to Z-RAM. With this allocation scheme, the total energy consumption is 147.84. It can be verified that both latency and energy consumption are optimal for this set of data blocks and the assumed parameters.
Time complexity: We can see that the time complexity of this algorithm is determined by the recursive part, which is
Due to the limited on-chip memory space of the CMP system, the size of each hybrid memory is generally small. Assume that the size for each memory is K. For a system with P cores, the time complexity of the algorithm is O(N × K 3P ). Since K and P are constant for a given architecture, the algorithm executes in polynomial time. When the size K for each memory becomes larger, the algorithm is still polynomial about K. Hence, this algorithm is very efficient.
VI. EXPERIMENTAL RESULTS
We evaluate our algorithm using a host of benchmarks selected from PARSEC [23] . We run these workloads on the M5 simulator [24] and obtain the memory traces. We implemented the MDPDA algorithm and the greedy algorithm as standalone programs. These programs take the memory traces we have collected as inputs. We also use a modified version of CACTI [25] to get the memory parameters, including memory read/write latency, energy consumption, and leakage power, using 65-nm technology.
There are two configurations for the target systems. One is a dual-core in-order CMP system where each core has a hybrid SPM with 4-KB SRAM, 16-B MRAM, and 8-KB Z-RAM. The other is a quad-core CMP where each core has a hybrid SPM with 4-KB SRAM, 8-KB MRAM, and 4-KB Z-RAM. The baseline configuration is a dual-core CMP system with a pure SPM configured from an 8-KB SRAM. The specifications of the hybrid memory modules and the baseline are given in Table X . Then, we integrate all these parameters into our custom simulator.
1) Performance Analysis of Hybrid SPM:
We compare data allocation performance of our proposed SRAM, MRAM, and Z-RAM hybrid memory with that of the 4-K pure SRAM SPM, in terms of memory access latency and power consumption. Fig. 3 shows the effectiveness of the hybrid SPM with the MDPDA algorithm. We observe that the hybrid SPM consumes much less power and incurs much shorter memory access latency than the pure-SRAM-based SPM does. On average, the power saving and memory access latency reduction across the set of selected benchmarks are around 78.11% and 70.79%, respectively. Two major reasons contribute to the saving and reduction. First, the hybrid SPM architecture benefits from the high density of MRAM and Z-RAM. As a result, it offers much more space for data allocation than that of the pure-SRAMbased SPM.
2) Comparison Results: Fig. 4 illustrates the number of writes to MRAM for the MDPDA algorithm and the greedy algorithm [12] across the set of workloads, when the target platform is a dual-core system [see Fig. 4(a) ] and a quad-core system [see Fig. 4(b) ], respectively. It can be observed that compared with the greedy algorithm, the MDPDA algorithm can reduce the number of writes to MRAM by 35.25% and 39.67% on average for the dual-core system and the quad-core system, respectively. The main reason for the reduction is that the greedy algorithm only greedily selects the most frequently referenced data blocks to the core that accesses the data most frequently, while it does not fully take advantage of the different types of memory modules. Instead, the MDPDA algorithm is write aware, and it can intelligently allocate each data block to different on-chip memory units. For example, with the aid of the MDPDA algorithm, most writes will be assigned to SRAM and Z-RAM, whereas most frequently read data will be allocated to MRAM and Z-RAM. In Fig. 4(a) and (b), we can also see that the reduction in the number of writes on the quad-core system is more prominent than that on the dual-core system. This is mainly due to the fact that the more the cores, the larger the on-chip memory is available. Through the optimal allocation, the MDPDA algorithm, therefore, can significantly reduce the number of writes to MRAM. The reduction of writes on MRAM efficiently contributes to the extension of MRAM's lifetime.
Figs. 5 and 6 exhibit the effectiveness of the MDPDA algorithm over that of the greedy algorithm [12] , with respect to memory access latency and total energy consumption through the ten workloads. We also investigate the performance for a dual-core system and a quad-core system for each case. Fig. 5(a) and (b) shows that compared with the greedy algorithm, the MDPDA algorithm can reduce the total memory access latency by 16.23% on average for a dual-core system and 23.43% on average for a quad-core system, respectively. Similarly, Fig. 6(a) and (b) demonstrates that the MDPDA algorithm outperforms the greedy algorithm in terms of energy efficiency. On average, the MDPDA algorithm can reduce the dynamic power consumption for a dual-core system and a quadcore system by 17.74% and 24.18%, respectively.
VII. CONCLUSION
This paper has proposed a novel hybrid SPM architecture comprising SRAM, MRAM, and Z-RAM and used a multidimensional dynamic programming algorithm to optimally allocate data blocks to each on-chip memory. Experimental results demonstrated the effectiveness of our approach.
