Scratchpad Memory (SPM), a software-controlled on chip memory, has been widely used as an alternative to caches in modern embedded systems due to its energy ef ficiency. To further reduce the energy consumption, non volatile memory (NVM) based hybrid SPM has been pro posed recently. This paper targets the problem of allocating program variables into hybrid SP M based systems. Both an lLP formulation and a graph-coloring based algorithm are proposed. The experiments show that the proposed graph coloring framework achieves both better memory latency and lower energy costs in comparison to previous works.
Introduction
Energy consumption is a crucial issue in the design of embedded systems. On-chip caches typically consume 25% to 45% of the total chip power [15] . Scratchpad Memory (SPM), a software-controlled on-chip memory, is often ap plied for energy efficiency in embedded systems. The work by Banakar et al. [4] demonstrates that SPMs are more effi cient than caches in performance, energy consumption, and area costs. Software techniques for SPMs also guarantee better timing predictability than a cache memory, which is critical in hard real-time systems. Given these advantages, SPMs are used as alternatives to caches in modem embed ded processors such as Motorola M-core MMC22 1 and TI TMS370Cx7x. This paper proposes techniques for allocat ing program variables on hybrid SPMs. 
17
Traditional SPMs are fabricated in pure SRAM. As the number of CMOS transistors increases, leakage power con sumption becomes a critical issue. Non-volatile memories (NVMs), with the features of low leakage power and high density, present a new way of addressing the memory leak age power consumption problem. However, the write oper ations on NVM suffer from considerably higher energy and longer latency. Recently, hybrid SPMs consisting of SRAM and NVM is proposed [8] and smart memory management techniques are needed when NVM is applied.
Much work has been done for SPM allocation. Avis sar et al. derive a static allocation approach, in which the allocation of data to SPM is fixed and unchanged through out program execution [3] . With the assumption that the live range of each data object is through the whole pro gram execution, they model the problem as an Integer Lin ear Programming (ILP) problem, and solve it using com mercial packages. Considering dynamic program behavior, Udayakumaran et al. present a dynamic allocation method for global and stack data [17] . Hu et al. propose a novel hy brid SPM architecture consisting of NVM and SRAM, and present a dynamic allocation approach, where the code is partitioned into multiple regions, and for each region a dy namic programming algorithm is used to reallocate the data [8] . Their solution assumes that the live range of each data objects is through the whole program execution, so cannot explore the opportunity that data objects with disjoint live ranges can share the same memory addresses in SPM. The work by Li et al. considers live ranges of variables, and can allocate arrays with non-overlapped live ranges into the same memory address [13] . There is an implicit assumption in their solution that, the cost of access to one memory unit is always better than another memory unit. This assumption is reasonable for the problem of register allocation, where register is always superior to main memory, and for pure SRAM based SPM, where SPM is always superior to off-chip memory. However, this assumption does not always hold in a hybrid SPM based system. For example, in a hy brid SPM consisting of NVM and SRAM, the latency of a write operation to NVM is not always cheaper than off-chip memory. Furthermore, when the cost considered is param eterized as a combination of latency, energy, area, etc., it is uncommon that one memory is always better than the other. Therefore, new management schemes are needed for hybrid SPM.
In this paper, we propose a compilation technique for statically allocating program data into a hybrid SPM to min imize the total memory cost. The memory cost can be la tency, power consumption, or any combination. In this pa per, we model the allocation problem using an ILP formu lation. Unlike the ILP solution presented in [3] , the ILP formulation proposed in this paper take into account of live ranges of data objects. Furthermore, as ILP does not scale well with large programs, we also propose a polynomial time heuristic algorithm, Multiple Graph-Coloring (MGC), to allocate program variables in hybrid SPMs. In MGC, the interference graph of variables is split into vertex-induced subgraphs according to each variable's favorite memory unit, and then a graph-coloring based algorithm is applied for each subgraph. Memory cost is integrated into this graph-coloring process. Experimental results show that MGC achieves better execution latency and lower energy costs in comparison to previous works.
The major contributions of this paper include:
1. For the first time, an ILP formulation exploring that variables that are not alive simultaneously can share the same memory address, is proposed for the problem of hybrid SPM allocation.
2. A graph-coloring based polynomial-time algorithm is proposed for the problem of hybrid SPM allocation.
The rest of this paper is organized as follows. Problem formulations and backgrounds are introduced in Section 2. The ILP formulation and the MGC algorithm are presented in Section 3 and Section 4 respectively. Section 5 shows the experiments. Finally, Section 7 concludes this paper.
Problem Formulations and Backgrounds
Given a program, and a hybrid SPM, a compilation ap proach is proposed in this paper to allocate program data into hybrid SPM with objective of minimizing the cost of memory accesses (the cost could be latency, power con sumption or any combination). There are two assumptions for this work. First, both local data and global data need static allocation. This assumption is reasonable for embed ded systems, since many low end micro controllers (MCUs), such as the 8-bit PIC MCUs [1], do not support stack stor age for local data. Second, the architecture parameters are known at the compilation time. The architecture parame ters include the types of memory units used (SRAM, NVM, etc.), and the size as well as the cost for accessing each memory unit. Note that there is no assumption that one memory unit always dominates the other, which makes this problem is different from register allocation as well as pure SRAM SPM allocation. Therefore, the solutions for this problem requires more than a simple cascading application of the classic graph coloring algorithm.
To exploit the live ranges of variables, live analysis is used. Live analysis is a classic data flow analysis performed by compilers to calculate program points where each vari able is alive. Each variable lives through a set of program points, which constitute the live range of that variable. If the intersection of two variables's live ranges is not empty, these two variables interfere with each other. This interfer ence relation could be represented by an interference graph, where each node represents a variable in the program, and the undirected edges connect pairs of nodes which interfere. Fig. 2(a) shows the interference graph for the example code in Fig. 1 .
To minimize the total cost of a memory system, it is ben eficial to know the access frequency of each variable during program execution. Static profiling [18] , which estimates the execution frequency at compilation time, is employed in this paper to obtain such information. 
3
The ILP Formulation
In this section, we present the ILP formulation for the problem of allocating program data in a hybrid SPM sys tem. There are two kinds of constraints for this problem.
The allocation constraints ensure that each variable can be allocated into only one memory unit. The interference con straints ensure that variables alive simultaneously cannot share the same memory address. The following notations are used, and are assumed constant in the ILP formulation.
• n: number of variables We use ai , j to describe whether variable Vi is assigned to memory unit Uj, as defined in Equation
2
.
Then, the allocation constraints is formulated as Equation
. (3)
We use Si to denote the start address of the memory space allocated for variable Vi. If variable Vi is allocated to memory Uj, then we can see that Si must be in the range [0, Pj -Qil. We can use the Inequation 4 to describe this constraint. Here, B is a large enough number, and we can simply take B = 2:: : 1 Pi. If ai , j = 1, which means vari able Vi is assigned to memory Uj, then Inequation 4 be comes 0 :::; Si :::; Pj -Qi.
For each pair of interfered variables Vi and Vj, if they are allocated in the same memory, their starting addresses Si and Sj must satisfy either Si + Qi :::; Sj or Sj + Qj :::; Si. To indicate the order of such two variables, we define Yi , j as Equation 5 .
Then, the interference constraints for any pair of inter fered variables (Vi, Vj) can be formulated as Inequation 6. Notice that Inequation 6 must hold for any memory Ut.
If two variables are allocated in different memories, i.e., ait + ajt < 2, then Inequation 6 is naturally held. If they are allocated into the same memory, then Inequation 6 leads to Si+Qi:::; Sj when Yi , j = 1,and sj + lj :::; Si when Yi , j = O.
Hence Inequation 6 guarantees that interfered variables can not share the same memory address.
The objective is to minimize the total cost of all memory access, which could be formulated as below:
4 MGC: Multiple Graph-Coloring Algo rithm Similar to register allocation, SPM allocation can also be viewed as a graph coloring problem. The interference graph of program variables is the graph to be colored. Each type of memory is associated with a set of colors, and the number of colors in this set equals the size of this memory. In the context of hybrid SPM, each coloring has a cost, which cor responds to the memory cost. Coloring a node with colors from the same set results in the same cost, as they represent allocations into the same memory unit. Therefore, the hy brid SPM allocation problem could be restated as: Given the interference graph, and the sets of colors, how to color this graph such that nodes connected by the same edge could not be colored with the same color, while the total cost is minimized?
It is observed that a good allocation has two features. First, it tries to allocate each variable into its preferred mem ory unit. Second, when two variables, Vi and V j , compete for a low-cost memory unit uc, if the cost penalty result ing from spilling Vi is more than from spilling V j , then, Vi carries a preference to be allocated into Uc. In this sec tion, based on the graph-coloring framework, a polynomial time heuristic algorithm, Multiple Graph-Coloring (MGC), is proposed for the hybrid SPM allocation. The input to MGC algorithm includes: the access frequency for each variable, the interference graph for variables, and the archi tecture parameters. As stated in Section 2, the information about access frequency can be obtained from static profil ing [18] , and the interference graph is constructed by live analysis. The MGC algorithm, as detailed in Algorithm 4. 1, mainly consists of two steps: computing and sorting the memory cost for each variable, and graph-coloring. The graph-coloring process splits the graph into subgraphs, and then coloring and spilling is done for each subgraph. In the rest of this section, these two main steps are discussed first, and then a walk-through example is presented.
Computing and Sorting the Access Cost
Given the access frequency, the memory cost for each variable could be calculated using Equation 1. Then, each variable Vi is associated with a memory cost list,
where k is the number of memory units. In addition, the memory cost list for each variable is sorted in as cending order. MemoryCos(i m precedes MemoryCos(i n in the memory cost list associated with Vi, if and only ' if
MemoryCoS(i , m is less than MemoryCoS(i , n .
Graph-Coloring Process
The graph-coloring process is shown in Algorithm 4.2. It consists of a loop which includes two parts: the first part is to split the interference graph into subgraphs, and the sec ond part is to color and spill the nodes of each sub graph respectively. The overall graph-coloring process for allo cation of a hybrid SPM consisting of SRAM and NVM is shown in Fig. 3 .
Subgraph splitting.
Each variable has its preferred memory unit. According to this preference, the interference 20 
bActualSpilled +-false;
8:
initialize DataAlloc to be empty;
9:
II Part I: splitting the interference graph
10:
split the InterGraph into SubGraphList;
II:
if SubGraphList is empty then
12:
print "Out of memory error"; graph could be split into vertex-induced subgraphs, where the variables preferring the same memory unit are split into the same subgraph. Each subgraph corresponds to a mem ory unit.
Coloring and spilling for each subgraph. This pro cess is detailed in Algorithm 4.2. It consists of four stages: simplify, potential spill, select, and actual spill. The nam ing as well as the meanings of these four stages is similar to that in graph-coloring algorithm for register allocation. For each subgraph, the simplify process finds nodes with degree less than k, where k is the size of the memory unit related to this sub graph. When such a node is found, it is removed from the subgraph and pushed into a stack, and all of the edges connected to it are removed too. This pro cess is repeated until none of the remaining nodes could be simplified. If there are nodes which cannot be simpli fied, one of them is marked as potential spill node, removed from the subgraph and pushed into the stack. This process is repeated until there exist nodes with degree less than k, at which point we return to the simplify stage. We choose the potential spill node by calculating the spill cost of each node using Equation 8. Here, the penalty is the difference of cost corresponding to a node's first and second favorite memory units. The node with the lowest spill cost will be chosen as the potential spill node.
spill cost = penalty/degree (8) After the simplify and potential spill stages, all nodes would be removed from the subgraph. We then assign col ors to nodes in the order popping them off the stack, under the constraint that nodes connected by the same edge can not be colored with the same color. When a node cannot be colored, it is marked as an actual spill node, and left uncol ored. Note that an actual spill node can not be allocated into its currently favorite memory unit, so the actual spill here causes the first element from the memory cost list of this node to be erased. If any actual spill occurs in a subgraph, there is a need to return to the subgraph spitting phase as illustrated in Algorithm 4. 1.
4.3
A Walk-Through Example A walk-through example is presented in this subsection. The access frequency and interference graph are given in Fig. 2 . The architecture parameters assumed are shown in TABLE 1. Memory cost of each variable for each memory is calculated using Equation 1, as illustrated in Fig. 4(a) .
First, the interference graph is split into sub graphs ac cording to each node's favorite memory. In Fig. 4(a) , only node b prefers NVM, so it is moved into the NVM sub graph.
All the other nodes are moved into the SRAM subgraph. The nodes preferring the same memory are highlighted in the same color. Then, the coloring and spilling process con tinues for each subgraph. For the NVM subgraph, one color from the color set related to NVM needs to be assigned to b. For the SRAM subgraph, since the size of SRAM is one byte, k equals 1. It is found that the degree of node i is less than k. So, i is removed from this subgraph and pushed into the stack. Then, we need to choose a potential spill node. Among all the remaining nodes, the spill cost of node e is the lowest, since the penalty is 66 (106-40), the degree is 4, and thus the spill cost is 16.5 (66/4). Then node e is chosen as a potential spill node and removed from this sub graph. Afterwards, there is no node with degree less than k, so we continue to mark the next potential spill node. This process continues by potentially spilling d, simplifying c, potentially spilling g and f, and finally simplifying a. The complete process is shown in Fig. 4(a) , where the poten tial spilled nodes are marked gray in the stack. After all nodes have been removed, we try to assign colors to the nodes in the order popping them off the stack. It is found all the potentially spilled nodes need to be actually spilled, and the coloring is shown in Fig. 4(a) . Now, {a, c, i} is al located into SRAM with the same address, {b} is allocated into NVM. The updated interference graph is shown in Fig.  4(b) . Since there are actual spills, we return to the subgraph splitting phase and start a new iteration. For the new sub graph splitting phase, a, c, i prefer SRAM; b, d, e, g prefer NVM; and f prefers DRAM. Then the coloring and spilling process for each subgraph continues. In this example, four iterations are needed in total, as illustrated in Fig. 4 . The final allocation is shown in Fig. 5 . It is found that only four bytes are used to hold eight variables, and only two variables.
Experiments
The experimental platform is based on LLVM compiler infrastructure [10] . A static profiling algorithm based on [18] has been implemented and is used to estimate the ac cess frequency of each variable. Live analysis is also im plemented to obtain the interference graph. In this paper's implementation, only local scalar variables, including both source-language variables and compiler-generated tempo raries, are considered. But, the proposed method will also 22 In the experiments, a hybrid SPM which consists of SRAM and PCM [8] is evaluated. The PCM memory sim ulator NVsim [5] , which is a PCM-supporting variant of the CACTI tool, is used to estimate the read/write latencies and the energy consumption for a PCM memory of given size. 45 nm technology is used with the tool. NVsim is also used to obtain the access latencies and energy consumption for a given size of SRAM memory. The parameters obtained are shown in TABLE 2.
Five SPM allocators are implemented to evaluate the proposed work. The MGC allocator implements the MGC algorithm presented in Section 4. The ILP allocator imple ments the ILP formulation presented in Section 3. This al locator works as a baseline to evaluate the other allocators. The ILP-N allocator implements the algorithm presented in [3] , which assumes that the live range of each program data object is through the whole execution time of the pro- [8] . The MC allocator implements the Memory Coloring (MC) algorithm presented in [13] . As discussed in Section 1, the memory coloring algorithm statically as sumes the preference for memory units. In the experiments, the MC allocator assumes SRAM is preferable to PCM, and PCM is preferable to off-chip memory. All of the five al locators output the allocation information which associates each variable with a memory address. A simulator is de signed to accept this information as the input, and evaluate the total memory cost using Equation 7. All the benchmarks used are from MiBench [6] .
To evaluate the memory latency and dynamic energy consumption, two sets of experiments are conducted respec tively. We design two memory configurations for both sets of experiments. In the first configuration, the size of SRAM space is 5% and the size of NVM space is 10% of the to tal data size for each benchmark. In the second configura tion, the size of SRAM space is 10% and the size of NVM space is 20% of the total data size for each benchmark. The results are shown in It is found that the ILP-N allocator and the aDA alloca tor do not perform adequately well for the selected bench marks. There are two reasons. First, in embedded sys tems, the register resources are very restricted, and thus in the experiments no register allocation is conducted. There fore, there is still a large number of compilation temporaries even with traditional optimization techniques enabled. Sec ond, both the ILP-N allocator and the aDA allocator do not consider the live ranges of variables, thus resulting of much higher memory pressure.
Considering memory latency, as illustrated in TABLE 3, the proposed MGC algorithm achieves better memory la tency than all the other heuristic algorithms in most cases. On average, compared to MC, the memory latency reduces to 92.0% in the the first configuration, and 93.3% in the sec ond configuration. Considering dynamic energy consump tion, as illustrated in TABLE 4, MGC achieves lower dy namic energy consumption than all the other heuristic al gorithms in most cases. On average, compared to MC, the memory latency reduces to 86.9% in the the first configura tion, and 90.8% in the second configuration. Furthermore, it is shown that the results of MGC are very close to the ILP allocator. It is also found that in few cases MGC works a little worse than the MC. The reason may lie in that, as a heuristic, MGC cannot guarantee better colorability than other heuristics in all cases. 6 
Related work
Non-volatile memories (NVMs), with the features of low leakage power and high density, present new opportuni ties for addressing the memory leakage power consumption problem [19] . As stated above, compared with the tradi tional memory technologies such as SRAM and DRAM, write operations on NVMs suffer from considerably higher energy and longer latency. To overcome this shortcoming, lots of work has been done. [14] [11] [12] resort to hybrid memory architectures which exploits both NVMs and tradi tional memory technologies. [16] presents a way to reduce write activities on NVMs based main memory via small vic tim cache. [7] [9] propose compilation based methods to reduce write activities on NVMs. 7 
Conclusion
In this paper, we propose a novel approach for static allo cation of program variables into hybrid SPM based systems. Both an ILP formulation and a graph-coloring based algo rithm are presented. The experimental results demonstrate that the proposed graph-coloring framework works well for the hybrid SPM allocation, and better than previous works.
