This paper proposes a novel dynamic Scratch-pad Memory allocation strategy to optimize the energy consumption of the memory sub-system. Firstly, the whole program execution process is sliced into several time slots according to the temporal dimension; thereafter, a Time-Slotted Cache Conflict Graph (TSCCG) is introduced to model the behavior of Data Cache (D-Cache) conflicts within each time slot. Then, Integer Nonlinear Programming (INP) is implemented, which can avoid time-consuming linearization process, to select the most profitable data pages. Virtual Memory System (VMS) is adopted to remap those data pages, which will cause severe Cache conflicts within a time slot, to SPM. In order to minimize the swapping overhead of dynamic SPM allocation, a novel SPM controller with a tightly coupled DMA is introduced to issue the swapping operations without CPU's intervention. Last but not the least, this paper discusses the fluctuation of system energy profit based on different MMU page size as well as the Time Slot duration quantitatively. According to our design space exploration, the proposed method can optimize all of the data segments, including global data, heap and stack data in general, and reduce the total energy consumption by 27.28% on average, up to 55.22% with a marginal performance promotion. And comparing to the conventional static CCG (Cache Conflicts Graph), our approach can obtain 24.7% energy profit on average, up to 30.5% with a sight boost in performance.
Introduction
Reducing energy consumption remains a major concern of the state-of-the-art SoC (system on chip) design [1] , [2] . Many studies have shown that giving consideration to the off-chip memory accesses would contribute to the reduction of the total energy consumption for embedded systems [3] - [5] . In practice, traditional Caches, as one type of the onchip memories, are implemented to take advantage of a program's locality to reduce accesses to the off-chip memory. However, severe Cache conflicts may happen due to the limitation of Cache's capacity and associative capability, which will in turn degrade the system performance and aggravate the total energy consumption eventually.
Some researchers proposed methods based on the static Cache Conflict Graph (CCG) [6] , [7] , to copy portions of a program to Scratch-pad Memory (SPM) for performance promotion. However, there are two drawbacks in these methods. Firstly, static CCG can't model the dynamic Cache conflicts' behavior. Most programs have several in- dependent execution phases and each phase behaves in a different memory accessing pattern. Therefore their methods can't optimize programs according to phases in temporal dimension, which will reduce the utilization efficiency of SPM. Secondly, programs partitioned according to branch instructions may not be suitable for the data sections. The locality of data section is inferior to instructions, which may lead to severe Cache conflicts. In this paper, a novel Time-Slotted Cache Conflict Graph (TSCCG) [8] is proposed to slice the whole program execution into several Time Slots. And based on the trace files of Cache misses, a conflict graph is established for each phase independently. Then the Virtual Memory System (VMS) is adopted to remap portions of data to SPM according to the TSCCG. When CPU requires data stored in globe, heap or stack segments, the Memory Management Unit (MMU) will decide its physical location in the main memory (MM) or SPM. In order to minimize the swapping overhead of dynamic SPM allocation, a novel SPM controller with a tightly coupled DMA is introduced to issue the swapping operations without CPU's intervention. Finally, Integer Nonlinear Programming (INP) is implemented to choose the most profitable pages, which should be remapped to SPM. This paper is organized as follows. First, Sect. 1 describes the research background in brief. Section 2 illustrates the difference between our scheme and the previous related works. Then Sect. 3 details the hardware model and the software strategies which are necessary to manage the SPM dynamically. Subsequently, the energy model is established, and INP is implemented to select the optimal solution in Sect. 4. The experimental results will be illustrated and discussed in Sect. 5 . Design exploration will be done based on the size change of virtual pages and Time Slots to show what effects they have on the system energy profit in Sect. 6. Finally, Sect. 7 concludes this paper and outlines the future work.
Related Works
Cache is transparent to programmers. However, due to its energy hungry, large area and unpredictable execution time, it has been restricted to be used in most embedded systems. By contrast, SPM consumes less energy and areas. In order to take advantage of their complementary features, most high-end embedded microprocessors employ both of SPM and Cache to optimize the system energy consumption and Copyright c 2012 The Institute of Electronics, Information and Communication Engineers performance. The optimal algorithms need to manage the SPM in these systems. However, it will change the Cache's behavior when portions of data have been remapped from MM to SPM. Therefore, these methods, which concern SPM only, may not be suitable for those systems with Cache and SPM [9] .
Verma et al. presented that the performance and energy overhead could be reduced due to Cache conflicts [6] . They remapped the instructions, which would lead to Cache conflicts to SPM. Based on analysis of the instructions, Memory Objects (MOs) was introduced with its boundary determined by unconditional branch. Each MO was adjusted with the NOP instruction to be alignment with Cache line. Unfortunately, their methods only consider the static situation and could not model the program dynamically, which will in turn degrade the SPM's efficiency. On the other side, their scheme is restricted to instruction partitions and not suitable for data segments due to its partitioning method.
Egger et al. exploited dynamic management methods based on MMU in Seoul National University. Data [4] and instruction [3] were optimized respectively. The MMU manages the system and provides a logical sequential address space which scatters in different physical address spaces. This mechanism could avoid modification of the branch and adding special instructions to the source code during accessing the SPM. Direct-mapped Mini-Cache and SPM were used to replace the conventional 4-way set-associative Cache. Comparing with the latter, the former could obtain considerable system energy profit with a marginal performance promotion. The SPM controller managed the swapping operation of SPM without additional handling instructions to be inserted in user programs. However, the optimal architectures for code and globe data are distinct. Therefore, their methods could not optimize different memory objects as a whole. Moreover, the LDR/STR instructions, used to transport the swapping contents, may contaminate the traditional instruction Cache (I-Cache) and eventually compromise the total profit.
Researchers [8] implemented the TSCCG in SPM dynamic allocation for energy consideration. But it hasn't considered the Time Slot's offset due to performance improvement, and hasn't detailed the composition of energy reduction. It minimized the swapping overhead between SPM and MM assisted by a dedicated DMA channel and the VMS. What effect of the size change for the virtual page and Time Slot has on the system energy profit hasn't been discussed.
Spatial, temporal or hybrid methods have been introduced to bring about the effective utilization of SPM for Priority-Based Preemptive Multi-Task Systems [10] . It relocated the optimal task's functions to SPM, minimized the accesses to external MM. However, it is only for the instruction space which has good spatial and temporal locality, but not suitable for the data partitions, especially for heap and stack data allocated dynamically. What's more, the base optimization granularity is functions which may be too coarse for data partitions dominated by values and array blocks.
Reference [11] minimized inter-task interferences in SPM utilization for energy reduction of the Multi-Task systems. The SPM space was divided into data and code partitions, each task occupied a separate SPM block. The overlap partitions of different task's block was firstly sliced to active and inactive partitions, and then the overlap partitions was focused on and optimized to minimize the swapping overhead between SPM and MM during task switching. Whereas in their scheme, they need to insert some special API functions in source code to count and save the overlap partitions of context during task switching. This indicates that the source code must be reached and modified firstly, which is infeasible in most cases for software copyright issues. Thus, it couldn't optimize programs non-invasively. As shown above, previous studies mostly focused on data or instruction partitions but not stack segment. However, experimental results have shown that the most frequently accessed partitions almost concentrated in one or several sequential address spaces of stack. Based on analyzing and tracing the profiling results of the standard test benchmarks, the accesses to stack behave obvious spatial locality. Researchers [12] introduced a method to remap a portion of the stack data to SPM by an arbiter called Shadow Stack Scratch-pad Memory.
Stack and heap data, allocated and used dynamically, cannot be analyzed by the traditional SPM optimization methods. Sjodin et al. [13] partitioned the stack data into 2 parts by the stack pointer, remapped the most frequently accessed ones to SPM. Moreover, the most accessed local variables were treated as globe ones in the function's stack. Soyoung Park [14] proposed a dynamic address-mapping scheme for stack management assisted by the MMU, and without architecture's modification and complier's assistance. Some researches transformed heap data to stack one [15] , [16] and remapped it to SPM, but the efficiency of data type conversion is very low. Dominguez et al. [17] proposed a dynamic allocation strategy to swap some heap data in SPM during the compile time.
Comparing with the earlier researches, Poletti Francesco et al. [5] proposed a method combining hardware with software strategy to manage the SPM dynamically, make full use of the SPM in Multi-thread environment by VMS. DMA was exploited to transport data between SPM and MM to reduce the swapping overhead.
Tiago Rogério Mück et al. [30] allocated the suitable memory level (SPM or external MM) for heap data assisted by the annotations inserted by programmers based on an OS (Operating System). SPM was dynamically allocated during the run-time, and a proportion of energy profit has been obtained on a small cost of performance overhead. However their scheme has two drawbacks. The first one lies in that some special instructions must be inserted to source code manually, which means that the source code must be obtained and modified firstly, this could be a problem in most cases due to the software copyright issues. The second one is that they can only deal with the heap data, or the global data converted to heap ones during system initialization, which will degrade the optimization efficiency and lead to additional system overhead.
R. Pyka et al. introduced a SPMM (SPM Manager) into an OS for multi-process system [31] , to manage the SPM and its allocation optimization. The memory allocation task has been migrated from user applications to the OS. The global data, functions and dynamically allocated data were combined into memory objects. The memory objects were re-allocated according to their optimization efficiency, and the optimal objects were remapped to SPM. A variety of allocation strategies have been studied in their proposed approach. Finally, the Optimal Allocation Strategy based on an ILP (Integer Linear Programming) offline allocation strategy has been developed. With the maximized objective function being constructed and solved, the optimal system energy consumption will be obtained at last.
Cache and SPM, as two dominate on-chip memories, they will complement each other in functions, especially for SPM to Cache, but not confrontational. However, in this paper, they have been considered separately and independently which sounds not reasonable [9] , [12] , [31] , [32] . In experiments, the optimization results of system with Cache or SPM were compared to those systems neither with Cache nor SPM, and this is unfair in most cases. The authors declared that if they extended their scheme for RTEMS OS and standard libraries, the SPM-allocation scheme proposed with their approach would definitely outperform Caches for any kind of applications, this statement is too arbitrary. Actually, a lot of embedded processors and MCUs have adopted this memory architecture in their physical implementation such as IXP network processor of Intel [33] , XScale of Marvel and i.MX27 series of Freescale [34] etc.
In this paper, the optimization strategy is based on the VMS, which could analyze the program data as a whole, including heap, stack, globe and constant pool data. The data could be swapped to SPM dynamically without the requirement of complier's modification and source code. The program will be optimized non-intrusively. The main contribution of this paper lies in two sides. Firstly, dynamic swapping of SPM contents will be more efficient of SPM usage in the limited size. Secondly, our dynamic allocation scheme will be more precise to the time-phase characteristics of programs.
Platform
In this section, the hardware model and software strategy will be described in detail. The hardware model consists of an enhanced VMS, a SPM controller with a tightly coupled DMA and a traditional timer. The software strategy sub-section will provide the fundamental basis of this paper, TSCCG. After that, the complete optimization process will be illustrated step by step.
Hardware Model
As mentioned above, assisted by the traditional VMS, we can remap those memory pages, which may lead to severe Cache conflicts, to SPM. However, due to the large conventional virtual page size (typical 4 KB/page), the SPM couldn't be managed in fine-grains, and this could eventually degrade the SPM's efficiency. Moreover, the large page size will significantly increase the swapping burden of SPM. The micro-paged MMU (512 B/page) in Ref. [4] is adopted in this paper.
Francesco [5] firstly proposed a dedicated DMA for the data swapping between SPM and MM, which can prevent I-Cache contamination and reduce the performance degradation by increasing the data transportation efficiency. And this method will eventually reduce the data-copy overhead effectively. However, their method had to insert some high level APIs into the source code for DMA's configuration. This required direct access to the source code, which might not be available in most cases for software copyright issues.
In order to minimize the swapping overhead, this paper proposed a novel SPM controller model with a tightly coupled DMA as shown in Fig. 2 . The SPM controller model is implemented in C++. As shown in Fig. 1 the system architecture is implemented in C++. Comparing with the ARMulator [18] , it could achieve almost 100% accuracy. And the simulation error accuracy could be controlled within ±7% comparing with the physical ARM926E-J chip. While its simulation speed is 3 times of ARMulator, and 8 times of SimpleScalar [19] - [21] . The SPM controller model is constructed by the control logic, a tightly coupled DMA and sets of region registers. The DMA operation will be configured and issued by the SPM controller itself. The control logic is primarily used to load the configuration of current Time Slot from the configure pool; configure the DMA to issue the SPM swapping. The SPM dynamic allocation algorithm would reach the optimal results. And then, C files including configure information will be generated. This information will be added to the program after the program being recompiled and linked. The page swapping will be accomplished by the dedicated DMA. The SPM Management (SPMM) is a section of software code, which implements the results of the dynamic allocation algorithm to the SPM swapping operation during the program execution. It includes three main parts: (1) The page start physical address to be remapped to SPM; (2) The block number of each page to be remapped to SPM; (3) Cycle number of the Timer (Time Slot size).
In this paper, the following three extensions are extended to the traditional SPM controller:
(1) A set of base address mapping registers are designed to record the address of page buffer in MM. After the current page table has been updated, the configure information of program image would be linked by the SPMM to get the actual mapping. DMA's destination will be quickly obtained by reading the write-back address in this register during the write-back process.
(2) A dedicated DMA channel is added to accomplish the data swapping between SPM and MM. Comparing with the traditional LDR/STR instructions. The DMA can minimize the swapping overhead and interrupt latency by fully utilizing the burst characteristic of MM and AHB (Advanced High-speed Bus). SPMM reads the configure information to be loaded to MM, loads the update information to DMA controller according to Time Slots, and finishes the DMA's configuration. Thus, it could accomplish data swapping by single or multi-continual DMA operations.
(3) Write dirty-bit was added to control and reduce the swapping overhead. To guarantee the data consistency of SPM and MM, a dirty-bit was used to record the write operation to SPM. When there was a write operation, the dirtybit of this page will be set to '1'. The SPMM will decide whether to execute DMA operation or not by the dirty-bit in the later Time Slot. On the contrary, bit '0' indicates that the page has not been written, thus the DMA operation should be suspended and overwritten operation to be executed directly.
As shown in Fig. 3 , the SPM Region registers includes information of the page size, two flags (valid and dirty bit), the original physical page address to be loaded to the DMA destination address register. The size field which occupies 2-4 bit stands for the page size. The complete process will be presented later.
In addition, this paper introduces a conventional timer to slice the program execution process into several Slots, help us to explicitly model the dynamic characteristics of the Cache conflicts. 
Software Strategy
The TSCCG introduces a new dimension, TIME into the Cache Conflict Graph (CCG). As shown in Fig. 4 , the whole program execution is sliced into several phases; and Conflict Graph is established within each phase independently. In fact, the practical program behaves obvious spatial and temporal locality during the whole execution. Traditional CCG only considers the spatial locality in static within the whole execution process of a program, and can't model the Cache conflicts according to time phases.
To generate TSCCG form the Cache miss Graph, the following steps should be finished:
(1) Slice the program execution into several Time Slots. In theory, the duration of one Time Slot will influence the system optimization. Oversized ones may not reflect the dy- namic execution characteristics of a program. On the other hand, based on our optimization algorithm for SPM swapping, each swapping overhead will be compared with its energy profit, and too small ones will lead to the profit and overhead to be neutralized, thus these small ones may eventually be combined into other big ones.
(2) Count page's conflicts of each Cache line within each Time Slot. And the trace file will be used to determine the page replacement.
(3) Since our optimization scheme manages program in pages, the conflicts of Cache lines should be merged up according to pages. Page is defined as 512 B/page and Cache line as 32 B/line, putting each Cache line's conflicts together to build up the page's conflicts.
(4) Merging Cache lines' conflicts to page ones, getting the conflict graph of each Time Slot. As shown in Fig. 5 , it could be found that page 12362 and 12346 have the most conflicts due to the obvious spatial locality of Cache conflicts. There is no need to increase the whole Cache's associative or capacity to remove these conflicts, and only need to remap the frequently conflicted pages to SPM, break up the conflict chains to wipe off these conflicts.
TSCCG G(T ) = (page, E) is a directed weighted graph with node set page = {page 1 , . . . , page n }. T (Time slot) indicates the Time Slot. Each vertex page i in G(T ) corresponds to one page, which is also the granularity of our optimization method. The edge set E contains an edge e i j from node page i to page j , which denotes that a Cache line belonging to page j is replaced by the Cache line belonging to page i . The weight m i j of the edge e i j stands for the Cache miss number of page i that occur due to page j . The weight dw(ni) of vertex page i denotes the total access times within page j . Figure 6 is the swapping operation and control flow within one Time Slot. The three shadow pages will be swapped to SPM in the next Time Slot, and the other pages marked by '0' to be remained in MM. Page A and B have been written dirty in the previous Slot, and they should be written back to their initial physical address recorded in the SPM control registers. Page C has not been written and needn't to be written back. According to the configure information, PAGE C should be replaced by PAGE F (0x00300D00); PAGE A replaced by PAGE D (0x0030B700) and written back to 0x007FFF00; PAGE B swapped with PAGE E (0x0030AA00), and written back to 0x0030B700. Completing the swapping operation, the Page Table should be modified to remap the virtual address to the new physical address. Thus the program can access the right ones after the address of PAGE A-F has been changed. Exiting the interrupt process, program will access PAGE A-C in MM and PAGE D-F in SPM. The page's PTE (Page Table Entry ) in SPM will be marked as non-Cacheable, which can eliminate the Cache conflicts due to these pages.
Energy Model and Algorithm
TSCCG is established to model the Cache conflict behavior of each page according to the profile information. To select the page which causes more Cache conflicts (the most profitable pages), a mathematical model is established from the TSCCG, and some mathematical tools have been used to obtain the optimal solution.
Energy Model
The system's energy model is firstly established to quantify the fluctuation of system energy when some of the nodes have been remapped to the SPM. Note that, the energy consumes by those cacheable pages in given Time Slot T S k could be described as the following equation:
The R/WHit() and R/WMiss() return the number of read/write hit and miss within T S k . In our scheme, writethrough and read allocation strategy are adopted. On this premise, (1) can be simplified as (2):
Since the read allocation strategy is adopted for Cache, the energy consumption of data read miss should be described as (3):
In (3), the second polynomial denotes the energy consumed by filling a new Cache line when read miss happens and the first one stands for the energy consumption of the twice Cache reading operations.
DMiss(Page
DMiss(Page i , T S k ) in (4) stands for the Cache miss number of page i due to page j in T S k . Function N(page i , T S k ) denotes the pages' collection which will lead to page i 's conflicts in T S k . The total number of CPU access pages in T S k is a constant described as the weight dw(n i ) of node n i in (5) .
Given a page remained in MM, if it will be remapped to SPM, the energy consumption within T S k should be described as the following equation:
In (6), l(page i , T S k ) stands for the page swapping overhead. According to a special page, this overhead doesn't occur in each Time Slot, such as if page i has been swapped to SPM in the previous Time Slot, thus it needn't to be transported any more. The factor W (0 ≤ W ≤ 1), obtained by experiments, is the probability of writing back a single page.
Mathematical Model
In order to select the most profitable pages, the two energy models (Cache and SPM) are combined into a unified one. The page's location in memory hierarchy is described by
When page i is accessed in T S k , the energy consumption should be described as (8) . (8) have been illustrated in (2) and (6) respectively. E(T S k ) denotes the total energy consumption within T S k :
Only p(page i , T S k ) and l(page j , T S k ) in (9) are unknown parameters. For the SPM size is limited, not all of the page nodes in TSCCG will be guaranteed into SPM. The constraint of SPM size is described as following:
Now the mathematical model can be described as (11) . The goal of the dynamic allocating scheme is to minimize the value of (9) with the constraint condition as (10) .
Integer Nonlinear Programming
It is shown that the solution of the energy model is a typical optimal issue in mathematical model (11) . Since there is a quadratic term of (12) in (9), the mathematical problem in (11) becomes a typical integer nonlinear problem.
There is a new variable P(page i , page j , T S k ) introduced to fix the nonlinear issue by transforming the nonlinear problem to a linear one in research [2] . However, they pulled in more nodes and constraints during the transformation. Given there is N nodes in a previous slot, there will be N * (N − 1) nodes of P(page i , page j , T S k ) to be inserted in the current slot. The total variable number will be as (13) :
Since each variable P(page i , page j , T S k ) corresponds to three constraints, N * (N − 1) variables of P(page i , page j , T S k ) will bring N * (N −1) * 3 ones, thus the total constraints will be:
When there are considerable nodes in the conflict graph, the problem would be very difficult to be solved due to too many variables and constraints. In this paper, MAT-LAB Toolbox bnb20 [22] is used to tackle the nonlinear problems in (11) , thus the optimal pages which minimize the objective function to be obtained.
Time Slot Adjustment
After remapping profitable pages into SPM, most of the Data Cache conflicts will be eliminated. The total energy consumption of memory subsystem decreased without any performance degradation. This will compress the execution time of each slot. Therefore, the start time of each slot should be revised to align the current one to the previous one before optimization. Before and after optimization, the gap of the start time would be figured out by the iterative solution, and regulated with the SPM controller. The iteration process is shown in Fig. 7 .
Experimental Results
Since our optimization scheme focuses on the system energy reduction by dynamically remapping pages which may cause severe Cache conflicts to SPM according to the dynamic program execution characteristics. And the optimization results may be clearly illustrated by the comparison of Cache misses' number and distribution, before and after optimization.
In real applications, there may be more than one program to be executed and dynamically switched from each other during the run-time. For this reason, we select benchmark of JPEG-dec, MPEG2, FFT and CRC32 to construct a new benchmark called Combine. In Combine, the four programs will be executed in sequence, and their multi-phase characteristics can be effectively utilized by the program dynamic management. As shown in Fig. 8 (a) and Fig. 8 (b) , the Cache misses' distribution is compared with each other in the time of before and after optimization. X axis stands for Time Slots, Y axis for Data Page address, and Z axis is for Cache misses' number. Figure 9 (a) and 9 (b) describe the system energy distribution, before and after optimization. The total energy has been reduced by 41.41%. 8KB-1A-8K stands for 8KB-1A Cache coupled with 8 KB SPM. The accesses to external MM have been mostly reduced, and the energy consumed by accessing the MM and off-chip bus has been significantly reduced [23] , [24] , and the energy consumption of the CPU Core has been reduced due to the reduction of execution time. Since these pages accessed through Cache, which may cause severe Cache conflicts, have been remapped to SPM, the energy of D-Cache has been decreased in a large proportion. On the other side, comparing with Cache, there will be less energy consumed by SPM access. Therefore, the total energy consumption of data portions would be reduced further. Some instruction pages would be remapped to SPM and marked as non-Cacheable, and accessed from SPM directly, thus the energy consumption of I-Cache will be reduced as an additional profit.
As shown in Fig. 10, Fig. 10 (a) is the Cache misses' mapping for the benchmark of Dijkstra before optimization, and Fig. 10 (b) stands for the Cache misses' distribution after optimization. Comparing Fig. 10 (a) with Fig. 10 (b) , it is shown that there is obviously Cache conflicts in instruction and stack space before optimization. After configured with another 4 KB SPM, implemented our dynamic optimization algorithm, the conflicts in instruction space have been almost eliminated, and the conflicts in stack space being reduced over 75%. For heap space, the conflicts have been reduced over 50% due to most conflict pages being remapped to SPM.
We take 14 benchmarks from MediaBench [25] and MiBench [26] to analyze our optimization results quantitatively. D-Cache is configured as 4KB-1A-4K in experiments. The optimization results are compared with 8KB-1A, 8KB-2A and 8KB-4A set-associative D-Cache. Though our method focuses on the energy optimization, the system performance has also been promoted due to the reduction of execution time. Although energy consumption of D-Cache occupies a small proportion in the total energy, the accesses to external MM caused by these Cache conflicts will contribute a considerable proportion to the total system energy.
The processor's energy is based on Ref. [27] . The en-ergy consumed by Cache, TLB and SPM is obtained from CACTI 6.5 [28] , and the main parameters of CACTI are shown in Table 4 . The energy consumption of MM and system off-chip bus could be found in literature [23] and [24] . We ignore the energy consumed by SPM controller. According to literature [29] , only the memory access energy, issued by DMA, will be included. After optimization, the energy consumption of the MM and off-chip bus has been reduced significantly with the additional SPM accessing energy neglected.
The energy profits of our optimization method are shown in Table 1 and Fig. 11 . Comparing 4KB-1A-4 K with 8KB set-associative Caches, the total energy consumption has been reduced by 27.28% on average, up to 55.22% on maximum. As shown in Table 2 and Fig. 12 , the system performance has been improved by 9.8% on average after implementing our optimization method.
For the difference in optimization objects, simulation platforms, algorithms etc. proposed in different approaches, it is difficult to simply compare the experimental results of different approaches with each other directly. Our TSCCG counts up the Cache conflicts in temporal dimension based on the static CCG (Cache Conflict Graph) proposed by Verma et al. [6] . As mentioned in Sect. 6.2, the duration of a time slot will influence the optimization results proposed with our TSCCG. Too short will lead to data pages selected and placed on SPM not changed at most time. On the other side, oversized one will make the dynamic algorithm degenerate into a static one. Therefore, we can easily compare our TSCCG with the static CCG. The experimental comparison between our TSCCG and static CCG for the benchmark of Combine is shown in Table 3 .
The improvement from static CCG to TSCCG is impressive as shown in Table 3 . For the six configurations, our TSCCG could reduce the system energy by 24.7% on average, up to 30.5% on maximum; the system performance has been improved by 5.5% on average, up to 6.8% compared to the static CCG. We can come to the conclusion that our TSCCG is more suitable for program reality than the tradi- Table 1 Energy comparison of benchmarks in 4KB-1A-4K and 8 KB set-associative D-cache configurations (Unit: nJ). tional static CCG.
In theory, static approach could obtain the optimal system energy by remapping all the Cache conflict pages to SPM if there is adequate SPM space to use. While our design philosophy lies in that we not only consider the system energy but also the limited chip area and cost in our scheme, fully utilize the temporal locality of Cache conflicts, slice the whole program execution into a series of Time Slots, dynamically swap the optimal pages to SPM and replace them out when they are not profitable any more, make full use of the limited SPM, improve its efficiency, taking the constraints of die area and cost etc. into account when optimize the system energy, and finally achieve the highest costeffective of on-chip resources. 
Design Exploration

Page Size Select
The minimal control granularity of SPM is decided by the virtual page size, which will make the on-chip memory to be efficiently utilized by the VMS. To page-managed system, the page size is critical in TLB (Translation Look-aside Buffer) design. For smaller page, the address space attached to each PTE will be reduced accordingly. If TLB is not redesigned under this situation, accesses to MM will increase dramatically due to the frequently TLB misses, which will aggravate the burden of system energy. On the other side, the address space of each PTE will expand according to larger page size, while the swapping overhead of SPM will be more expensive as the following two negative effects:
(1) Increase the actual page swapping overhead. The cost of loading pages to SPM will dramatically increase for large pages.
(2) Decrease the efficiency of SPM. In this paper, SPM is used to load the pages which will lead to severe Cache conflicts. While these conflicts just lies in one or several Cache lines, if the page size is too large, the proportion of SPM's effective unit will be reduced, thus the optimal result may not be reached.
To get the appropriate TLB hierarchy, associative and entries under fixed page size, we did some design space exploration based on 7 Benchmarks. The benchmark of Dijkstra, as an example shown in Fig. 15 , illustrates that, whether for the L1 (Level 1) or L2 TLB, configured with larger capacity and higher associative, will lead to less overhead for small page management. Since the TLBs are essentially full/set associative Cache, larger capacity and higher associative will cause larger area, more accessing energy and latency. In practice, we select 16 entries for D-uTLB, 8 entries for I-uTLB and 64 entries, 32 set-associative L2 unified-TLB, 512B for page. Comparing with TLB for 4 KB page, the TLB energy consumption for 1 KB page is 1.32% higher on average, 3.79% and 8.80% for 512B and 256B page respectively as shown in Fig. 16 .
As shown in Fig. 13 , "Non-optimization 256" denotes the system energy in 8KB-1A with 256B page before optimization; "Optimization 256" stands for the system energy in 4KB-1A-4K with 256B page after optimization. "Non-optimization 512" and "Optimization 512" stand for the system energy in 8KB-1A and 4KB-1A-4K with 512B page, before and after optimization respectively. Comparing "Non-optimization 256" with "Optimization 256", and "Non-optimization 512" with "Optimization 512", it is shown that the system energy of 512B page is lower than 256B one whenever before or after optimization. After op- timization, the energy profit is 24.52% and 27.28% on average for system with 256B and 512B page respectively. Considering the TLB energy consumption and system energy profit in different page size configuration, trading off these two sides, 512B page is taken in this paper. 
Time Slot Size Discussion
The duration of a Time Slot will influence the optimization results of TSCCG. Too short will lead the data pages which have been selected and placed on SPM, to not change at most of the time. On the other side, oversized one will make the dynamic algorithm degenerate into a static one. As shown in Fig. 17 , the benchmark of JPEG-enc is selected to illustrate the temporal and spatial locality of D-Cache miss patterns. When the memory access pattern is simulated with the most appropriate period, it appears obvious periodic characteristics which could be used by SPM to allocate the optimal partitions. As described in Sect. 3.2, based on our optimization algorithm, we did some design exploration for the time slot size. The Cache misses' distribution in Time Slot of 2,500,000, 2,000,000 and 1,000,000 cycles is compared with each other. The horizontal axis stands for time and the vertical axis for address space. Every small box denotes the Cache misses' number within this page in current Time Slot. The grey degree of each small box stands for the Cache conflict's severity. When Time Slot is set to 2,500,000 cycles, the conflicts keep high severity within some pages; but for 1,000,000 cycles, the conflicts show ap-parent periodic characteristic.
As shown in Fig. 14, the system energy is normalized by dividing the energy of 2,500,000 cycles and compared with each other. To most benchmarks, the system energy has the optimal result in time slot of 1,000,000 cycles. But when the Slot size is reduced further, the swapping overhead and energy profit will be neutralized, and small slots may eventually be combined into big ones. In this paper, based on a large number of experiments, 1,000,000 cycles is selected as the Time Slot in our experiments.
Conclusions and Future Work
In this paper, we introduce a VMS to remap those data pages, which can cause severe Cache conflicts within a Time Slot, to the SPM according to our innovated TSCCG. Then, a novel SPM controller Model with a tightly coupled DMA is implemented to minimize the swapping overhead of dynamic SPM allocation. Afterwards, we introduce the INP to select the most profitable pages. Finally, based on analyzing the Cache and SPM Co-architecture system, we explore how to select the appropriate virtual page size and the duration of a Time Slot to obtain the optimal system energy. This work might provide reference for the optimization of CPU design. The experimental results show that our approach can reduce a great deal of energy with a slight boost in performance. For the future work, according to the optimization algorithm and program execution characteristics, the size of virtual page and Time Slot should be dynamically adjusted to simulate the real program more realistic. Therefore, the system energy and performance can be further improved.
