Abstract-Applications that run in the embedded systems normally should be finished within a timing constraint in energy-efficient fashion. Due to these two requirements, the embedded systems often employ software-controlled scratch pad memory (SPM) instead of hardware-controlled cache as their on-chip memory. The data accesses in SPMs are controlled purely by the software, which provides better time-predictability and precise time-control. In this paper, we propose a time, energy, and area efficient domain wall memory (DWM)-based SPM for embedded systems. To efficiently manage this type of novel SPM, an integer nonlinear programming formulation and the instructions group schedule algorithm are proposed to generate memory access instruction scheduling and data placement. In addition, the longest move reduce algorithm is also proposed to configure different types of DWM memory cells to achieve minimal area size. Experimental results show that the proposed techniques can generate a configuration of DWM-based SPM with minimal area size while satisfying time constraint.
I. INTRODUCTION

I
N EMBEDDED systems, applications should be finished in a specific time frame (time constraint) and with low energy consumption. Usually, large amount of data has to be dealt in some embedded systems such as anti-lock braking system (ABS), energy conservation, and high-confidence medical systems. In addition to computation, memory accesses in these embedded systems account for a large percentage of time and energy. The data transformation between the applicationspecified integrated circuit and off-chip memories consumes 50%-80% of the power cost [1] . On-chip cache typically consumes 25%-45% of the processor's area and energy [2] . Therefore, time-efficient and energy-efficient memories are desirable for embedded systems.
To achieve timing predictability and energy efficiency, software controlled scratch pad memory (SPM) [2] is usually employed to substitute hardware controlled cache in embedded systems. Different from cache, the data transfer between SPM and off-chip memory is explicitly managed by software which leads to better timing predictability [3] . Besides, SPM also has small die area and low energy consumption compared with cache. Therefore, SPM has been used as an alternative to cache in many embedded systems [2] - [6] .
Emerging nonvolatile memories (NVMs) such as magnetic random access memory, spin-transfer torque random access memory, ferroelectric random access memory, pulsecode modulation, and domain wall memory (DWM), have many attractive characteristics such as low leakage power, fast read access, high-density, and nonvolatility. In recent years, researchers have been trying to adopt NVMs in the existing memory hierarchy to take advantage of these benefits. Among those NVMs, DWM is a promising replacement for SRAM as on-chip memory due to its extremely high density along with ultralow standby power, efficient read/write energy, and bestcase access latencies comparable to SRAM [7] , [8] . Therefore, DWM has captured the attention of academia and semiconductor industry in recent years. Several research efforts have been devoted to understanding the device characteristics [7] , [9] , [10] and fabricating functional prototypes [8] . However, DWM has unique characteristics that pose significantly different challenges from all other nonvolatile memory technologies.
There are two different types of DWM cells: 1) micro-cell and 2) macro-cell. In micro-cell DWM, the access speed is fast while the density is low. In macro-cell DWM, the density is high. However, the access speed varies. The high density of macro-cell DWM is achieved by sharing access transistors to access multiple magnetic domains where data are sequentially stored in "tapes." Before performing read/write operations, several shift operations are required to align the data with read/write port. Thus, the access speed to macro-cell DWM varies, depending on the number of shift operations required. While the best-case access latencies are close to that in SRAM, the worst-case access latencies can be several times higher. From analysis, we found that the worst cases occurs when two instructions access two data that are on two different ends of the "tape" consecutively. In this case, the maximum number of shift operations are required to access these two data consecutively. By carefully scheduling memory access instructions and placing the data, the worst-cases can be avoided.
In this paper, we propose a time, energy, and area efficient DWM-based SPM as on-chip memory for embedded systems. In order to achieve area efficiency, we first adopt pure macrocell DWM to build the SPM. To improve the performance of macro-cell DWM-based SPM, we propose an integer nonlinear programming (INLP) formulation to achieve the minimum number of shift through memory access instructions scheduling and data placement. Since INLP takes exponential time to finish, we propose a polynomial-time solution, the instructions group schedule (IGS) algorithm. However, the time constraint of applications may not be satisfied even though the number of shift is minimal. In such cases, it is necessary to adopt an SPM which consists of both micro-cell and macro-cell DWM. The longest move reduce (LMR) algorithm is then proposed to find a configuration of different DWM cells that can satisfy the time constraint with minimal area size. The main contributions of this paper include the following.
1) We propose an INLP formulation to generate optimal memory access instruction scheduling and data placement which can achieve the minimum number of shift in SPM that only consists of macro-cell DWM. 2) We propose a polynomial-time algorithm, the IGS algorithm, to minimize the number of shift. 3) We propose the LMR algorithm to find the best configuration of micro-cell and macro-cell for DWM-based SPM to balance the memory area size and performance. The experimental results show that our proposed techniques are efficient. The rest of this paper is organized as follows. Related works are discussed in Section II. In Section III, background and definitions are introduced. Motivational examples are presented in Section IV. The proposed techniques are presented in Section V. The experiments are presented in Section VI. Section VII concludes the whole paper.
II. RELATED WORKS
There are various works focusing on using SPM in the memory architecture of embedded systems [2] , [5] , [6] , [11] - [13] . In [12] , an optimal data allocation for scalar variables on a single CPU with SPM was proposed to improve the performance of embedded systems. Hu et al. [5] proposed a energy efficient hybrid on-chip SPM with NVM. As a recent emerge NVM, many research works show the potential of DWM as a promising memory technology. DWM storage applications have been demonstrated, such as the array integration at standard IBM 90 nm technology [14] and the content addressable memory design and fabrication [15] . Exploiting the tape-like nature, DWM was shown to be well suited for designing onchip FIFOs [16] . In [8] , DWM was also proposed as a potential replacement for secondary storage.
There are several previous works focusing on the tradeoff between density and average access latency in hardware [10] , [14] , [17] - [20] . Techniques such as cache management policies for head selection and update were proposed to address the overhead of shift [10] , [17] when DWM is used in cache. Mao et al. [18] proposed a write buffer structure and DWSW-RM aware warp scheduling algorithm to improve the performance and energy efficiency of DWM-based register file. Sun et al. [19] proposed cross-layer DWM design, such as cell size design, array design, and optimized architecture design, to achieve high performance of DWM. There are also some previous works to improve the performance of DWM with schedule and data management techniques. Sun et al. [21] proposed an application-driven data management policy to minimize shift. These techniques need the support of hardware in cache and cannot be applied to SPM directly. In cache, task scheduling is controlled by software, and data is managed by hardware logic. While in SPM, task scheduling and data management are both controlled by software. Therefore, scheduling for SPM-based systems requires additional consideration for data management compared with scheduling for cache-based systems. In this paper, we propose techniques to improve the performance of DWM-based SPM such that the real-time constraint of applications can be satisfied.
III. BACKGROUND AND DEFINITION
In this section, we first give some background knowledge about DWM. Then, the proposed DWM-based SPM along with target architecture will be presented. At last, we present definitions that will be used in this paper.
A. Micro-Cell and Macro-Cell DWMs
When designing DWM-based SPM, two different cells are considered: 1) micro-cell and 2) macro-cell.
1) Micro-Cell DWM: As shown in Fig. 1(a) , the primary components of micro-cell are two fixed domains, one free domain, three transistors, and a magnetic tunnel junction (MTJ) sensor. Each micro-cell can store only 1 bit in free domain. If orientation in free domain is in the same direction with orientation in MTJ, the micro-cell stores a "0"; otherwise, it stores a "1." The read operation is carried out using MTJ. The write operation is carried out by shifting orientation of appropriate fixed domain to free domain.
2) Marco-Cell DWM: As shown in Fig. 1(b) , a macro-cell consists of a ferromagnetic wire (also called tape), two fixed domains, an MTJ, and five access transistors. There are multiple domains in the tape. Each domain can store 1 bit, therefore, multiple bits can be stored in the tape. Read operation and write operation are similar with micro-cell. But, before read and write, the tape must be shifted to align the bit to the R/W port. Although macro-cell has high density by sharing R/W ports with multiple bits, which also leads to shift penalty.
B. DWM-Based SPM
The target architecture of this paper is a single core with on-chip SPM system, as shown in Fig. 2(a) . The proposed DWM-based on-chip SPM is marked with rectangle, which consists of both micro-cell and macro-cell. This design can achieve higher performance than pure macro-cell-based SPM and higher density than pure micro-cell-based SPM when the configuration of SPM is organized appropriately.
In this paper, we adopt the layout of macro-cell DWM proposed in [21] , as shown in Fig. 2(b) . The tapes of multiple macro-cell are placed side by side above transistors. Corresponding R/W ports are placed in a diagonal manner. Note that, those macro-cell have their own write-line and bit-lines, but share one source-line. This design can use the space above CMOS layer sufficiently such that the density of macro-cell-based memory is improved.
C. Definitions
In this paper, we use access instruction graph (AIG) to model the basic block of an application. The AIG is defined as follows.
Definition 1: An AIG G = <V, E, D> is a directed graph, where V represents a set of memory access instructions, E ⊆ V × V is a set of edges to represent the dependencies between memory access instructions, and D is a set of data that is accessed by V.
Formally, the problem addressed by this paper is defined as follows.
Definition 2: Give an AIG G and time constraint T, what is the instruction schedule, data placement and configuration of micro-cell and macro-cell DWM-based on-chip SPM that can achieve minimum access cost (energy) with minimal area overhead while satisfying the time constraint T.
IV. MOTIVATIONAL EXAMPLE
In this section, we will use a motivational example to show the main ideas of the proposed techniques. Fig. 3 (a) shows a motivational example. There are two basic blocks which are profiled from two loops of a real application program. Two basic blocks need to access seven data. We transform basic blocks to assembly-level code and profile the memory access instructions. Fig. 3(b) shows the procedure of transform [22] . According to the dependencies of instructions, an AIG is constructed as shown in Fig. 3(c) . The dark circles represent load, and white circles represent store. For simplicity, we assign each circle a number. The time constraint for memory access is 180 ns.
A. SPM (Macro-Cell DWM)
To achieve a high density SPM, only macro-cell is adopted initially. Assuming the capacity of each macro-cell is four bit and the size of each data is one bit. The R/W ports are initially located at the leftmost bit of the tape. The parameters of DWM are shown in Table I [23], [24] . Fig. 4 shows the schedule and data placement generated by three techniques: 1) original method (includes list schedule and a simple data placement algorithm); 2) IGS; and 3) INLP. The schedule and data placement generated by original method are shown in Fig. 4 (a). The movement of R/W port reflects the shift operation intuitively. According to the schedule and data placement, the track of R/W port of tape 0 and tape 1 is ABCGBCBCAG and DFDFEF, respectively. The total access latency is T o = 250.56 ns, and total access energy consumption is E o = 14.31 nJ. The number of shift is 18. Fig. 4 (b) shows the schedule and data placement generated by IGS. The number of shift is 10. Track of R/W port of tape 0 and tape 1 is ACBDA and GEFG, respectively. The total access latency is T IGS = 203.6 ns, and total access energy consumption is E IGS = 10.71 nJ. Compared to original method, the total access latency and energy consumption is reduced by 18.74% and 25.16%, respectively. However, the reduction of shift is not the minimal. Fig. 4 (c) shows the schedule and data placement generated by INLP. The number of shift is 9 which is the minimum. Track of R/W port of tape 0 and tape 1 is ABCDA and EGF, respectively. The total access latency is T INLP = 197.73 ns, and total access energy consumption is E INLP = 10.26 nJ. Compared to original method, the total access latency and energy consumption is reduced by 21.08% and 28.3%, respectively. However, even the minimum result T INLP is larger than time constraint T = 180 ns. Therefore, using macro-cell DWM alone to build SPM cannot satisfy the time constraint.
B. SPM (Macro-Cell and Micro-Cell DWMs)
To further improve the performance of SPM, we will adopt both micro-cell and macro-cell. Assuming the access latency and energy of micro-cell are the same with macro-cell. The area of a micro-cell is 40F 2 [24] . We assume the area of a macro-cell (can store 4 bit in the tape) is 55F 2 [24] .
Three SPM configurations and corresponding data placement in macro-cell part of SPM are shown in Fig. 5 . The data placement and configuration, four micro-cell, and one macrocell, generated by baseline method and LMR are shown in Fig. 5(a) . The total area is 215F 2 . According to the schedule of Fig. 4(a) and data placement, the number of shift is 5. The total access latency T o+LMR and energy E o+LMR are 174.25 ns and 8.46 nJ, respectively. Although T o+LMR satisfies the time constraint, the density is low. There are better configurations with higher density while the total access latency satisfies the time constraint. Fig. 5(b) shows better data placement and configuration which includes one micro-cell and two macro-cell. The total area is 150F 2 which reduced by 30.23% compared to the first configuration. According to the schedule of Fig. 4(b) and data placement, the number of shift is also 5. The total access latency T IGS+LMR and energy E IGS+LMR are 174.25 ns and 8.46 nJ, respectively. The T IGS+LMR also satisfies the time constraint. Fig. 5(c) shows the optimal data placement under the configuration including one micro-cell and two macro-cell. The minimum number of shift is 4 which is generated according to the optimal schedule of Fig. 4(c) and data placement. The minimum total access latency T INLP+LMR and energy E INLP+LMR are 168.38 ns and 8.01 nJ, respectively.
The motivational example shows that a time-aware, high density, and energy-efficient SPM can be constructed through software (data placement and schedule) and hardware (microcell and macro-cell DWM) co-design.
V. SOFTWARE AND HARDWARE CO-DESIGN APPROACH
In this section, we will present the proposed software techniques and hardware design technique. We first define shift-aware scheduling and data placement problem (SSDP) and show that it is NP-complete. Then, we will present INLP formulation. IGS algorithm is presented to solve SSDP in polynomial time. LMR algorithm, which is proposed to find an SPM configuration.
A. NP-Completeness
To show the NP-completeness of SSDP, we first define this problem.
Definition 3: The SSDP is the optimization problem of finding a memory access instruction scheduling and data placement with the minimum number of shift in macro-cell DWM.
Theorem 1: The SSDP is NP-complete. Proof: Consider a special case of SSDP named SSDP-S. In SSDP-S, the memory access instruction scheduling is determinate. The SSDP-S problem is defined as follows.
Definition 4: Given an access sequence Sch for a set of data D and configuration of target DWM (K, M), find a data placement for data in D such that the number of shift operations following access sequence Sch is minimized.
For a given access sequence Sch, we can construct the ASG
R for a set of data items D can be obtained. Assuming P is a set of locations (cells of DWM), there is a distance function dis : P × P → R. The SSDP-S problem can be defined as follows.
Definition 5: Given two sets, D (data) and J (cells), of equal size, together with a weight function D×D → R and a distance function dis : P × P → R. Find the bijection p : D → P such that the cost function:
The SSDP-S problem is essentially the quadratic assignment problem [25] , which is a known NP-complete problem. Therefore, the SSDP-S problem is NP-complete. Since SSDP-S problem is a special case of the SSDP problem, the SSDP problem is NP-complete. There are two types of instruction constraints: 1) instruction constraints for local schedule Sch u,k,ls and 2) instruction constraints for real schedule SCH u,s . For local schedule Sch u,k,ls , there are three instruction constraints: 1) each access instruction must be executed exactly once; 2) there is at most one instruction executed at each local step on each tape; and 3) if an instruction is executed at local step ls(ls > 1) on tape k, then there must be an instruction which is executed at local step ls − 1 on the same tape. For schedule SCH u,s , there are two instruction constraints: 1) each access instruction must be executed exactly once and 2) there is at most one instruction executed at each step.
2) Dependency Constraints: In this model, there are two kinds of dependencies: 1) dependencies between instructions that access the same tape and 2) dependencies between instructions that access different tapes. For local schedule, we only care the dependencies between instructions that access the same tape. For schedule SCH u,s , two kinds of dependencies should be considered.
For local schedule, the dependency constraint is that "an instruction cannot be executed until all of its predecessors that access the same tape are finished." Let Ta(u) be the tape that instruction u accesses, which can be represented by Sch u,k,ls . For two instructions u 1 and u 2 , if and only if Ta(u 1 ) = Ta(u 2 ), then S_ta(u 1 , u 2 ) = 0. Let St(u) be the local step of instruction u, which can be represented by Sch u,k,ls . The constraint can be formulated as follows:
For schedule SCH u,s , there are two dependency constraints: 1) an instruction cannot be executed until all its predecessors in graph G are finished and 2) if two instructions access the same tape k, they must follow the execution order in local schedule Sch u,k,ls . Let Gst(u) be the step of instruction u in schedule SCH u,s , which can be represented by SCH u,s . The first constraint can be represented by the following formula:
The second constraint is that if S_ta(u 1 , u 2 ) = 0 and St(u 1 ) < St(u 2 ), then Gst(u 1 ) < Gst(u 2 ). Let h u 1 ,u 2 = 0, if and only if St(u 1 ) − St(u 2 ) < 0. The constraint can be formulated as follows:
3) Data Allocation Constraints: There are three constrains for data allocation: 1) each data must be placed into a cell of a tape, that is, there is exactly one p d,k,j equals to 1 for a data d; 2) there is at most one data placed in a cell of each tape. which means, for a cell j of tape k, there is at most one 
Equation (4) 
5) Objective Function:
The object of INLP formulation is the minimum number of shift which can be formulated as follows: if l is nonexistent then 5: Find an instruction l where data d accessed by l has been placed in tape k; 6: if l is nonexistent then 7: Select an instruction l r in L randomly and place the data d accessed by l r in an available tape k; 8: end if 9 :
Update the position of p k ; 12: end if 13: Schedule selected instruction and update L 14: end while 15: Compute T t and E t ; 16: return Sch, Alloc, N s , S, T t , and E t
C. Instructions Group Schedule Algorithm
The IGS algorithm is shown in Algorithm 1. The main idea of IGS is that instructions which access same data are scheduled adjacently. In this way, the number of shift can be reduced.
Initially, all R/W ports align to the first cell of tapes which means p k = 0 (p k ∈ P). N s equals to 0 and s d = 0 (s d ∈ S). It first finds executable instructions according to G and stores them in list L.
From lines 2-14, it generates schedule Sch and data placement Alloc. It first finds an instruction l in L where l needs to access the data which is aligned to a R/W port p k , as shown in line 3. If there are multiple instructions, it selects one randomly. For such instructions, the execution order has no effect on shift. If l is nonexistent, in line 5, it finds an instruction l in L where the data accessed by l has been placed in tape k. If there are multiple instructions, it selects one randomly. If l is nonexistent, it finds an instruction l r in L randomly and places data d accessed by l r in next available cell of tape k, as shown in line 7. Here, tape k is the tape which is currently selected by word line.
In N ma = |D|/M ; count_s = 0; 8: Initialize p k and update Sch ma , Dp; 9: for l in Sch ma do 10: d ← Find the data accessed by l; 11: k ← Find the tape in which data d is placed; 12: count_s = count_s + |p k − pos(d)|; 13: end for 14: else 15: count_s = 0 16: end if 17: Compute T t and update N mi + + and s max_d = 0; 18: else 19: return None; 20: end if 21 : end while 22: Computes E t and area; 23: return N mi , N ma , Dp, T t , E t , and area Alloc, number of shift N s , array S, total latency T t , and total energy consumption E t are returned.
The complexity of IGS algorithm is O(|V| 2 × |E| × K × M) where |V| is the number of instructions, |E| is the number of edges, K is the number of tapes, and M is capacity of a macro-cell DWM.
D. Longest Move Reduce Algorithm
In order to obtain a high density and high performance SPM, we propose the LMR algorithm, as shown in Algorithm 2, to conduct the configuration of SPM. In LMR, data which leads to most number of shift (longest move of tapes) is placed in micro-cell such that most number of shift can be reduced with least increase of area.
Initially, LMR assumes that all data are placed in macrocells. In line 1, it initializes the total access latency T t according to equation T t = tr * nr + tw * nw + ts * N s . The number of micro-cell N mi and macro-cell N ma are initialized to be 0 and |D|/M , respectively. The schedule Sch ma is initialized to be Sch, and data placement of macro-cell Dp is initialized to be Alloc.
From lines 2-21, LMR moves data from macro-cell to micro-cell until the time constraint is satisfied. If D is not empty, it selects a data from D and places this data in microcell from lines 4-17. Otherwise, it returns "None" which means that there is no such DWM-based SPM can satisfy time constraint. In line 4, it finds the data max_d where s max_d is the maximal value in S. In line 5, it removes max_d from D which means max_d is moved from macro-cell to micro-cell. If D is not empty currently, it generates data placement of macro-cells and computes the number of shift from lines 7-13. Otherwise, the number of shift count_s is set to be 0 as shown in line 15.
In line 7, it computes the number of macro-cell N ma and initializes the number of shift count_s. In line 8, an array P = [p 1 , . . . , p N ma ] is initialized to record the position of R/W ports of macro-cells. It updates schedule Sch ma by removing instructions which need access data max_d and data placement of macro-cell Dp. Since data max_d has been moved from macro-cell, the position Dp(max_d) where max_d was placed is available now. Data that placed in rear cells of Dp(max_d) are moved one-cell ahead. If the last cell of a macro-cell is available, it moves data that are in next macro-cell to the available cell. After updating Dp, only the last macro-cell has available cells.
In line 9, it selects an instruction to be l according to schedule Sch ma . In lines 10 and 11, it finds the data d accessed by instruction l and the tape k where data d is placed. The number of shift count_s equals to count_s + |p k − pos(d)|. In line 17, it computes the total access latency T t according to equation tr * nr + tw * nw + ts * count_s + tr * nr + tw * nw .
In line 22, it computes the energy consumption E t according to equation E t = er * nr + ew * nw + es * count_s + er * nr +ew * nw and the area area of SPM according to equation area = area ma * N ma + area mi * N mi . At last, it returns number of macro-cells N ma , number of micro-cells N mi , data placement of macro-cell Dp, total access latency T t , total energy consumption E t , and area of SPM area.
The complexity of LMR algorithm is O(|D| × |V|) where |D| is the number of data and |V| is the number of instructions.
VI. EXPERIMENTS
In this section, we conduct experiments to demonstrate the efficiency of proposed DWM-based SPM and the proposed techniques.
A. Experimental Setup
In our experiments, we use the area estimation, latency and energy parameters of DWM-based on-chip SPM from [24] . To evaluate the proposed SPM and techniques, we developed a simulator and integrated the DWM-based on-chip SPM model. The components of this simulator include a memory trace processing unit, DWM-based on-chip SPM, and a DDR3-like simple DRAM main memory. The target system specification is shown in Table III . Base on the criteria described in [26] , benchmarks from MediaBench [27] are divided into regions at two types of program points described in [28] . We ran benchmarks in SimpleScalar [29] and profiled memory access trace for each region. Then, we fed the memory access trace for each region into our simulator.
B. Experimental Results and Analysis
In this section, we evaluate our proposed techniques. Table IV shows the running overhead (in seconds) of ASAP, APCO, and the proposed algorithms. From the table, we can see that ASAP and proposed IGS can finish in less than 4 min for all the benchmarks. However, INLP can only generate the optimal results for "epic," "pegwit," and "rasta" in 10 h. The APCO algorithm [30] can almost finish in 1 s for all the benchmarks. The proposed LMR algorithm can almost finish in a half second for all the benchmarks.
1) Macro-Cell DWM-Based SPM: Fig. 6 shows the comparison of the number of shift, latency, and energy on target architecture which is equipped with macro-cell base SPM. The results generated by ASAP are used as baseline. All other results are normalized to ASAP. The results show that schedule and data placement have significant impact on the performance of macro-cell DWMbased SPM. The proposed techniques, IGS and INLP, can optimize performance and energy consumption of macro-cell DWM-based SPM.
2) Micro-Cell and Macro-Cell DWM-Based SPM Under Time Constraint: In cases that the time constraint is tight, the deadline cannot be met even with optimization for the macro-cell DWM. In such cases, micro-cell that has high performance than macro-cell needs to be adopted. Fig. 7 shows the comparison of latency, energy, and area of SPM generated by different techniques under the time constraint. Fig. 7(a) shows the comparison of latency that is normalized to the time constraint. All benchmarks can satisfy the time constraint under the schedule and data placement generated by ASAP, IGS, and INLP in SPMs whose configurations are generated by LMR. The results generated by three techniques are close. LMR conducts the configuration of SPM according to time constraint. Whenever the time constraint is satisfied, the LMR finishes. However, not all benchmarks can satisfy the time constraint when the configurations of SPMs are generated by APCO. Since APCO is designed for soft real-time applications, time constraint violation can be tolerated. Compared with APCO, the latency using LMR is 3.47% less on average when same schedule and data placement technique is adopted. Fig. 7(b) shows the comparison of energy that is normalized to the results of ASAP+LMR. ASAP+APCO consumes more energy than ASAP+LMR. This situation also happens when IGS and INLP are employed. Compared with APCO, the energy consumption using LMR is 3.89% less on average. The results generated by ASAP+LMR, IGS+LMR, and INLP+LMR are close. For some benchmarks, such as "ghostscript," "mesa," and pegwit, the energy consumption of IGS+LMR is greater than ASAP+LMR. Moreover, for mesa, INLP+LMR consumes more energy than ASAP+LMR. Placing data in micro-cells can reduce the number of shift. Therefore, the performance of SPM can be improved. However, LMR will stop placing data in microcells once the time constraint is satisfied. For ghostscript, mesa, and pegwit, the latency of IGS+LMR is less optimized than ASAP+LMR even though both of them satisfy the time constraint. Therefore, the energy consumption of IGS+LMR is greater than ASAP+LMR. Similarly, for mesa, the energy consumption of INLP+LMR is greater than ASAP+LMR. Fig. 7(c) shows the comparison of SPM area that is normalized to the result of ASAP+LMR. Compared with APCO, the area size using LMR is 3.58% less on average. However, most results generated by ASAP+APCO are better than ASAP+LMR. Since APCO allows time constraint violation, the configuration of SPM generated by APCO may employ less micro-cell SPM than LMR. Therefore, the area of SPM generated by APCO may be smaller.
Although the latency and energy consumption generated by ASAP+LMR, IGS+LMR, and INLP+LMR are close, the area of SPMs are different. Compared with ASAP+LMR, IGS+LMR, and INLP+LMR can reduce the area of SPM by 54.15% and 61.48% on average. In order to satisfy time constraint, ASAP+LMR places more data in micro-cell which has lower density than macro-cell. Therefore, the area of SPM generated by ASAP+LMR is larger than IGS+LMR and INLP+LMR.
3) Configuration of DWM-Based SPM: Fig. 8 shows the comparison of latency and energy among macro-cell DWMbased SPM, micro-cell DWM-based SPM, and micro-cell and macro-cell DWM-based SPM whose configuration is generated by LMR. In these experiments, macro-cell DWM-based SPM and micro-cell and macro-cell DWM-based SPM have the same area. The area size of micro-cell DWM-based SPM is 4.5× over macro-cell DWM-based SPM and micro-cell and macro-cell DWM-based SPM. For macro-cell DWM-based SPM, IGS, and INLP are adopted. Here, macro-cell (IGS) means that IGS is used and macro-cell (INLP) means that INLP is used. For micro-cell and macro-cell DWM-based SPM whose configuration is generated by LMR, IGS and INLP are also adopted. Here, micro-cell and macro-cell (IGS) means that IGS is used and micro-cell and macro-cell (INLP) means that INLP is used. None of the proposed techniques is adopted for micro-cell DWM-based SPM. Fig. 8(a) shows the comparison of latency that is normalized to time constraint. In macro-cell DWM-based SPM, the latency generated by IGS cannot satisfy the time constraint. For most cases, the latency generated by INLP also cannot satisfy time constraint. However, in micro-cell and macro-cell DWM-based SPM whose configuration is generated by LMR, time constraint can always be satisfied using IGS or INLP. Compared with macro-cell (IGS), micro-cell and macro-cell (IGS) can reduce the latency by 13.73% on average. Compared with macro-cell (INLP), micro-cell and macro-cell (INLP) can reduce the latency by 21.21% on average. Among different SPM configurations, the micro-cell DWM-based SPM is time efficient. Compared with microcell and macro-cell (INLP), micro-cell DWM-based SPM can reduce the latency by 55.06% on average. Fig. 8(b) shows the comparison of energy that is normalized to the results generated by macro-cell (IGS). Compared with macro-cell (IGS), micro-cell and macro-cell (IGS) can reduce the energy by 15.85% on average. Compared with macrocell (INLP), micro-cell and macro-cell (INLP) can reduce the energy by 25.14% on average. The energy consumption of micro-cell DWM-based SPM is much more less than other SPMs. However, the area size of micro-cell DWM-based SPM is much larger than others.
We also compare the area between micro-cell DWM-based SPM and micro-cell and macro-cell DWM-based SPM. In these experiments, both types of SPMs can satisfy the time constraint. Since there is no shift in micro-cell, both IGS and INLP are not applicable to micro-cell DWM-based SPM. For micro-cell and macro-cell DWM-based SPM whose configuration is generated by LMR, IGS, and INLP are adopted. Compared with micro-cell DWM-based SPM, the area achieved by micro-cell and macro-cell (IGS) is 5.5× lower on average. The area achieved by micro-cell and macro-cell (INLP) is 6.5× lower on average.
VII. CONCLUSION
In this paper, we propose a time-aware and energy-efficient DWM-based SPM as on-chip memory of embedded systems.
To improve the performance of macro-cell DWM-based SPM, we propose an INLP formulation to achieve the minimum number of shift through memory access instructions scheduling and data placement. Since INLP takes exponential time to finish, we propose a polynomial-time solution, the IGS algorithm, to minimize the number of shift. However, the time constraint of applications that run in a system equipped with macro-cell DWM-based SPM may not be satisfied even though the number of shift operations is minimal. In such cases, it is necessary to adopt an SPM which consists of both micro-cell and macro-cell. The LMR algorithm is then proposed to generate a configuration of SPM that can satisfy the time constraint. Experimental results confirm the effectiveness of the proposed DWM-based SPM and optimization techniques.
