Abstract-A method to both reduce energy and improve performance in a processor-based embedded system is described in this paper. Comprising of a scratchpad memory instead of an instruction cache, the target system dynamically (at runtime) copies into the scratchpad code segments that are determined to be beneficial (in terms of energy efficiency and/or speed) to execute from the scratchpad. We develop a heuristic algorithm to select such code segments based on a metric, called concomitance. Concomitance is derived from the temporal relationships of instructions. A hardware controller is designed and implemented for managing the scratchpad memory. Strategically placed custom instructions in the program inform the hardware controller when to copy instructions from the main memory to the scratchpad. A novel heuristic algorithm is implemented for determining locations within the program where to insert these custom instructions. For a set of realistic benchmarks, experimental results indicate the method uses 41.9% lower energy (on average) and improves performance by 40.0% (on average) when compared to a traditional cache system which is identical in size.
Exploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System
Andhi Janapsatya, Member, IEEE, Aleksandar Ignjatovic, and Sri Parameswaran, Member, IEEE Abstract-A method to both reduce energy and improve performance in a processor-based embedded system is described in this paper. Comprising of a scratchpad memory instead of an instruction cache, the target system dynamically (at runtime) copies into the scratchpad code segments that are determined to be beneficial (in terms of energy efficiency and/or speed) to execute from the scratchpad. We develop a heuristic algorithm to select such code segments based on a metric, called concomitance. Concomitance is derived from the temporal relationships of instructions. A hardware controller is designed and implemented for managing the scratchpad memory. Strategically placed custom instructions in the program inform the hardware controller when to copy instructions from the main memory to the scratchpad. A novel heuristic algorithm is implemented for determining locations within the program where to insert these custom instructions. For a set of realistic benchmarks, experimental results indicate the method uses 41.9% lower energy (on average) and improves performance by 40 .0% (on average) when compared to a traditional cache system which is identical in size.
Index Terms-Embedded system, scratchpad memory.
I. INTRODUCTION

P
ROCESSORS are increasingly replacing gates as the basic design block in digital circuits. This rise is neither extraordinary nor rapid. Just as transistors slowly gave way to gates and gates to more complex circuits, processors are progressively becoming the predominant component in embedded systems. As microprocessors become cheap and the time to market critical, it is a natural progression for designers to abandon gates in favor of processors as the main building components. The utilization of processors in embedded systems gives rise to a plethora of opportunities to optimize designs, which are neither needed nor available to the designer of an application-specific integrated circuit (ASIC).
One critical area of optimization is the reduction of power in systems while increasing or at least maintaining the performance. The criticality stems from usage of embedded systems in battery-powered devices as well as the reduced reliability of systems which operate while emanating excessive amounts of heat. Examples of existing techniques for achieving reduced energy consumption in embedded systems are: shutting down parts of the processor [1] , voltage scaling [2] , addition of application specific instructions [3] , [4] , feature size reduction [5] , and additional cache levels [6] . Cache memory is one of the highest energy consuming components of a modern processor. The instruction cache memory alone is reported to consume up to 27% of the processor energy [7] . We present a method for reducing instruction memory energy consumed in an embedded processor by replacing the existing instruction cache [see Fig. 1 (a)] with instruction scratchpad memory (SPM); see Fig. 1(b) . SPM consumes less energy per access compared to cache memory because SPM dispenses with the tag-checking that is necessary in the cache architecture. In embedded systems design, the processor architecture and the application that is to be executed are both known a priori. It is therefore possible to extensively profile the application and find code segments which are executed most frequently. This information can be analyzed and decisions made about which code segments should be executed from the SPM.
Various schemes for managing SPM have been introduced in the literature and can be broadly divided into two types: static and dynamic. Both of these schemes are usually applied to the data section and/or the code section of the program. Static management refers to the partitioning of data or code prior to execution and storing the appropriate partitions in the SPM with no transfers occurring between these memories during the execution phase (occasionally, the SPM is filled from the main memory at start up). Memory words are fetched to the processor directly from either of the memory partitions. Dynamic management, on the other hand, moves highly utilized memory words from the main memory to the SPM, before transferring to the processor, thereby allowing the code or data executed from the SPM to have a larger total memory footprint than the SPM.
In this paper, we present a novel architecture, containing a special hardware unit to manage dynamic copying of code segments from the main memory to the SPM. We describe the use of a specially created instruction which triggers copying from the main memory to the SPM. We further set forth heuristic algorithms which rapidly search for the sets of code segments to be copied into the SPM and where to place the specially created instructions to maximize benefit. The whole system was evaluated using benchmarks from Mediabench [37] and UTDSP [38] .
The remainder of this paper is organized as follows. Section II provides the motivation and the assumptions made for this study. Section III describes previous works on SPM and presents the contributions of our work. Section IV introduces our hardware strategy for using the SPM. Section V defines the concomitance information, Section VI formally describes the problem. Section VII presents the heuristic algorithm for partitioning the application code. Section VIII provides the experimental setup and results, and, finally, Section IX concludes this paper.
II. MOTIVATION AND ASSUMPTIONS
A. Motivation
An SPM is a memory comprised of only the data array (SRAM cells) and the decoding logic. This is illustrated in Fig. 2(b) . A cache shown in Fig. 2(a) , in comparison, has data array (SRAM cells), decoding logic, tag array (SRAM cells), and comparator logic. Comparator logic checks if the existing content in cache matches the current CPU request and thus determines whether a cache hit or a cache miss has occurred. The access time and the access energy of individual components within the cache have been estimated using the tool CACTI [35] . CACTI results show that, for a cache system consisting of 8 KB with associativities ranging from 1 to 32, the access time of the cache tag-RAM array is 38.9% longer on average compared with the access time of the cache data-RAM array, and the access energy of the data-RAM is, on average, 72.5% of the total cache access energy.
Each piece of data is placed into the cache according to its main memory location, identified by the tag. Thus, cache memory is transparent to the CPU. The SPM is used as a replacement of (level 1) instruction cache. The absence of the tag array within the SPM means that the CPU has to know where a particular instruction resides within the SPM. To allow the CPU to locate data, SPM and the DRAM span a single memory address space partitioned between them.
Selecting the correct code segments for placement onto the SPM requires a careful analysis of the way such code segments are executed. All prior work in this area has focused upon loop analysis of the trace of a program as the method for finding the appropriate code segments to be placed in the SPM. Loop analysis has several drawbacks: 1) the structure and relationship of loops can be very complex and thus difficult to analyse; 2) this structure can significantly vary for different inputs; and 3) the precise structure of loops is irrelevant for the placement of instructions in the SPM because only relative (temporal) proximity of executions of blocks of code matters, rather than the precise order of these instructions as provided by the loop analysis; see [9] - [12] . Thus, in this paper, instead of looking at the loop structure, we analyze temporal correlation of executions of blocks of code using statistical methods with a signal processing flavor. The trace is seen as a "signal" on which we perform a statistical rather than a structural analysis.
We call the measure we introduce for temporal proximity of consecutive executions of the same block of the code the selfconcomitance of the block, and we use it to decide if the block should be executed from the SPM or from the main memory. Similarly, the measure for temporal proximity of interleaved executions of two different blocks of the code we call the concomitance, and we use it to decide whether two blocks should be placed in SPM simultaneously or if they can overlap in the SPM.
Such analysis of the trace has proven to yield an algorithm for SPM placement with performance results that are superior to the loop analysis method, and the algorithm is also much simpler, more efficient, and adaptive to different types of applications by changing only the distance function used to define the concomitance and self-concomitance. Recently, we found out that a related but cruder and less general idea has been used for cache management in [13] . It uses a simple counting function rather than a decreasing function of distance, has no notion corresponding to our self-concomitance concept, and operates on a coarser level of granularity (procedures rather than basic blocks).
B. Motivational Example for Concomitance
The benefits of using temporal proximity information is illustrated in the following example. Consider an "if-clause" shown in Fig. 3(a) , with the control flow graph (CFG) shown in Fig. 3(b) . For and , the profiling would determine that the first 50 executions of the "if-clause" will yield 50 executions of the block corresponding to , while the next 50 executions of the "if-clause" will yield 50 executions of the block for . Thus, executions of the block corresponding to will have a large temporal distance to the executions of the block for , and we will demonstrate that, in such a situation, the concomitance of these two blocks will be small. Thus, blocks for and need not be placed into the SPM simultaneously, but can be placed to overlap with one another in the SPM, yet without degrading the performance because their execution does not overlap in time. Note that, if only the frequency of execution was used, the CFG would only inform the algorithm that both blocks were executed 50 times, without providing crucial information that the pattern of execution is such that they can overlap on the scratchpad without any penalty. Note that, should the "if-clause" be of the form if , then executions of and would alternate. Under such circumstances, executions of the blocks corresponding to these two functions would be interleaved with short temporal distances. We will see that this makes their concomitance large, and the SPM placement algorithm would be informed that these two blocks must be placed into the SPM simultaneously and thus without overlap.
C. Assumptions
During the design and implementation of our methodology, the following assumptions are made.
• The size of the SPM or cache memory is only available in powers of two, and the size is passed to the algorithm. This is a reasonable assumption since the program is to be executed in an embedded system where the underlying hardware is known in advance. This assumption allows better optimization. If a number of processors with differing sizes of SPM have to be serviced, the SPM size can be made an adjustable parameter for the SPM management algorithm.
• The program size is larger than the SPM size. This is a valid assumption, because energy consumptions and cost constraints dictate architectures with relatively small size SPMs, compared with the size of a typical application.
• The size of the largest basic block is less than or equal to the SPM size. This assumption is quite valid in embedded systems where basic blocks are usually rather small. However, if the basic block is too large for the SPM, it can be split into a smaller part that fits into the SPM and the remainder that is executed from the main memory.
• Each instruction is exclusively executed from either the SPM or the main memory. This ensures that it is never necessary to have duplicates of parts of the code or to alter branch destinations during the execution of the program.
• Program traces used for profiling provide an accurate depiction of the program behavior during its execution. This assumption is reasonable when a sufficiently large input space has been applied. The amount of profiling needed to obtain a particular confidence interval is given in [32] .
• We do not consider higher level caches. In an embedded system, where frequently there is no cache at all, it is unlikely that more than a single level of cache is available. Moreover, having higher level caches would not reduce the effectiveness of the approach.
III. RELATED WORK
In the past, use of the SPM replacing cache memory has been shown to improve the performance and reduce energy consumption [8] . Existing works on cache optimization techniques rely on careful placement of instructions and/or data within the memory to ensure low cache miss rates. Cache optimization methods generally increase the program memory size [14] - [21] .
In 1997, Panda et al. [22] presented a scheme to statically manage an SPM for use as data memory. They describe a partitioning problem for deciding whether data variables should be located in the SPM or DRAM (accessed via the data cache). Their algorithm maximizes cache hit rates by allowing data variables to be executed from the SPM. Avissar et al. [23] presented a different static scheme for utilizing the SPM as data memory. They presented an algorithm that can partition the data variables as well as the data stack among different memory units, such as SPM , DRAM, and ROM to maximize performance. Experimental results shows that Avissar et al. were able to achieve 30% performance improvement over traditional direct-mapped data caches. By allowing part of the data stack to operate from the SPM, a performance improvement of 44.2% was obtained compared with a system without SPM.
In 2001, Kandemir et al. introduced dynamic management of the SPM for data memory [24] . Their memory architecture consisted of an SPM and a DRAM which was accessed through cache. To transfer data from the DRAM into the SPM, they created two new software data transfer calls; one to read instructions from DRAM and another to write instructions into the SPM. Multiple optimization strategies were presented to partition data variables between the SPM and DRAM. The best results were obtained from the hand-optimized version. Their results shows a 29.4% improvement in performance compared with a traditional cache memory system. In addition, they also show that it is possible to obtain up to 10.2% performance improvement when using a dynamic management scheme for SPM compared with the static version. The work presented in [24] applied only to well-structured scientific and multimedia applications, with each loop nest needing independent analysis.
In 2003, Udayakumaran and Barua [25] improved upon [24] and presented a more general scheme for dynamic management of the SPM as data memory. The method from [25] analyzes the whole program at once and aims to exploit locality throughout the whole program, instead of individual loop nesting. The reported reduction in execution time is by an average of 31.2% over the optimal static allocation scheme.
In 2002, Steinke [26] presented a statically managed SPM for both instruction and data. Their results show, on average, a 23% reduction in energy consumption over a cache solution.
In 2003, Angiolini et al. presented another SPM static management scheme for data memory and instruction memory that employs a polynomial time algorithm for partitioning data and instructions into DRAM and SPM [27] . Their results show energy improvement ranging from 39% to 84% over an unpartitioned SPM. In 2004, Angiolini et al. presented a different approach to static usage scheme for the SPM [28] by mapping applications to existing hardware.
In [29] , Steinke et al. presented a dynamic management scheme for an instruction SPM. For each instruction to be copied into the SPM, they insert a load and store instruction. The load instruction brings the instruction to be copied from the main memory to the processor, and a store instruction stores it back into the SPM. Their algorithm for determining locations in the program for inserting load and store instructions is based on a solution of a integer linear programming (ILP) problem that is obtained by using an ILP solver. They conducted experiments with small applications (e.g., bubble sort, heap sort, and "biquad" from DSP stone benchmark suit), and their results show an average 29.9% energy reduction and a 25.2% performance improvement compared with a cache solution.
In 2004, Janapsatya et al. presented a hardware/software approach for a dynamic management scheme of an SPM [30] . Their approach introduces a specially designed hardware component to manage the copying of instructions from the main memory into the SPM. They utilize instructions execution frequency information to model applications as graph and perform graph partitioning to determine locations within the program for initiating the copying process.
A. Contributions
Our work improves upon the state of the art in the following ways.
• We design a novel architecture to perform dynamic management of instruction code between SPM and main memory.
• We replace a difficult structural loop analysis of the program by an essentially statistical method for automated decision making regarding which basic blocks should be executed from the SPM and, among them, which groups of basic blocks should be placed in the SPM simultaneously without an overlap.
• We develop a novel graph partitioning algorithm for splitting the instruction code into segments that will either be copied into the SPM or executed directly from the main memory in a way that reduces the overall energy dissipation. We evaluated the system by using realistic embedded applications and show performance and energy comparisons with processors of various cache sizes and differing associativities.
IV. SYSTEM ARCHITECTURE
A. Hardware Implementation
The proposed methodology modifies the processor by adding an SPM controller, which is responsible for copying instructions from DRAM to SPM and stalling the CPU whenever copying is in progress. Fig. 4 shows the block diagram of the SPM controller, SPM, DRAM, and the CPU. A new instruction called Scratchpad Managing Instruction (SMI) is implemented. The SMI is used to activate the SPM controller; SMIs are inserted within the program and are executed whenever the CPU encounters them. The cost of the CPU stalling while the SPM is being copied is identical to servicing a cache miss, and the number of times the SMI was executed provides useful information for determining the application execution time. Further performance improvement is possible if copying can be performed a few cycles before the instructions are to be executed, assuming that there is no conflict in the SPM location between the instructions to be executed and the instructions to be copied.
The micro-architecture of the SPM controller is shown in Fig. 5 and includes the basic block table (BBT). The BBT stores the following static information: start addresses of the basic blocks to be read from within the DRAM; how many instructions are to be copied; and the start addresses for storing the basic blocks into the SPM. Content of the BBT is filled by the algorithm shown in Fig. 6 .
With the addition of SPM, an example of program execution is as follows: at the start of a program, the first instruction is always fetched from DRAM. When an SMI is executed, the CPU will activate the SPM controller. The SPM controller will then stall the CPU, read information from the BBT, and provide addresses to DRAM and SPM for copying instructions from DRAM into SPM. After copying is completed, the SPM controller will release the CPU, which will then continue to fetch and execute instructions from DRAM. Whenever a branching instruction leads program execution to the SPM, the CPU continues to fetch instructions from the SPM. Fig. 6 shows the steps for implementing the proposed methodology. The methodology operates on the assembly/ machine code level. Profiling and tracing (using modified simplescalar/Pisa 3.0d [31] to output instruction trace into a gzip file) is performed on the assembly/machine code to obtain an accurate representation of the program behavior. Program trace was generated using SimpleScalar 3.0d out-of-order execution. To ensure proper order of execution of SMI in an out-of-order machine, the instructions to be copied should be made dependent on the SMI. This is similar to a load operation followed by register operation upon the loaded data. Program behavior is captured by extracting the concomitance information as explained below. The SPM placement algorithm uses the concomitance information to determine appropriate locations for inserting SMI within the program and to construct a table (BBT) that specifies the positions where the blocks will be copied.
B. Software Modification
The new instruction, SMI, is an instruction with a single operand. The operand of the SMI represents the address of a BBT entry. For a 32-b instruction with a 16-b operand, it is possible to have up to 65 536 entries in the BBT, allowing 65 536 SMIs to be inserted into a program.
The code can be modified by insertion of an SMI in the following three cases.
1) An SMI can be inserted just before an unconditional branch [see Fig. 7(a) ]; in this case, the destination address of the unconditional branch is altered to point to the correct address within the SPM. 2) An SMI can be inserted to be the new destination of a conditional branch [see Fig. 7(b) ]; if the branch tests positive, such an SMI will copy the basic block to be executed from the appropriate location in the SPM; an extra jump instruction is added immediately after the SMI, to transfer the program execution to that SPM location. 3) An SMI instruction can be inserted just after a conditional branch instruction [see Fig. 7(c) ]. In this case, if the conditional branch tests negative, the SMI will copy the basic block that is to be executed following the branch instruction. Again, an extra jump instruction is added immediately after the SMI to transfer program execution to the SPM. An extra branch instruction may also be needed at the end of the basic block if execution flow is required to jump to another location within memory. If there is an unconditional branch at the end of the basic block, then the branch destination can be simply altered, or modified as in 1). If there are conditional branches, then we may have to modify as per 2) or 3) depending upon whether they are to be executed from DRAM or SPM.
The algorithm shown in Fig. 6 outputs the modified software, including the addition of any extra branching instructions.
V. CONCOMITANCE
A basic block, by definition, is the largest chain of consecutive instructions that has the following properties: 1) if the first instruction of the block is executed, then all instructions in the basic block will also be executed consecutively and 2) any instruction of the basic block is executed only as a part of the consecutive execution of the whole block.
The distance between two consecutive executions and of a basic block in the trace of a run of a program is defined as follows. If, between the executions and of , there are no other occurrences of in , we count the number of distinct instruction steps executed between and , including . We call this value the distance between and and denote it by . For example, assume that "
" is a sequence of consecutive executions in a trace and that each of the basic blocks , , , and contains ten distinct instructions; then the distance between and is 40, because only , , and appear between the two executions of the basic block (and we include itself in the count).
The weight function is used to give a decreasing significance to the two consecutive executions of the same block that are further apart in the sense of the above notion of distance. Thus, it is a nonnegative real function that is decreasing, i.e., such that, if , are real numbers and , then . The trace concomitance gives information about how tightly interleaved the executions of two distinct basic blocks and in the trace are. Thus, for a basic block , we consider all of its consecutive executions and in the trace , for which there exists at least one execution of the block between the executions and ; we denote such a fact by . We now also reverse the roles of and , and define by
Here, in the sum means that ranges over all executions of the basic block that appear in the trace . Note that, for two distinct basic blocks and , the concomitance of these two blocks will be large just in case is often executed between two consecutive executions of that are a short distance apart, and/or if is often executed between two consecutive executions of that are also a short distance apart, in the sense of distance defined above.
The trace self-concomitance of a basic block is a measure of how clustered consecutive executions of the block are and is defined as Thus, trace self-concomitance has a large value for those basic blocks whose executions appear in clusters, with all successive pairs of executions within each cluster separated by short distances. Note that, even if is executed relatively frequently, but such executions of are dispersed in the trace rather than clustered, then self-concomitance will still be low. On the other hand, if, for a certain input, a particular loop is frequently executed, then the trace self-concomitance of each basic block from this loop will be large. Thus, the loop structure of a program is reflected in the statistics of the concomitance values if such statistics are taken over executions with a sufficient number of inputs reasonably representing what is expected in practice. This is the motivation for the following definitions.
The concomitance of a pair of basic blocks and for given probability distribution of inputs is the corresponding expected value of trace concomitance . The self-concomitance of a basic block for a given a probability distribution of inputs is the corresponding expected value of trace self-concomitance . To conveniently use the concomitance and self-concomitance in our scratchpad placement algorithms, we construct the concomitance table by the following profiling procedure.
• Chose a suitable weight function . In our experiments so far, we have studied two types of weight functions:
and , where is a constant depending on the size of the scratchpad.
• Run the program with inputs that reasonably represent the probability distribution of inputs expected in practice.
• Calculate the average value of the trace self-concomitance obtained from such runs, thus obtaining the self-concomitance value .
• Set a threshold of significance for the value of self-concomitance of basic blocks. The set of all blocks with significant self-concomitance (i.e., larger than the threshold) is formed.
• Calculate the concomitance for all pairs of basic blocks from such set , by finding the average of all trace concomitances obtained from the runs of the program and then form the corresponding 
VI. PROBLEM DESCRIPTION
To copy the important code segments into SPM, SMIs are inserted into strategic locations within the program. To determine strategic locations for insertion of SMIs, we transformed the problem into a graph-partitioning problem.
Given any program, it can be represented as a graph as follows. The vertices represent basic blocks belonging to the set of basic blocks having large self-concomitance. The weight of each vertex represents the number of instructions within the basic block (size of the basic block), and the weight of each edge represents the concomitance value between the blocks joined by the edge. An illustration of the basic block is given in Fig. 8(a) , and its corresponding concomitance graph is shown in Fig. 8(b) .
The problem is to find vertices within the graph (i.e., the basic blocks of instructions) where one should insert the custom instruction, SMI.
VII. CODE-PARTITIONING ALGORITHM
A. Graph-Partitioning Problem
Consider the following graph-partitioning problem. Assume that we are given: 1) a graph with a set of vertices and a set of edges ; 2) for each vertex , we are given a vertex weight and for each edge an edge weight (the concomitance value);
3) a constant (size of SPM). Find a partition of the graph, such that each subgraph has a total vertex weight less than or equal to , and the sum of edge weights of all of the edges connecting these subgraphs is minimal.
It is easy to see that the "Minimum Cut into Bounded Sets Problem" [33] is P-time Turing reducible to our graph-partitioning problem; thus, we cannot expect to find a polynomial time algorithm, and we must use a heuristic approximation algorithm to work on real applications. We implemented an algorithm consisting of two procedures: graph-partitioning procedure and removal of insignificant subgraphs procedure, as shown in Fig. 9 .
We start with the set of basic blocks whose self-concomitance is larger than the chosen threshold and consider a graph whose vertices are all elements of . All vertices of this graph are initially pairwise connected, thus forming a complete graph (a clique). The edges have weights that are equal to the concomitance between their two vertices. We now order the set of edges according to their weights in an ascending way; thus, the edge at the top of the list is one of the edges with the lowest concomitance value. Notice that, since we start with a complete graph,
. We eliminate all of the edges of the graph which have such a minimal weight value. This usually induces a partition of the graph into several connected subgraphs; the total weight of the vertices for each of these connected subgraphs is calculated and compared with . We then apply the same procedure to all connected subgraphs whose total vertex weight is larger than , and the procedure stops when all of the connected subgraphs have a total vertex weight smaller or equal to . It is easy to see that the complexity of the graph-partitioning procedure is . After we have produced the subgraphs using the concomitance, we now consider the CFG graph with edges having weights equal to the total number of traversals of the edge during execution.
First of all, there can be subgraphs that are not worthwhile to copy and execute from SPM. For example, the graph-partitioning procedure can produce results such as those shown in Fig. 11 . The dotted lines indicate the partitions of the graph. It can be seen that partition 2 resulted in basic block B as a subgraph and it will only ever be executed once. Thus, it is not cost-effective to first read it from DRAM and copy it to SPM, since one can just read it and execute it directly from DRAM. To search and remove these type of subgraphs, another heuristic procedure is implemented.
B. Removal of Insignificant Subgraphs Procedure
Such a heuristic procedure is shown in Fig. 10 . It starts by calculating the energy cost of executing each subgraph from DRAM and the energy cost of executing such subgraph from SPM. Energy cost of executing from DRAM is calculated using DRAM
where DRAM is the energy cost of a single DRAM access per instruction, is the number of instructions in the basic block corresponding to the vertex , and is the total number of executions of the basic block corresponding to the vertex . Equation (2) shows the energy cost of executing instructions within the subgraph from SPM, including the cost of accessing the DRAM once and copying to SPM SPM (2) where SPM is the energy cost of a single SPM access per instruction. is defined as (3), shown at the bottom of the page, where is the set of all of the edges that are incoming to the subgraph from all other subgraphs, is the energy cost of executing the special instruction including reading from DRAM and copying to SPM, and is the cost of adding a branch instruction if necessary. This happens if the proper execution of the branching instruction should change the flow from DRAM to SPM or vice versa. If such an extra branching instruction is necessary for an edge , constant is equal to 1, else is 0. The energy calculation is used to classify which subgraphs are to be executed from SPM and which subgraphs are to be executed from DRAM; only subgraphs with will be executed from SPM. We call such a subgraph an SPM-subgraph .
The remainder of the procedure inserts the SMI as follows. For every SPM-subgraph, it will examine all incoming edges to this subgraph . We determine if such an edge is included in a path either from another SPM-subgraph or from the start of the program, and only in such cases an SMI is inserted. This means that, if all paths including this edge emanate from the same SPM-subgraph , then an SMI need not be inserted. The complexity of this procedure is , where is the number of vertices in the graph. For example, in the CFG shown in Fig. 11 , there are three subgraphs created by the graph partitioning procedure. Subgraph 1 consists of the basic block B only. Assume that the energy cost estimation using (1) and (2) classified basic block B to be executed from DRAM. Consider the edge from B to D. All paths containing the edge from B to D also contain the edge from A to B. Since A belongs to the same SPM subgraph as D, by our procedure no SMI will be inserted in the edge from B to D. In this way, we avoid unnecessary replacement of instructions that are already in the SPM.
The algorithm described in this section assumed the use of the concomitance value for the partitioning of the graph. Similar algorithm can be used with frequency (as in [30] ) instead of concomitance, and the results are compared in Section VIII.
VIII. EXPERIMENTAL RESULTS
A. Experimental Setup
We simulated a number of benchmarks using simplescalar/ PISA 3.0d [31] to obtain memory access statistics. Power figures for the CPU were calculated using Wattch [34] (for a 0.18-m process). CACTI 3.2 [35] was used as the energy model for the cache memory. The energy model for the SPM was extracted from CACTI as in [8] . The DRAM power figures were taken from IBM embedded DRAM SA-27E [36] . The configuration for the simulated CPU is shown in Table I . All benchmarks were obtained from the Mediabench suite [37] except for the benchmark histogram which were from the UTDSP test suite [38] . The total number of instructions executed in each benchmark is tabulated in Table II .
B. Area Cost for Inclusion of SMI Hardware Controller
In cache, the tag array memory cells keep record of the entries within the cache data array while, for the SPM system proposed here, the BBT keeps a record of where each set of code segment(s) is to be placed within the SPM. Each BBT entry needs to store one DRAM address, one SPM address, and the number of instructions to be copied. Since the most significant bits of the SPM address is known from the memory map [as shown in Fig. 2(b) ], only the least significant bits of the SPM address need to be stored within the BBT.
For example, given DRAM size of 2 MB, SPM size of 1 KB, and instruction size of 8 B, there exists enough space to store up to 256 K instructions within the DRAM and 128 instructions within the SPM. Thus, the number of instructions that need to be copied ranges from 1 to 128 requiring 7 b per entry. Each DRAM address requires 18 b per entry (to manage 256 K instructions), and each SPM address requires 7 b per entry. In total, each BBT entry requires enough space to store 32 b. Fig. 12 shows a comparison between the number of bits in a cache tag RAM cells and the average size of the BBT for different size cache/SPM with a main memory size of 2 MB. The average size of the BBT is calculated from all of the benchmarks shown in Table II . From Fig. 12 , it can be seen that cache tag RAMcells grow linearly as the cache size grows exponentially, but the BBT size decreases as the SPM grows exponentially.
BBT size decreases as SPM size increases because, with a larger SPM, fewer SMIs are needed. Table III shows a comparison of the total number of memory accesses. Performance and energy results measured from the experiments are shown in Tables IV and V, and the cost of adding copy locations within the program is shown in Table II .
In Table II , column 1 shows the application name, column 2 gives the size of the program, column 3 the number of instructions executed, column 4 gives the average number of copy locations to be inserted (average is taken from varying SPM sizes ranging from 1 KB to 16 KB), column 5 shows the average number of instructions that need to be copied into the SPM using the concomitance method, and columns 6 and 7 present the results from using frequency as the edge value [30] . Comparing figures in Table II , columns 5 and 7 show that the number of instructions to be copied into the SPM is significantly reduced using the concomitance method compared to the frequency method.
An SPM controller with BBT size of 128 entries ( b) was implemented in Verilog. The SPM controller is a finite state machine that accesses the BBT and forwards the output of the BBT as DRAM and SPM memory addresses. Power estimation of the SPM controller was done using a commercial power estimation tool with the following settings: Clock Frequency MHz; Voltage 1.8 V; using a 0.18-m technology. Power estimation showed that it consumed 2.94 mW of power. This is in comparison with cache tag power of 159 mW, obtained from CACTI [35] , for a 0.18-m 64 bytes of direct map cache. The cache power would progressively increase for greater ways of associativity. In addition, we also synthesized the SPM controller using Synopsys tools [39] , for a 0.18-m process. The result shows that the SPM controller is approximately 1108 gates in size. Comparing the size of the SPM controller with a synthesized version of the simplescalar CPU (obtained from [40] , the size of the synthesized CPU without the cache memories is approximately 82 200 gates.) shows that the SPM controller is approximately 1.35% of the total CPU size. Table III compares the total number of memory accesses of a cache system, the system with scratchpad using the frequency method for capturing the program behavior [30] , and the system with scratchpad using the concomitance metric as a method for capturing the program behavior. Column 1 in Table III gives TABLE III  TOTAL MEMORY ACCESS COMPARISON. (CACHE RESULT SHOWS THE TOTAL  CACHE MISSES. DRAM ACCESS IS THE TOTAL NUMBER OF INSTRUCTIONS  EXECUTED FROM DRAM PLUS THE NUMBER OF INSTRUCTIONS COPIED  FROM DRAM TO THE SPM) the application name; column 2 shows the cache or SPM size; columns 3-7 show the total number of cache misses for different cache associativities; column 8 gives the total DRAM accesses using the frequency optimization method; column 9 shows the total DRAM accesses obtained from using the concomitance optimization method; column 10 shows the percentage improvement of the concomitance method over the average values of the results of the cache system (in columns 3-7); and column 11 compares the percentage improvement of the concomitance method over the frequency method. In column 11, it is shown that the concomitance method reduces the total number of DRAM accesses by an average of 52.94% compared with the frequency method. When comparison is made with a cache based system (columns 3-7), it is shown that the total number of DRAM accesses for the SPM system can be larger than the total number of cache misses. Despite the higher total DRAM accesses in a SPM system compared to the cache system, it does not translate to worse energy consumption and worse performance compared to a cache system (such cases are highlighted in bold in Tables III-V. ) This is because the energy and time cost per SPM access is far less compared with the energy and time cost per cache access (especially when compared with the energy and time cost of accessing a 16-way set associative cache.). It can also be noted that total number of DRAM accesses for an SPM system is comprised of both the number of instructions to be executed from DRAM and the total number of instructions copied from DRAM to SPM. Copying instructions from DRAM to SPM causes a sequential DRAM access which consumes less power and time compared with a random DRAM access that happens on each cache miss.
C. Performance Cost to Accommodate the Execution of SMI
D. Analysis of Results
The experimental setup for measuring the performance and energy consumption is shown in Fig. 13 . Performance of the memory architecture is evaluated by calculating the total memory accesses for a complete program execution. Estimation of memory access time is possible due to known SPM access time, cache access time, DRAM access time, hit rates of cache, and number of times the SPM contents are changed.
Total access time of the cache architecture system is calculated using
where cache is the total number of cache hits, cache is the number of cache misses, cache is the access time to fetch one instruction from the cache, and DRAM is the amount of time spent to access one DRAM instruction.
We estimate the total memory access time for the SPM architecture using
where SPM is the number of instructions executed from SPM, SPM is the amount of time needed to fetch one instruction from the SPM, SMI is the total number of times all SMIs are executed, copy is the total number of instructions copied from DRAM to SPM during program runtime, and DRAM is the number of instruction executed from DRAM.
SPM memory access time is compared with cache memory access time using the following equation:
Energy consumption comparison is shown in Table IV . The cache energy consumption is calculated using cache cache cache cache cache
where is the energy cost per DRAM access and is the energy consumed by the CPU. The SPM energy is calculated using
where is the energy cost per SPM access, is the energy cost per execution of the special instruction, is the total number of times SMIs are executed, is the energy cost for any additional branch instructions, and is the total number of times any additional branch instructions are executed.
The percentage difference between the SPM energy consumption and the cache energy consumption is calculated using the following equation: SPM (9) Results of the percentage energy improvement are shown in Table IV . Table IV shows energy comparison of: 1) a cache system; 2) a scratchpad system using the frequency method [30] ; and 3) a scratchpad system using the concomitance method. The table structure is identical to Table III except that the comparison is now for energy. Fig. 14 shows the energy improvement comparison between the cache system, the frequency SPM allocation procedure [30] , and the concomitance SPM allocation method for all the benchmarks. The energy comparison shows that the concomitance method almost always performs better compared with the cache system, and superior results are seen when compared with results of the frequency method. On average, the energy consumption by utilizing the concomitance method is 41.9% better than cache system energy and 27.1% better than the frequency method. In particular, the concomitance method is superior in cases where negative improvements over cache were shown using the frequency method.
Performance results are shown in Table V . The structure of the table is identical to that of Table IV ; columns 3-7 show the execution time of a cache based system; column 8 shows the performance results of the frequency method; column 9 shows the performance measurement results of the concomitance method; column 10 shows the average performance improvement of the concomitance method over cache system; and column 11 shows performance improvement over the frequency method. The concomitance method improves the execution time by 40.0% compared with cache and 23.6% compared with the frequency method.
Shorter instruction memory access time does not always imply a shorter execution time, due to data memory access time and the time required by the CPU to execute multicycle instructions. For the purposes of evaluation, we minimized the effect of data memory access on the execution time by setting a large data cache so that data cache miss rates are less than 0.01%. Thus, the data cache has a very small to negligible miss rate and, hence, minimal effect on the program execution time.
For multicycle instructions, it is not possible to accurately estimate the execution time without knowing which multicycle instructions were executed. We perform comparisons between cache execution times obtained by Simplescalar simulation, to cache memory access times obtained from (4) . We found that, on average, a 9.5% error is seen between the two values with a maximum error of 17%.
Thus, it is clear from the results that the method is feasible for utilizing a dynamic SPM allocation scheme. We show that the number of copy instructions inserted are far fewer than for the existing methods described in [29] and [30] . In addition, for the applications shown here, we show performance improvement and energy savings over conventional cache-based systems.
IX. CONCLUSION
We have presented a method to lower energy consumption and improve the performance of embedded systems. We presented a methodology for selecting appropriate code segments to be executed from the SPM. The methodology determined the code segments that are to be stored within the SPM by using the concomitance information. By using a custom hardware SPM controller to dynamically manage the SPM, we have successfully avoided the need to insert many instructions into a program for managing the content of the SPM. Instead, we implemented heuristic algorithms to strategically insert custom instructions, SMI, to activate the hardware SPM controller. Experimental results show that our SPM scheme can lower energy consumption by an average of 41.9% compared with traditional cache architecture, and performance is improved by an average of 40.0%.
