Code generation for embedded processors creates opportunities for several pelformanee optimizations not applicable for traditional compilers. We present techniques for improving data cache pelformance by organizing vanables declared in embedded code into memory, using specific parameters of the data cache. Our approach clusters variables to minimize compulsory cache misses, and solves the memory assignment problem to minimize conflict cache misses. Our experiments demonstrate significant improvement in data cache pelformance (average 46% in hit ratios) by the application of our memory organization technique using code kemels from DSP and other domains on the LSI Logic CW4001 embedded pmcessol:
Introduction
Embedded microprocessors are a common feature in modern electronic systems due to the advantages they offer in terms of flexibility, reduction in design time and full-custom layout quality [ll] . These processors are often available in the form of cores, which can be instantiated as part of a larger system on a chip. This is feasible in current technology due to the relatively small area occupied by the processor cores, making the rest of the on-chip die area available for RAM, ROM, coprocessors, and other modules. A core-based design methodology is driven by demands for design reuse that ultimately results in reduced development time. Apart from the processors in the Digital Signal Processing domain (such as the TMS320 series from Texas Instruments), we also find microprocessors with relatively general purpose architectures available as embedded processors. An example of such a general purpose embedded processor is LSI Logic's CW4001 [2] , which is based on the MIPS family of processors.
Generation of efficient code for embedded processors has been the subject ofrecent investigation [ 1, 5, 13] . Optimiza-*This work was partially supported by grants from NSF(CDA-
9422095) and ONR(N00014-93-1 -1348).
tion techniques that improve the performance of application programs by exploiting the irregular architectures of embedded processors have been reported [5, 9, 10, 141 .
An important determinant of performance in embedded systems is the interaction between the processor and external memory. General purpose embedded processors such as the CW4001 are equipped with on-chip instruction and data caches, which interface with larger off-chip memories. Since off-chip memory accesses usually stall the CPU execution for significant durations (each access could take 10-20 processor cycles, depending on the relative processor and memory access speeds), it is important to design the interface between cache and main memory carefully.
Cache misses can be classified into three categories: compulsory misses, capacity misses, and conflict misses [7] . In the computer architecture and compiler domains, many techniques for achieving cache miss reduction involve additional hardware assistance [6, 17, 31 , which can often be expensive in terms of additional on-chip area. A well known compiler optimization technique called blocking combines strip mining and loop permutation to maximize temporal 10-cality of reused data [16] . This technique helps in reducing capacity misses in data caches, but fails to take advantage of data placement strategies to reduce conflict misses.
In the embedded processors domain, code placement methods based on program traces for improvement of instruction cache performance have been reported [15] . A technique for estimation of instruction cache performance has also been reported [8] . However, no published literature exists on the improvement of data cache performance in embedded systems.
Embedded system design is characterized by certain features that traditional compilers typically do not consider in their optimizations. For example, compilers seldom take into account the specific cache parameters such as cache line size in their optimizations, because fast compilation speed requirements preclude the complex analysis procedures. However, in embedded systems, code generation can be tuned to the specific cache configuration to be used (or the specific configuration that is being currently explored). Fur-ther, the typical execution of only a single program and the absence of virtual addresses permits the compiler to asign the exact memory location occupied by the data. In this work, we exploit this situation to organize data in memory in order to minimize data cache misses.
Problem Description
Consider a direct-mapped cache of size C (C = 2m) words, with a cache line size L words, i.e., L consecutive words are fetched from memory on a cache read miss. In our formulation, we assume a write-back cache with a fetch-onmiss policy [7] , though the technique remains identical for other write policies, and is equally effective. We use a small example to illustrate the problem and our approach. Suppose the code fragment in Figure 1 We have:
This ensures that u [ i ] , b[i], and c[i]
are always mapped into different cache lines, and their accesses do not interfere with each other in the data cache. We extend this basic idea to organize scalars (Section 3) and arrays (Section 4).
Memory Organization of Scalar Variables
We assume that the scheduling and register allocation of the code has already been performed, and the sequence of accesses to variables is fixed.
Constructing the Closeness Graph
We first generate an Access Sequence, which is a graph representing memory references (loads and stores are treated alike) in the code. Figure 2 (a) shows an example Access Sequence. The label 3 on edge e --f a represents a loop with bound = 3. We then construct a Closeness Graph of the variables, which represents the degree of desirability for keeping sets of variables in the same vicinity in memory. E.g., if L words accessed successively from memory are placed in consecutive locations, a single memory access could fetch them all into cache, thereby reducing upto L -1 extra memory accesses caused due to compulsory misses.
We define the distance between two nodes U and v in the access sequence as: distance(u, v) = number of distinct variable nodes encountered on a path from U to v, or v to U (including u and v). The Closeness Graph CG(YE) is constructed from the Access Sequence by first creating a node 'U E V for every variable in A, and initializing all 
Grouping of Variables into Clusters
The next step is to group the variables into clusters of M words, where M is the number of words in a data cache line.
Intuitively, a higher edge weight between two variables U and v in the Closeness Graph represents a reduction in the number of memory accesses, if the two variables are stored in the same cache line. I.e., we can reduce cache misses by clustering variables with higher edge weights into the same cache line. We formulate the problem of maximizing the sharing of the cache lines by closely correlated variables as follows: Partition the nodes of the Closeness Graph
the set of n variables) into clusters of size M , so that the total weight of edges in all the clusters is maximized.
Since an exact solution to the above problem has a com- 
Create new cluster C = { U } while (size of cluster C # M ) and ( X # 4) do Let 2: be the variable E X with maximum T , where T = ~U E C , w E X --C e ( u , U ) --sum of edge weights with nodes already in C
--delete all edges to nodes in cluster C just formed Update S ( w ) for all v E X ; F = F U { C } When procedure Pe~ormClustering is applied on the graph in Figure 2 (b), node b is selected first (line i). Next, ' The required values of k in case of conditionals and loops could be obtained by using profiling information. However, in this work, we use the often-used simplifying assumption that branch probability is 0.5 for an if-statement, and that the loop bounds are always known at compile time. 
The Cluster Interference Graph
After grouping the variables into clusters of size M , we build an Interference Graph (ZG) of the clusters, which represents the desirability to store clusters in memory, so that they do not map into the same cache line. Each node in the Interference Graph represents one cluster of variables obtained from procedure Per$ormClustering. A high edge weight between two nodes indicates a large number of conflict misses in the data cache, if the respective clusters were to map into the same cache line. We first convert the Variable Access Sequence A into a Cluster Access Sequence by renaming each node U in the sequence by the cluster C , where U E C. We then construct the Cluster Interference Graph by first creating a node in IG for each cluster in F (the set of clusters), and then assigning edge weight e(u, v) between nodes U and v to be the number of times the access to clusters U and v alternate along the execution path. E.g., -+ x -+ z. Thls results in an Interference Graph ZG, with edges e(z, y) = 2, e(z, z ) = 1, and e(y, z ) = 1. The pair of nodes x and y alternate twice in the execution path, due to the edges x 3 y and y --f 2, causing e(z, y) = 2. The other pairs change orders only once. The composition rules to be followed for conditionals and loops are identical to those used for building the Variable Access Sequence (Section 3.1).
Memory Location Assignment
The final assignment of variables to memory locations should consider the clustering and conflict-penalty information in the Interference Graph. To minimize the conflict misses in the data cache during code execution, we need to ensure that cluster pairs with large edge weights do not map to the same cache line when we assign memory locations.
We define the cost of a memory assignment ( C ) as follows: C = C z , y E V ( I G ) e ( x ,~)
x P(z,u) where e (2,y) is the edge weight, and P ( z , y) = 1 or 0, depending on whether memory locations for z and y map into the same cache line or not. Figure 3(b) shows the effect of a sample memory assignment for an IG with six clusters (Figure 3(a) ), on a cache with four lines. We have P(a, e) = P(b, f ) = 1 and C = e(a, e) + e(b, f ) = 1 + 3 = 4. We solve the following problem: Find an assignment of clusters in IG to memory locations, such that the assignment cost C is minimized. This problem can be easily shown to For an n-way associative cache, we use the same definition of cost, except that the cost remains zero for the first n clusters assigned to the same cache line. We solve the memory organization problem for arrays by first constructing the Znteqerence Graph among arrays in the code, and then assigning memory addresses to each array by minimizing the possibility of cache conflicts with other arrays in the code. be NP-hard, by using a reduction from the Graph Colouring problem. We present below a greedy O(n2) heuristic (where n is the number of clusters) to solve the Cluster Assignment problem for a cache of size k that is similar to the Performclustering procedure.
Memory Organization for Array Variables

Constructing the Interference Graph
We Delete U from X For the example IG in Figure 3(a) , the page size is k = 4 lines. When we apply procedure Assignclusters on this example, we first sort the vertices in decreasing order of the sum of their incident edge weights: f(13), c( l l ) , e(9), b ( 8 ) , a ( 6 ) and 4 5 ) . Clusters f , e, e, and b are placed into the first page PO. While attempting to assign a into the second page P I , we find: cost(a,O) = 2 (since e(a, f ) = 2), cost(a, 1) = 1, cost(a, 2) = 1, and cost(a, 3) = 1. Thus, we choose a line within page PI that minimizes the cost, and assign a to line 1. Cluster d has: cost(d, i) = 1 for all i, so we assign line 0 of PI to d. The final assignment is: PO: (0 -f; 1 -c; 2 -e; 3 -b) and PI :
In the case of arrays, we note that if two arrays A and B are accessed repeatedly within a loop, then there is a possibility that the accesses to A and B might cause conflict misses in the data cache (Section 2). The Interference Graph (IG) of arrays represents the possibility of cache conflicts between the arrays in the code.
We first create a node for each array in the specification. Next, we determine the arrays that are repeatedly accessed in each loop, and add the loop bound BI to the edge weights between each pair of arrays. This signifies that a total of BI cache conflicts could possibly arise between each pair of arrays during execution of this loop. The resulting I G gives us a criterion to prioritize the order in which we assign memory addresses to arrays. The complexity of this procedure is O(Ln2), where L is the number of loops, and n is the number of arrays in the code. 15 to e(a, b), e(a, c) , and e(b, e). The IG helps identify the order in which the memory address qssignment to arrays should be done.
Memory Assignment to Array Variables
In solving the problem of memory assignment of array variables, we assume that the loop bounds and array dimensions are known at compile time. We also assume that a unidimensional array of N elements is stored in N consecutive 2The problem of clustering of variables to avoid compulsory misses is not relevant in the case of arrays, as most arrays are usually much larger than a cache line -often much larger than the cache itself. memory locations, and multidimensional arrays are stored in row-major format. (The issue of selection of a good storage technique for multi-dimensional arrays is addressed in [12] ). The memory assignment problem is NP-hard, because the degenerate case, when the array dimension = l , itself happens to be NP-hard (Section 3.4) .
From the Interference Graph, we use the S(U) values for each node U (defined in Section 3.2) to determine the order of assignment of arrays. S ( U ) signifies the relative importance of the nodes, because a higher S ( U ) indicates that U could possibly be involved in many cache conflicts.
Central to the technique we use for memory assignment of arrays, is a computation of the cost of assigning an array ( U ) to begin at a specific memory address A. This cost is equal to the expected number of cache conjlicts with all arrays that have already been assigned, if U were to begin at A. Note that if the first element of U is fixed at address A, all the other elements of U are automatically assigned their respective locations.
To determine whether two specific accesses to two arrays in the same loop will map into the same cache line (i.e, cause cache conflict miss), we perform a symbolic evaluation of the equality checking function. Two memory locations X and Y will map into the same cache line in a direct-mapped cache with k lines ( M words per line), if the condition:
is an integral multiple of k , which resolves to:
x-Y M where n is any integer. Clearly, the symbolically evaluated expression: ( X -Y ) / M , might not always reduce to a constant, because X and Y could be arbitrary functions of any variable in the code. If the expression does not resolve to a constant, then we conclude that the two arrays do not conflict.
To formalize a strategy to perform the memory assignment of arrays, we first describe the cost function Assignmentcostthat returns the expected number of conflicts when an array is tentatively assigned a specific location. The worst case complexity of procedure AssignArrayAddresses could be O(nkP), where n, k , and Pare thenumber of nodes (arrays), cache lines, and total array accesses in the specification respectively. However, in real behaviors, we have observed that the loop j = 0 . . . h -1 tends to converge very soon (typically less than 2 or 3 iterations), because the number of different array elements that are accessed in inner loops of code is usually small and finite.
This completes the memory address assignment of scalar and array variables in the behavior.
Experiments and Results
We now describe the experiments performed on several benchmark examples to validate our memory organization strategy. Our experimental platform was the CW4001 embedded processor core simulator from LSI Logic running a SUN SS-5, using a sample configuration of: 1 KB instruction cache; 256 byte data cache; Line size = 4 words; Array dimension = 16 (for 1-dimensional) and 16 x 16 (for 2-dimensional): Memory latency = 5 cycles. The latency number is an aggressively low value for the fastest memories. Since the performance difference widens even more for higher memory latencies, the improvements we have shown are the minimum possible.
Column 1 of Table 1 shows the example designs on which we performed our experiments, and Column 2 gives the number of scalar and array variables in each. All the examples are benchmark code kernels used in image processing, telecommunication, and other applications in the DSP In almost all the examples, we notice that the difference in the hit ratios is substantial (46 % on an average). Columns 5 and 6 show the execution time in thousands of cycles). There is a significant reduction in the total cycle time for most of the applications (Column 6). The total cycle time reduced by an average of 21.9% over all the examples. This reduction is less than the data cache hit ratio because the instruction cache performance remains unchanged.
Conclusions and Future Work
Code generation for embedded processors reveals the scope for many optimizations that have been hitherto unaddressed in traditional compilers. An important feature that can be exploited while generating code for embedded processors is the parameters of the data cache. In this paper, we have demonstrated how a careful data layout strategy that takes into account the parameters of the data cache, such as cache line size and cache size, could induce significant performance improvements in the execution of embedded code.
We described techniques for clustering variables to minimize compulsory cache misses, and for solving the memory assignment problem with the objective of minimizing conflict cache misses. The experiments we performed on standard benchmark code kernels from the DSP and scientific domains, indicate that significant performance improvements result from our memory assignment techniques. We noticed an average improvement of 46% in the data cache hit ratios for the benchmark examples for which we generated code that was executed on the simulator for the CW4001 embedded processor core from LSI Logic.
In the future, we plan to integrate our memory assignment techniques with reordering of the memory accesses in the code. Reordering holds out the possibility of obtaining further improvements in performance through reduction in both compulsory and conflict misses in the data cache.
