Abstract-Arrays in behavioral specifications that are too large to fit into on-chip registers are usually mapped to off-chip memories during behavioral synthesis. We address the problem of system power reduction through transition count minimization on the memory address bus when these arrays are accessed from memory. We exploit regularity and spatial locality in the memory accesses and determine the mapping of behavioral array references to physical memory locations to minimize address bus transitions. We describe array mapping strategies for two important memory configurations: all behavioral arrays mapped to a single off-chip memory and arrays mapped into multiple memory modules drawn from a library. For the single memory configuration, we describe a heuristic for selecting a memory mapping scheme to achieve low power for each behavioral array. For mapping into a library of multiple memory modules, we formulate the problem as three logical-to-physical memory mapping subtasks and present experiments demonstrating the transition count reductions based on our approach. Our experiments on several image processing benchmarks show power savings of up to 63% through reduced transition activity on the memory address bus in the single memory case. We also observe a further transition count reduction by a factor of 1.5-6.7 over a straightforward mapping scheme in the multiple memories configuration.
I. INTRODUCTION

P
OWER minimization efforts at the behavioral level usually attempt to reduce signal transition count, particularly off-chip transitions, owing to dynamic power dissipation accounting for a significant fraction of the total power dissipation in CMOS circuits [1] . In this paper, we attempt power reduction in memory-intensive applications by analyzing the access patterns of behavioral arrays in the specification and organizing the arrays in memory so as to minimize transitions on the memory address buses. Examples of memory-intensive applications that exhibit regular access patterns exploitable by our proposed technique are digital signal processing, and image and audio applications such as filters, relaxation algorithms, and compression algorithms such as discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT).
Given that off-chip capacitances are three orders of magnitude larger than typical on-chip capacitances [2] , we can effect significant power savings by reducing the switched capacitance of the off-chip address bus drivers through reduction of transition activity on the memory address bus. Furthermore, reduced activity in the address bus also leads to reduced activity in the memory address buffers and decoding circuitry. Studies have shown that power dissipation in the address decoder and address buffers of typical memory chips constitute a significant portion of the power consumed (up to 50%) in the memory chip [3] . Hence, design techniques leading to decrease in power dissipation in this part of the memory will significantly reduce the overall power dissipation of the application.
In this paper, we study the impact of different memory address mapping strategies on power dissipation by counting the number of bit transitions of the address bus. We are able to guarantee a decrease in address bus transition activity, as we derive the memory access patterns from the behavior at compile time; our mapping strategy can thus result in additional power minimization after other behavioral power minimization techniques have been applied.
In Section II, we outline some related research in this field. In Section III, we define the problem and, in Section IV, we describe our approach for mapping behavioral arrays into a single physical memory. In Section V, we present an algorithm for power-efficient mapping of behavioral arrays into multiple physical memories. We report several observations on the mapping techniques from the experiments we conducted on examples from the image processing applications domain in Section VI. Finally, Section VII concludes this paper.
II. PREVIOUS WORK
Previous work on power minimization at the behavioral level has been addressed by considering architectural transformations at the behavioral level [4] . Techniques for incorporation of circuit transition activity-related cost functions into high-level synthesis were presented by Musoll and Cortadella [5] , [6] . Dasgupta and Karri [7] have reported algorithms for scheduling and binding in order to minimize on-chip data bus transitions. Iterative improvement techniques for scheduling and module allocation, based on switched capacitance matrices, were reported by Raghunathan and Jha [8] . Martin and Knight have incorporated profile-based techniques for scheduling and module assignment for low power in their power-profiler tool [9] . However, the above works have not addressed the impact of memory accesses in their techniques.
Compiler techniques dealing with reduction in the number of memory accesses during program execution [10] are directly applicable to the problem of power minimization since reduction in the number of memory accesses directly implies a reduction in the number of switchings of the memory address bus, data bus, and memory circuitry.
Wuytack et al. [3] present transformations on the initial specification into an optimized form for reducing the number of memory accesses. They report on the effect of caching schemes and reordering of loops on power dissipation. Their work focuses on reducing the number of memory accesses only and does not consider further minimization of transition count for the given number of memory accesses.
Work on reducing transition counts/bus activity has been done at several levels. Approaches to reducing transition counts in arithmetic circuits with the ultimate objective of reducing power are presented by Ercegovac and Lang [11] . They modify a conventional sign-detection circuit to reduce the average number of transitions by eliminating unnecessary transitions. Related work on exploiting data encoding to reduce transition activity was presented by Stan and Burleson [12] , who use a data-encoding strategy to decrease the number of input/output (I/O) transitions at the expense of a moderate increase in on-chip transitions, by analyzing data streams on I/O pins. This paper focuses on the reduction of power dissipation in off-chip drivers and in the memory's decoding logic by reducing the number of transitions on the memory address bus, and is an extension of the compiler techniques for reducing the number of memory accesses. We exploit regularity and spatial locality in the memory accesses and determine the mapping of behavioral array references to physical memory locations to minimize address bus transitions. Our technique can be applied in conjunction with previous approaches to obtain further power minimization at the behavioral level.
III. PROBLEM DEFINITION
The goal of this paper is to assign arrays in a behavioral specification to memory locations so that the transition count on the memory address bus, when these arrays are accessed, is minimized. The domain in which our paper is based includes memory-intensive applications such as image processing and digital signal processing (DSP), in which large arrays are processed. By memory intensive, we mean that the accesses to memory are frequent enough that optimizing the transition count of the address bus is a worthwhile effort.
The target implementation of the specification could either be software compilation or synthesis into hardware since the issue of power dissipation is relevant to both cases, and the same strategy of minimizing transition count on the external memory address bus is applicable. In this paper, we only examine the issues involved in hardware synthesis of the specification. Fig. 1 shows the model we assume for a typical memory-intensive system, synthesized into an applicationspecific integrated circuit (ASIC) (consisting of datapath and control blocks) and a memory. Memory accesses result in power dissipation in both these components. Power dissipated in the ASIC is the sum of power dissipation in its constituents-the datapath, control, and off-chip drivers. Reduction in transition count of the address bus also entails power savings in the off-chip drivers of the ASIC. The computed address value in the ASIC is assumed to be latched before being fed to the off-chip driver.
Power consumption during memory accesses within a memory can be roughly classified into two categories-data-related and address-related. Data-related power is the component of memory power contributed by the process of transferring the data between the data bus and memory location where the data is stored, and the component dissipated in the refresh circuitry (if any) of the memory. Address-related power refers to the component contributed by the address buffers and decoders. Since data-related power is dependent on the specific data involved (over which we have no control), we have not studied the minimization issues involved in this component. We have focused instead on the address-related power since the memory access sequences are known at the specification stage. Our approach exploits these memory access sequences to guarantee a reduction in the memory address bus transitions, thereby leading to power reduction during memory accesses.
Our technique takes as input the behavioral specification that is to be synthesized; the output is the assignment of arrays in the specification to physical addresses in memory. For each array, we output the expression for accessing an arbitrary memory element as a function of its behavioral array index and the array dimensions that corresponds to the best memory mapping strategy for the array.
We first show how the mapping is performed with one single port memory as the target, and later describe in Section V the extension to multiple physical memories. For simplification of our discussion, we have assumed that every array element occupies one memory word, although this assumption is not necessary in the analysis.
IV. MAPPING OF ARRAYS TO A SINGLE MEMORY
We first assign the arrays in the specification to locations in a logical memory, and employ a Gray code converter (GCC) in the address generator (Fig. 1) to map logical memory addresses to physical memory locations. The GCC converts a logical address into Gray code [13] , thereby ensuring that access of consecutive logical memory addresses results in the transition of exactly one bit on the memory address bus. The GCC helps bring down the address bus transition count to a minimum, but our memory mapping strategies are valid techniques for ensuring reduction in transition count even in the absence of the GCC from the architecture. We illustrate our approach on a sample behavior, shown in Fig. 2(a) . Fig. 2(b) shows a sequential mapping of the three arrays , , and . In this case, the sequence of logical memory locations accessed is: 0, 3, 6, 1, 4, 7, 2, 5, and 8. For the same example, we show in Fig. 2(c) , an interleaved organization of arrays into logical memory, where the array elements have been stored in the order in which they are accessed in the loop. This mapping, when converted into Gray code, ensures the transition of exactly one bit on the memory address bus between consecutive accesses, resulting in a significant decrease in address bus transitions; for instance, the transition count decreases from 18 [see Fig. 2 
A. Array Mapping Schemes
In this paper, we concentrate on memory mapping strategies for two-dimensional arrays. 1 We consider the following three strategies when we attempt to find an effective mapping of arrays into memory.
1) Row Major:
A simple way of mapping a logical array to physical memory (unless the order is imposed by the language) is to store the elements in row-major form [14] , i.e., the elements of the first row are placed in consecutive memory locations in order of increasing column index. This is followed by the elements of the second row in the same order, and so on.
2) Column Major: In column-major mapping [14] , the elements of the two-dimensional array are stored column-bycolumn.
3) Tile Based: This mapping style first partitions the bigger array into tiles of smaller rectangles. Here, the elements of a row (or column) of tiles are stored in consecutive memory locations, with the elements of the tiles themselves being stored in either row or column major format (Fig. 3) . Similar ideas have been used in the context of compilers, where cache reuse is improved by dividing loop iteration space into tiles and transforming the loop nesting structure to iterate over the 1 The simple interleaving strategy (discussed in Section IV-D) used in Fig. 2 seems to work quite well for one-dimensional arrays. Multidimensional arrays are handled by a straightforward extension to the techniques presented here. tiles [10] . Block decomposition of arrays in a multiprocessor environment [10] is also based on a similar concept. Fig. 4 (a) shows a simplified version of the code kernel for a successive over relaxation (SOR) algorithm [15] , which is often used in the domain of image-processing applications. The plus-shaped contour in Fig. 4(b) shows the basic access pattern of the elements of array in the inner loop. Fig. 4(c) shows that this shape of accesses moves across by two columns to the right as we iterate through the inner loop. The second iteration of the outer loop causes the first pattern of the previous loop to move vertically down by one row. The remaining inner loop accesses are as before.
B. Extraction of Access Pattern
The one element that is common to successive patterns can be stored in a register instead of being accessed from memory again in the next iteration [16] ; hence, the effective access pattern for this example is as shown in Fig. 4(d) . This pattern determines the dimensions of the tile to be used in the 
C. Analysis of Mapping Schemes
We now examine the effect on the number of address bit transitions in the three mapping styles for the SOR example of Fig. 4 . This analysis helps determine the mapping style that minimizes the total transition count.
We use the observation that, in general, there is a low Hamming distance between elements in the same tile or adjacent tiles, and a possibly high Hamming distance between elements in distant tiles. We define a maximal transition as occurring when two logical addresses with a large difference are accessed in succession. A minimal transition occurs when this difference is small. In this case, large means comparable to the dimension of the array. For example, we treat all consecutive accesses to elements in the same tile as minimal transitions.
We observe in Fig. 4 (e) that row-major mapping causes one minimal transition (and, hence, three maximal transitions) per inner loop iteration since and are closely located, but column-major mapping causes two minimal transitions (i.e., two maximal transitions) owing to , , and being consecutive. In the tile-based mapping, the first iteration of the outer loop causes all minimal transitions since the elements accessed are all well aligned into single tiles. In the second iteration, we have in one tile, but the other three in a (distant) vertically adjacent tile. We see that in each inner iteration, we have two maximal transitions occurring when we switch back and forth between the two tiles across which the four elements are located. The third iteration behaves like the second, the fourth like the first, etc. Thus, tile-based mapping performs marginally better than column major, while column major is clearly preferable to row major.
To generalize this idea, we show in Fig. 5 part of an array organized into tiles (only tiles 1-4 are shown). Rectangle R encloses the contour of elements that are accessed in a particular iteration. We observe that no matter where rectangle R is placed in the grid of tiles, there will be a maximum of two maximal transitions while accessing all the elements in R. This is ensured if we access the elements in region first, followed by those in . Accesses within (or ) are from the same (or neighboring) tiles, and lead to minimal transitions only. The two maximal transitions occur when we switch back and forth between and . 2 However, in some special cases, other mapping strategies may work almost equally well. For example, if the tile has only two columns, then the same effect of two maximal transitions for successive accesses in the inner loop is also observed in the case of column-major mapping since we can access the elements in one column, followed by those in the other. The exception is that, during certain iterations of the outer loop ( in the example discussed, at intervals of three), tile-based mapping will not incur any maximal transition, whereas column major will incur two in every iteration. This, however, does not cause a significant difference, as is shown by our experiments. Further, since the implementation of tile-based mapping has an associated overhead (discussed in Section IV-E), we prefer the simpler column-major mapping in such cases. The same rule applies if the tile has two or less rows.
In conclusion, we observe that if column-major mapping is used when the tile has columns, there will be maximal transitions in every inner loop iteration, assuming at least one element from every column is accessed. Similarly, row-major mapping would entail maximal transitions in an -row tile. In comparison, tile-based mapping leads to a maximum of two maximal transitions, independent of the tile size. We thus choose to use tile-based mapping when the tile has both dimensions greater than two. Otherwise, we use row or column major as appropriate. The above observation leads to Heuristic 1 (Fig. 6 ) for selecting a mapping scheme for a two-level nested loop involving a two-dimensional array.
The reason for the condition " " is that, if the increment of the outer loop is the same as the tile dimension, then every iteration of the inner loop will access elements from a single tile. This obviously makes tile-based mapping a preferable style, since all the transitions will be minimal in every iteration. If condition (1) fails, the extracted tile has dimensions or . Condition (2) ensures that, for the first two cases, column-major mapping is selected, and row-major mapping is selected for the other two. If , we arbitrarily select row-major mapping. Our experiments, described in Section VI-A, show that the heuristic does predict the best mapping strategy for our experimental set of benchmarks from the image-processing domain.
D. Interleaving of Multiple Arrays
When multiple arrays have similar access patterns, we can interleave their storage to maintain spatial locality during accesses. Interleaving two arrays and means storing and in consecutive logical memory locations, followed by , , etc. (assuming row-major mapping). Referring back to the SOR algorithm of Fig. 4(a) , we note that the elements , , , , , and are all accessed exactly once in each inner loop iteration, making them candidates for interleaving.
E. Evaluation of Mapping Schemes
Although we have demonstrated a reduction in memory address bus transitions, this could be offset by transition activity in the additional hardware overhead introduced: the GCC and the more complex address generator ( Fig. 1 ) in tilebased mapping. To evaluate the effectiveness of the mapping schemes, we first establish a power dissipation relationship between on-and off-chip transitions. We then determine the overhead in the ASIC's address generator, and the reduction in the memory's decoders. Using these computations, we present the overall power reduction in Section VI.
To relate the power dissipation between an on-chip transition (e.g., activity in the address generation logic) and an off-chip transition (e.g., off-chip driver driving an address bus wire), we use an example transistor with from a 0.7-m CMOS process as a typical transistor. Although the capacitance values of n and p transistors are different in a CMOS process (due to differing values), we do not make that distinction in our analysis. Instead, we assume that is a suitable average value for any type of transistor. Such a transistor has a gate input capacitance of approximately 20 fF [2] . If larger capacitances are driven, then we adjust the equivalent transition count to reflect this increased capacitance so that the equivalent transition count represents the effective number of transitions with a typical transistor gate as the load. If, e.g., a gate drives an inverter (with two transistors), the load on the output is the sum of the gate capacitances of the two transistors it drives; hence, a change in input to the inverter causes an equivalent transition count of two. Similarly, we use the value of 10 pF as the capacitive load on a typical off-chip driver [2] . This gives the value of pF fF as the equivalent transition count corresponding to a bit transition in an off-chip driver.
In a typical implementation of the address generator, we have the following index expressions for generating the logical address for an arbitrary element : Row-major ; Column-major ; Tile-based where the tile dimensions are and is the start address of array of dimensions ( and range from 0 to ). Note that the expressions above need to be evaluated only if an arbitrary element is accessed outside of any regular loop structure. However, since most of the computation (and, hence, power dissipation) occurs within loops, we can use techniques such as strength reduction and induction variable elimination [14] to minimize the actual number of operations. This means, for example, that the value is never computed by a multiplication operation, but is implemented by addition of to a running counter every time is incremented in a loop.
We now illustrate a possible hardware implementation of the inner loop access sequences for array in the SOR example of Fig. 4 using row-major and tile-based mapping. Fig. 7(a) shows the elements accessed in two consecutive iterations of the inner loop. The four elements of array accessed in the current loop iteration are at logical addresses , , , and . The four elements accessed in the previous iteration are at addresses , , , and . For row-major mapping, it is easy to see that , , , , and are related through the equations in Fig. 7(a) . Each address is generated from its preceding address, and we only need to maintain the most current address in a register. 4 We need only one addition operation to generate the address for every memory access in the loop [ is a constant known at compile time]. For tile-based mapping, Fig. 7(b) shows the computation of location in terms of . We have the following equations:
when when when when when when when when
We present below a computation of the maximum effective transition count overhead and savings incurred in various parts of the system due to the mapping schemes.
1) Overhead in GCC:
In general, an -bit-wide GCC is implemented with XOR gates. The most significant bit (MSB) of the input becomes the MSB of the output, while the rest of the output bits are generated by XORing adjacent bits of the input [13] . A typical implementation of a two-input CMOS XOR gate uses eight transistors [1] . On examining this circuit, we notice that in an XOR gate, there can be a maximum equivalent transition count of six, which occurs when both inputs change [17] . It can be concluded that the number of internal transitions in the GCC is no more than , where is the number of transitions on its inputs (assuming the absence of glitches at the input).
2) Overhead in Selector:
The implementation of the address generator for the two mapping strategies is shown in Fig. 7(c) . The SELECTOR circuits in the two mapping styles are different. They use information from the controller to select the appropriate (constant) operand for addition. Fig. 8 shows the structure of the selector circuits for row-major and tile-based mapping. In both cases, a two-bit signal from the controller ( ) indicates which of the four addresses ( , , , ) is being accessed. In Fig. 8(a) , the combinational circuit generates the appropriate value for , so that the correct operand is chosen by the multiplexer. In Fig. 8(b) , the combinational circuit uses, along with the two-bit signal ( ), the current value of , which is generated by a simple three-state ring counter. For any specific value of , we have a "1" on either , or , depending on whether is "0," "1," or "2." The inputs select the appropriate operand for the adder.
Using the encoding 00, 01, 10, and 11 on controller output to indicate address , , , and , respectively, being accessed, we arrive at the logic equations , and .
5 a) Row major and column major: Since the rowmajor scheme is the default mapping scheme used by most conventional synthesis tools, and column-major mapping is symmetrical, there is no overhead involved in using these mapping schemes.
b) Tile based: In Fig. 7(c) , the selector circuit consists of a 2 1 MUX for row major and a 4 1 MUX for tile based. We compute the maximum difference in transition count per memory access due to the 4 1 MUX as , where is the bit-width (details in [17] ). 6 In addition, the selector for tilebased mapping also has a 3-b ring counter to keep track of . The combinational circuits that generate the MUX control signals have a transistor count of eight and 44 for row and tile mapping, respectively-a difference of 36. Assuming 50% of the transistors change output 7 we get an equivalent transition count overhead of 18. The transition count in the ring counter can be ignored since two bits change state only once every outer loop iteration.
3) Memory Address Decoder Power: To evaluate the effect of minimizing address bus transitions on the memory address decoder power, we conducted an experiment comparing the transition count in the address decoder for two sequences-one in which the address bus takes sequential values from 0 to (for various values of ) and another, in which the corresponding addresses are in Gray code sequence. Further details about the model of buffered address lines, split into row and column decoders, etc., can be found in [17] . The results show that the equivalent transition count in the decoder reduces by roughly 46%, i.e., a decrease in 50% in the address bus transition count (Gray code versus sequential) 8 led to a 46% decrease in internal transitions of the address decoder, demonstrating the close correlation between transition counts on the address bus and the decoder circuitry.
V. MAPPING OF ARRAYS TO MULTIPLE MEMORIES
The previous section described the memory mapping techniques under the assumption that all the memory elements are bound to a single physical memory. In this section, we generalize our approach to the more realistic case of mapping to multiple physical memories drawn from a library. We wish 5 Another possible implementation of the tile-based mapping is to maintain , and D = B + 3. In this implementation, the operands in the additions are no longer dependent on the value of i, but it requires two extra registers and additional control signals. 6 In brief, this is due to the fact that the data inputs to the MUX's are constants, and a bit slice of the 4 2 1 MUX can only be as complex as an XOR gate (6n transitions), whereas the 2 2 1 MUX can be as complex as an inverter (2n transitions). 7 This is usually an overestimation-note that the ring counter outputs never change during the inner loop execution. 8 Average bit-transition count for the sequential case is two [18] . For Gray code, the average is one (by definition). to map arrays in a behavior to addresses in physical memory, where the physical memory now consists of memory modules of different sizes. We assume the availability of a library with an unlimited supply of each type of memory. Fig. 9 shows the model we assume for a typical memory-intensive system with a group of memory modules. We have a pair of data and address buses connecting the ASIC to each memory. The address value is assumed to be latched before being fed to the off-chip driver to avoid glitches. The configuration in Fig. 9 is common in performance-critical systems where the required data access rate exceeds the maximum access rate possible when only a single memory can be accessed at any instant of time (i.e., multiple memories connected to a single data bus).
The approach we take is summarized in Fig. 10 . The strategy consists of the following steps.
Step 1) Analyze the access patterns in the specification, independent of the size and number of available physical memories, and determine the best partitioning of the arrays into multiple logical array partitions based on how they are accessed in the loops. For example, in Fig. 10 , array is split into logical array partitions and . The partitioning ensures that different elements of an array accessed in the inner loop are located in different logical array partitions.
Step 2) Regroup the logical array partitions into logical memories based on criteria such as the possibility of interleaving the different array partitions, and nonoverlapping lifetimes of the arrays. In Fig. 10 , logical array partitions and are merged into logical memory
. After this step, we have with us all the logical memories that we need along with their required sizes.
Step 3) Map the logical memories into the available physical memories. The criterion here is minimization of the transition count overhead arising from mapping multiple logical memories to the same physical memory module. In Fig. 10 , logical memories and are assigned to physical memory , whereas is assigned to .
A. Splitting Into Logical Array Partitions
Regularity in behavioral access patterns, which is a common feature in most memory-intensive applications, especially those in the DSP/image-processing domain, allows us to extract the dimensions of the tiles (Section IV-B) for arrays accessed in loops. Fig. 11 shows an array , of dimensions 6 4, organized into four tiles. In the multiple memories scenario, for each array, we use the tile, thus derived as the starting point and split into as many logical array partitions as number of array elements in the tile. Fig. 11 illustrates the splitting of array , whose tile contour has already been established, into multiple array partitions. We have six array partitions corresponding to each element of the tile. There are as many elements in each array partition as the number of tiles in array . The rationale for this division is that if each array partition is mapped to a different logical memory, this ensures optimality in terms of bit transitions on the address bus. For example, if (in TILE1) is accessed in one loop iteration, we have , , and being accessed in subsequent iterations (because the access pattern remains constant). The partitioning in Fig. 11 results in all these four elements being mapped into the same logical partition, thereby ensuring that consecutive elements of the partition are accessed in each iteration. If the memory address is converted into Gray code, the address bus for each memory would have just one-bit transitioning between consecutive iterations of the inner loop of the specification.
B. Merging Logical Array Partitions Into Logical Memories
After partitioning the arrays in the specification to multiple logical array partitions, the next task is to group several array partitions into larger logical memories. This is because it might be prohibitively expensive to have a separate physical memory for each array partition. We assume that the physical memories available each have a single read and write port. We consider the following properties while merging the array partitions.
1) Interleaving:
It may be possible to interleave multiple arrays in the same physical memory without incurring any penalty in bit transitions (See Section IV-D).
2) Independence: If two arrays are never accessed in the same loop, they can be stored in the same logical memory.
3) Nonoverlapping Lifetimes: If two arrays with the same access pattern have nonoverlapping lifetimes, they can be stored in the same memory location since the same memory space can be reused for the two arrays, i.e., we need only one logical array partition for the two arrays.
Based on the above properties, we construct a compatibility graph G, in which each vertex represents an array partition and the presence of an edge indicates that the two partitions can be placed in the same physical memory (i.e., they are compatible). If the lifetimes of two arrays with the same access pattern are nonoverlapping, then we consider only one of the arrays in the analysis since the same memory space is used for storing both. If two arrays are either independent or can be interleaved, then we create an edge between the corresponding vertices.
After constructing the compatibility graph G, we apply a clique partitioning algorithm to divide the graphs into subgraphs of array partitions. A clique is a fully connected subgraph of the original graph G. The significance of a clique is that all partitions in the clique can be placed in the same physical memory. An exact solution of the clique partitioning is known to be NP-complete [18] , thus, we employ an existing approximate algorithm [19] for this purpose. Each subgraph resulting from the clique partitioning corresponds to a logical memory. Fig. 12(a) shows an example behavior. The tile corresponding to arrays and have three elements each [see Fig. 12(b) ], which leads to three array partitions for the two arrays. Fig. 12(c) shows the corresponding graph for the array partitions. We use simple row-major mapping for arrays , ,
, and . The array has edges to all other array partitions since it is accessed in a different loop; thus, the independence property allows it to be placed with any of the other array partitions. One possible partitioning of this graph into cliques is shown in Fig. 12(d) . This leads to four logical memories, each consisting of clusters of array partitions.
C. Mapping of Logical Memories to Physical Memory
The next step is to map the logical memories to available physical memory modules. There exists a tradeoff between memory utilization and address bus transition count: mapping each logical memory into a separate physical memory might be prohibitively expensive, forcing multiple logical memories to be mapped into the same physical memory. However, this mapping leads to inefficiency in terms of transition count because some components of two different cliques in the compatibility graph would be incompatible (otherwise they would be in the same larger clique). We introduce an additional user-supplied constraint, a memory packing factor , which allows the user to control this tradeoff. The constraint , which represents the minimum fraction of each physical memory that needs to be filled, can, of course, be set to one, indicating that all the physical memories should be full. On the other hand, if the value of is relaxed to less than one, then a better packing in terms of transition count might be achieved. If is set to zero, then the resulting mapping is optimum with respect to transition count, but could be area expensive, as memory space would be wasted.
The algorithm we use for mapping the logical memories into physical memories is a variant of the best-fit decreasing heuristic for the bin-packing problem [20] . The general strategy is to consider the logical memories one at a time, largest first, and assign it to a physical memory module based on its own size and the sizes of the available memory modules. We start with the largest logical memory ( ) since this is a candidate that possibly accounts for large transition counts. If is larger than all available physical memories, we implement it with (multiple copies of) the largest available physical memory module. If there is a remainder of lesser size, we add it to the list of logical memories and continue the mapping process. If there is at least one physical memory module of greater size than that of the logical memory under consideration, we map the logical memory into the largest physical memory module that satisfies the memory packing factor . If the constraint cannot be satisfied, the mapping is too expensive, and we need to compromise on transition counts by mapping more than one logical memory to the same physical memory. The details of the algorithm can be found in [17] .
VI. EXPERIMENTS
We conducted experiments for testing the efficacy of various memory mapping schemes on several examples taken from the image-processing applications domain [21] . We use the cumulative count of the bit transitions on the memory address bus during the execution of the algorithm as a power-consumption metric for each memory mapping technique. We varied the dimension of the arrays and studied the effects of the mapping schemes on transition count. Detailed experimental results are described in [17] ; we summarize the results here. Fig. 13 shows the variation of the total transition count with respect to the dimension of the arrays for the SOR example. All arrays are of the form , so was varied along the -axis from 50 to 1000 in steps of ten. The curves marked Row-major and Row-major (Gray) represent the total transition count for row-major mapping with and without the GCC. Likewise for the Column-major and Column-major (Gray) curves. This illustrates the difference in transition counts due to the GCC alone. The curve marked Tile-based shows the transitions for the tile-based mapping discussed before.
A. Single Memory
For the SOR example, column-major mapping with the GCC does roughly as well as the tile-based mapping. In this case, the column-major mapping is still preferable since it results in a simpler address generator compared to tile-based mapping, while performing almost equally well in terms of transition count. Table I summarizes our experimental results by comparing the total transition counts for the row major to the best mapping out of the ones we have considered for seven examples from the image processing domain. The * beside Linear and Wavelet indicates that the counts represent interleaved versus noninterleaved since the question of row and column major does not arise in these examples that involve only one-dimensional arrays. The second and third columns show the total I/O transition count for two mapping schemes-row major and the best mapping predicted by Heuristic 1. Column 4 shows the percentage reduction in I/O transition count for the specific case where . Column 5 shows the average percentage reduction in transition count obtained by performing the same experiment for values of ranging from 50 to 1000 in steps of ten. We observe a significant decrease in the memory address bus transition counts, indicating guaranteed power reduction for these examples.
The heuristic presented in Heuristic 1 selects the best memory mapping technique for the examples on which we conducted our experiments.
1) Overall Effect on Transition Count:
We now evaluate the net reduction in equivalent transition count due to the proposed mapping strategy for the SOR example discussed before. It is clear from the graph of Fig. 13 that inclusion of the GCC makes no appreciable difference in transition count for row-major mapping, so we compare row major without GCC against tile-based mapping with GCC.
We evaluate the power reduction in the ASIC component as follows. Let be the reduction in equivalent transition count on the memory address bus, be the transition count overhead in the GCC, be the transition count overhead in the multiplexer, and be the transition count overhead in the combinational circuit in the selector. The overall reduction in the transition count of the ASIC is, therefore ( ), (i.e., ). Assuming all variables and addresses are 16-bit wide and , we get that the average off-chip transition count on the memory address bus (over all iterations) as 3.75 for tile based, and 7.51 for row major. This being the difference in the off-chip transition count amounts to equivalent transition count (of typical transistors). We observed an average of 4.29 transitions at the input of the GCC, giving transition count overhead incurred at the GCC For the 16-bit data, the overhead in the multiplexer is transitions. The overall reduction in transition count in the ASIC . ( from Section IV). We note that the overhead incurred in implementing the desired mapping scheme is minimal (in terms of area, delay, and transition count) and the transition count difference is dominated by the reduction in off-chip transition count. The implementation of the selector shows that the area overhead for tile-based mapping is insignificant. The delay overhead can also be ignored since the only additional delay incurred is due to the one XOR gate stage delay in the GCC.
Note that, in this example, the GCC helped bring down the transition count per memory access from 4.29 (average transition count at the input of the GCC) to 3.75 (address bus transition count) or by approximately 12.5%. If the GCC were absent, the reduction in equivalent transition count in the address bus would be . The component (overhead of GCC) does not exist. and remain the same as before. The net reduction in equivalent transition count . This observation corroborates our claim in Section VI that our memory mapping techniques guarantee reduction in transition count even in the absence of the GCC. Although the illustration of transition count reduction was for a specific example, the analysis is quite general. We, therefore, believe that this analysis applies to all the other examples. For example, the overhead of the GCC is always the same function of the input transition count. The implementation of tile-based mapping would follow the same hardware organization shown in Fig. 7(c) -only the inputs to the selector would change. The combinational circuit shown in the selector would be different for different input specifications. However, the inputs to the selector would still be constants and the technique used to estimate the transition count in the MUX would still be valid.
B. Multiple Memories
Our experiments in Section VI-A showed a 27%-63% reduction in transition count when behavioral arrays were mapped into a single memory. The experiments we report in this section demonstrate a further significant reduction when the arrays are mapped to multiple memories.
Experiment 1: In our first experiment to determine the impact of multiple memories on transition count, we used a configuration of multiple memories in the examples, where there is one physical memory available for each logical memory. In other words, this is the best improvement possible over the single memory case since the partitioning into logical memories represents the ideal mapping, according to our technique. Table II shows a comparison of transition counts for five examples of the best mapping in the case of a single physical memory and multiple physical memories. Columns 2 and 3 show transition counts for the different examples, with the value of as 1000. Column 4 shows that the number of times by which transition counts decreased as a result of using multiple memories. Column 5 shows the average reduction for all the different sizes of the arrays that were considered for each example ( was varied from 50 to 1000 in steps of ten). We observe that transition count could be reduced by a factor of between 2.7-6.6 times if multiple memories are considered. Fig. 14 shows the transition counts for the five examples. The lighter curves represent the total transition count for the best mapping targeting a single physical memory; the heavier curves indicate the transition counts when the mapping resulting from application of Algorithm 2 targeting multiple physical memories is used.
We observe that the ratio of transition counts remains roughly constant across a wide range of array dimensions, with very little deviation from the average reduction factor in transition counts shown in Table II , column 5. This demonstrates that the proposed extension of our techniques to the multiple memories scenario could lead to significant savings in power, as measured by the transition count on the memory address bus.
Experiment 2: Experiment 1 demonstrates the possible transition count reduction in the hypothetical case of physical memories of the same size as the logical memories being available. In practice, however, the designer may be constrained by a specific library of physical memory modules. In the second experiment, we used a specific library of memory modules of sizes 128, 256, and 512 kbyte, and 1 and 2 Mbyte, with the assumption that each array element in the arrays of the examples occupies 1 byte (data bus is 8-b wide). Table III shows the transition counts obtained for the same five examples when the mapping is performed with the above library.
In Table III , column 2 shows the transition counts obtained for the case when using a simple algorithm that maps the arrays in the specification (in order of decreasing size) to the smallest memory that accommodates them. Column 3 shows the corresponding transition counts using our approach. Comparing column 3 in Tables II and III , we note that for three of the examples (Compress, Laplace, and Lowpass), the transition count in experiment 2 was the same as the best case (one physical memory for each logical memory) transition count in experiment 1 (i.e., Algorithm 2 performed the optimal mapping). The transition count reduction factors shown in column 4 indicate a significant reduction resulting from our approach, ranging from a factor of 1.5 to 6.7.
VII. CONCLUSIONS AND FUTURE WORK
In this paper, we studied the impact of mapping of array elements to physical memory address on the power consumption of the resulting design. We presented a heuristic for examining the important access patterns of arrays in the specification, and choosing an appropriate mapping strategy. We also formulated and solved the problem of mapping the arrays into multiple physical memories with the objective of reducing memory address bus transitions.
The concept of maximal and minimal transitions used in Section IV-C is also related to memory pages in dynamic random access memories (DRAM's). A minimal transition roughly corresponds to accesses in the same memory page, while a maximal transition indicates accesses from different pages. The implication of minimal and maximal transitions in power also has an equivalence in the memory performance in page-mode DRAM's. The access time for addresses in the same page is much less than the time for consecutive accesses from different pages. Thus, the formulation we employed in this paper holds not only for the power minimization objective, but also for improving memory performance.
We conducted experiments on several image-processing benchmarks that exhibited various (regular) memory access patterns. Our memory mapping algorithm selected the best mapping for low power on these examples and resulted in savings of up to 63% in transition count on the memory address bus related circuitry in the ASIC and memory. Our experiments showed that the power consumption could reduce by another factor of 2.7 to 6.6 from the best mapping technique for a single physical memory if multiple memories are employed. Mapping into a sample library showed reductions from a factor of 1.5 to 6.7 through reduced transition count on the memory addresses buses over a straightforward mapping scheme.
Therefore, this paper highlights the importance of selecting an appropriate memory mapping technique for low-power design and presents an effective heuristic for selecting a memory mapping technique that minimizes address bus transitions for lower power. Further, since we attempt to align data in memory, in the order in which they are accessed, our strategy will help improve system performance when pagemode memories are employed (in this case, the Gray code conversion is not applicable).
The mapping strategy we have described is valid for array references in loops of the form , where is the index and is a constant. The formulation and solution of the problem to handle multidimensional arrays remains exactly the same. Currently, the designer is responsible for identifying the most important loop of the specification on which the memory mapping scheme has to be based. Identification of the appropriate nested loops to examine is important, otherwise it might lead to conflicting mapping results (since the arrays could be accessed in different ways in different parts of the specification). 9 The concept of mapping of behavioral arrays to multiple physical memories can be explored further by studying different architectural connectivities of the memories (e.g., all memories connected to the ASIC through one memory address bus), as well as different memory configurations (e.g., multiport memories).
