We developed a new modular synthesis approach for design of low-power core-based data-intensive application-specific systems on silicon. The power optimization is conducted in three steps: 
INTRODUCTION
The ever increasing popUlarity of communications applications, such as wireless telephony, Internet browsing, has established them as a target for developers of consumer electronics. Systems for these markets demand high design flexibility and large volume data management. The fact that battery life has not followed the progress pace of the semiconductor technology, has established low-power design as premier design goal for many systems. Sim ilarly, semiconductor technologies have allowed high-level inte gration of processors and memory structures on a single die. Thus, low-power systems-on-silicon (SOS) consisting of a programmable processor and memory hierarchy attracted a great deal of attention of all silicon vendors. Since significant portion of the SOS power consumption comes from the processor and cache cores, we de veloped a new synthesis technique which focuses on selecting a processor and instruction and data cache configuration for mini mized power consumption on a set of target applications.
The distribution of power dissipation for an application-specific SOS depends on the actual application. An extensive on-chip power distribution analysis has been conducted for a number of mod em general purpose processors. Gonzalez and Horowitz [Gon96] show that 25%-40% of the total energy dissipation of a typical gen eral purpose microprocessor is consumed by on-chip caches. Burd and Brodersen [Bur96] show that out of 1.2W of total Infopad pro cessor system power dissipation, 120mW and 600mW are due to microprocessor (ARM60) and 512Kb SRAM cache. The Toshiba design team [Nag95] showed that the cache, the datapath, power and control power are about 50 -100mW, lOOmW -150mW and 50mW for 3.3V 50MHz operation of their 150MIPS/w RISC.
We developed a new synthesis approach for low-power core based data-intensive application specific SOS. The approach op timizes the power of the integrated system by applying a num ber of power and performance optimization strategies. The power As an outer synthesis loop, the resource allocation algorithm performs a bounded search for a power efficient system config uration. The result consists of a microprocessor core selection, and 1-and D-cache specifi cation (size, associativity and line size).
The bounded search is guided by evaluating the design trade-offs for the selection of each of the configuration parameters. Sharp lower and upper bounds significantly reduce the number of cache configurations considered for power and performance estimation.
This reduction is important because accurate cache performance and power estimation is done using trace-driven simulations.
In order to bridge the gap between the profiling and model ing tools from the two traditionally independent synthesis domains (architecture and CAD), we have developed a new synthesis and evaluation platform. The platform integrates the existing model ing (CACTi [Wil94]), profiling (SHADE [Cme94] ) and simula tion (DINEROIII [HiI88] ) tools with the developed system-level synthesis tools. The effectiveness of the approach is demonstrated on the MediaBench benchmark suite [Lee97] .
PREVIOUS WORK
The related work can be traced along the following three Jines: application-specific system synthesis, processor and memory hi erarchy design and energy-efficiency evaluation, and cache line coloring techniques. The market-driven need for application spe cific system cost reduction has spurred the development of system level synthesis techniques [WoI94, Pot95] . As embedded applica tions have become more sophisticated and commercially relevant, hardware-software codesign and techniques for system level syn-thesis have also become increasingly important [Gup93, Gaj94] . The importance of low-power design has resulted in adequate treat ment of these optimization issues in system-level and high-level synthesis [Meh96, Lee96] .
The increased interest in embedded system design with reusable components has encouraged the development of high level archi tecture evaluation models. The Microprocessor Report presents a monthly summary of the power dissipation and performance of processor cores [Mic97] . Low-power processor design has been discussed in [Bur96, Gon96] . 1-and D-caches, as the highest level of the memory hierarchy, have been thoroughly studied [HiI88, Jou93] . Cache models for latency and power have been developed and power minimization strategies have been evaluated for a num ber of design parameters [WiI94, Wad92, K095, Eva95, Su95] .
Compiler directed techniques for minimizing number of cache misses in such studies has received significant attention in the re search community. Bershad et al. have proposed dynamic address remapping for avoiding conflict misses in large direct-mapped in struction caches [Ber94] . Static code repositioning by using cache line coloring at the procedure or basic block level has been an al ternative approach proposed and evaluated in [Hwu89, To r96].
PRELIMINARIES
Several factors combine to influence system performance: instruc tion and data cache miss rates and penalty, processor performance, and system clock speed. Power dissipation of the system is esti mated using processor power dissipation per instruction and num ber of executed instructions per task, supply voltage, energy re quired for cache read, write, and off-chip data access as well as the profiling information about the number of cache and off-chip accesses. The approach that we use to realistically address the factors used to estimate power consumption leverages on existing cache and processor models. Typical cache design is illustrated in Figure 1 . The infl uence of invidual cache components on its latency is quantified in [Wad92, Wil94] . There exists also a number of reports on how energy con sumption is distributed among parts of the cache structure. Evans and Franzon [Eva95] show that global decode, precharging, and address latches are each responsible for approximately one third of the power consumption. Ko and Balsara [K094] conclude that the bit array, address decode, control generator, and sense ampli fiers consume 40, 18,25, and 17% respectively of the total power dissipation. We used CACTi [WiI94] as a cache delay estimation tool with respect to the main cache design choices: size, associa tivity, and line size. The energy model was adopted from [Eva95, Su94] and scaled with respect to the industrial implementations. Table 2 : A sample of the cache power consumption model. Caches typically found in current embedded systems range from 128B to 32KB. Although larger caches correspond to higher hit rates, their power consumption is proportionally higher, re sulting in a design trade-off. Since higher cache associativity re sults in significantly higher access time, we consider only direct mapped caches. We experimented 2-way set associative caches, but they did not dominate in a single case. Cache line size was variable in our experimentations. Its variation corresponded to the following trade-off: larger line size results in higher cache miss penalty delay and higher power consumption by the sense ampli fiers and column decoders; smaller line size results in large cache decoder power consumption. Extreme values result in significantly increased access time. We estimated the cache miss penalty based on the operating frequency of the system and external bus width and clock for each system investigated. Write-back was adopted in oppose to write-through, since it is proven to provide superior performance and especially power savings in uniprocessor systems [Jou93] though at increased hardware cost. Each of the processors considered is constrained by the Flynn's limit, and is able to issue at most a single instruction per clock period. Thus, caches were de signed to have a single access port. A sample of the cache model data is given in Tables 1 and 2 . Cache access delay and power consumption model were computed for a number of organizations and sizes, assuming feature size of 0.5 I-lm and typical six transis tors per CMOS SRAM cell. The nominal energy consumption per single off-chip memory access, 98nJ, was adopted from [Fro97] .
Data on microprocessor cores was extracted from The Micro processor Report [MPR97] . A sample of the collected data is pre sented in Table 3 . The table presents embedded microprocessor core operating frequency, MIPS performance, technology, area, and nominal power consumption. The last two rows of the table show two integrated microprocessor products with on-chip caches and their system performance data.
Voltage scaling is used as a powerful power consumption min imization strategy in the developed synthesis strategy. The supply voltage Vdd vs. delay TD model used in the experimentations was adopted from [Cha92] and is quantified by the following formula: 
THE SYSTEM-LEVEL SYNTHESIS APPROACH
In this section we describe the function of each module in the synthesis system and how modules are combined into a modular We attack these problems by scanning the space using search al gorithms with conservative sharp lower and upper bounds and by providing powerful performance and power estimation techniques.
The developed synthesis tools synthesize a programmable ap plication specific SOS which satisfies the requirements of multi ple applications. This system requirement represents a realistic design scenario for most modem non-preemptive multitask appli cation specific SOS (speech compression, RF signal processing, etc.). The synthesis technique considers each non-dominated pro cessor core and competitive cache configuration, and selects the hardware which requires minimal power consumption and satis fies the performance requirements of all target applications.
System performance is evaluated using a platform which inte grates simulation, modeling, and profiling tools as shown in Figure   2 . SHADE is a tracing tool which allows users to define custom trace analyzers and collect rich information on runtime events. where TC is the total number of cycles required to execute the application, CPUIC is the number of cycles while the CPU is idle waiting for data to be fetched from main memory to cache (blocking caches assumed due lower hardware cost), CPUEPC is the average consumed energy per cycle for a particular proces sor core, CA and EPCA are the number of cache accesses and energy consumed per access respectively, and M A and EP M A are the number of main memory accesses (reads due cache misses and write-back updates) and energy consumed per memory access respectively. Function Scale (VoltageSupply ) returns the scal ing factor for V oltageSupply with respect to the nominal supply voltage (3.3V). The power savings due decrease of transitions in the cache decoder are encountered by scaling down 30% of the to tal cache power consumption with the ratio of the number of tran sitions due address change in the relocated and traditional basic block schedule.
SYNTHESIS OPTIMIZATION ALGORITHMS
The problems encountered in synthesis of a low power SOS and competitive optimization algorithms are described in the following subsections. First, the algorithm for basic block relocation is dis cussed. Then, based on the obtained code repositioning table and improved cache performance we assemble a power-efficient sys tem configuration (core and cache structure) which satisfies per formance requirements of a set of target applications.
Application to Cache Mapping for Low-Power
The core of the proposed application driven system level synthesis technique involves basic block repositioning based upon profile in formation. Block repositioning is undertaken in two phases. In the first phase the relocation aims for application execution on fixed hardware resources with minimal number of cache misses. The second phase relocates the part of code assigned to a particular cache line to a set of addresses which map to the same cache en try in a way that basic blocks executed frequently sequentially are stored at consecutive Gray code locations. In other words, sets of basic blocks mapped to the same cache line are reassigned to, not necessarilly, different cache entries such that the number of transi tions in the instruction cache decoder is minimized.
The first phase in the relocation strategy is described in detail in [Kir97] . The collected run-time information on the execution of the program is used to generate and weight a control flow graph (CFG). Then, the CFG is transformed so that spatially and tempo rally highly probable execution paths are identified. Once the CFG is transformed, basic blocks are selected for mapping to particular addresses relative to the starting address of the program. The goal of the mapping function is to minimize the number of cache misses for the application, i.e. to map basic blocks that are likely to be executed sequentially into memory locations which result in map pings to different cache lines. Since the problem of finding the op timal basic block mapping is computationally intractable [Gar79] , we opted to employ the heuristic presented in [Kir97] . After groups of basic blocks are identified, in the second phase of the optimization strategy we assign these groups to particular addresses. As shown in Figure 3 the problem is modeled using a graph representation. Groups of basic blocks which map to the same cache lines are represented as nodes in the graph. The weights of all edges from the CFG that connect basic blocks from two groups are summed. The sum is the weight of the directed edge connecting two vertices that represent the appropriate two groups of basic blocks. Obviously the number of nodes corre sponds to the number of entries in the cache structure. The goal of the optimization algorithm is to assign codes to nodes in such a way that the sum products of the Hamming distance and directed 560 weights between two nodes is minimal. The problem of embed ding a general weighted graph, to minimize the weighted lengths, in an n-dimensional cube is NP-hard [Afr85] . Algorithms which address this optimization problem exist [Che97] . We have developed a novel probabilistic least-constraining most constrained heuristic for minimum-switching state encoding. The heuristic is explained in detail using the pseudo-code presented in Figure 4 . The algorithm iteratively randomly selects an ordered subset of nodes with the highest value for an objective function that guides the heuristic. The objective function (see Figure 4) takes into account the constraints of each considered subset of nodes. It forces early selection of a subset of vertices with the highest sum of edge weights (most-constrained) and smallest sum of edges con necting two nodes, one in, and the other outside the subset (least constraining). The edges are summed corresponding to their order in the subset. Subset cardinality and the number of iterations in the randomized search for each objective-efficient subset are pa rameters of the heuristic. Once the subset is selected, it is assigned a set of consecutive Gray-code addresses and removed from the list of vertices considered in the subset selection inner shell. The vertices are not removed from the graph. The assigned encodings are used to add another component to the objective function de scribed above. The objective function of an ordered subset is now scaled with a factor which quantifies how well the new subset fits the existing encoding assignments (see pseudo-code in Figure 4) .
When all subsets are selected, the graph is transformed so that all nodes in one subset are merged into a single vertex (see Figure  3) . This vertex inherits only the incoming edges of the initial and outgoing edges of the ending vertex of the parent subset. Although the graph has reduced size, the goal of the optimization strategy is still the same. Since the number of nodes in the graph is reduced to
SubsetGardinali t y no es, we eIt er recursIve yapp y t e escn e procedure in order to further reduce the cardinality of the problem, or perform an exact algorithm.
Low-Power Resource Allocation
In this phase of the synthesis approach, a search is conducted for an energy-efficient system configuration and its voltage sup ply value which satisfy the performance requirements of the set of target applications. The search algorithm is described using the pseudo-code shown in Figure 6 . Figure 5 demonstrates how competitive hardware configurations are traced for the optimal so lution. The search algorithm evaluates number of cache systems and uses a technique to reduce the search space of competitive processor-cache systems from a continuous to a discrete domain.
Since performance and power evaluation of a single proces sor, 1-and D-cache configuration require trace-driven simulation, the goal of our search technique is to reduce the number of evalu ated cache systems using sharp lower and upper bounds for cache system performance and power estimations. A particular cache system is evaluated using trace-driven simulation only once. The data retrieved from such simulation can be used for overall system power consumption estimation for different embedded processor cores with minor additional computational expenses. Firstly, the algorithm excludes from further consideration pro cessors dominated by other processor cores. One processor type dominates another if it consumes less power at higher frequency and results in higher MIPS performance at the same nominal power supply. The competitive processors are then sorted in ascending order with respect to their power consumption per instruction and frequency ratio. Microprocessors which seem to be more power efficient are, therefore, given priority in the search process. This step provides later on sharper bounds for search termination. Fi nally, at the nominal supply voltage, power consumption is esti mated for the most power efficient processor combined with all cache configurations which satisfy the performance requirements of all target applications ( Figure 5; point A) .
The search for the most efficient cache configuration is bounded with conservative sharp lower and upper bounds. The sharp upper bound on cache increase for fixed voltage supply and processor core is determined by measuring the number of conflict misses and comparing the energy required to fetch the data from off-chip memory due measured conflict misses and the power that would have been consumed for twice as large cache and same number of cache accesses (number of cache conflicts assumed to be zero). In the former case we terminate further increase of the particu lar cache structure. Similarly, the lower bound is defined at the point when the energy required to fetch the data from off-chip memory due conflict cache misses for twice smaller cache with zero-energy consumed per cache access, is larger than the energy required for both fetching data from cache and off-chip memory in the case of the larger cache structure. The smallest cache system which meets the performance requirements of the set of target ap-561 plications guarantees power consumption optimality for fixed pro cessor and fixed supply voltage ( Figure 5 ; point A; configuration CPU1, 1$1, D$1).
After evaluating all competitive hardware configurations, we obtain the cache-processor configuration that represents the power optimal solution at the nominal voltage supply (in our experimen tations 3.3V). The voltage supply can now be reduced until a point where the current best configuration does meet the applications' timing constraints. The key step in the search mechanism is trans lating the search domain for the voltage supply from continuous into discrete scope. As shown in Figure 5 the curves that represent power consumption behavior for a single system configuration are congruent with respect to the voltage supply change. This means that a configuration which gives the best energy savings for a spe cific voltage continuously has the best results in the domain of its definition, i.e. until able to satisfy the application constraints. Ac cording to this fact, we compare hardware configurations only at the discrete points of discontinuity of curves which quantify con figurations' power analysis. In Figure 5 , the point of discontinuity of the configuration selected at point A is point B. Point B repre sents the first solution (configuration and voltage) candidate.
In order to search for a configuration which results in lower power consumption, at point E, we estimate the power character istics of all application feasible configurations (which do not have the processor core from the first candidate) and select the one with the best performance (CPU2, 1$2, D$l). However, since we al ready have the current best solution (from previous processor core evaluations, CPU1, 1$1, D$l) the bounds are now even sharper.
Namely, we set the final upper and lower bound on the cache struc ture size and organization as follows:
• Upper bound. Given: Feasible cache structure (1-and D-cache), and pro cessor core A. We terminate further increase of the cache structure when the increase of the power consumption due larger cache (although cache con flicts total zero) would be larger than the energy consumed by the current best solution.
• Lower bound. Given: Feasible cache structure (1-and D-cache), and pro cessor core A. We abort further decrease of the cache structure if the amount of energy required to bring the data due additional cache misses from off chip memory (although the smaller cache power consumption totals zero) is larger than the energy consumed by the current best solution. Evaluate power consumption of the cache structure According to tbe existing cacbe system analysis evaluate tbe power consumption of the entire system (witb the processor core) Memorize Configuration C if power consumption is minimal Scale down tbe VoltageSupply until C does not satisfy the constraints. Delete tbe processor core PC E C from the list of cores L until L is not empty. The algorithm iterates the process of scaling down the volt age supply of the current best configuration until it can satisfy the deadlines. It consecutively records the best configuration and the appropriate voltage with different processor core. These it erations are repeated until the last application-feasible processor core's configuration is evaluated for power. This configuration has the ability to satisfy the application constraints at the lowest voltage supply (point E; configuration CPU6, 1$1, D$1). How ever, this property does not guarantee the best overall power savings. In Figure 5 , the most power'-efficient configuration found is CPU3, 1$1, D$1 at point D.
EXPERIMENTAL RESULTS
The effectiveness of our synthesis approach is demonstrated on a set of applications and input data from the MediaBench suite lee97], and a set of hard real-time constraints. The results are shown in Table 4 . The upper part of the table presents the con figurations which were found to be the most power-effective for a particular benchmark. For each application and its timing con straint, the winning configuration is described by specifying its processor type, the associated 1-and D-cache structures (cache sizes followed by cache line size). These properties of the sys tem are shown in the third column. The next column quantifies the power dissipation of the system for execution of one iteration of the application. The last column points to the percentage im provement with respect to the average configuration among all the confi gurations minimal at the discrete points of system evaluation (see Figure 5 ; points A, B, C, D, and E). The energy efficacy for an average confi guration is presented in the lower part of Table 4 : Experimental results: comparison of our synthesis ap proach to a greedy strategy.
The approach results in quantitatively diverse improvements depending on the actual applications and selected deadline. We opted to test the approach for strict as well as relaxed deadlines. Note that the highest improvements in the power dissipation oc curred in cases where the number of cache misses was significantly reduced using the basic block relocation compilation technique (this heuristic has been experimented in [Kir97] ). On the other hand the heuristic used to minimize the transitions in the cache de coder resulted in average 35% less estimated power dissipation in the I-cache decoder with negligible standard deviation. The per centage of power dissipation of the overall system due the I-cache accesses ranged from 21 % to 44%. For a single application and fixed timing constraint, the resulting power consumptions for var ious configurations appeared to cluster at particular values in the range of one to three orders of magnitude. Therefore, the power consumption results have varying improvements from 8% to 464% depending on the number and range of confi guration clusters.
CONCLUSION
We developed algorithms for system-level power minimization of instruction cache misses, instruction and data cache size and organization selection, placement of frequently executed sequential basic blocks of code in consecutive Gray code addressed memory locations, and power-effi cient processor and cache matching with the applications. The compilation performance and power opti mization algorithms are engine for synthesis of core-based low power system-on-silicon. The new synthesis platform integrates the existing modeling, profiling, and simulation tools with the de veloped system-level synthesis tools. The effectiveness of the ap proach is demonstrated on a variety of modem industrial-strength multimedia and communication applications.
