We present results of extensive hardware/software partitioning experiments on numerous benchmarks. We describe our loop-oriented partitioning methodology for moving critical code from hardware to software. Our benchmarks included programs from PowerStone, MediaBench, and NetBench. Our experiments included estimated results for partitioning using an 8051 8-bit microcontroller or a 32-bit MIPS microprocessor for the software, and using on-chip configurable logic or custom application-specific integrated circuit hardware for the hardware. Additional experiments involved actual measurements taken from several physical implementations of hardware/software partitionings on real single-chip microprocessor/configurable-logic devices. We also estimated results assuming voltage scalable processors. We provide performance, energy, and size data for all of the experiments. We found that the benchmarks spent an average of 80% of their execution time in only 3% of their code, amounting to only about 200 bytes of critical code. For various experiments, we found that moving critical code to hardware resulted in average speedups of 3 to 5 and average energy savings of 35% to 70%, with average hardware requirements of only 5000 to 10,000 gates. To our knowledge, these experiments represent the most comprehensive hardware/software partitioning study published to date.
INTRODUCTION
Much previous work has shown the advantages of hardware/software partitioning in embedded system design. Hardware/software partitioning divides an application into software running on a microprocessor and some number of coprocessors implemented in custom hardware. Advantages of such partitioning include improvement in performance (e.g., Gokhale and Stone 1998; Hauser and Wawrzynek 1997) , as well as reduction in power or energy [Henkel and Li 1998; Henkel 1999; Stitt et al. 2002; Wan et al. 1998 ]. These advantages are gained at the expense of increased silicon area-area that is becoming cheaper and more readily available every year.
Many previous efforts have focused on partitioning an application that consists of numerous concurrent processes [Hou and Wolf 1996; Kalavade and Lee 1994] . Our work focuses on partitioning a single sequential program among a microprocessor and one or more custom coprocessors. In such single-program partitioning, custom coprocessors execute certain functions that were previously implemented in software. Such partitioning is possible in embedded systems, where the program is often fixed for the lifetime of the system. Previous work on single-program partitioning [Balboni et al. 1967; Eles et al. 1997; Gajski et al. 1998; Henkel and Ernst 1977; Vanmeerbeeck et al. 2001 ] has emphasized exploration of large numbers of candidate partitionings in order to meet timing constraints, utilizing powerful search algorithms such as simulated annealing, and utilizing sophisticated estimation models.
However, we have observed that most embedded applications spend a majority of their time in a few small loops or subroutines. Therefore, we will discuss a straightforward methodology for hardware/software partitioning that capitalizes on this observation, and is easy to implement manually or automatically.
The primary contribution of our work, though, is an extensive examination of the energy savings as well as speedups possible through hardware/software partitioning. We have examined numerous benchmarks ranging from small applications from the PowerStone suite [Malik et al. 2000] , to medium-sized applications from MediaBench [Lee et al. 1997] and NetBench [Mernik et al. 2001] . We have utilized an 8-bit microcontroller as well as 32-bit processors. We have analyzed energy savings by using estimation models, as well as by taking physical measurements of real platforms. We have examined partitioning for an application-specific integrated circuit (ASIC) design flow, as well as for increasingly popular single-chip platforms having a microprocessor plus configurablelogic [Altera Corporation 2001; Atmel FPSLIC; E5; Triscend Corporation; Xilinx Corporation] . We have considered energy savings of partitioning using a microprocessor with low-power standby mode, and with a voltage-scalable power source.
HARDWARE/SOFTWARE PARTITIONING METHOD

Problem Description
We assume a designer is interested in reducing the energy required by a software application, in speeding up the application, or both. We assume the common situation in embedded systems of an application being made up of a particular repeating task. In some cases, that task must repeat once every X seconds; executing more frequently is not necessary. For example, an audio decompressor might have to decode a compressed audio frame once every 2 ms in order to provide a steady audio stream. In other cases, we may want the task to execute as frequently as possible. Thus, our goal is to reduce the energy for each • G. Stitt et al. execution of the task, or to just speedup the task. Energy, measured in joules, is the product of time (seconds) and power (watts). In general, hardware/software partitioning reduces energy by reducing the execution time of the task, that is, by speeding up the task. Speedup is defined as the old execution time (software only) divided by the new execution time (software and hardware). The speedup typically occurs because a custom coprocessor can often execute a software region in one clock cycle that would have required numerous assembly instructions in software, due to the fine-grained parallelism possible in custom hardware. For example, the following software:
might require dozens of clock cycles in software, but could easily be accomplished in one cycle using custom hardware. However, energy reduction is not obvious from this speedup because such partitioning typically increases the power of the system while the system is executing. Thus, to reduce energy, the speedup must be great enough to overcome the increase in power.
Critical Loop Detection
Past work in hardware/software partitioning, including our own, has typically emphasized extensive exploration of the partitioning solution space. Exploration algorithms typically examine thousands of possible partitionings. However, during our experiences with embedded applications, we have observed that most applications spend a majority of their time in just a few loops or subroutines-what we will call critical loops. We use the term critical loop for any loop that accounts for roughly 7% or more of a task's execution time. Though we use the term "critical loop" for simplicity, sometimes the region actually represents a subroutine. That subroutine is usually critical due to being called from a loop or due to containing a loop, so the term "loop" is appropriate. In very few cases, the subroutine is critical because of being called from numerous places throughout a program.
The number of critical loops in each benchmark is typically between two and four. Amdahl's law [Amdahl 1967 ] leads us to realize that we should focus initially on those critical loops to obtain our speedup. For example, a task may have a critical loop that accounts for 60% of the task's execution time, and numerous other loops that each account for only 5% each. Speeding up the critical loop may ideally result in a speedup of 100/(100 -60) = 2.5, whereas speeding up any of the other loops could have at best only resulted in a speedup of 100/(100 -5) = 1.1; even speeding up all of those other loops would have at best resulted in a speedup of 100/(100 -40) = 1.7. Table I summarizes critical loop statistics for a variety of benchmarks on several different microprocessors. The prefixes PS, MB, or NB indicate whether the benchmark application was taken from PowerStone [Malik et al. 2000 ], MediaBench [Lee et al. 1997; MediaBench] , or NetBench [Mernik et al. 2001] . Arch indicates the microprocessor architecture onto which we compiled the application, being either an Intel 8051 8-bit microcontroller, a MIPS 32-bit embedded processor [MIPS] , or the SimpleScalar (SS) processor (a MIPS extension) [Burger and Austin 1997] . Size indicates the static size in bytes of the application after being compiled to the given architecture. The Critical Loops columns provide statistics on the most critical loops of each application, up to four of them (L1, L2, L3, and L4). Size represents the static size of the loop in bytes. % time indicates the percentage of task execution time (in this case, the percentage of total cycles) that this loop accounts for. Ideal Cum. Speedup represents the speedup that would ideally be obtained if this loop were executed in zero time. The speedups shown are cumulative for each successive loop, so the speedup under loop L2 assumes both L1 and L2 execute in zero time.
We obtained the data in Table I as follows. After compiling the C source code of the benchmark to a binary, for the 8051, MIPS, and SimpleScalar, we executed the binary on a cycle-accurate instruction set simulator. We developed our own instruction-set simulators for the 8051 [Univ. of California] and for the MIPS [Givargis et al. 2001 ], and we used the SimpleScalar simulator for the SimpleScalar processor. We configured each simulator to output an instruction trace. We also wrote a tool that parses each binary and outputs a listing of all the loops and subroutine locations. We then created a tool, called LOOAN [ Villarreal et al. 2001] , that reads the loop/subroutine file, and then processes the instruction trace, maintaining a wide variety of statistics on loop/subroutine behavior, including the number of visits to the loop/subroutine and the number of iterations per loop-keeping minimums, maximums, and averages of these numbers. While this tool was quite useful in isolating critical loops, the trace files the tool generates are rather large, exceeding several gigabytes in some cases. Such size not only results in long run times for LOOAN, but can also exceed available disk space. Thus, we plan to update the instruction-set simulators to keep LOOAN's statistics during runtime (related work at UCR has already resulted in integration of LOOAN with the Simics simulator [Werner and Magnusson 1997]) . Table I indicates the extent to which most execution time is spent in just a few small loops. Many people refer to this phenomenon as the "80-20 rule" or the "90-10 rule," wherein software spends 90% of the time in 10% of the code. From the averages at the bottom of the table, we see that the phenomenon in these benchmarks results in what we might call a "50-2 rule"-about 50% of the time is spent in 2% of the code, or an "80-3 rule"-about 80% of the time is spent in 3% of the code.
The main observation we might make from these data is that most of our speedup will come from speeding up just one to three very small loops. This observation implies that an extensive solution space search during hardware/software partitioning is not necessary. This observation may also imply that only a small amount of hardware may be necessary to gain speedups of 2 to 3, and that in some cases, extremely high speedups may be possible with that small amount of hardware. With such speedups, we may also find good energy savings.
Partitioning Approach
To examine different hardware/software partitions of the applications, we began by ordering the critical loops according to their percentage of total time, with the loops labeled L1, L2, and so on. We then created a version of the application with all the critical loops moved to hardware. We modeled the hardware using a synthesizable register-transfer level VHDL process [Synopsys] . Each process described a finite-state machine, where we scheduled C-level statements into a minimal number of states.
When multiple loops from the application were synthesized, we included them as substates of a single-state machine, so that when synthesized they would share hardware. Such sharing was possible because we were guaranteed that the loops would not execute concurrently to one another, since they came from sequential software.
For our microprocessor/configurable-logic experiments, our target architecture was based on that in Figure 1 , which is similar to the architecture found in Triscend's single-chip microprocessor/configurable-logic platforms [Triscend Corporation] . Unlike Triscend's platforms, we assume the configurable system logic (CSL) is a master of the bus. CSL bus mastering allows for more flexible CSL memory accesses compared to a DMA. However, a DMA can generally handle block accesses more efficiently and allows for parallel execution of the processor and CSL. The added flexibility of giving the CSL the option of using a DMA or directly accessing the bus allows for hardware to be implemented for almost any software region. Communication between the microprocessor and CSL takes place via shared memory and several direct signals. Some of this shared memory is implemented in registers in the CSL, which have a direct connection to the hardware. The processor can efficiently communicate with the hardware by writing or reading from these registers. Due to a variety of types of on-chip configurable logic (FPGA, CPLD, and PLA), we use the more general term CSL to refer to all of these.
For the core-based ASIC experiments, we use a simpler architecture shown in Figure 2 . This model shares a memory between the microprocessor and custom hardware, but does not require a DMA component because the execution of the processor and custom logic is guaranteed not to overlap. The simplified architecture also has less interconnect between the microprocessor and custom logic because of the absence of the CSL.
We implemented each partitioning by replacing the selected software regions with a handshaking routine. The software would activate the custom hardware (either in the CSL or on the ASIC) using a start signal. For the microprocessor, such activation consists of simply setting a memory-mapped register with a direct connection to the hardware. The microprocessor then enters a low-power state while waiting for the hardware, achieved by setting a bit in a special function register. The processor then waits for the hardware to assert an interrupt, thereby waking up the processor. While the software partition is running, the hardware partition enters a low-power idle state. Waking up the processor from its low-power state requires anywhere between a few cycles to a few dozen, depending on the processor. In either case, these cycles are only expended after a loop in hardware completes (and not on every iteration), and thus is generally negligible compared to the thousands of cycles for the loop execution.
Currently, we are only considering the situation where the microprocessor and CSL execute in mutual exclusion. Mutually exclusive execution simplifies the architecture because the microprocessor and CSL will never access memory at the same time. In fact, because we are partitioning a sequential program, there would likely be little benefit to parallelizing the execution of the hardware and software partitions.
SPEEDUP AND ENERGY SAVINGS FOR MICROPROCESSOR/CSL DEVICES
Estimation Based
In Section 2.2, we discussed our method for determining software cycles, utilizing instruction-set simulators. To determine the software cycles for a partitioned design, we replaced the software loop with the required handshaking behavior. To determine the hardware cycles for a loop, we pessimistically assumed the longest path through the loop body was taken for every loop iteration. Such an assumption enabled us to avoid having to get every example simulating perfectly-something not necessary to estimate the speedup and energy improvements. Thus, the actual speedups and energy savings from partitioning may actually be slightly better than what we report. Speedup data is shown in Table II for the benchmarks on which we performed estimation. Cycles orig represent the clock cycles for the entire application executing on a microprocessor. For the critical loops, Cycles sw represent the cycles those loops account for on the microprocessor, while Cycles hw represent the number of clock cycles needed for those loops to execute in the CSL hardware. We see that the speedups (Sp.) range from almost none (1.1) to 12.9, averaging 3.2. For all examples, the MIPS ran at 100 MHz and the 8051 ran at 25 MHz. The custom hardware in the CSL was run at the maximum possible clock frequency reported by CSL tools after synthesis, place and route. These frequencies are specified in the column labeled Clk hw .
We used Xilinx's virtex power estimator [Virtex] to estimate the power of the CSL executing the critical loops of each example, for a 0.18 µm FPGA technology at 1.8 V (in particular, for the XCV50E), shown in Table III as P hw . We must also consider that the idle microprocessor continues to consume power while the CSL is active. Through physical measurement of Triscend's parts, we determined the idle microprocessor to consume 85% of the power of its active state, so we use 85% throughout the experiments.
We used typical power values for a commercial MIPS processor [MIPS Technologies] in 0.18 µm CMOS at 1.8 V, shown as P sw , for the microprocessor active state. However, as above, we must also consider the idle power of the CSL while the processor is active, which we found through experimentation with Triscend parts to be about 12.5% of the CSL active power.
When either the microprocessor or CSL is active, we must also consider the power of interconnect and memory, P i , which we obtained through physical measurement on Triscend parts and used throughout the experiments. Thus, we developed the following equation for total system energy E: E = Time sw * (P sw + 0.125 * P hw + P i ) + Time hw * (0.85 * P sw + P hw + P i ).
Time sw is the number of cycles the microprocessor is active times the microprocessor clock period. We multiply Time sw by the system power consumed while the microprocessor is active. Similerly, Time hw is the number of cycles the CSL is active times the CSL clock period, which is then multiplied by the system power consumed while the CSL is active.
• G. Stitt et al. Energy results are shown in Table III . E orig is the energy for the unpartitioned, software-only application, while E sw/hw is the energy after partitioning. In general, we see modest energy savings, averaging 34%, due to the lack of major speedups. These speedups were obtained using an equivalent of 10,507 logic gates on average.
Physical Measurement Based
We obtained two single-chip microprocessor/CSL devices from Triscend: an E5 and an A7. The E5 contained an accelerated 8051 8-bit microcontroller, which used only four clock cycles per instruction byte instead of the typical 12 cycles per instruction, and a 40,000 gate equivalent CSL. The clock frequency for both the 8051 and the CSL was 25 MHz. The A7 contained an ARM7 32-bit microprocessor plus a 40,000 gate equivalent CSL, both of which were clocked at 40 MHz. We used these parts to determine energy and speedup through physical measurement rather than estimation. We connected a digital multimeter to each device to measure current, and we used the timer available on our workstation to measure time. By multiplying current with voltage, we obtained power, which could be multiplied by our time measurements to obtain energy.
Results for the benchmarks we implemented on the A7 and E5 are shown in Table IV . As getting complete working implementations was rather time consuming, we obtained results for a subset of the benchmarks. We see good speedups and energy savings. We also see that our estimated results were reasonably accurate, and that our energy estimates were perhaps even a bit conservative. 
SPEEDUP AND ENERGY SAVINGS FOR CORE-BASED ASICs
The results given in the previous section utilize prefabricated single-chip microprocessor/CSL devices. We now describe estimated results if the critical loops could be implemented in custom hardware alongside a microprocessor core on a single ASIC. We utilized Synopsys synthesis and power estimation tools to obtain estimates for a 0.18 µm library similar to that used for the microprocessor.
The speedup results of hardware/software partitioning in an ASIC design are shown in Table V . Average speedup increased to 4.0, due to higher clock frequencies for the hardware partitions. Energy savings are shown in Table VI. • G. Stitt et al. The energy savings improve to nearly 50%. The reason for the increased energy savings is primarily because the hardware partition consumes nearly an order of magnitude less power in an ASIC than in CSL. The core-based ASIC required, on an average, only 5738 gates to implement the hardware partition.
VOLTAGE SCALING
Dynamic power consumption in CMOS designs is proportional to the supply voltage squared. Therefore, voltage scaling can be effective at reducing energy because lowering the voltage results in a quadratic reduction in power. Lowering voltage also increases the delay of the critical path, resulting in a slower clock and decreased performance. Due to the decreased performance, voltage scaling is typically performed dynamically during nonperformance critical parts of an application. An example of a voltage-scalable processor is the Intel XScale [Intel] , having the capability to reduce the clock and voltage dynamically in order to reduce power at the expense of performance. Hardware/software partitioning introduces a new possibility for voltage scaling. Due to the speedups achieved from the custom hardware, we can reduce voltage until the performance matches the original software, while consuming much less energy due to the approximately quadratic reduction in power.
We estimated the energy savings from voltage scaling using the following formulae [Gonzalez et al. 1997] :
where T is the delay of the critical path, V is the supply voltage, V t is the threshold voltage, and k is a design-dependent constant.
Using the previous formulae, we are able to derive an equation for determining the clock frequency at a given supply voltage:
where F is the clock frequency.
We first determine how much we can reduce the clock in order to match the performance of the software-only design. We then estimate the delay of the critical path by using the maximum clock frequency. Using this delay, we can determine k. We use a threshold voltage of 0.8 V. With this information, we can determine the minimum supply voltage that allows the design to run at the reduced clock speed. Once we have determined this voltage, we can estimate power for the voltage-scaled system in the following way: and F o are the power, voltage, and clock frequency of the system before voltage scaling, C is the capacitance of the system and a is the switching frequency. P, V , and F are the power, voltage, and clock frequency after voltage scaling. Therefore, we are estimating the power of the voltage-scaled system by using the power of the original system and the voltages and clock frequencies of both systems.
Results for voltage scaling are shown in Table VII . PSR is the percent speed reduction in order to match the original software. Clk VS is the lower clock frequency used to achieve the required speed reduction. The 8051 had an original clock speed of 25 MHz and the MIPS had a clock of 100 MHz. V VS is the voltage after voltage scaling. All examples originally used a supply voltage of 1.8 V. Power is the original power of the system. Power VS is the power after voltage scaling. Energy is the amount of energy for each example on the original system. Energy VS is the energy required after voltage scaling. E sav is the energy savings achieved from voltage scaling.
We see that voltage scaling increases the energy savings by nearly an additional 14%, to 62%.
CONCLUSIONS
Our experiments demonstrate that significant performance and energy benefits can be obtained for a wide variety of real software applications by moving just a small amount of critical code to hardware, while in some cases the speedups and energy savings can be huge. One interesting conclusion is that increasingly popular single-chip microprocessor/configurable-logic platforms can yield big improvements over microprocessor-only platforms, using only a modest amount of hardware. A second conclusion is that good speedups can be obtained by moving just one to three small loops from software to hardware, for which extensive hardware/software exploration methods are not necessary. Thus, automation tools could foreseeably be developed whose main tasks would be profiling followed by synthesis of critical loops into hardwaresomething becoming increasingly possible with the advent of C-based synthesis tools.
We plan in the future to analyze the impacts of various architecture features, such as microprocessor and hardware clock frequencies and power consumption, interconnect power, CSL power, available hardware size, and memory bandwidth, on obtaining speedups and energy savings. We also plan to investigate the benefits of moving additional loops to hardware after the initial critical loops have been moved.
