Abstract-Reconfigurable System-on-Chip (RSoC) devices incorporate various components, such as processor core, reconfigurable logic, memory, etc., onto a single chip. They are being used to implement many wireless embedded systems, where energy efficiency is a major concern. When an application is synthesized on RSoCs, part of it can be executed using hardware implementations on the reconfigurable logic or software implementations on the processor core. Besides, the communication and reconfiguration costs between the tasks can significantly impact the overall system energy dissipation depending on how the application is synthesized on RSoC. In order to develop applications on RSoCs for energy efficiency, we propose a threestep design process in this paper. We develop (a) a performance model to abstract a general class of RSoC architectures for application development, (b) a mathematical formulation of the energy-efficient synthesis problem for a class of applications, and (c) a dynamic programming algorithm that minimizes the system energy dissipation. We illustrate our approach by implementing two beamforming applications on a state-of-theart RSoC device. Beamforming is one of the key techniques for improving the capacity of wireless systems such as software defined radio. Compared with a greedy algorithm, reduction in energy dissipation ranging from 41% to 54% is observed in our experiments.
I. INTRODUCTION
Reconfigurable System-on-Chips (RSoCs), which integrate processor core, reconfigurable logic, memory, etc., are becoming popular. This is driven by the advantages of programmable design solutions over application specific integrated circuits and a recent trend in integrating configurable logic, e.g., FPGA, and programmable processors, offering the "best of both worlds" on a single chip. An example of these RSoCs is the Triscend A7 CSoC device [19] , which integrates a 32-bit ARM7TDMI processor core with programmable logic, a memory subsystem, and a high-performance dedicated internal bus on a single chip. Another example is Xilinx Virtex-II Pro [21] , which integrates low-power IBM PowerPC 405LP processor(s), high density FPGA, on-chip memory as well as other dedicated hardware components.
In recent years, energy efficiency has become increasingly important in the design of various computation and communication systems. It is especially critical in battery operated embedded and wireless systems. RSoC architectures offer high efficiency with respect to time and energy performance. They have been shown to achieve energy reduction and increase in computational performance of at least one order of magnitude compared with traditional processors [2] . One important application of RSoCs is software defined radio (SDR). In SDR, dissimilar and complex wireless standards (e.g. GSM, IS-95, wideband CDMA) are processed in a single base station, where a large amount of data from the mobile terminals results in high computational requirement. The state-of-the-art RISC processors and DSPs are unable to meet the signal processing requirement of these base stations [6] , [7] . Minimizing the power consumption has also become a key issue for these base stations due to their high computation requirements that dissipate a lot of energy as well as the inaccessible and distributed locations of the base stations. RSoCs stand out as an attractive option for implementing various functions of SDR due to their high performance, high energy efficiency, and reconfigurability.
In the systems discussed above, the application is decomposed into a number of tasks. Each task is mapped onto different components of the RSoC device for execution. By synthesis, we mean finding a mapping that determines an implementation for the tasks. We can map a task to hardware implementations on reconfigurable logic, or software implementations using the embedded processor core. Besides, RSoCs offer many control knobs (see Section II for details), which can be used to improve energy efficiency. In order to better exploit these control knobs, a performance model of the RSoC architectures and algorithms for mapping applications onto these architectures are required. The RSoC model should allow for a systematic abstraction of the available control knobs and enable system-level optimization. The mapping algorithms should capture the parameters from the RSoC model, the communication costs for moving data between different components on RSoCs, and the configuration costs for changing the configuration of the reconfigurable logic. These communication and configuration costs cannot be ignored compared with that required for computation. We show that a simple greedy mapping algorithm that maps each task onto either hardware or software, depending upon which dissipates the least amount of energy, does not always guarantee minimum energy dissipation in executing the application.
We propose a three-step design process to achieve energy efficient hardware/software co-synthesis on RSoCs. First, we develop a performance model that represents a general class of RSoC architectures. The model abstracts the various knobs that can be exploited for energy minimization during the synthesis process. Then, based on the RSoC model, we formulate a mapping problem for a class of applications that can be modeled as linear pipelines. Many embedded signal processing applications, such as the ones considered in the paper, are composed of such a linear pipeline of processing tasks. Finally, a dynamic programming algorithm is proposed for solving the above mapping problem. The algorithm is shown to be able to find a mapping that achieves minimum energy dissipation in polynomial time.
We synthesize two beamforming applications onto Virtex-II Pro to demonstrate the effectiveness of our design methodology. Virtex-II Pro is a state-of-the-art RSoC device from Xilinx. In this device, PowerPC 405 processor(s), reconfigurable logic, and on-chip memory are tightly coupled through on-chip routing resources [27] . The beamforming applications considered can be used in embedded sonar systems to detect the direction of arrival (DOA) of close by objects . They can also be deployed at the base stations using SDR to exploit the limited radio spectrum [15] .
The organization of the paper is as follows. Section II identifies the knobs for energy-efficient designs on RSoC devices. Section III discusses related work. Section IV describes the proposed RSoC model. Section V describes the class of linear pipeline applications we are targeting and formulates the energy-efficient mapping problem. Section VI presents our dynamic programming algorithm. Section VII illustrates the algorithm using two state-of-the-art beamforming applications. The modeling process and the energy dissipation results of implementing the two applications onto Virtex-II Pro are also given in this section. We conclude in Section VIII.
II. KNOBS FOR ENERGY-EFFICIENT DESIGNS
Various hardware and system level design knobs are available in RSoC architectures to optimize the energy efficiency of designs. For embedded processor cores, dynamic voltage scaling and dynamic frequency scaling can be used to lower the power consumption. The processor cores can be put into idle or sleep mode if desired to further reduce their power dissipation. For memory, the memory (SDRAM) on Triscend A7 CSoC devices can be changed to be in active, stand-by, disabled, or power-down state. Memory banking, which can be applied to the block-wise memory (BRAMs) in Virtex-II Pro, is another technique for low power designs. In this technique, the memory is split into banks and is selectively activated based on the use.
For reconfigurable logic, there are knobs at two levels that can be used to improve energy efficiency of the designs: low level and algorithm level.
Low-level knobs refer to knobs at the register-transfer or gate level. For example, Xilinx exposes the various features on their devices to designers through the unisim library [22] . One low-level knob is clock gating, which is employed to disable the clock to blocks to save power when the output of these blocks is not needed. In Virtex-II Pro, it can be realized by using primitives such as BUFGCE to dynamically drive a clock tree only when the corresponding block is used [27] . Choosing hardware bindings is another low-level knob. A binding is a mapping of a computation to a specific component on RSoC. Alternative realizations of a functionality using different components on RSoC result in different amounts of energy dissipation for the same computation. For example, there are three possible bindings for storage elements in Virtex-II Pro, which are registers, slice based RAMs, and embedded Block RAMs (BRAMs). The experiments by Choi et al. [5] show that registers and slice based RAMs have better energy efficiency for implementing small amount of storage while BRAMs have better energy efficiency for implementing large amount of storage.
Algorithm-level knobs refer to knobs that can be used during the algorithm development to reduce energy dissipation. It has been shown that energy performance can be improved significantly by optimizing a design at the algorithm level [14] . One algorithm-level knob is architecture selection. It plays a major role in determining the amount of interconnect and logic to be used in the design and thus affects the energy dissipation. For example, matrix multiplication can be implemented using a linear array or a 2-D array. A 2-D array uses more interconnects and can result in more energy dissipation compared with a 1-D array. Another algorithmlevel knob is the algorithm selection. An application can be mapped onto reconfigurable logic in several ways by selecting different algorithms. For example, when implementing FFT, a radix-4 based algorithm would significantly reduce the number of complex multiplications that would otherwise be needed if a radix-2 algorithm is used. Other algorithm-level knobs are parallel processing and pipelining.
As reconfigurable architectures are becoming domainspecific and integrate reconfigurable logic with a mix of resources, such as the ASMBL (Application Specific Modular Block) architecture proposed by Xilinx [18] , more control knobs are expected to be available on RSoCs in the future.
III. RELATED WORK
Gupta et al. [9] and Xie et al. [25] have considered the hardware/software co-design problem in the context of reconfigurable architectures. They use techniques such as configuration prefetching, to minimize the execution time. Energy efficiency is not addressed by their research.
Experiments for re-mapping of critical software loops from a microprocessor to hardware implementations using configurable logic are carried out by Villarreal et al. [26] . Significant energy savings is achieved for a class of applications. However, a systematic technique that finds the optimal hardware and software implementations of these applications is not addressed. Such a systematic technique is a focus of this paper.
A hardware-software bipartitioning algorithm based on network flow techniques for dynamically reconfigurable systems has been proposed by Rakhmatov et al. [16] . While their algorithm can be used to minimize the energy dissipation, designs on RSoCs are more complicated than a hardwaresoftware bipartitioning problem due to the many control knobs discussed in the previous section.
A C-to-VHDL high-level synthesis framework is proposed by Gupta et al. [9] . The input to their design flow is C code and they employ a set of compiler transformations to optimize the resulting designs. However, generic HDL description is usually not enough to achieve best performance as the recent In the second implementation, this is accomplished by directly controlling the related low-level knobs through the unisim library. In the third implementation, we use the IP core for multiplication from Xilinx. Low-level knobs and the device specific design constraints are already applied for performance optimization during the generation of these IP cores. The maximum operating frequencies of these implementations are shown in Table I . The implementation using IP core has by far the fastest maximum operating frequency. The reason for such performance improvement is that the specific locations of the embedded multipliers require appropriate connections between the multipliers and the registers around them. Use of appropriate location and timing constraints as in the generation of the IP cores leads to improved performance when using these multipliers [1] . It is expected that such constraint files and vendor IP cores will also have a significant impact on energy efficiency of the designs. Therefore, comparing with Gupta et al.'s approach, we consider task graphs as input to our design flow. In our approach, the energy efficiency of the designs is improved by making use of the various control knobs on the target device and available parameterized implementations of the tasks.
System level tools are becoming available to synthesize applications onto architectures composed of both hardware and software components. Xilinx offers Embedded Development Kit (EDK) that integrates hardware and software development tools for Xilinx Virtex-II Pro [21] . In this design environment, the portion of the application to be synthesized onto software is described using C/C++ and is compiled using GNU gcc. The portion of the application to be executed in hardware is described using VHDL/Verilog and is compiled using Xilinx ISE. In the Celoxica DK2 tool [3] , Handel-C (C with additional hardware description) is used for both hardware and software designs. Then, the Handel-C compiler synthesizes the hardware and software onto the device. While these system level tools provide high level languages to describe applications and map them onto processors and configurable hardware, none of them address synthesis of energy efficient designs.
IV. PERFORMANCE MODELING OF RSOC ARCHITECTURES
An abstraction of RSoC devices is proposed in this section. A model for Virtex-II Pro is developed to illustrate the modeling process.
A. RSoC Model
In Figure 1 , the RSoC model consists of four components: a processor, a reconfigurable logic (RL) such as FPGA, a memory, and an interconnect. There are various implementations of the interconnect. For example, in Triscend CSoC [19] , the interconnect between the ARM7 processor and the SDRAM is a local bus while the interconnect between the SDRAM and the configurable system logic is a dedicated data bus and a dedicated address bus. In Virtex-II Pro [27] , the interconnect between the PowerPC processor and the RL is implemented using the on-chip routing resource. We abstract all these buses and connections as an interconnect with (possibly) different communication time and energy costs between different components. We assume that the memory is shared by the processor and the RL. Since the operating state of the interconnect depends on the operating state of the other components, an operating state of the RSoC device, denoted as a system state, is thus only determined by the operating states for the processor, the RL, and the memory. Let S denote the set of all possible system states. Let P S(s), RS(s) and M S(s) be functions of a system state s, s ∈ S. The output of these functions are integers that represent the operating states of the processor, the RL and the memory, respectively. An operating state of the processor corresponds to the state in which the processor is idle or is operating with a specific power consumption. Suppose that an idle mode and dynamic voltage scaling with v − 1 voltage settings are available on the processor. The processor is assumed to operate at a specific frequency for each of the voltage settings. Then, the processor has v operating states, 0 ≤ P S(s) ≤ v − 1, with P S(0) = 0 being the state in which the processor is in the idle mode. The RL is idle when there is no input data and it is clock gated without switching activity on it. Thus, when the RL is loaded with a specific configuration, it can be in two different operating states depending on whether it is idle or processing the input data. Suppose that there are c configurations for the RL, then the RL has 2c operating states, 0 ≤ RS(s) ≤ 2c − 1. We number the operating states of RL such that (a) for 0 ≤ RS(s) ≤ c − 1, RS(s) is the state in which the RL is idle, loaded with configuration RS(s);
is the state which the RL is operating, loaded with configuration RS(s) − c. Each power state of the memory corresponds to an operating state. For example, when memory banking is used to selectively activate the memory banks, each combination of the activation states of the memory banks represents an operating state of the memory. Suppose that the memory has m operating states, then 0 ≤ M S(s) ≤ m − 1. The operating state of the interconnect is related to the operating states of the other three components. Considering the above, the total number of distinct system states is 2vcm.
The application is modeled as a collection of tasks with dependencies (see Section V-A for details). Suppose that task i is to be executed immediately preceding task i. Also, suppose that task i is executed in system state s and task i is executed in system state s. If s = s, a system state transition is required. The transition between different system states incurs certain amount of energy. Our model consists of the following parameters:
• ∆EV P S(s ),P S(s) : energy dissipation in the processor for transition from state P S(s ) to state P S(s)
• ∆EC RS(s ),RS(s) : energy dissipation in the RL for transition from state RS(s ) to state RS(s)
energy dissipation for changing the memory state from M S(s ) to M S(s)
• IP : processor power consumption in the idle state • IR: RL power consumption in the idle state
• P M M S(s) : memory power consumption in state M S(s) • MP M S(s) : average energy dissipation for transferring one bit data between the memory and the processor when memory is in state M S(s)
• MR M S(s) : average energy dissipation for transferring one bit data between the memory and the RL when memory is in state M S(s)
The system state transition costs depend not only on the source and destination system states of the transition but also on the requirement of the application. Let ∆ i ,i,s ,s be the energy dissipation for such system state transition. ∆ i ,i,s ,s can be calculated as
where, ∆A i i is the additional cost for transferring data from task i to task i in a given mapping. For this given mapping, ∆A i i can be calculated based on the application models discussed in Section V-A and the communication costs
MP M S(s) and MR M S(s) .

B. A Model for Virtex-II Pro
There are four components to be modeled in Virtex-II Pro. One is the embedded PowerPC core. Due to the limitations in measuring the effects of frequency scaling, we assume that the processor has only two operating states: On and Off, and is operating at a specific frequency when it is On. Thus, v = 2. We ignore IP since the PowerPC processor does not draw any power if it is not used in a design. ∆EV 0,1 and ∆EV 1,0 is also ignored since changing the processor states dissipates negligible amount of energy compared with that when it performs computation. Two partial reconfiguration methods which are module based and small bit manipulation based [20] are available on the RL of Virtex-II Pro. For the small bit manipulation based partial reconfiguration, switching the configuration of one module on the device to another configuration requires downloading the difference between the configuration files for this module. This is different from the module based approach, which requires downloading the entire configuration file for the module. Thus, the small bit manipulation based partial reconfiguration has relatively low latency and energy dissipation compared with the module based one. Therefore, we use the small bit manipulation based method. We estimate the reconfiguration cost as the product of the number of slices used by the implementation and the average cost for downloading data for configuring one FPGA slice. According to [21], the energy for reconfiguring the entire Virtex-II Pro XC2VP20 device is calculated as follows. Let ICC Int denote the current for powering the core of the device. We assume that the current for configuring the device is mainly drawn from ICC Int . From the data sheet [27] , ICC Int = 500 mA@1.5V during configuration and ICC Int = 300 mA@1.5V during normal operation, the reconfiguration power is estimated as (500-300) × 1.5 = 300 mW. The time for reconfiguring the entire device using SelectMAP (50 MHz) is 20.54 ms. Thus, the energy for reconfiguring the entire device is 6162 µJ. There are 9280 slices on the device. Together with the slice usage from the post place-and-route report generated by the Xilinx ISE tool [21], we estimate the energy dissipation of reconfiguration as ∆EC RS(s ),RS(s) = 6162 × (total number of slices used by the RL in operating state RS(s)) / 9280 µJ. The quiescent power is the static power dissipated by the RL when it is on. This power cannot be optimized at the system level if we do not power on and off the RL. Thus, it is not considered in this paper. Since IR represents the quiescent power, it is set to zero. We also ignore the energy dissipation for enabling/disabling clocks to the design blocks on the RL in the calculation of ∆EC RS(s ),RS(s) since it is negligible compared with the other energy costs. For memory modeling, we use BRAM. It has only one available operating state, m = 1 and M S(s) = 0. Since the memory does not change its state, ∆EM M S(s ),M S(s) = 0. The BRAM dissipates negligible amount of energy when there is no memory access. We ignore this value so that P M 0 = 0. Using the power model from [5] and [23] , energy dissipation M R 0 is estimated as 42.9 nJ/Kbyte. The communication between processor and memory follows certain protocols on the bus. This is specified by the vendor. Its energy efficiency is different depending on the bus protocols used. Energy cost M P 0 is measured through low-level simulation.
V. PROBLEM FORMULATION
A model for a class of applications with linear dependency constraints is described in this section. Then, a mapping problem is formulated based on both the RSoC model and the application model. 
A. Application Model
The application consists of a set of tasks, T 0 , T 1 , T 2 , · · · , T n−1 , with linear precedence constraints. T i must be executed before initiating T i+1 , i = 0, · · · , n − 2. Due to the precedence constraints, only one task is executed at any time. The execution can be on the processor, on the RL, or on both. There is data transfer between adjacent tasks. The transfer can occur between the processor and the memory or between the RL and the memory, depending on where the tasks are executed.
. . .
Fig. 2. A linear pipeline of tasks
The application model consists of the following parameters:
out : amount of data input from memory to task T i and data output from task T i to memory.
• EP i,s and T P i,s : processor energy and time cost for executing task T i in system state s. EP i,s = T P i,s = ∞ if task T i cannot be executed in system state s.
• ER i,s and T R i,s : RL energy and time cost for executing task T i in system state s. ER i,s = T R i,s = ∞ if task T i cannot be executed in system state s.
B. Problem Definition
We now formulate the problem based on the parameters of the RSoC model and the application model. In this paper, an energy efficient mapping is defined as the mapping that minimizes the overall energy dissipation for executing the application over all possible mappings.
During the execution of the application, a task can begin execution as soon as its predecessor task finishes the execution. Thus, for any possible system state s, the processor and the RL cannot be in idle state at the same time. The total number of possible system states is |S|= (2v−1)cm. Let the system states be numbered from 0 to (2v − 1)cm − 1. Then, depending on the sources of energy dissipation, we divide the system states into three categories:
• For 0 ≤ s ≤ (v − 1)cm − 1, s denotes the system state in which the processor is in state P S(s) ( 
≤ P S(s) ≤ v−1), the RL is in the idle state loaded with configuration RS(s) and the memory is in state M S(s). P S(s), RS(s) and M S(s) are determined by solving equation s = (P S(s) − 1)cm + RS(s)m + M S(s).
• For (v − 1)cm ≤ s ≤ vcm − 1, s denotes the system state in which the processor is in the idle state (P S(s) = 0), the
RL is operating with configuration RS(s) − c (c ≤ RS(s) ≤ 2c − 1) and the memory is in state M S(s). P S(s), RS(s) and M S(s) are determined by solving equation s = (RS(s)− c)m + M S(s) + (v − 1)cm.
• For vcm ≤ s ≤ (2v−1)cm−1, s denotes the system state in which the processor is in state P S(s) ( 
≤ P S(s) ≤ v−1)), the RL is operating in state RS(s) − c (c ≤ RS(s) ≤ 2c − 1) and the memory is in state M S(s). P S(s), RS(s) and M S(s) are determined by solving equation s = (P S(s) − 1)cm + (RS(s) − c)m + M S(s) + vcm.
Let E i,s denote the energy dissipation for executing T i in state s. E i,s is the sum of the following:
• The energy dissipated by the processor or/and the RL that is/are executing T i ;
• If the processor or the RL is in the idle state, the idle energy dissipation of the component;
• The energy dissipated by the memory during the execution of T i .
The above three sources of energy dissipation are calculated as in Table II .
We calculate the system state transition costs using Equation (1). Since a linear pipeline of tasks is considered, i = i − 1. The energy dissipation for state transitions between the execution of two consecutive tasks T i−1 and
namely, ∆ i−1,i,s ,s is calculated as ∆EV P S(s ),P S(s) + ∆EC RS(s ),RS(s) + ∆EM M S(s ),M S(s) +D
Let s i denote the system state while executing T i under a specific mapping, 0 ≤ i ≤ n − 1. The overall system energy dissipation is given by
Now, the problem can be stated as: find a mapping of tasks to system states, that is, a sequence of s 0 , s 1 , · · · , s n−1 , such that the overall system energy dissipation given by Equation (2) is minimized.
VI. ALGORITHM FOR ENERGY MINIMIZATION
We create a trellis according to the RSoC model and the application model. Based on the trellis, a dynamic programming algorithm is presented in Section VI-B.
A. Creation of the Trellis
A trellis is created as illustrated in Figure 3 . It consists of n + 2 steps, ranging from -1 to n. Each step corresponds to one column of nodes shown in the figure.
Step -1 and step n consist of only one node 0, which represents the initial state and the final state of the system. Step i, 0 ≤ i ≤ n−1, consists of |S| nodes, numbered from 0, 1, · · · , | S | −1, each of which represents the system state for executing task T i . The weight of node N s in step i is the energy cost E i,s for executing task T i in system state s. If task T i cannot be executed in system state s, then E i,s = ∞. Since node N 0 in step -1 and step n do not contain any tasks, E −1,0 = E n,0 = 0. There are directed edges (1) 
B. Dynamic Programming Algorithm
Based on the trellis, our dynamic programming algorithm is described below. We associate each node with a path cost P i,s . Define P i,s as the minimum energy cost for executing T 0 , T 1 , · · · , T i with T i executed in node N s in step i. Initially, P −1,0 = 0. Then, for each successive step i, 0 ≤ i ≤ n, we calculate the path cost for all the nodes in the step. The path cost P i,s for node N s in step i is calculated as
• For i = 0:
• For 1 ≤ i ≤ n − 1:
• For i = n:
Only one path cost is associated with node N 0 in step n. A path that achieves this path cost is defined as a surviving path. Using this path, we identify a sequence of s 0 , s 1 , · · · , s n−1 , which specifies how each task is mapped onto the RSoC device. From the above discussion, we have, Proposition 1: The mapping identified by a surviving path achieves the minimum energy dissipation among all the mappings.
Since we need to consider O((2v − 1)cm) possible paths for each node and there are O((2v − 1)cm · n) nodes in the trellis, the time complexity of the algorithm is O(v 2 c 2 m 2 n).
The configurations and the hardware resources are not reused between tasks in most cases, which means that the trellis constructed in Figure 3 is usually sparsely connected. Therefore, the following pre-processing can be applied to reduce the running time of the algorithm: (1) nodes with ∞ weight and the edges incident on these nodes are deleted from the trellis; (2) the remaining nodes within each step are renumbered. After this two-step pre-processing, we form a reduced trellis and the dynamic programming algorithm is run on the reduced trellis.
VII. ILLUSTRATIVE EXAMPLES
To demonstrate the effectiveness of our approach, we implement a broadband delay-and-sum beamforming application and an MVDR (minimum-variance distortionless response) beamforming application on Virtex-II Pro, a state-of-the-art reconfigurable SoC device. These applications are widely used in many embedded signal processing systems and in software defined radio [7] .
A. Delay-and-Sum Beamforming
Using the model for Virtex-II Pro discussed in Section IV-B, implementing the delay-and-sum beamforming application is formulated as a mapping problem. This problem is then solved using the proposed dynamic programming algorithm.
1) Problem Formulation:
The task graph of the broadband delay-and-sum beamforming application [8] is illustrated in Figure 4 . A cluster of seven sensors samples data. Each set of the sensor data is processed by an FFT unit and then all the data is fed into the beamforming application. The output is the spatial spectrum response, which can be used to determine the directions of the objects near by. The application calculates twelve beams and is composed of three tasks with linear dependences: calculation of the relative delay for different beams according the positions of the sensors (T 0 ), computation of the frequency responses (T 1 ), and calculation of the amplitude for each output frequency (T 2 ). The data in and data out are performed via the I/O pads on Virtex-II Pro. The number of FFT points in the input data depends on the frequency resolution requirements. The number of output frequency points is determined by the spectrum of interest. The three tasks can be executed either on the PowerPC processor or on the RL. The amount of data input (D Task graph of the delay-and-sum beamforming application We employ the algorithm-level control knobs discussed in Section II to develop various designs on the RL. There are many possible designs. For the sake of illustration, we implement two designs for each task. One of the main differences among these designs is the degree of parallelism, which affects the number of resources, such as I/O ports and sine/cosine look-up tables, used by the tasks. For example, one configuration of task T 0 handles two input data per clock cycle and requires more I/O ports than the other configuration that handles only one input per clock cycle. While the first configuration would dissipate more power and more reconfiguration energy than the second one, it reduces the latency to complete the computation. Similarly, one configuration for task T 2 uses two sine/cosine tables and thus can generate the output in one clock cycle while the other configuration uses only one sine/cosine table and thus requires two clock cycles in order to generate the output.
Each task is mapped on the RL to obtain T R i,s and ER i,s values. The designs for the RL were coded using VHDL and are synthesized using XST (Xilinx Synthesis Tool) provided by Xilinx ISE 5.2.03i [21] . The VHDL code for each task is parameterized according to the application requirements such as the number of FFT points, and the architectural control knobs such as precision of input data and hardware binding for storing intermediate data. The utilization of the device resources is obtained from the place-and-route report files (.par files). To obtain the power consumption of our designs, the VHDL code was synthesized using XST for XC2VP20 and the place-and-route design files (.ncd files) are obtained. Mentor Graphics ModelSim 5.7 was used to generate the simulation results (.vcd files). The .ncd and .vcd files were then provided to Xilinx XPower [24] to obtain the average power consumption. T R i,s is calculated based on our designs running at 50 MHz and 16-bit precision. ER i,s is calculated based on both T R i,s and power measurement from XPower.
For the PowerPC core on Virtex-II Pro, we develop C code for each task, compiled it using the gcc compiler for PowerPC, and generated the bitstream using the tools provided by Xilinx Embedded Development Kit (EDK). We used the SMART model from Synopsis [17] , which is a cycle-accurate behavioral simulation model for PowerPC, to simulate the execution of the C code. The data to be computed is stored in the BRAMs of Virtex-II Pro. The latencies for executing the C code are obtained directly by simulating the designs using ModelSim 5.7. The energy dissipation is obtained assuming a clock frequency of 300MHz and the analytical expression for processor power dissipation provided by Xilinx [27] as: 0.9 mW/MHz × 300 MHz = 270 mW. Then, we estimate the T P i,s and EP i,s values. Note that the quiescent power is ignored in our experiments as discussed in Section IV-B.
Considering both the PowerPC and the FPGA, we have three system states for each of the three tasks on the reduced trellis after the pre-processing discussed in Section VI-B. Thus, 0 ≤ s ≤ 2. Table III shows the E i,s values for the three tasks when the number of input FFT points and the output frequency points is 1024.
For simple designs, the values of the parameters discussed above can be obtained through low-level simulations. However, for complex designs with many possible parameterizations, such low-level simulation can be time consuming. This is especially the case for designs on RL. However, using the domain-specific modeling technique proposed in [4] and the power estimation tool proposed by us in [13] , it is possible to have rapid and fairly accurate system-wide energy estimation of data paths on RL without performing time consuming lowlevel simulation.
2) Energy Minimization: We create a trellis with five steps to represent this beamforming application. After the preprocessing discussed in Section VI-B, step -1 and step 3 contain one node each while step 0, 1 and 2 contain three nodes each on the reduced trellis. By using the values described above, we obtain the weights of all the nodes and the edges in the trellis. Based on this, our dynamic programming based mapping algorithm is used to find the mapping that minimizes For the purpose of comparison, we consider a greedy algorithm that always maps each task to the system state in which executing the task dissipates the least amount of energy. The results are shown in Figure 5 . For all the considered problem sizes, energy reduction ranging from 41% to 54% can be achieved by our dynamic programming algorithm over the greedy algorithm.
Considering the case where both the number of FFT points of the input data and the number of output frequency points are 2048, the greedy algorithm maps task T 0 on the RL. However, the dynamic programming algorithm maps this task on the processor and a 54% reduction of overall energy dissipation is achieved by doing so. The reason for the energy reduction is analyzed as follows. Task T 0 is executed efficiently on the RL for both the configuration files employed (ranging from 4.15 to 9.17 µJ). But the configuration costs for these two files are high (ranging from 272.49 to 343.03 µJ) since task T 0 needs sine/cosine functions. The Xilinx FPGA provides the CORE Generator lookup table [27] to implement the sine/cosine functions. For 16-bit input and 16-bit output sine/cosine lookup tables, the single output design (sine or cosine) needs 50 slices and the double output (both sine and cosine) design needs 99 slices. Two and three sine/cosine look-up tables are used in the two designs employed for T 0 , which increases the reconfiguration costs for this task. The amount of computation energy dissipation of task T 0 is relatively small in this case and thus the configuration energy cost impact the overall energy dissipation significantly. Therefore, executing the task on the processor dissipates less amount of energy than executing it on the RL.
B. MVDR Beamforming
Using a similar approach as in Section VII-A, we implemented an MVDR (Minimum Variance Distortionless Response) beamforming application on Virtex-II Pro. Details of the design process are as follows.
1) Problem Formulation:
The task graph of the MVDR beamforming application is illustrated in Figure 6 . It can be decomposed into five tasks with linear constraints. In T 0 , T 1 and T 2 , we implemented a fast algorithm described in [8] for MVDR spectrum calculation. It consists of: Levinson Durbin recursion to calculate the coefficients of a predictionerror filter (T 0 ), correlation of the predictor coefficients (T 1 ), and the MVDR spectrum computation using FFT (T 2 ). This fast algorithm eliminates a lot of computation that is required by the direct calculation. We employ an LMS (Least Mean Square) algorithm (T 3 ) to update the weight coefficients of the filter due to its simplicity and numerical stability. A spatial filter (T 4 ) is used to filter the input data. The coefficients of the filter are determined by the previous tasks. We considered the low-level and algorithm-level control knobs discussed in Section II and developed various designs for the tasks which are listed in Table IV . Different degrees of parallelism are employed in designs for task T 0 and T 1 . Task T 2 uses FFT to calculate the MVDR spectrum. We employed the various FFT designs discussed in [5] which are based on the radix-4 algorithm as well as the design from Xilinx CORE Generator. Clock gating, various degree of parallelism and memory bindings are used in these FFT designs to improve energy efficiency. V p and H p are the vertical and horizontal parallelism employed by the designs (see [5] for more details). Two different bindings, one using slice based RAMs and the other using BRAMs, were used to store the intermediate values. The number of dedicated multipliers used in the designs for task T 3 and T 4 is varied. Using the approach described in Section VII-A, we developed parameterized VHDL code for T 0 , T 1 , and T 2 . Parameterized designs for T 3 and T 4 are realized using a MATLAB/Simulink based design tool developed by us in [13] . All the designs for the RL and the processor core were mapped on the corresponding components for execution. Values of the parameters of the RSoC model and the application model were obtained through low-level simulation. The synthesis designs for the RL run at a clock rate of 50 MHz. The data precision is 10 bits. Table V shows the E i,s values when M = 8 and the number of FFT points is 16 after the pre-processing discussed in Section VI-B.
Let M denote the number of antenna elements. For task T 0 and task T 1 , we need to perform a complex multiply-andaccumulate (MAC) for problem sizes from 1 to M . Typically, M = 8, 16 are used in the area of software defined radio while M = 32, 64 are used in embedded sonar systems. There are several trade-offs which affect the energy efficiency when selecting the number of inputs to the complex MAC when implementing task T 0 and T 1 . Architectures for complex MACs with 2, 4, and 8 inputs are shown in Figure 7 . For a fixed M , using a complex MAC architecture that handles more input data at the same time reduces the execution latency. However, it dissipates more power. It also occupies more FPGA slices, which increases the configuration cost. The energy dissipation when using MACs with different input sizes for task T 1 is analyzed in Figure 8 . While a MAC with input size of 4 is most energy efficient for task T 1 when M = 8, a MAC with input size of 2 is most energy efficient when M = 64. Also, the number of slices required for a complex MAC that can handle 2, 4, and 8 inputs are 100, 164, and 378, respectively. This incurs different configuration costs between 2) Energy Minimization: A trellis with seven steps was created to represent the MVDR application. After applying the pre-processing technique discussed in Section VI-B, step -1 and step 5 contain one node each, step 0 and 1 contain four nodes each, step 2 contains seven nodes, step 3 contains 3 nodes, and step 4 contains 5 nodes on the reduced trellis. Using the values described in the previous section, we obtain the weights of all the nodes and the edges on the reduced trellis. The proposed dynamic programming algorithm is then used to find a mapping that minimizes the overall energy dissipation.
The results are shown in Figure 9 . For all the considered problem sizes, energy reduction from 41% to 46% are achieved by our dynamic programming algorithm over a greedy algorithm that maps each task onto either hardware or software, depending upon which dissipates the least amount of energy.
For both the dynamic programming algorithm and the greedy algorithm, task T 0 and task T 1 are mapped to the RL. However, the dynamic programming algorithm maps them to the design using 2-input complex MAC while the greedy algorithm maps them to the designs based on 4-input complex 
