I. INTRODUCTION
High-Level Synthesis (HLS) is the process of translating an algorithmic specification into a Register-Transfer Level (RTL) Finite State Machine with Datapath (FSMD) implementation. Through different optimization steps HLS aims at balancing the distribution of RTL components throughout the execution of applications. Although HLS has been a research topic for more than 25 years, it has recently gained industrial acceptance with the introduction of hardware description languages like VHDL and Verilog in design flows, and the availability of efficient synthesis methods and tools, that enable the translation of RTL designs into optimized gate-level implementations [1] .
Designing at higher levels of abstraction offer better management of the design complexity and reduction of the design cycle all together. However, a lot of optimization opportunities can still be explored at lower levels, like RTL and below RTL. This holds for HLS also. A strong argument of those against HLS is that an HLS environment, if not properly used, can produce definitely worse results than those of an RTL environment. For example, many HLS transformation algorithms have treated multiplications and additions as two unique functional units of equal delay (1 control step), that cover all primitive operations of an application. Through RTL or gate-level optimization and architectural style exploration (array or tree multipliers, functional pipelining), the above assumption can be proved too far away from reality.
Another inefficiency of an FSMD architecture is that leaving an expensive hardware component like the multiplier idle in one control step during HLS is a large waste in chip area. This will not happen however, if a reconfiguration technique could use idle multipliers (or parts of them) with different functionality. Reconfigurable computing [2] has been initially introduced to fill the gap between hardware and software, achieving potentially much higher performance than software, while maintaining a higher level of flexibility than hardware. Reconfigurable computing consists of reconfigurable devices, either fine or coarse grain, and a systematic way to apply reconfiguration data to them.
Fine grain reconfigurable devices, including FieldProgrammable Gate Arrays (FPGAs), contain an array of programmable computational elements connected using a set of programmable routing resources. Any custom digital circuit can be mapped to the reconfigurable hardware by computing the logic functions of the circuit within the computational elements and using the configurable routing to connect them appropriately. Currently, the most common configuration technique is to use Look-Up Tables (LUTs) . On the other hand, coarse grain reconfigurable devices are suited to specific applications or application kernels but are much more efficient than FPGAs. The latter have in general huge routing area overhead and poor routability. Coarse grain architectures provide operator level configurable blocks, word level datapaths, and powerful and very area efficient datapath routing switches. A major benefit is the massive reduction of configuration memory and configuration time, as well as drastic complexity reduction of the placement and routing problem. A survey of reconfigurable devices can be found in [3] . This paper presents a solution for idle multipliers in 
in tree form. All adders that are not in the critical path can work in different modes, using multiplexers to connect different inputs and outputs. These multiplexers should be carefully selected, so as not to increase the critical path delay of the component.
The contribution of this paper is in two areas. First, an efficient gate-level design technique for morphable arithmetic components is introduced, based on the synthesis tools from Synopsys [4] . Following this technique, an extensive set of experiments is presented (under different implementation technologies, design architectures and operator bitwidths), to justify the applicability of morphable components during HLS. Second, using these components, an efficient postprocessor to UCSD's Spark HLS tool [5] that supports PRTR is presented. Spark is given resource constraints that count both modes of morphable components and the postprocessor splits all control steps where resources for both modes are bound. In all these steps, PRTR will be performed, since the two modes are mutually exclusive. From the results obtained through systematic experimentation in using morphable instead of conventional multipliers in different DSP benchmarks, we have obtained performance gain (shortest schedules) of 15% on average and 41% on best case, without any increase in datapath area.
II. RELATED RESEARCH
Reconfigurable computing has been a hot research topic during the previous years and different research groups have proposed various PRTR methodologies and reconfigurable component architectures. The key point in all approaches is how to conduct reconfiguration quickly and flexibly. This section presents some ideas related to coarse grain reconfigurable components and their adoption during HLS.
Coarse grain reconfigurable arithmetic components are presented in [6] , [7] , [8] and [9] . In [6] a component called Morphable Multiplier is presented, which is an array multiplier that can be configured through multiplexers to work as either a number of adders or a multiplier at each control step. The paper concentrates on the efficient design of such a device, maximizing hardware utilization, and not on using morphable multipliers for algorithm realization. A few manual scheduling results of DSP kernels are given for comparison reasons but no systematic approach. In [7] , [8] and [9] , a more systematic approach is given and the designs tested are full applications, like the graphics processor presented in [8] . All three contributions work on a high abstraction level and lately, [9] considers multimode systems, but they do not fully exploit morphable components as defined in [6] , but components that can perform a single operation at each control step, out of a more wide variety (addition, subtraction, comparison, multiplication).
Reconfigurable computing for HLS is reported in [10] , [11] and [12] . In [10] the problem of register binding of the RTL description is considered and a technique to utilize onchip embedded memory (found in modern FPGA devices) instead of LUTs is proposed. In [11] a scheduling heuristic is proposed to employ a single morphable component (like those defined in [6] ) in each FSMD while in [12] this idea is generalized for all components of the FSMD. However, in both cases very few details are given about the morphable components used while the DSP schedules presented are divided into arbitrary equivalent classes and thus, the performance improvements presented are arbitrary biased.
Enhancements to HLS algorithms for optimal exploitation of arithmetic component architectures are given in [13] . Even though reconfiguration is not considered, the approach of [13] presents similarities with the proposed work because it uses a synthesis preprocessor to support bit-level synthesis. Specifically, operations are decomposed into smaller pieces that can be executed in a chaining fashion in the same control step, increasing parallelization. The preprocessor approach allows the same technique to be applied in different heuristics and environments. Also relative to the proposed approach is the work presented in [14] , where three different scheduling heuristics for PRTR are given. However, in this case multiprocessor scheduling is considered, presented in task graphs instead of dataflow graphs in the case of HLS, and the reconfigurable device used is a complete architecture instead of a coarse grain RTL component.
III. DESIGN OF MORPHABLE RTL COMPONENTS
We use the term morphable RTL components to denote coarse grain reconfigurable functional units. Morphable components can work in different modes by proper use of specific configuration bits. A key issue in designing morphable units is to keep their delay and area not significantly greater than that of a single mode unit, thus making their deployment in an embedded processor or DSP feasible. For the rest of this paper we will use morphable multipliers, as initially presented in [6] . However, our HLS approach is general enough to cover any kind of morphable component (more details in the following section).
Morphable multipliers are based on the general purpose architecture of a binary multiplier, either in array or tree form (based on design constraints). In both cases, first the partial products are generated using an array of AND gates, or more generally, radix-k Booth's multiple generators. Next, the Partial Product Reduction Tree (PPRT) adds the partial products and produces a sum result in a redundant form. Finally, the redundant form is converted into a binary form by a carry propagate adder.
The design of morphable multipliers, as initially defined in [6] , exploit the fact that the inputs and outputs of each Full Adder (FA) in the PPRT and the final carry propagate adder do not equally contribute to the delay of the multiplier. Only a few of all available FAs belong to the critical path. For example, consider the case of the 6x4 carry save array multiplier of figure 1. Each FA cell, both in the PPRT as well as the final adder, has a numeric value in it. This value denotes the timing interval between the control step its inputs are ready and the latest control step outputs can be calculated, without violating the critical path. All cells that have a 0 in them belong to the critical path. For example, the right most column of the PPRT along with the final adder consist the critical path of the multiplier because they form a carry chain, where each FA has to wait the carry output of the previous to perform addition. All other FAs, that are not on the critical path, have timing slack equal to the minimum delay that can be added to each of them to make it critical. Using an FA in more than one modes requires a number of multiplexers so that in each mode, different inputs and outputs may be driven in and out of the FA to form different arithmetic operations. An FA with sufficient slack to allow the incorporation of multiplexers on its inputs and outputs is called a reusable FA. Taking into account that an nxn multiplier requires
in tree form, there are many opportunities to find reusable FAs, provided n has a reasonable value.
Two methods of identifying reusable FAs have been developed in [6] : a strict method and a relaxed method. In the strict method, given an FA that has timing slack, multiplexers are inserted on all its inputs and outputs. As a result of this insertion, the output delays of the FA are recalculated and any changes are propagated through the fan-out tree of the FA. If the recalculated delays of the PPRT outputs are less than or equal to the maximum tolerable delay, then the chosen FA is reusable. In the relaxed method, identification of reusable FAs exploit the existence of paths from the carry output of one FA to the carry input of another FA in the PPRT. For such paths, multiplexers can be inserted only at the other inputs (A and B) of the FA. The reusable FAs found with this relaxed method are already chained in a ripple carry fashion. These chains are wired up in order to build larger structures.
For example, following the strict method, a number of steps must be followed in order to find a reusable adder chain in the multiplier of figure 1. First, in figure 2, a multiplexer is inserted in the lower left corner of the PPRT, that has the maximum slack value. The insertion of the multiplexer reduces its slack by 1. Next, another multiplexer is inserted in its top right cell, as shown in figure 3 . Note that new slack values may have to be calculated for other neighboring cells, except from the one where the multiplexer is inserted, because the arrival time of their inputs may change. Next, figure 5 , a reusable 4 bit carry chain is selected, which does not increase the slack of the multiplier, since no negative slack value has been found.
Our approach in this paper is a variation of the strict method, taking advantage of modern design solutions offered by the EDA industry and specifically Synopsys [4] . First, different multiplier architectures are designed using the Design Ware IP library. Then, for each architecture, reusable FAs are identified, by inserting multiplexers to the inputs of each available FA and using Prime Time to calculate variations in the critical path. If after a multiplexer set insertion, Prime Time gives no critical path variation, the corresponding FA was not in the critical path and thus, it is reusable. Next, the identified reusable FAs are connected in ripple carry chains to form carry propagate adders. The chains are selected so as each new FA added causes the minimum critical path increase (if no increase is not possible), through exhaustive search. Finally, the resulting architecture is given to Design Compiler with timing constraints equal to the initial multiplier's critical path (without multiplexers inserted), for further optimization with respect to the selected design library. This optimization guarantees that the final morphable multiplier will have no timing overhead and minimum (if not zero) area overhead.
For example, the number of reusable and available FAs for different multiplier implementations are given in For all implementations, the area overhead drops as the requested component delay decreases (or operation frequency increases). This result is very promising despite the fact that a 0% area overhead is not reported. It means that as more effort is used to design a multiplier, either morphable or non-morphable, the overhead of making it morphable is even less important. Furthermore, overheads around 5% can be considered not to impose any practical overhead in the resulting FSMD architecture. This happens because the FSMD with non-morphable components is larger (more components) than the one with the morphable. As a consequence, it requires the insertion of an application dependent number of extra multiplexers and interconnection nets to work. These extra resources compensate for the area increase of the morphable multipliers, and sum up into a practically equivalent (and in some cases, even larger) area increase in the overall architecture.
What seems strange at first sight is the last implementations of table II, the 16 bit multipliers, where we need less area to add functionality. By examining the resulting gate-level netlist in these cases, it can be found that Design Compiler optimizes the morphable component, using lower level primitives from the standard cell library (AND, XOR and AOI gates) than the FAs and the multiplexers. When optimizations are allowed not to respect the initial design hierarchy boundaries, there are cases when the lower level primitives of the combined higher level components can be further minimized, because of the inserted multiplexers (the same minimization is not possible in the PPRT alone).
Overall, tables I and II, provide enough evidence that a morphable component, merging the functionality of 1 multiplier and 3 parallel adders, is feasible under a wide variety of different design considerations, and can be used to pack different functionalities during higher abstraction level synthesis methodologies, like HLS.
IV. HLS WITH MORPHABLE COMPONENTS
HLS has been a hot research topic for the last 25 years. Recently, with other design technologies (HDLs, RTL synthesis, technology libraries) offering a stable foundation, HLS is considered as efficient as RTL design has been in the past. Many solutions exist in today's EDA market that start at higher abstraction levels and support compact and easy to manage design descriptions, without compromising on output quality.
The Spark HLS tool [5] from UCSD has been motivated by the advances in parallelizing compiler technology that enable exploitation of extreme amounts of parallelization through a range of code motion and code transformation techniques. The specific set of transformations to be applied to each design is user selectable through appropriate script files. This makes Spark highly customizable. Although our methodology is general enough and can be applied to other environments, Spark has been chosen for these advanced customization capabilities as well as the fact that as an academic tool is available as a free download.
For our problem, that is PRTR with morphable components, neither Spark nor any other commercial or academic tool directly supported our ideas. Most HLS tools allow the definition of new functional units, which cover different behavioral level constructs (built-in operators or user-defined functions). Then, the HLS tool can bind such resources and Figure 6 . Proposed approach perform scheduling and allocation. None of the available tools however support functional blocks that work in different and mutually exclusive modes. Especially, when the number of reconfigurable blocks is different in each mode (1 multiplier in mode 1 and 3 adders in mode 2, as reported in the previous section). Each resource is rather considered as a single entity throughout all optimization phases. Our proposed solution in this paper is an efficient approach to handle such cases with Spark. It is applied in two steps, as shown in figure 6 . First, we let Spark perform any selected optimization with enough resources to cover all modes of operation of all morphable components. Spark will generate optimized RTL FSMD descriptions that in some of the control steps use mixed elements of different modes of morphable components. Each one of these steps need to be split in two, with a reconfiguration between them to change modes. Because morphable components have been designed not to impose reconfiguration delays in the critical path of the application, there is no need for extra reconfiguration control steps. All is needed is just the propagation of the correct configuration bits to the multiplexers inserted in the morphable component. Reconfiguration is also needed when a morphable component changes mode at different control steps. However, any reconfiguration along with normal operation of the morphable component is guaranteed to fit in one control step by design. The generation of the correct multiplexer bits is the final step of our technique, which is applied as a scheduling postprocessor to the results of Spark (MM-Aware Scheduling Postprocessor in figure 6 ).
SPARK HLS FRAMEWORK
In pseudocode, our design approach is given below, assuming we have £ morphable multipliers with 2 modes (1 multiplier in mode 1 and 3 adders in mode 2) and single mode adders. Note that the same approach can be followed with other component types.
number of multipliers = n; number of adders = m+3*n; call Spark to perform HLS; for each control step with mode conflict split control step in 2; generate reconfiguration bits between control steps; end for for each control step if at previous control step a morphable component was used in different mode than here generate reconfiguration bits at the beginning of control step; end for
The postprocessing approach has many advantages. First of all, it is easy to implement and integrate with any existing HLS tool (other than Spark), even without having access to its source code (although it is preferable). Also, it can be combined seamlessly with any selected optimizations performed during HLS, either resource constrained or timing constrained. Finally, it can be executed as a separate module and the effects of the optimizations it performs can be clearly evaluated (by performing HLS with and without postprocessing). The quality of the produced results are expected to be similar to the ones produced if the same PRTR were put inside each scheduling heuristic. This is expected because the proposed approach can be considered to follow all HLS optimizations for an extended control step, which consists of 2 normal control steps. In one step, the morphable components may work in one mode and in the second in the other, if needed. The postprocessor simply breaks the extended control step. If an operation can be scheduled in the extended control step, it can be put in any of the resulting 2 normal control steps without any overall performance overhead.
V. EXPERIMENTAL RESULTS
In order to evaluate the proposed technique, we have generated reconfigurable architectures for 11 DSP applications: an FIR filter with 7 taps (fir7), an IIR filter (iir), a lattice filter (lattice), an elliptic second order filter (elliptic), a wavelet transformation (wavelet), a discrete cosine transformation (dct), an inverse discrete cosine transformation (idct), a fast Fourier transformation (fft), a 2D discrete cosine transformation of 8x8 pixel patterns (2D dct8x8), a matrix multiplication of 4x4 matrices (mat mult4x4) and a matrix inversion of 4x4 matrices (mat inv4x4). For each application we made experiments with different number of resources, for both conventional components (table III) and morphable  components (table IV) . Table III shows different resource usage scenarios for all DSP applications using only conventional components. Column two shows the required control steps to schedule all iir  8  8  8  8  8  8  8  8  8  lattice  11  10  10  11  10  10  11  10  10  elliptic  28  17  15  28  17  15  28  17  15  wavelet  29  16  16  29  16  16  29  16  16  dct  27  14  10  27  14  10  27  14  10  idct  37  22  17  37  22  17  37  22  17  fft  23  17  17  20  13  11  20  12  10  2D dct8x8  99  54  45  99  59  43  99  59  43  mat mult4x4  23  23  23  20  15  15  20  14  12  mat inv4x4  95  95  95  52  49  49  52  38  38   Table III  SCHEDULING WITH CONVENTIONAL COMPONENTS   0   5   10   15   20   25   30   35   40   1M,1A  1M,2A  1M,3A  2M,1A  2M,2A  2M,3A  3M,1A  3M,2A 3M,3A
CS
Conventional Morphable Figure 7 . Average number of control steps applications when using 1 multiplier and 1 adder, column three 1 multiplier and 2 adders, column four 1 multiplier and 3 adders, next 2 multipliers and 1 adder and so on, until column ten, where 3 multipliers and 3 adders have been used. Smaller applications like the fir7 and the iir filters require mainly additions and so, their corresponding schedules do not change by using more multipliers. The larger applications like 2D dct8x8 and mat inv4x4 require balanced resources for additions and multiplications and so, in columns with limited resources of any type, a lot of control steps are required for algorithm execution. Table IV shows the same resource usage scenarios of table III, with morphable instead of conventional multipliers. Looking at both tables, small examples require again the same number of control steps. However, in medium and large applications, fewer control steps are required when using morphable components. Calculating the average number of control steps for all applications in both tables, the average performance gain offered by morphable components is 15%. In the best case (the architecture with 3 morphable multipliers and 1 adder in column 8), the average performance gain for all applications is as high as 41%. Moreover, looking at table IV, this improvements tends to lower the high control step values found in table III, in most cases. Since high control step values come from unbalanced resource utilization, morphable components offer a second improvement by balancing the utilization of all available components through PRTR. This last result can be seen in figure 7 , where the average number of control steps required for all applications is shown for all different resource usage scenarios.
The top line in figure 7 shows the average number of control steps when using conventional components while the bottom when using morphable components. It can be seen that the top line has local maximum values (peaks) in places where limited resources of one type are available, especially additions (all 1A cases in tables III and IV). So, resource utilization is unbalanced. This does not happen in the bottom line, where even though there are points with few dedicated adders, PRTR balances resource utilization using morphable multipliers as adders. In fact, the bottom line is almost straight, without local maximum or minimum values, and a constant slope that corresponds to fewer control steps
