Anand Pande* Broadcom India Pvt. Ltd 
I. Introduction:
DSP applications are based on DSP kernels such as filtering (FIR and IIR) and transforms (DCT and FFT etc.). Many of these transforms involve weighted sum computations wherein the weights are fixed at the design time. Add-Shift based hardware implementation of such fixed-coefficient multiplications is preferred over a multiplier based realization for both area and performance efficiency. The number of computations in such implementation can be minimized with techniques like common sub-expression elimination [ 1-3,101. The computational DFG (Data Flow Graph) structure depicting such add-shift implementation comprises variables and computations whose precision varies significantly. coefficients and 12-bit data. The resultant DFG in addshift implementation would have variables and computations with precision varying from 12 bits at input to as high as 33 bits at the output.
In this paper, we address the area efficient resource shared implementation of such multi-precision DFGs. This problem can be solved as a conventional high-level synthesis problem. However, since the techniques involved in the conventional high-level synthesis ignore the varying precision nature of DFG, the resultant implementation is likely to be sub-optimal. We present a precision sensitive behavioral synthesis technique where we address all the three major components of high level synthesis, viz. register allocation, functional unit binding and scheduling. In each of these steps we exploit the multi-precision nature of the DFG to achieve most area efficient implementation. Since these three components of high-level synthesis are interdependent [7] , we present an integrated methodology to take advantage of the coupling between them.
To the best of our knowledge the problem of supporting multi-precision arithmetic hasn't been looked at in the context of high-level synthesis of ASICs (Application Specific Integrated Circuits). Some processor architectures have been proposed to efficiently perform variable precision computations, mostly with the granularity of 8/16/32 bits [8, 9] . Multi-precision arithmetic in context of Image and Video Processing applications is dealt with in [4] . The inherent parallelism of such applications is exploited for multi-precision functional unit allocation and binding. Here also, the multi-precision arithmetic is restricted to byte or word boundaries. Restricting to word or byte boundaries is often necessary in processors to optimize memorycpu transfers. This restriction does not apply to ASICs and the scope of area optimization is significantly enhanced. Moreover, scheduling has not been addressed in [4] , considering massive parallelism available in the Image and video applications. Such parallelism is not present in multiplierless implementation of weighted sum transforms and hence those techniques can't be directly applied Although most of our techniques are described in the context of add -shift implementation, they are generic enough and could be applied to multi-precision DFGs involving any type of functional units. The rest of the paper is organized as follows. In the next section we describe the register allocation and functional unit-binding problem in the context of multi-precision arithmetic. A precision sensitive register allocation and functional unitbinding algorithm is also presented. Section I11 describes scheduling and the integrated HLS (High Level Synthesis) methodology. Results are presented in section IV. These results justify our claims of area efficiency over conventional methods. Conclusions are drawn in section V and areas of future work are outlined.
Register allocation and functional unit binding:
In this section we address the problem of register allocation and functional unit binding in the context of multiple precision DFGs. The objective is to exploit the varying precision of the DFG to attain maximum area efficiency. The register allocation problem has been extensively studied in the literature [6, 7, 11, 12] . The input to the register allocation problem is the variable lifetime graph, which is derived from the given schedule of the DFG. The following example illustrates the improvement that can be achieved using precision sensitive approach for register allocation. On the other hand a precision sensitive approach for allocation should allocate variable A,D and B,C to two precision respectively. We now present a precision sensitive, register allocation algorithm. The pseudo code for this algorithm is shown in figure 2 .
II(A). Precision sensitive register assignment algorithm :
For a given schedule of the IIFG, the input to this algorithm is the lifetime graph of variables. The 2-tuple <start(v), end(v)> characterize the lifetime of each variable. Similar to the left-edge algorithm [ 121, the variables are sorted with +tart(v)> of their lifetime as the primary key. But instead of having; secondary key as the <end(v)>, it is taken as bit precision of the variables in decreasing order. The algorithm maintains two ordered working lists ("free-unit" and "cur-var") and a regular list ("busy-unit"). The list "free-unit" holds the units ( registers in this case) freed at a control step. The list "busy-unit" holds the list of units currently holding some variable. These two lists get modified as the variables are assigned to units and as the units are freed due to the completion of lifetime of variables in each control step. List "cur-var" holds the variables having their gtart(v)> same as the control step being considered.
Algorithm starts from 0th control step with "free-unit", "busy-unit" being empty lists. Variables having their Btart(v)> equal to 0 are added to the list "cur-var". At each control step, the members of the "cur-var" list are assigned to the members of the "free-unit" list in order, These unils holding variables are moved to the list "busy-unit".
Any unallocated variables in "cur-var" list are allocated to new units and these units are also added to the list "busy-unit". At each control step the any unit being freed from the list of 'busy-unit" are moved to ordered list "free-unit" maintaining the order of decreasing precision. When all the variables in "cur-var" get allocated, the algorithm continues to next control step and performs the same operations. This continues till all control steps are exhausted.
The foundation of the register allocation algorithm is laid on the following facts:
a. The variation in bit precision of the register should be minimum. This ensures optimum utilization of a register.
b. The number of the registers used should be minimum similar to the left edge algorithm.
II(B). Functional unit binding
The effect of precision sensitive approach for functional unit binding is shown in the figure 3. Consider the DFG shown in figure 3 . A precision insensitive approach could result in implementation as shown in figure 3 
Scheduling:
Depending upon the application in consideration the scheduling can be resource constrained or time constrained [7] . Most of the DSP applications, being used in real time systems are tightly constrained with time rather than resources. Keeping this in mind, we address time constrained scheduling of multi-precision DFGs. The objective here is to obtain an area efficient implementation of the computational structure represented by the DFG, without increasing its latency.
A resource shared multi-precision multiplier-less DFG implementation comprises adders, shifters, registers, multiplexers and interconnect. While the conventional approach minimizes the number of functional units we aim to minimize the number of bits of the functional units. 
I
The computational complexity of both register allocation and functional unit binding proposed by us in section 11, is polynomial time. Hence, we can afford to do these computations for each of the schedules, rather than relying on some estimation mechanism. This modification contributes to the quality of the implementation and significant gain is achieved without being computationally expensive.
III(A). Cost function
In this subsection we show the relation between number of bits of functional units and their area. Since we aim to reflect the implementation area in the cost function formulation, we propose to associate different weights per bit to the each of the resources (adder, shifter and storage). The relative weights are dependent upon the technology used and the type(architecture) of functional units employed. Now, we illustrate the computation of the relative weights associated with each of the resources in our approach. To keep our technique fairly generic we have used a generic block synthesis tool Synopsys MODULE COMPILER to synthesize the adders, shifters and registers for a number of bit widths. We have used the tool in area optimization mode with the Texas Instruments 0.15 micron ASIC library. The area indicated is in terms of equivalent NAND gates The objective of performing synthesis of each of the resources is to establish the validity of our cost function quantitatively. It is however, not necessary to perform the whole synthesis and approximate weights can be assigned the bits of each resource and it's architecture. Figure 5 shows the variation of an adder area with respect to its width. These curves are plotted for Carry Save Adder(CSA), Ripple Carry Adder(RCA), and Carry Look Ahead Adder(CLA) architectures. Similarly, figure 6 shows the NAND gate equivalent area obtained for Shifter and Storage Registers. It can be interpreted that the area increases in a fairly linear manner with the number of bits for the data-path Where :
W, : Weight per bit of unit i. NI : Number of bits of unit i. Here we haven't considered the interconnect in the optimization phase of the scheduling algorithm. We assume availability of Over the Cell (OTC) routing area, which is usually true in cell based ASIC designs. This is possible because of 5-6 or more laycm of metallization in contemporary technology. We also cxclude the mux area from the formulation of our cost function. Firstly, the contribution of the muxes in the area is quite small as compared to that of other components. Secondly, the mux area is directly proportional to its width and number of select signals. As our algorithm aims at minimizing the width of functional and storage unifs, it reduces the mux width implicitly and mux area is reduced consequently. We now present the scheduling algorithm.
III(B). Precision sensitive scheduling algorithm:
Problem : Schedule a given multi-precision DFG and allocate the functional and storage units with the aim of minimizing the total number of bits of functional and storage units with appropriate weight associated to them. 
30

Ell Width
Objective : To schedule the DFG (i.e. to assign each computation to a control step) such that :
1. For each edge E,j=(vi,vj) : the node j is scheduled in control step later than the node i.
2. Number of scheduling steps are equal to the minimum possible for that DFG( the latency is not increased).
Test
Cases FIR-1 3. The allocation results in optimized use of resources in terms of minimizing the total bits of functional and storage units. This further translates to minimizing the area requirement by the functional and storage units. The pseudo code for the algorithm is formally presented in figure 7 . For scheduling we use KL [5] -based iterative improvement heuristic with either As Soon As Possible (ASAP) or As Late As Possible (ALAP) [7] scheduling being the initial solution. The objective of ASAP and ALAP is to obtain a slack within which a node can be moved, so that the latency bound is not exceeded. For each of the incremental scheduling step we perform complete allocation and binding of functional and storage units and the cost for the new schedule is computed with weights, as shown in the pseudo code. In this section we report the results of implementation of our approach. The algorithms were implemented in C and were run on Solaris 
IV. Results:
~1
Conventional
V. Conclusion and Future Work
We have for the first time, to the best of our knowledge, addressed the problem of High Level Synthesis (HLS) of multi-precision DFGs. We have presented a precision sensitive scheduling algorithm. We have used an iterative improvement approach with cost function being formulated in terms of number of bits of arithmetic operators and storage units. An algorithm for register allocation and functional unit binding for variable precision arithmetic has also been proposed. We have also proposed an integrated HLS methodology to exploit the interdependence of scheduling, allocation and binding. Optimization ratios of as high as 27.21%(23.14% average) over the conventional fixed precision techniques establish the potential of our approach.
The size of a functional unit affects its area as well as performance. The system clock period is decided by the delay of the largest or most complex functional unit. We are planning to enhance this work by incorporating techniques such as scheduling high delay functional units over multiple cycles. This will h a d to smaller clock periods and the system performance (throughput as well as latency) will improve.
VI. References:
