Abstract-
I. INTRODUCTION
Computationally DSP functions [3] , [6] are computationally intensive and exhibit spatial [2] , [4] parallelism, temporal [5] parallelism or both. High speed applications like Software Defined Radio (SDR), satellite modems, HDTV etc. need very high performance that is not achievable with currently available DSP processors [22] , [23] . Even though higher performance achievement, relatively lower cost and low power dissipation are the major advantages of ASICs, high degree of inflexibility restricts their usage for rapidly changed scenario in the current high end applications as mentioned above. On the other hand, mapping different DSP functions at run-time, dynamically reconfigurable FPGAs [4] , [7] , [8] are becoming popular because of their flexibility and low risk factor. However, lower utilization factor due to wastage of area in SRAM based CLBs, higher cost and relatively lower performance due to complex interconnection and routing delay are the major bottlenecks of the FPGAs. Although, some of the FPGAs of virtex family offer DSP basic building blocks like Multiply and Accumulation (MAC) units but silicon utilization factor is not optimized for the LUT based architecture [24] of FPGA. The proposed FPDA architecture eliminates the drawbacks of FPGAs and ASICs. DSP functions are mainly of two types: 'Filter Functions' (FIR, IIR etc.) and 'Linear Transforms' (DFT, FFT, DCT, DWT etc.). Keeping these in views, this paper presents a novel reconfigurable DSP architecture which combines different DSP functions by interconnections among different CMs.
Section-II of the paper describes Distributed Arithmetic Principle which has been used to implement DSP functions (like FIR, IIR, DCT, DWT etc.) in the proposed architecture. Section-III of the paper describes different DSP functions and their implementation proposal in proposed architecture. Section-IV describes the detailed representation of "Reconfigurable Architecture". Section-V analyzes the performance with various simulations, implementation and comparison results and Section-VI concludes the paper.
II. DSP FUNCTIONS AND PROPOSED IMPLEMENTATION

A. Finite Impulse Response Filter
An FIR with constant coefficients is an LTI digital filter. The output of an FIR of length L, to an input series x[n] is finite version of convolution sum:
16 tap FIR filter has been implemented using Parallel DA in Fig. 1 and Fig. 2 . DA [11] , [12] , [21] architecture replaces multiplier block by adder and shifter. LUT contents for DA FIR are f(c[n-k], x[n]). A LUT of 2 16 locations is needed to implement 16 tap FIR using DA. This paper proposes FIR architecture with 32 numbers of 2 4 LUTs that cause decrease in memory locations and fast execution at the cost of excess LUTs, registers and adders. Each bit of each input enters in parallel to the LUTs (2 LUTs for a coefficient). The Proposed FIR architecture is scalable. The basic building blocks, needed to implement FIR filter, are LUTs, adders and registers. From the above equations, it is observed that an IIR filter can be implemented using two FIR filters with same inputs. 
C. Discrete Wavelet Transform
Discrete Wavelet Transform has been widely used in digital signal processing and image compression (like JPEG) domain in recent years. The coefficients of DWT are calculated recursively using Mallat's Pyramid Algorithm.
Where W L (n, j) and W H (n, j) are the n th scaling and wavelet coefficient at the j th stages, h 0 (n) and h 1 (n) are dilation coefficients [18] corresponding to scaling and wavelet functions. The forward DWT has been implemented using Decimator block, which consists of a PDA FIR filter and down sampling operator. The PDA FIR has been implemented as FIR architecture described above in Fig.  4 . The FIR Daubechies 8-tap has been chosen for the implementation as shown in TABLE I. The FIR input has been driven by the clock i.e. tied to the clock input of the 1bit counter in Fig. 5 . The output port of FIR is connected to the input of parallel load register. Receiving and Blocking of the input to register depend upon the state of the counter. The input enters decimator at the rate of 1sample/ clock while filtered output comes out at the rate of 1sample/ 2 clocks. 
D. Fast Fourier Transform
Discrete Fourier Transform is a discrete transform for Fourier analysis of the signal. The formulation of DFT for an input signal:
FFT is basically computation process of Discrete Fourier transform with multi-dimensional index mapping, suitable for real time application. The proposed FFT architecture has been implemented with Cooley-Tukey Algorithm [14] , [15] .The efficient complex multiplier has been implemented for complex multiplication of butterfly, as shown in Fig. 6 . (8) Final product of the complex multiplication:
Instead of cosine and sine table to compute complex multiplication, the implementation can be accomplished with three multipliers, one adder and two subtractors at the cost of one additional table, as shown in TABLE II. The Butterfly has been implemented using proposed efficient complex multiplier. The parallelism of the proposed architecture has been achieved by performing each stage with only 8 butterfly units [20] that cause increase in speed. Output of stage n is the input of stage (n+1). Output of butterfly unit is fed back to the input. Multiplexer's select lines s0, s1 determine the stages while s2 incurs the scalability to the proposed architecture. 
E. Discrete Cosine Transform
Discrete Cosine Transform is a Fourier related transform, dealing with real numbers only. 2D DCT, one of the efficient functions, is used for different compression technique.
N point 1D DCT is defined by (11) Where, When, k=0 else
The formula of 2D DCT can be computed by row-column decomposition of two 1D DCTs. 1D DCT blocks along row and column implement 2D DCT. In the proposed architecture of 1D Fast Discrete Cosine Transform, it has been implemented by Distributed Arithmetic [16] , [19] in Fig. 8 . DCT constant coefficient for N = 16 can be represented as:
The matrix has been decomposed into even and odd subscript matrices. Even subscript matrix has been decomposed again into 4x4 matrices. Odd subscript matrix has been decomposed into a number of 4x4 matrices followed by adders. 
IV. RESULTS AND ANALYSIS
The reconfigurable architecture has been validated on Virtex-5 FPGA. The synthesis report has been discussed below. "Minimum input arrival time before clock" is the worst case input data setup time requirement to clock pin has been reported 4.222ns. This minimum input arrival time before clock is maximum for configuration of DCT. The worst case data output delay after clock pin which is same in all cases, is termed as "Maximum output required time after clock" in the final report. No combinational data path from input to output. "FPDA" will offer high speed as configuration is basically interconnections among basic modules of DSP instead of complex interconnections in FPGA.
A. Final Reports
There are so many advantages to realize DSP algorithms in the proposed FPDA architecture other than FPGA: different DSP functions can be made by changing the connectivity among the basic building blocks, placement & routing of basic building blocks in such a fashion that it should be optimum in delay than FPGA, architecture has a low design complexity, higher utilization factor than FPGA, high degree of parallelism and scalability.
V. CONCLUSIONS
The proposed "Reconfigurable Architecture" includes 'Filter Functions' and 'Linear Transforms'. The combined circuit is basically the union of all the basic building blocks mentioned above and they are required for implementing each of the functions. By interconnecting different building blocks in different fashions various DSP functions can be made. This process can be viewed as "Configuration". The architecture also offers scalability as new transforms with higher number inputs or higher tapped filter functions can also be implemented with those basic building blocks. The problems of inflexibility of ASICs, low utilization factor and low performance of FPGAs can be overcome with the proposed architecture as the major building blocks which are common to most of the DSP functions are implemented by direct hardware and not by LUT thereby optimizing the silicon utilization factor. Only one configuration can be made at a time which can be observed as a limitation of the architecture. But, minimization and maximum utilization of the hardware have been achieved at the cost of mentioned limitation. The future work can be proceed with the VLSI implementation of the proposed architecture, the implementation of different filter or linear transform functions in the proposed architecture and globalize the architecture for implementation of all DSP functions, implementation of high speed building blocks to achieve comparatively more faster architecture, time and hardware complexity analysis of the proposed hardware with other DSP functions and analysis the feasibility.
Employing Distributed Arithmetic approach for FIR, IIR, DWT and DCT functions and exploitation of the inherent parallelisms of the DSP functions, enhance the speed of the proposed architecture over the FPGAs substantially.
The Proposed architecture was validated on Xilinx virtex-5 FPGA on 5vlx330tff1738-2 using Xilinx ISE 9.1i.
