The best-basis algorithm has gained much importance on textured-based image compression and denoising of signals. In this paper, an architecture for the wavelet-packet based best-basis algorithm for images is proposed. The paper also describes the architecture for best-tree selection from 2D wavelet packet decomposition. The precision analysis of the proposed architecture is also discussed and the result shows that increase in the precision of input pixel greatly increases the Signal-to-Noise Ratio (SNR) per pixel whereas increase in the precision of filter coefficient does not greatly help in improving the SNR value. The proposed architecture is described in VHDL at the RTL level, simulated successfully for its functional correctness and implemented in an FPGA.
INTRODUCTION
The inclusion of wavelet transform in JPEG2000 gives clear indication that it is one of the best candidates for still image compression. In DWT, the decomposition is performed on the coarse approximation subbands only and the detailed approximation subbands are retained. The performance of these applications could be improved if the transform also allows a further decomposition of the detailed approximation subbands of the wavelet tree. This type of transform is called Wavelet Packet Transform (WPT), a generalization of wavelet transform. Coifman and Wickerhauser [1] proposed Wavelet Packet Transform (WPT) based Best-Basis algorithm that selects the most suitable frequency subbands for signal or image compression by optimizing additive information cost functions. The Best-Basis algorithm have gained importance because they outperform the most advanced wavelet coders and JPEG2000 compression significantly for textured images in terms of rate-distortion analysis [2] such as famous test image "Barbara" and fingerprint images. Another important application of Best-Basis algorithm is denoising [3] . This is achieved simply by finding a representation with a few significant terms by neglecting the coefficients less than a threshold value. Thus, the hardware realization of best-basis alogrithm attracts much attention.
The architecture for best-basis algorithm for 1D signal is proposed in [4] . This architecture does not discuss about the hardware implementation of information cost function to determine the cost of each wavelet packet and it performs convolution based WPT. There are a few Multiple Instruction Multiple Data (MIMD) based architectures for best-basis algorithm presented in [2] for images or 2D signals and in [5] for 1D signals.
In this work, we propose an architecture for wavelet packet based best-basis algorithm for images or 2D signals using Threshold function as an information cost. The architecture was also proposed to implement the threshold cost function and best-tree selection from 2D wavelet packet nodes.
BEST-BASIS ALGORITHM
The best-basis algorithm expands 2D signal or image into a set of wavelet packet bases having quad-tree structure. The algorithm finds a set of wavelet bases that provide the most desirable representation of the data relative to a particular cost function. Some of the cost functions or information costs are listed in [6] . The threshold cost function counts the number of wavelet coefficients in a particular wavelet packet node whose absolute value is greater than a threshold value 't' as shown in eqns (1-2). The proposed architecture implements the threshold cost function which is given below:
The cost of the wavelet packet or node,
where u(n) is the n th coefficient of the node and
where 't' is the threshold value. The steps involved in best-basis algorithm are as follows:
Step 1: Decompose image into a quad-tree using WPT.
Step 2: Compute the cost of each wavelet packet or node using cost function.
Step 3: Starting from the bottom of the tree, repeat the step 4 on all nodes except those in the last level until the root is reached.
Step 4: If ithparent end if;
THE PROPOSED BEST-BASIS ARCHITECTURE
The proposed architecture for best-basis algorithm for 2D signal or image is broadly classified into three stages. In the first stage (Step 1 of the best basis algorithm), the WPT decomposition is carried out using lifting steps [7] . The cost of each wavelet packet is calculated in the second stage (Step 2 of the algorithm) and is written into the dual-port best-basis RAM. The final stage (Step 4 of the algorithm) performs best-basis node selection. The block diagram of the best-basis architecture for 2D signals or images is shown in Fig.1 .
WPT Architecture (First Stage)
The WPT decomposition based on lifting scheme for (5,3) and (9,7) biorthogonal wavelet is proposed and discussed in detail by the authors in [8] [9] . The odd and even samples of the input signal are read from dual-port input RAM and the coarse and detailed coefficients are written into the input RAM in every clock cycle. The input samples read from the input RAM are fed to both WPT and cost function architectures at the same instant and thus, both computations are carried out simultaneously.
Threshold Cost-Function Architecture (Second Stage)
The threshold function architecture consists of two comparators and two adders as shown in Fig.2 . In each clock cycle, the even and odd values fed to the architecture are compared with the threshold value 't'. If the even value is greater than the threshold value and the cnt_even_val signal (this signal decides if the valid even value is fed to the architecture) is set high, the count value is incremented by one using adder A 1 . The output of A 1 is then fed to the next adder A 2 . If the odd value is greater than the threshold value and the cnt_odd_val signal is set high, the output of A 2 is incremented and stored in the "count" register. The output of the register gives the cost of the wavelet packet or node. The cost function output is fed to the best-basis RAM when "best-tree enable" signal is set to zero. When it is set to one, the best-tree selector output is fed to the RAM. The best-tree selector architecture for 2D signal is shown in Fig.3 . To determine Best-Basis nodes, the tree is traversed from the bottom and each node is assigned with its cost value relative to the threshold function in this case. The input to the architecture is fed from the dual-port RAM, named Best-Basis RAM. For 2D signal, each parent node is linked to their corresponding four children nodes. Thus, the cost of all the four children nodes have to be added before it is compared with the cost of the parent node. The architecture reads the cost of the two children nodes from the Best-Basis dual-port RAM in the first clock cycle and the cost of the other two nodes are read in the second clock cycle. In the second clock cycle, the cost of the first two children nodes are added and stored in the register R 1 . In the third clock cycle, the cost of the second two children nodes are added and stored in the register R 2 . In the fourth clock cycle, the cost of the parent node is read from the BestBasis RAM and stored in the register R 4 . In the same clock cycle, the values in the registers R 1 and R 2 are added and stored in the register R 3 . In the fifth clock cycle, the total cost of the children nodes stored in the register R 3 and the cost of the parent node stored in the register R 4 are compared. The cost of the parent node is replaced with the total cost of the children nodes if the total cost is less than the parent cost. In the last clock (sixth) cycle, the parent node cost is updated in the Best-Basis RAM. The timing diagram for best-tree selection is shown in Fig.4 . The addresses (Best-Basis RAM Addr_A & B) '9,' '10,' '11,' '12' are the addresses of the children nodes and the address '2' is the parent address whereas the node-cost (Best-Basis RAM Data_A & B) corresponding to those addresses are shown in Fig.4 . The total cost of the children nodes is 798 which is lesser compared to the parent cost of 839. So, the total children node value is selected as best-basis value and it is written into the parent address '2'.
To determine the Best-Basis nodes, the tree is traversed from the bottom and each node is assigned with flag and its cost value based on the threshold function. When the wavelet packet tree is constructed, all the nodes or wavelet packets are marked with flags to one. If the cost of the parent node is less than the total cost of the children nodes associated with the parent node, the flag of the parent node and all the flags of the nodes in the sub-tree of the parent node are set to one. Otherwise, the cost of the parent node is replaced with the total cost of the corresponding children nodes and the flag of the parent node is set to zero leaving the flags of all the nodes in the sub-tree. After the BestBasis set is determined, the Best-Basis nodes are calculated by performing inverse wavelet transform on the nodes whose flags are set to one.
PERFORMANCE EVALUATION
To explore the effects of the precision of the proposed bestbasis architecture, the image database [10] with total number of fifty images of size 256x256 is used to verify it. The C program is written to compute the average Signal-toNoise Ratio (SNR) per pixel for the proposed (9,7) waveletpacket based best-basis architecture between the original and the reconstructed images after performing the best-basis algorithm from second and third level of WPT. For various precisions of the filter coefficients and the pixel values, the SNR value is calculated and shown in Table 1 . It is evident from Table 1 that the SNR value considerably increases with the increase in the precision of the pixel values but slightly increases with the increase in the filter-coefficient precision. For (5,3) wavelet-packet based best-basis algorithm, the original image and the reconstructed image from best-basis selection are the same. So, the study of the filter-coefficient and the pixel precision is not required.
Before hardware implementation and testing of the proposed architecture is performed, the VHDL model of the proposed (5,3) and (9,7) lifting-based wavelet-packet based Best-Basis algorithm using Threshold cost function for 2D-signals or images is tested for correct functionality using the ModelSim functional simulation tool. The VHDL model is developed to perform Best-Basis algorithm from two-level and three-level of 2D WPT. The proposed architecture is synthesized and implemented on Xilinx Virtex2Pro FPGA. This device features dual-port Block RAMs and embedded multipliers both of which have been used for implementation. The image of size 256x256 is stored in the dual-port Block-RAM available in the FPGA for implementation. Table 2 presents the hardware resource usage of the unit for an image of size 256x256. Except for the number of RAM blocks, the values in Table 2 Table 2 : Implementation Results of (5, 3) and (9, 7) wavelet packet based best-basis algorithm for images on Xilinx Virtex-2Pro FPGA
From Two level of WPT
The software (C Program) implementation to perform Best-Basis algorithm from two-level and three-level 2D WPT using Threshold cost function is running in a SunBlade 1500 workstation with one 1.5GHz UltraSPARC IIIi processor. Tests are performed on all the images of size 256x256 in the image database. Table 3 shows the comparison of the speedup achieved by the hardware implementation in comparison to the software method for images with 8-bit depth.
CONCLUSION
In this paper, an efficient best-basis architecture for images using lifting-based (5,3) and (9,7) wavelet and threshold cost function has been proposed. The proposed architecture performs threshold-cost function calculation and wavelet packet decomposition simultaneously and increases the speed of the architecture. The proposed architecture has been described in VHDL at the RTL level and simulated to 
