Abstract-Image data consumes enormous bandwidth and storage space. Neural networks can be used for image compression. Neural network architectures have proven to be more reliable, robust, and programmable and offer better performance when compared with classical techniques. In this paper the main focus is development of new architectures for neural network based image compression optimizing area, power and speed as specific to ASIC implementation, and comparison with FPGA.
I. INTRODUCTION
The transport of images across communication paths is an expensive process. Image compression provides an option for reducing the number of bits in transmission. This in turn helps increase the volume of data transferred in a space of time, along with reducing the cost required. It has become increasingly important to most computer networks, as the volume of data traffic has begun to exceed their capacity for transmission. Traditional techniques that have already been identified for data compression include: Predictive Coding, Transform coding and Vector Quantization [1, 2] . In brief, predictive coding refers to the decor relation of similar neighboring pixels within an image to remove redundancy. Following the removal of redundant data, a more compressed image or signal may be transmitted [1] . Transform-based compression techniques have also been commonly employed. These techniques execute transformations on images to produce a set of coefficients. A subset of coefficients is chosen that allows good data representation (minimum distortion) while maintaining an adequate amount of compression for transmission. The results achieved with a transform based technique is highly dependent on the choice of transformation used (cosine, wavelet, Karhunen-Loeve etc.) [2] . Finally vector quantization techniques require the development of an appropriate codebook to compress data. Usages of codebooks do not guarantee convergence and Manuscript received June 16, 2010; revised June 20, 2011. K.Venkata Ramanaiah, Principal, Narayana Engineering College Gudur, Andhra Pradesh, India (ramanaiahkota@gmail.com).
C. P. Raj, Course Manger M.S. Ramaiah School of Advanced Studies Bangalore, Karnataka, India hence do not necessarily deliver infallible decoding accuracy. Also the process may be very slow for large codebooks as the process requires extensive searches through the entire codebook [1] .
Artificial Neural Networks (ANNs) have been applied to many problems [3] , and have demonstrated their superiority over traditional methods when dealing with noisy or incomplete data. One such application is for image compression. Neural Networks seem to be well suited to this particular function, as they have the ability to preprocess input patterns to produce simpler patterns with fewer components [1] . This compressed information (stored in a hidden layer) preserves the full information obtained from the external environment. Not only can ANN based techniques provide sufficient compression rates of the data in question, but security is easily maintained. This occurs because the compressed data that is sent along a communication line is encoded and does not resemble its original form.
The basic architecture for image compression using neural network is shown in figure1.
II. FPGA IMPLEMENTATION OF 16 & 64 INPUT NEURAL NETWORK ARCHITECTURE
The neural network architecture proposed in this paper is consisting of 64 and 16 input neuron, are modeled using HDL. The network supporting numbers in the range 0 to 1 is taken care by introducing BCSD multipliers for weight multiplication [6] . The HDL code for the proposed network is verified for its functionality using test bench, the design is synthesized on FPGA to estimate the hardware complexity for efficient ASIC implementation. The design is mapped on Spartan III device from Xilinx. The synthesis results are shown in table 1. As the design is mapped on FPGA, it supports Reconfigurability. Reconfigurability can be achieved by changing the weight matrix and the input layer for better compression [8] .
The design proposed consists of matrix multiplication of two matrices, one is the input image samples, and the second is the weight matrix obtained after training. This multiplied output is passed through the nonlinear transfer function to obtain the compressed output that gets transmitted or stored in compressed format. On the decompression side, the compressed data in matrix form is multiplied with the weight matrix to get back the original image. The image quality of the decompressed image depends on the weight matrix. The image data of size 16x16 is multiplied by the weight matrix of 4x16 to get a compressed output of 4x16. On the decompress or side 4x16 input matrix (compressed image) is multiplied with the weight matrix of size 16x4 reproduces the original image. In order to achieve better compression nonlinear functions are used both at the transmitter and receiver section. The HDL code modeled for FPGA implementation is modified for ASIC implementation. The general coding styles are adopted for building optimized RTL code for ASIC implementation. The results obtained from ASIC synthesis to physical implementation are compared with FPGA implementation. The architecture synthesized is optimized for power area and speed.
The design is synthesized using TSMC 130 nanometer technology and library files. The design is synthesized using Synopsys DC, the timing analysis is carried out using Prime Time. The proposed neural network architecture is implemented on FPGA as well as ASIC. The HDL model developed for the entire network supporting 64-4-64 and 16-4-16 is verified for its functionality, the input image pixels of size 64 x 1 represented in integer form is stored in the RAM and is fed into the network for processing. Simulation results for 16 inputs Neuron is shown in Fig. 2 Results obtained from the Modelsim simulation for the 1 st row of the neuron it is shown in Fig. 2 , these results exactly matches with the first column in C matrix obtained from MATLAB. Fig. 2 . Simulation results for 16 input neuron. Fig. 3 highlights the remaining rows after compression. These results are matching with Mat lab results. As the complexity in the RTL code increases the area should increase. However, due to the efficient coding style adopted The above techniques are set using the tool options and changing the coding styles. Optimization 1 refers to the results obtained with initial settings; these results reflect the performance of the design without any optimization techniques incorporated. Optimization 2 refers to use of inbuilt constraints available on the tool Design Compiler (DC). This involves setting up of constraints on clock, area, power, time arrivals, load capacitance, gate sizing, clock buffers and insertion buffers. Optimization 3 refers to modifying coding styles by insertion of clock gating, power gating, resource sharing and memory sharing techniques. A graph shown in Fig. 5 to 8 discusses the variation in performance from the initial design to the final design based on optimization techniques mentioned. Optimization 4, 5, 6, refers to the techniques adopted by the tool based on constrains set during the flow. Fig. 4 shows the synthesized RTL net list obtained using Synopsys design compiler. The green cells represents the logic gates from TSMC 130nm technology library, blue lines represents interconnects and the red line represents the interconnect that takes the maximum delay (critical path). Slack represents whether the data is available at the right time to be latched on to next stage. Slack should always be positive. Positive slack implies that the design meets the timing requirements. Without optimization the design is verified for the slack. From Fig. 5 it is found that the slack that was negative has been converted to positive slack of 0.18. At slack 0.0 the design is just ideal, with optimization techniques, by setting constraints on clock the slack is made positive. The results are obtained using Design Compiler, a signoff tool from Synopsys. It is recommended that a slack of 0.08 is optimum.
In order to achieve positive slack, the buffers get introduced on the critical path; even the gate sizes are increased to achieve higher currents to drive the load. This drastically increases the cell area. However with change in coding style the percentage increase in cell area is minimized by adopting coding guidelines. The number of nets also increases as the layers are increased to more than three. Fig. 6 shows that in order to achieve positive slack, the cell count is increased to 486 from 353. It is also observed that the number of nets has been increased to 581 from 440. The cell count and nets reaches saturation after certain limits. This demonstrates the idealities of the synthesis tool. In the Fig. 7 it is observed that the total cell area has increased to 8662 Sq µm from 6319 Sq µm, this is done in order to achieve low power and higher speeds. 1, 2, 3, 4, 5 and 6 . The dynamic power varies from 449µW to 815µW. It is found frequency variation affects power. Increase in power is brought into control by constraining the design during synthesis by adopting suitable power saving techniques. Hence, we find that during the optimization the gradual increase in power is limited to 713µW which is better than the previous results as shown in Figure. 8. This is achieved by using the power saving techniques. However we find that the leakage power has increased from 18µW to 28µW, this is due to the fact that the power saving techniques incorporated such as clock gating, power gating concepts adopted introduces additional cells that enables the required logic only when required, this keeps most of the logic in standby mode by disabling them from the clock network and power network being connected. Hence there is increase in leakage power. In order to reduce the leakage power the High Vt Library cells can be used. This has not been experimented in this work. The synthesized net list is taken through the VLSI Physical Design Flow to generate the GDSII. In this flow the design is taken through Physical design flow steps such as floor planning, placement, clock tree synthesis, routing, physical verification, parasitic extraction and finally timing verification to sign off the design. This ensures the pre-silicon verification is carried out using industry standard sign off tools and the design is taped off for fabrication, with generation of GDSII format. In this design Synopsys flow is used for physical design and verification, which is one of the standard signoff tools. The Figure. 9 shows the synthesis results of 16 input neuron RTL code synthesized using Synopsys DC, targeting TSMC130nm library.
This design has to be taken through the Physical design flow. Fig. 9 shows the gate level net list, but does not give details of the cell placement, power network, clock network and pin details.
The design which models 16 input neuron developed in this work is considered as major macro for image compression using neural network, this module is followed by quantization, data encoding and storage modules. Hence the design is converted into a macro that can be interfaced with other building blocks. The pin configuration and the sizing of this macro decide the cost function and its performance. The design is taken through routing phase and the Fig. 11 shows the placed and routed design. As the design has 539 nets to be routed, metal 1 to metal 8 layers are used in order to route the design, with metal layer 6 consisting maximum number of wire length of 6209.5µm. The total wire length is 12592.2µm. The design is constrained to minimize congestion. The routed design is shown in Fig. 11 the vertical and horizontal pink line shows the wires connecting the standard cells within the core area. Both local and global routing is performed; the final routing directions are what are highlighted in Fig. 11 which are pink in color. The color combination highlighted in Fig. 12 shows the IR drop distribution within the die. Red color is the violated and more power consuming region. Green color is the region where there is nominal power consumption. As we see that there are very few areas with high IR drop, this is taken as warning and is neglected. The cell power when compared with the power report obtained during synthesis phase has 97% improvement. This shows that the physical design improves the power performance, as this considers the actual cell placement and pin placement with clock and nets routed.
The routed design is taken through IR drop analysis to find out the impact of power dissipation due to parasitic extracted from the nets, and the cells. The parasitic report suggests that the design has 803 internal nodes, 8 boundary nodes, 881 resistors and 500 current sources. This data enables to find the power dissipation. It is found that the total power is 0.0202574mW and the I/O net switching power is 0.000152925mw. The Fig. 13 shows the comparison of the hardware implementation details focusing the major parameters like area and power.
The result clearly shows the performance metrics of each implementation. FPGA occupying more space and power, the only advantage is Reconfigurability and time to market. ASIC results are found to be better than the FPGA results in terms of area and power. They consume less power and space, hence suitable for low cost and reliable Hardware implementation. V. CONCLUSION ASIC implementation of neural network architecture for image compression has been successfully implemented implemented using 130nm technology. Input image is compressed and decompressed using the two layered neural network architecture. The network trained using backpropagation algorithm is realized using multipliers and adders. The trainied weights are stored in memory and is used to compress and decompress image. Low power tehcniques have been used to reduce power dissipation of the complex architecture. Power reduction can be further achieved by replacing multipliers and adders using low power arithmetic units.
The techniques proposed in this work are modeled, designed and validated as per the Hardware requirements for ASIC and FPGA implementation. Suitable techniques have been incorporated to optimize area, power and speed.
