INTRODUCTION
The image compression techniques can be divided into two classes: Lossless and lossy compressions. Lossless image compression is particularly useful in applications such as image archiving and facsimile transmission. However, most of the applications today use lossy image compression technique because of its higher compression ratio compared with lossless image compression, and this is crucial for many image applications. The lossy image compression techniques play a major role in systems having limited transmission bandwidth and *Corresponding author. E-mail: shabiul@ukm.my., shakir_dhaka@yahoo.com. storage capacity. There are various schemes and standards for lossy image compression. One of them is Joint photographic expert group (JPEG), the most widely used standard for image compression. The 2-D discrete cosine transform (DCT) is the main transformation core of the JPEG standard in lossy mode. The DCT algorithm introduced by Ahmed et al. (1974) is until today one of the most well known and widely used transform techniques in digital signal processing (DSP) especially for image compression because of its excellent energy compaction characteristic. This type of transform can also be computed by using the fast Fourier transform (FFT) explicitly explained by Lim (1990) . The DCT process is applied on blocks of 8×8 or 16×16 pixels, which will convert into series of coefficients and define spectral spectral composition of the block. The 2-D DCT is a separable transform consisting of forward DCT (FDCT) and inverse DCT (IDCT). DCT is also basis for many other image and video compression standards like MPEG-1, MPEG-2 and MPEG-4 (ISO/IEC 11172-2 video MPEG-1, 1991; ISO/IEC 13818-2 video MPEG-2, 1994; ISO/IEC JTC1/SC29/WG11 N4030, 2001). Hence, there are many scopes in the development of DCT chip. Such aspects make the development and design of 2-D DCT chip very important. Cintra and Bayer (2011) introduced an orthogonal approximation for the 8-point DCT computation based on matrix polar decomposition. The proposed transformation matrix contains only zeros and ones; multiplications and bit shift operations are absent. The proposed algorithm is superior to the signed DCT. It could also outperform state-of-the-art algorithms in low and high image compression scenarios, according to PSNR, UQI and MSE measurements, exhibiting at the same time a comparable computational complexity. Block-based quantization been widely accepted in state-of-the-art image/video coding standards. Jin et al. (2011) proposed a block-based decontouring method to reduce the false contour artifacts in the decoded image/video by automatically dithering its direct current (DC) value according to a composite model established between gradient smoothness and blockedge smoothness, and thus improving compression efficiency. DCT-based block level contour artifacts detection mechanism ensures the blocks within the texture region are not affected by the DC dithering. Jridi and Alfalou (2010) implemented a new design of lowpower and high speed DCT for image compression on a field programmable gate array (FPGA) board. The proposed compression method converts the image to compressed many lines of 8 pixels and then applies the optimized 1-D DCT algorithm for compression. The DCT optimization is based on the hardware simplification of the multipliers used to compute the DCT coefficients. In fact, by using constant multipliers based on canonical signed digit (CSD) encoding, the number of adders, subtracters and registers will be minimum. To further decrease the number of required arithmetic operators, a new technique based on common sub-expression elimination (CSE) is examined. FPGA implementations prove that the CSE implies less computation, less material complexity and a dynamic power saving of about 22% at 110 MHz of clock frequency in Spartan3E device. The required silicon area and power consumption were reduced and the maximum operating frequency was increased. Hong et al. (2011) proposed a quality inspection system for optical lenses using computer vision techniques. The system is able to inspect light-emitting diode (LED) lenses visually and to validate their quality level automatically based on the defect severity. The optical inspection system applies the block discrete cosine transform (BDCT), Hotelling2 T statistic, and grey clustering technique to detect visual defects of LED lenses. A spatial domain image with equal sized blocks is converted to DCT domain and some representative energy features of each DCT block are extracted. Hyungjun and Park (2011) proposed a new ringing-artifact reduction method for image resizing in a block DCT domain. The proposed method reduces ringing artifacts without further blurring, whereas previous approaches must find a compromise between blurring and ringing artifacts. The proposed method consists of DCT-domain filtering and imagedomain post-processing, which reduces ripples on smooth regions as well as overshoot near strong edges. DCT and discrete sine transform (DST) are two transform compression techniques that are used extensively for data compres-sion most notably in audio, speech and image processing applications. Several algorithms for computation of DCT and DST coefficients have been developed so far, among which independent update algorithm seems to be the most promising technique for future applications. Dibyayan et al. (2011) has shown the FPGA implement-tation of the DCT-II independent update algorithm and evaluated its performance. Use of the above algorithm for applications like image intelligence, biometric systems, image duplication etc. is also discussed. Fu et al. (2004) completed a low-power 2-D DCT IP core based on 2-D algebraic integer encoding and discussed the application of a new 2-D algebraic integer encoding scheme for the design of a 2-D DCT processor core for JPEG and MPEG applications. The processor takes the advantage of less complex, multiplier-less and high-precision nature of the algebraic integer encoding schemes to achieve low power consumption. Test results from a proof-of-concept 0.18  CMOS 88 2-D DCT chip demonstrate a low power dissipation of 7.5 mW at 75 MHz. Danian et al. (2004) presented a novel cost effective VLSI implementation of the 2-D DCT and its inverse. The VLSI architecture namely the transpose free row-column decomposition method replaced the transpose circuits with permutation networks and parallel memory modules. As a result, the timing overhead of I/O operations was eliminated and the hardware complexity was largely reduced. An accuracy testing system was designed to find the optimum wordlength parameters. Based on the accuracy testing system, the proposed architecture achieved the smallest wordlength among the reported 2-D DCT architectures. Synthesis results showed that, with 0.25  CMOS technology, the area was about 1.5 mm 2 and the speed was about 125 MHz.
With sophisticated processing schemes at hand and further promising advances in multimedia algorithm research to come, efficient VLSI implementation assumes enormous importance. Conventional DSPs are highly optimized for processing speech/audio and lack the high performance needed for image and video processing. Programmable DSPs may reach higher performance levels for desktop computing, but are typically weak at signal processing and too expensive and power consuming for typical multimedia applications. Thus, the development of the 2-D DCT chip using FPGA (as outlined in this work) becomes a subject of much attention.
The objectives of our research work are to design a faster 2-D DCT processor of higher operating frequency (that is, 140 MHz) for higher image compression, to design the processor, the different hardware functional blocks such as adders, subtractors and multipliers to be developed: then, to integrate the developed functional blocks into our proposed 2-D DCT processor and finally, to verify the performance of the whole faster 2-D DCT processor into FPGA platform.
2-D DISCRETE COSINE TRANSFORM (DCT) COMPUTATION
Under such a technique, a source image is first partitioned into blocks of (8×8) pixels. The FDCT of each block is then computed using Equation 1 explained in detail by Loannis (1993) .
where C(k 1 , k 2 ) are DCT coefficients, k 1 = 0,1,2,3,…(N 1 1) and k 2 = 0,1,2, 3,…(N 2 1). Here, N 1 and N 2 are total number of coefficients for row and column matrixes, respectively. W is the weighting factor, Re {.} is the real part of a complex number, and V (k 1 , k 2 ) is the DFT. The FDCT employs the 2-D FFT algorithm for transformation from time domain to frequency domain. The 2-D FFT coefficients of image signal x (n 1, n 2 ) can be computed using Equation 2.
The FDCT outputs represent a set of 64-DCT coefficients; their values are uniquely determined by the particular 64-point input signal. The DCT coefficient values are thus regarded as the relative amount of the 2- 
If the DCT coefficients C (k 1 , k 2 ) and the computational value of DFT V (k 1 , k 2 ) from Equation 2 are known, the decoded image signal x (n 1 , n 2 ) can be recovered easily using the above equation 3. Firstly, the DCT algorithm has been developed in Matlab simulation session to achieve higher image compression ratio. Here, the image data of gray scale image ("Lena") consisting of (2 8 = 256) levels in ASCII format has been used as the input file. The input image file of Lena (25, 7944 pixels) is used for all sub-blocks with total of 4096 iterations to represent the reconstructed image. To calculate the compression ratio, the following Equation 4 has been considered. After performing the quantization process, the compression ratio for the original and the developed 2-D DCT algorithm is shown in Tables 1 and 2, respectively. As an example, 15 iterations have been chosen randomly with their corresponding number of non-zero values for calculating the average compression ratio. We observed that the compression ratio in the developed algorithm is higher (that is, 6.26) than the original one (that is, 3.08).
Compression ratio = 64/number of non-zero values (4)

COMPARISON OF SIMULATION RESULTS BETWEEN MATLAB AND VHSIC HARDWARE DESCRIPTIVE LANGUAGE (VHDL)
The 2-D DCT algorithm developed in Matlab sessions and the image data of gray scale ("Lena") consisting of (2 8 = 256) levels in ASCII format has been used as the input file. The total matrices of image input data (512  512 = 262144 pixels) with 4096 iterations have been applied for computing the image compression of the 2-D DCT algorithm. The arithmetic operations are performed using bit-width of 16-bit. Then, after getting the confidence on the result with Matlab, the specification of the algorithm has been made using the VHDL. A test bench in VHDL was generated to verify the correctness of the VHDL model of the 2-D DCT algorithm.
It has been found from the experiment results that the output image 2-D DCT coefficients given by Matlab and VHDL are almost equal as shown in Table 3 . The minor difference is due to the internal architecture of PC (16-bit) and workstation (64-bit). In order to find the timing information and add constraints to meet the timing goals in our design, the timing simulation has been performed. As an example, the delay between the input and the output signals is approximately 140 ns as shown in Figure 1 . The data path elements in processor architecture are bit-parallel ExUs, and they communicate via a dedicated bus network. However, we can conclude that the 2-D DCT algorithm can be synthesized into the logic gate level of hardware design for VLSI implementation.
DEVELOPMENT OF THE ENHANCE ARITHMETIC LOGIC UNIT (ALU) BLOCK FOR 2-D DCT CHIP
To design a faster 2-D DCT chip, we have enhanced the conventional adder, subtractor, and multiplier in the arithmetic logic unit (ALU) block. First, the basic building block of faster 4-bit adder based on carry look adder (CLA) method has been chosen. The CLA circuits generate and propagate carries ahead based on the following CLA (4-bit full adder) Equations 5 to 9. C g PC g P g P P g P P PC      
Here C's represent the carry bit for the next bit of operation in CLA equations. Initially, carry bit C 1 = 0, with generate (g) bit and propagate (P) bits into the F 1 block (in cascading FA), is applied to perform the next carry bit using the CLA equation. Next, the other FA blocks follow their respective CLA equations to produce carry sum values sequentially for and so on, to obtain the final carry bit result. By summing up all carry bits separately using normal addition, a faster adder block can be produced to carry sum values as the final output. Second, the 2's complement method has been adopted to perform the number subtraction using the same basic building block of faster 4-bit adder. To subtract two numbers A and B (that is, A-B), a 2's complement of B is first carried-out and then added with A to get the fast subtraction result. Third, the speed of multiplication has been computed using the following equation of execution time. We used three basic items to reduce matrix of summands, number of reduction stages, In the faster multiplier block, the generated matrix of summands is used for partial products term, and then followed by several reduction stages using 2-2 and 3-2 adders and so on. It builds a fast multiplier by creating a chain of adders that take the multiplicand as one input (controlled by the appropriate bit of the multiplier) and take the output of the preceding stage as the other input.
In the final stage (that is, 2 rows), it uses fast adder for generating the result. The number of reduction stages depends on the length of the multiplier. From the block diagram, in order to reduce the manipulation time, 2 rows of summands were taken up, and also a faster adder was used to perform the complete process for faster multiplication result. Summands are reduced to only two remaining for a (4-bits  4-bits) multiplication. The algorithm development and implementation processes for both of the conventional and the enhanced ALU blocks have been computed using VHDL, followed by the Quartus-II integrated synthesis synthesis (QIS) software to get the gate-level architecture. Based on operating clock frequency, VHDL timing report shows that the arithmetic components of the enhanced block are faster than the conventional block as shown in Table 4 .
FAST 2-D DISCRETE COSINE TRANSFORM (DCT) CHIP WITH ENHANCED ALU BLOCK USING QIS
In this section, the logic synthesis of the combined 2-D FDCT/IDCT blocks, incorporated with the enhanced ALU block has been carried out to design the complete fast 2-D DCT processor. The structural VHDL code of the fast 2-D DCT processor is generated from the logic synthesis process. The technology library used in the synthesis process is TSMC 90-nm digital library. We have synthesized the fast 2-D DCT algorithm with those of the enhanced ALU block using QIS software, which are easily compatible with the memory size and I/O pins, targeted to Altera FPGA technology (EP2S60: EP2S60F1020C4). The final step is the technology mapping which maps the design into a specific architecture. The QIS software has generated the register transfer level (RTL) architecture of 44 sheets and also, technology architecture of 2 sheets for the fast 2-D DCT processor. As an example, 1 sheet of the fast 2-D DCT chip from technology -mapping view is shown in Figure  2 . The performance evaluation parameters are generated using QIS software such as chip planner, power dissipation of 638.84 mW and adaptive logic module (ALM) of 128. It is to be noted that each logic array block (LAB) consists of eight ALMs in Stratix II FPGA board. So, 16 LABs are needed in total to describe the logic cells for the fast 2-D DCT chip.
FIELD PROGRAMMABLE GATE ARRAY (FPGA) IMPLEMENTATION
To verify the functionality of the 2-D DCT chip, an analysis and compilation based generated SRAMObject-File (.sof) of the developed 2-D DCT algorithm has successfully downloaded into the FPGA board. The classic timing analyzer tool reports the operating frequency of 140 MHz for the fast 2-D DCT chip as shown in Figure 3 . The FPGA implementation results from our experimental work are compared with those as reported in references (Jridi and Alfalou, 2010; Fu at al., 2004; Danian at al., 2004) . We observed that the operating frequency of 125 MHz was the highest as reported in the past. Thus, we conclude that the operating frequency of 140 MHz of our developed 2-D DCT chip is higher than the other previously reported results.
CONCLUSION
The paper presented a successful development of the encoder and decoder parts of the 2-D DCT algorithm using Matlab session. To implement a fast 2-D DCT chip, the CAD-design flow of the system behavior was described. The behavior of the 2-D DCT algorithm was simulated in VHDL. The developed coding parts of the 2-D DCT algorithm is integrated with the enhanced ALU block for designing the fast 2-D DCT chip. VLSI implementation of the 2-D DCT algorithm for higher image compression ratio using Altera Stratix-II FPGA board was also presented. We may thus propose our developed fast 2-D DCT processor of 140 MHz for hardware ASIC-based implementation in future. The proposed ASIC chip can be used for the application of digital technologies in electronic products.
