The discrete cosine transform (DCT) is the key step in many image and video coding standards. The 8-point DCT is an important special case, possessing several low-complexity approximations widely investigated. However, 16-point DCT transform has energy compaction advantages. In this sense, this paper presents a new 16-point DCT approximation with null multiplicative complexity. The proposed transform matrix is orthogonal and contains only zeros and ones. The proposed transform outperforms the well-know Walsh-Hadamard transform and the current state-of-the-art 16-point approximation. A fast algorithm for the proposed transform is also introduced. This fast algorithm is experimentally validated using hardware implementations that are physically realized and verified on a 40 nm CMOS Xilinx Virtex-6 XC6VLX240T FPGA chip for a maximum clock rate of 342 MHz. Rapid prototypes on FPGA for 8-bit input word size shows significant improvement in compressed image quality by up to 1-2 dB at the cost of only eight adders compared to the state-of-art 16-point DCT approximation algorithm in the literature [S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy. A novel transform for image compression.
Introduction
The discrete cosine transform (DCT) [1, 12, 38 ] is a pivotal tool in digital signal processing, whose popularity is mainly due to its good energy compaction properties. In fact, the DCT is a robust approximation for the optimal Karhunen-Loève transform when first-order Markov signals, such as images, are considered [12, 30, 38] .
Indeed, the DCT has found application in several image and video coding schemes [5, 12] , such as JPEG [36] , MPEG-1 [39] , MPEG-2 [22] , H.261 [23] , H.263 [24] , and H.264 [32, 46, 51] .
Through the decades signal processing literature has been populated with efficient methods for the DCT computation, collectively known as fast algorithms. This can be observed in several works with efficient hardware and software implementations, including [2, 3, 13, 14, 17, 21, 31, 44] . Methods such as the Arai DCT algorithm [2] can greatly reduced the number of arithmetic operations required for the DCT evaluation.
Indeed, current algorithms for the exact DCT are mature and further complexity reductions are very difficult to achieve. Nevertheless, demands for real-time video processing and transmission are increasing [27, 42] . Therefore, complexity reductions for the DCT must be obtained using different methods.
One possibility is the development of approximate DCT algorithms. Approximate transforms aim at demanding very low complexity while offering a close estimate of the exact calculation. In general, the elements of approximate transform matrices require only {0, ±1/2, ±1, ±2} [15] . This implies null multiplicative complexity; only addition and bit shifting operations are usually required. While not computing the DCT exactly, such approximations can provide meaningful estimations at low-complexity requirements.
In particular, 8-point DCT approximations have been attracting signal processing community attention.
This particular blocklength is widely adopted in several image and video coding standards, such as JPEG and MPEG family [5, 30, 36] . Prominent 8-point DCT approximations include the signed discrete cosine [20] , the level 1 approximation by Lengwehasatit-Ortega [29] , the Bouguezel-Ahmad-Swamy (BAS) series of algorithms [8] [9] [10] [11] , and the DCT round-off approximations [4, 15] . However, transforms with blocklength greater than eight has several advantages such as better energy compaction and reduced quantization error [16] .
In [16] , an adapted version of the 16-point Chen's fast DCT algorithm [13, 38] is suggested for video encoding. Chen's algorithm requires multiplicative constants cos(kπ/32), k = 1, 2, . . . , 15, which can be approximated by fixed precision quantities [16, Sec. 5] . Indeed, dyadic rational were employed [12] , resulting in a non-orthogonal transform [16, Sec. 5] . The International Telecommunication Union fosters image blocks of 16×16 pixels [47] instead of the 4×4 and 8×8 pixel blocks required by the H.264/MPEG-4 AVC standard for video compression [33] . The main reason for such recommendation is the improved coding gains [28] . It is clear that for such large transform blocklengths, minimizing the computational complexity becomes a central issue [16] .
In this context, the main goal of this paper is to advance 16-point approximate DCT architectures. First, we introduce a new low-complexity 16-point DCT approximation. The proposed transform is sought to be orthogonal and to possess null multiplicative complexity. Second, we propose an efficient fast algorithm for the new transform. Third, we introduce hardware implementations for the proposed transform as well as for the 16-point DCT approximate method introduced by Bouguezel-Ahmad-Swamy (BAS-2010) in [10] . Both methods are demonstrated to be suitable for image compression.
The paper unfolds as follows. In Section 2, the new proposed transform is introduced and mathematically analyzed. Error metrics are considered to assess its proximity to the exact DCT matrix. In Section 3, a fast algorithm for the proposed transform is derived and its computational complexity is compared with existing methods. An image compression simulation is described in Section 4, indicating the adequateness of the introduced transform. In Section 5, FPGA-based hardware implementations for both the proposed transform and the BAS-2010 approximation are detailed and analyzed. Conclusions and final remarks are given in Section 6.
16-point DCT Approximation
In this section, a new 16-point multiplication-free transform is presented. The proposed matrix transform T was obtained by judiciously replacing each floating point of the 16-point DCT matrix for 0, 1, or −1.
Substitutions were computationally performed in such a way that: (i) the resulting matrix could satisfy the following orthogonality-like property:
(ii) DCT symmetries could be preserved, and (iii) the resulting matrix could offer good energy compaction properties [20] . Among the several possible outcomes, we isolated the following matrix:
Above matrix furnishes a DCT approximation given bŷ
, and diag(·) returns the block diagonal concatenation of its arguments.
The proposed transformĈ is orthogonal and requires no multiplications or bit shifting operations. Only additions are required for the computation of the proposed DCT approximation. Moreover, the scaling matrix D may not introduce any additional computational overhead in the context of image compression. In fact, the scalar multiplications of D can be merged into the quantization step [8, 9, 11, 15, 29] . Therefore, in this sense, the approximationĈ has the same low computational complexity of T. The WHT is selected for its simplicity of implementation [19, p. 472 ]. The BAS-2010 method considered since it is the most recent method for DCT approximation for 16-point long data.
A classical reference in this field is the signed DCT (SDCT) [20] , which became a standard for comparison when considering 8-point DCT approximations. However, for 16-point data, the signed DCT is not orthogonal and its inverse transformation requires several additions and multiplications [10] . Thus, we could not consider SDCT for any meaningful comparison.
According to the methodology employed in [20] and supported by [15] , we can assess how adequate the proposed approximation is. For such analysis, each row of a 16×16 approximation matrix A can be interpreted as the coefficients of a FIR filter. Therefore, the following filters are defined:
where a m,n is the (m + 1, n + 1)-th entry of A.
Thus, the transfer functions associated to h m [n], m = 0, 1, . . . , 15, can computed by the discrete-time
:
where
Spectral data H m (ω; A) can be employed to define a figure of merit for assessing DCT approximations.
Indeed, we can measure the distance between H m (ω; C) and H m (ω; A), where C is the exact DCT matrix. We adopted the squared magnitude as a distance measure function. Thus, we obtained the following mathematical expression: 
Fast Algorithm
As defined in (1), transformation matrix T requires 208 additions, which a significant number of operations. In the following, we present a factorization of T obtained by means of butterfly-based methods in a decimationin-frequency structure [7] . For notational purposes, we denote I n as the identity matrix of order n,Ī n as the opposite diagonal identity matrix of order n, and the butterfly matrix as We maintain that T can be decomposed into less complex matrix terms according to the following factorization:
where the required matrices are described below:
and matrix P is a permutation matrix given by where e j is a 16-point column vector with one in position j and zero elsewhere.
Matrix E corresponds to the even-odd part, whereas matrix O is linked to the odd part of the proposed transformation [6, p. 71] . A row permuted version of matrix E was already reported in literature in the derivation of the 8-point DCT approximation described in [15, Fig. 1 ].
On the other hand, matrix O does not seem to be reported. Without any further consideration, matrix O requires 48 additions. The locations of zero elements in (2) is such that a decimation-in-frequency operation by means of a butterfly structure is prevented. In order to obtain the required symmetry, we propose the following manipulation:
where 
The resulting matrix O can factorized according to:
where ⊗ denotes the Kronecker product. The additive complexity of matrix O is 20 additions.
Above mathematical description can be given a flow diagram, which is useful for subsequent hardware implementation. Fig. 2(a) depicts the general structure of proposed fast algorithm. Block A and Block B represent the operations associated to matrix E and O, respectively. The structure of Block A is disclosed in Fig. 2(b) . Fig. 3 (a) details the inner structure of Block B as described in (3). Fig. 3(b) exhibits Block C according to (4) . Arithmetic complexity comparisons with selected 16-point transforms are summarized in Table 2 .
Application to Image Compression
This section presents the application of the proposed transform to image compression. We produce evidence that it outperforms the other transforms in consideration. For this analysis, we used the methodology described in [20] , supported in [8] [9] [10] [11] , and extended in [15] .
A set of 45 512×512 8-bit greyscale images obtained from a standard public image bank [48] was considered.
We adapted the JPEG compression technique [36] for the 16×16 matrix case. Each image was divided into 16×16 sub-blocks, which were submitted to the two-dimensional (2-D) transform procedure associated to the DCT matrix, the BAS-2010 [10] matrix, the WHT [18] matrix, and the proposed matrixĈ. A 16×16 image block K has its 2-D transform mathematically expressed by [45] :
where A is a considered transformation.
This computation furnished 256 approximate transform domain coefficients for each sub-block. A hard thresholding step was applied, where only the r initial coefficients were retained, being the remaining ones set to zero. Coefficients were ordered according to the usual zig-zag scheme extended to 16×16 image blocks [35] .
We adopted r ∈ {2, 4, . . . , 254, 256}. The inverse procedure was then applied to reconstruct the processed data and image quality was assessed.
Image degradation was evaluated using three different quality measures: (i) the peak signal-to-noise ratio (PSNR), (ii) the mean square error (MSE), and (iii) the universal quality index (UQI) [49] . The PSNR and MSE were selected due to their wide application as figures of merit in image processing. The UQI is considered an improvement over PSNR and MSE as a tool for image quality assessment [49] . The UQI includes luminance, contrast, and structure characteristic in its definition. Another possible metric is the structural-similarity-based image quality assessment (SSIM) [50] . Being a variation of the UQI, SSIM results were not very different from the measurements offered by the UQI for the considered images. Indeed, whenever a difference was present, it was in the order of 10 −4 only. Therefore, SSIM results are not presented here.
Moreover, in contrast with the JPEG image compression simulations described in [8] [9] [10] [11] , we considered the average measures from all images instead of the results derived from selected images. In fact, average calculations may furnish more robust results [26] . Fig. 4 shows the resulting quality measures. The proposed transform could outperform both the BAS-2010 transform and the WHT in all compression rates according to all considered quality measures. Fig. 4(b) shows that the proposed transform outperformed the BAS-2010 transform in ≈ 1 dB and the WHT in ≈ 8 dB, which corresponds to ≈ 26% and ≈ 630% gains, respectively. At the same time, Fig. 4(b) shows that the results of the proposed transform are at most 2 dB way from when DCT results at compression ratios superior to 85% (r < 40).
In order to convey a qualitative analysis, Figures 5 and 6 show two standard images compressed according to the considered transforms. The associate differences with respect to the original uncompressed images are also displayed. For better visualization, difference images were scaled by a factor of two. This procedure is routine and described in further detail in [40, p. 273 ]. The images compressed with the proposed transform are visually more similar to the images compressed with DCT than the others. As expected, the WHT exhibits a poor performance.
FPGA-based Hardware Implementation
In this section, the proposed DCT approximation and the BAS-2010 algorithm [10] were physically implemented on a field programmable gate array (FPGA) device. We employed the 40 nm CMOS Xilinx Virtex-6
XC6VLX240T FPGA for algorithm evaluation and comparison. Beforehand it is expected that the proposed algorithm exhibit modestly higher hardware demands. This is due to the fact that it requires 72 additions, whereas the BAS-2010 algorithms demands 64 additions.
We furnished circuit performance using metrics of (i) area (A) based on the quantity of required elementary programmable logic blocks (slices), the number of look-up tables (LUTs), and the flip-flops count, (ii) the speed, using the critical path delay (T ), and (iii) the dynamic power consumption.
The number of occupied slices furnished an estimate of the on-chip silicon real estate requirement, whereas number of LUTs and flip-flops are the main logic resources available in a slice. In Xilinx FPGAs, a LUT is employed as a combinational function generator that can implement a given boolean function and a flip-flop is utilized as a 1-bit register. The critical path delay corresponds to the delay associated with the longest combinational path and directly governs the operating frequency of the hardware. The total power consumption of the hardware design constitutes of static and dynamic components. Static power consumption in FPGAs is dominated by the leakage power of the logic fabric and the configuration static RAM. Thus, is mostly design independent. On the other hand, the dynamic power consumption, which accounts for the dynamic power dissipation associated with clocks, logic blocks, and signals, provides a metric for the power efficiency of a given design [43] . The respective results are shown in Tables 3, 4 , and 5, where the metrics corresponding to each design were measured for several choices of finite precision using input word length W ∈ {4, 8, 12, 16}.
From Table 3 , it is observed that the proposed design consumes ≈ 10% more LUTs hardware resources than [10] . For W = 8 the proposed design shows a ≈ 20% increase in the number of slices consumed (and 10% more LUTs) while gaining 1-2 dB of improvement in PSNR compared to [10] . The increase in area shown by the proposed design has led to an increase in the critical path delay, area-time (AT ), area-time-squared (AT 2 ) metrics and to a higher power consumption as indicated in Tables 4 and 5 . Of particular interest is the case for W = 8 input word size, where the proposed algorithm and hardware design shows a 5.7% and 10% increase in the critical path delay and dynamic power consumption, respectively, when compared to the algorithm in [10] .
In this paper, we define a metric consisting of the product of error figures and the AT value:
(error figure) × (area-time product). The considered error figure can be the 1/PSNR, MSE, 1/UQI, or the total error energy, as given in Table 1 . This metric aims at combining both the mathematical and the hardware aspects of the resulting implementation. The total error energy has the advantage of being image independent, being adopted in the combined metric. Considering the proposed architecture and [10] , the obtained values for the combined metric are shown in Table 6 .
Although the proposed DCT approximation consumes more resources than [10] , a much better approximation for the exact DCT is achieved (see Table 1 ). This leads to superior compressed image quality (see Fig. 4 ). Indeed, the choice of algorithm is always a compromise between its mathematical properties, such as DCT proximity, energy error, and resulting image quality; and the related hardware aspects, such as area, speed, and power consumption. This implies our proposed algorithm is a better choice over [10] when picture quality is of higher importance.
Conclusion
This paper introduced a new 16-point DCT approximation. The proposed transform requires no multiplication or bit shifting operations, is orthogonal, and its matrix elements are only {−1, 0, 1}. Using spectral analysis methods described in [15, 20] , we demonstrated that the proposed transform outperforms the WHT and the BAS-2010 as an approximation for the 16-point DCT. The proposed transform was considered into standard image compression methods. The resulting images were assessed for quality by means of PSNR, MSE, and UQI. According to these metrics, the proposed transform could outperform the WHT and the BAS-2010 approximation at any compression ratio. We also derived an efficient fast algorithm for the proposed matrix, which required 72 additions.
This algorithm was implemented in hardware and compared with a state-of-the-art 16-point DCT approximation [10] . FPGA-based rapid prototypes were designed, simulated, physically implemented, and tested for 4-, 8-, 12-, and 16-bit input data word sizes. A typical application having 8-bit input image data could be subject to 16-point DCT approximations at a real-time rate of 342 · 10 6 transforms per second, for FPGA clock frequency of 342 MHz, leading to a pixel rate of 5.488 · 10 9 pixels/second. Both proposed and BAS-2010 algorithms were realized on FPGA and tested and hardware metrics including area, power, critical path delay, and area-time complexity. Additionally, an extensive investigation of relative performance in both subjective mode as well as objective picture quality metrics using average PSNR, average MSE, and average UQI was produced. The proposed DCT approximation algorithm improves on the state-of-art algorithm in [10] by 1-2 dB for PSNR at the cost of only eight extra adders.
Video coding using motion partitions larger than 8×8 pixels is investigated in [16, 47] with satisfactory application in H.264/AVC standard for video compression. In this perspective, the new proposed approximation transform is a candidate technique to image and video coding with block size equal to 16×16. This blocklength is of particular importance in the emerging H.265 reconfigurable video codec standard [25] .
