Abstract-Due to the intensive use of discrete transforms in picture coding, the search for fast and power-efficient approaches for their hardware implementation has gained importance. The DTT (Discrete Tchebichef Transform) represents a discrete class of the Chebyshev orthogonal polynomials, and it is an alternative for the DCT (Discrete Cosine Transform), commonly used in picture coding. In this work, we propose a new approximation for the integer DTT, with better quality and power-efficiency by exploring truncation and pruning. The principal idea is reduce the values of coefficients to fractions enables truncation by shifts in the internal transform calculations and lead to lower values for the non-diagonal residues, which reduces non-orthogonality. We have also selectively pruned the rows of the state-of-the-art approximate DTT matrix. The approximate DTT architectures were synthesized for ASIC in Cadence RTL Compiler tool using a realistic power extraction methodology considering real-inputs vectors and the delays, with the Nangate 45 nm standard cells library. The synthesis results show that the proposed-pruned approximate DTT hardwired solution increases the maximum frequency about 10.78%, minimize cells area by 50.2%, with savings up to 55.9% of power dissipation with more compression ratio and less quality losses in the compressed image, when compared with state-of-the-art approximate DTT hardware designs.
I. INTRODUCTION
Nowadays, the reproduction of digital videos has become an important challenge when developing real-time and mainly energy-efficient VLSI solutions. The compression of digital images is a must to save data storage. The discrete transforms most commonly used in picture coding is the Discrete Cosine Transform (DCT).
The DCT is the preferred choice for some applications such as the JPEG standard for picture coding and the video coding standards such as the state-of-the-art HEVC (High Efficiency Video Coding Standard) [1] . The discrete transforms implementation is challenging due to the high computational effort contained in the calculations. Therefore, efficient implementations of this module are required for battery-supported systems operating in real time.
The upcoming electronic design automation challenges in the Dark Silicon era further support this claim [3] . Physical limits of cooling technology on mobile devices, allied with the increasing power density in the chip [4] , [5] , will make it impossible to have a fully operational device without compromising its reliability and durability. Intel's prediction indicates that, for an 8 nm technology node, only 20% of the chip will be able to operate simultaneously [6] . Some works suggest that even external memories will have to be partly powered off [7] , and that general purpose processors (GPP) multi-core scaling is also compromised due to aforementioned limitations [8] . Consequently, the performance of current software-based video encoders that rely on multi-threading, such as x264 [9] , x265 [10] , xAVS [11] , will be hindered. Therefore, designing low-power application-specific accelerators will contribute to reduce the Dark Silicon impact [6] .
Recently, the Discrete Tchebichef Transform (DTT) has emerged as a lower complexity discrete transform for picture coding, with characteristics close to the DCT [12] - [16] . This transform presents a polynomial kernel, whose characteristics such as high energy compaction and decorrelation make it comparable to the DCT, mainly when specific features of an image, such as its structure and content, that profoundly influence the quality of the reconstructed image after decompression, are taken into account [14] . The approximate integer 8-point DTT was firstly proposed in [16] .
Approximate DTT matrices for picture coding were proposed in [17] , named CB-2015, and in [18] , named CB-2017, whose main purpose is to reduce the amount of arithmetic operations by keeping the quality of the information. Since the matrices are simplified, thus the multiplierless hardware implementations are simplified by using only shifts, adders and subtractors working in parallel. Therefore, this approach enables power dissipation reduction in the hardwired DTT, when encoding an image, since the new approaches present a low computational effort when compared with the native DCT.
The CB-2015 forward DTT matrix is non-orthogonal, hence its inverse is different from the forward transpose transform. The bottleneck of the CB-2015 approximation is the inverse matrix that presents all coefficients with non-zero values coefficients. As 3 and -3 values are used in the matrix, therefore for its hardware implementation, more additions are required. As the CB-2017 DTT matrix is a quasi-orthogonal approximation, thus its inverse matrix is approximated as the forward transpose. The CB-2017 forward and inverse matrices are only composed of 2, -2, 1, -1 values.
The main contributions of this work are: i) A new DTT approximation matrix which keeps a good image quality and presents both less power dissipation and less circuit area, when compared with the other approximations proposed by CB-2015 [17] and CB-2017 [18] ; ii) New approximations, where pruning is evaluated, for the first time, in the three last lines of the DTT approximation matrix by keeping an excellent tradeoff between quality of the images and energy efficiency, when compared with the literature.
The rest of the paper is organized as follows: Section II Guilherme Paim 1 , Leandro M. G. Rocha 1 , Gustavo M. Santana presents the DTT realizations technical background. Section III presents the proposed approximate DTT followed by our novel pruning scheme. A quality evaluation is performed in Section IV followed by the proposed Approximate DTT architectures in Section V. The synthesis methodology and the experimental results are given in Section VI and finally, Section VII concludes the paper.
II. BACKGROUND ON DISCRETE TCHEBICHEF TRANSFORM
The orthogonality is an important characteristic for the discrete transforms. A matrix M is orthogonal if its transpose is the inverse. Therefore, the factor of the forward with the transposed is the identity M · M T = I. As the CB-2017 DTT matrix, in (1) , is a quasi-orthogonal approximation, thus its inverse matrix is approximated as the forward transpose with a correction after the diagonal matrix
),
where B is the input pixels block. The diagonal correction is treated in other steps, such as the quantization. Both CB-2015 [17] and CB-2017 [18] perform an entropy analysis in order to evaluate the compression.
III. PROPOSED DTT APPROXIMATION The proposed approximations presented in (2) are based on the state-of-the-art nearly-orthogonal approximate DTT CB-2017 [18] . Firstly, the matrix elements of (1) are divided by 16, since this simplification reduces the adders bit-width, resulting in the proposed transposed matrix, named Proposed 1, i.e, Tp1 in (2). Tp1 = 
Note that divisions are simply performed in hardware with right shifts. The division requires the multiplication by 64 in the diagonal correction matrix of ( · · · ), compared with one quarter of D, and therefore, = 64 * diag(
).
The approximate 8-point DTT CB-2017 [18] presented a metric to evaluate the non-orthogonality of the transforms.
However, the analytical equation fails, since it does not take into account the truncation effect. The truncation is different for each one of the input sample amplitude. The evaluation of the squared diagonal · deviation from the identity matrix I was done according to (3) . Figure 1 evaluates (3) for both quality factor ( ) and input sample amplitude (k).
The fractional values of the proposed coefficients, considering integer calculations, enable truncation and a lower nondiagonal residue values that reduces the non-orthogonality. Figure 1 represents the distribution of diagonal deviation residue of the discrete transform. As can be seen, the diagonal deviation residue is more distributed in the proposed transform ( Fig. 1-b ) when compared with CB-2017 ( Fig. 1-a) , what is demonstrated by the effect of more blue regions in Another contribution of this work is related to the pruning in the lines of the matrix transform. The lines 6, 7 and 8 of the matrix, in (2), calculate the high frequency coefficients that have less significance to the human visual system and tend to be removed by quantization. The proposed approach is named Proposed-Pruned DTT. The removal of these lines enables a significant hardware reduction for the transform implementation. These modifications result in the matrix presented in (4) , that is now the diagonal correction · · · , where is represented by = 64 * diag(
Tp2 = 
IV. IMAGE QUALITY EVALUATION
An image quality evaluation is presented in this section from both: (a) objective metrics, and (b) subjective one. Figure 2 presents the quality evaluation by using the Structural Similarity Index (SSIM) [19] . SSIM is an objective metric which takes into account the distortion in the image brightness, contrast and structure [19] . SSIM results (Fig. 2) show that: (a) under QF = 85, the Proposed DTT is better than all approximations in quality evaluation, and (b) under QF = 75, the Proposed-Pruned DTT keeps the same quality than the state-of-the-art CB-2017. SSIM results demonstrate that our proposal can maintain or even improve quality concerning the state-of-the-art, but the main contribution is the significant reduction on dissipated power, as will be shown later. [20] . The SR-SIM is an objective metric which takes into account the distortion in the image spectrum. SR-SIM is a better metric than well-known SSIM and PSNR (Peak of Signal Noise Ratio) metrics, because it is much more correlated with the human subjectivity metrics [20] . In the tests we used the same JPEG quality factor methodology used in CB-2017 [17] , and 15 test pictures of the same repository were taken into account. Figure 3 shows that the Proposed and Proposed-Pruned DTT are better considering the SR-SIM image quality metric for all quality factors, when compared with the state-of-the-art CB-2017 DTT approximation. Our approximations are also better than CB-2015 in the operation point of interest (close to 50), for quality factors under 65. Note that, since there is no significant compression, there is no interest in using high quality factors above 65 point. Figure 4 shows the spectral entropy analysis. The transformed and quantized data are analyzed with the MATLAB entropy function [21] . The entropy is directly related to the data compressibility. How lower is the spectral entropy more compression rate can be reached in an encoder. Figure 4 demonstrates that both proposed DTT stand out for reducing the spectral entropy, when comparing with the state-of-theart CB-2015 and CB-2017.
A. Objective Image Quality Evaluation
As can be observed in Figures 2, 3 and 4, the proposed 8-point DTT approximations show better image quality and better compression rate, when compared with both CB-2015 and CB-2017. It is worth highlighting
It is important to emphasize that DTT Proposed means the proposed transform with the approximation of coefficients, and DTT Proposed-Pruned is the proposed transform with both approximation of coefficients and pruning of the lines in the matrix. Figure 5 presents a subjective quality analysis comparing uncompressed Lena's picture, Figure 5 -a, and after using compression with an entropy rate around 0.4. It means that all the other pictures are subjected to approximately the same rate of compression. This rate is applied to the images of [17] ; (e) CB-2017 DTT [18] ; (f) Proposed DTT; and (g) Proposed-Pruned DTT. In order to obtain fair comparisons of subjective quality, Proposed and Proposed-Pruned QP operation points were selected, so that, entropy was less than 0.4. For this case, CB-2015 approximation ( Fig. 5-d) shows a loss of brightness and contrast. On the other hand, CB-2017 (Fig. 5-e) reduces both the brightness and contrast losses of CB-2015. We observed in the subjective tests that, when using less than 0.4 entropy rate, our two proposed solutions, Figs. 5-f,g, add less noise than the stateof-the-art CB-2015 and CB-2017 DTT approximations. Moreover, our solutions also improve the brightness and contrast limitations presented by both CB-2015 and CB-2017 solutions, and keep a better quality than CB-2015 approach, mainly in regions with more details, such as the face, the eyes, and in the area of Lena's hair.
B. Subjective Image Quality Evaluation

V. PROPOSED APPROXIMATE DTT ARCHITECTURES
Proposed and Proposed-Pruned 2-D DTT hardware architectures are based on the separability DTT property [2] . The separated approach uses two stages with the 1-D 8-point DTT and a transposition buffer to link the stages (Fig. 6 ). solution [7] . However, the main gains enabled by the Proposed architecture are related to the fractional coefficients from its base matrix (2) , which enables the insertion of internal right-shifts. The truncation enabled by the right-shifts allows to: a) Reduction in the non-orthogonality deviation error, as demonstrated in Section II (Fig. 1) ; b) Reduction in the bit-widths in both data-path and transposition buffer. The 1-D Proposed DTT (Fig. 7-a) 
VI. SYNTHESIS RESULTS AND DISCUSSIONS
The 8-point approximate 2-D DTT and the HEVC 2-D DCT architectures were described in Verilog HDL (Hardware Description Language) with the same level of parallelism, and processing eight coefficients per cycle. The syntheses were performed for ASIC in Cadence RTL Compiler tool [22] , using 45 nm Nangate standard-cells library [23] Table I shows area results. The Proposed approximation results show 25% of area reduction when compared with CB-2017. This reduction is significantly increased to 51.6%, when compared with 8-point DCT of HEVC standard [1] . On the other hand, the Proposed-Pruned version shows 67.9% area savings, when compared with 8-point 2-D DCT of HEVC, and 50.2% when compared with the state-of-the-art DTT CB-2017 [18] . Table II shows critical path timing and maximum frequency results. The Proposed approximation is 11.9% faster than the state-of-the-art 8-point DTT CB-2017 [18] , and 69.9% faster, when compared with 8-point DCT of the HEVC standard [1] . The gains demonstrated by our solutions are due to the reduction of the bit-width enabled by internal right-shifts. On the other hand, the Proposed-Pruned version keeps the maximum frequency gains, as in the Proposed version, because it keeps the same number of adders in the critical path. 
A. Power Extraction Methodology and Results
As signal transitions cause dynamic power dissipation in digital integrated circuits, thus in order to achieve a realistic power estimation methodology, the gates and interconnections delay information, as well as input data vectors, are needed for the synthesis tool [24] . Figure 8 shows the methodology for synthesis and power estimation.
The stimulus file formats supported by the commercial synthesis tool are Value Change Dump (VCD), Toggle Count Format (TCF) or Switching Activity Interchange Format (SAIF), where all of them can be generated by running a testbench in the simulation tool. There are two ways by which these files can be generated: 1) from the simulation of Register Transfer Level (RTL) description, i.e., in Verilog or VHDL languages; 2) from the gate-level netlist simulation, which is the synthesis output Verilog description generated after the first synthesis in the logic synthesis tool [25] . The second option is a preferred choice, since it proves to be realistic, because this simulation model is the same as the final circuit.
To enable the realistic power estimation of the temporal glitches, it is required to add the delay for all transitions using another output file. The Standard Delay Format (SDF) file will be generated after the first preliminary synthesis. Therefore, this file is highly required to generate realistic power dissipation results.
The Cadence logic synthesis tool [22] also allows an option for the interconnection estimation mode called Physically-aware Layout Estimation (PLE) mode, which estimates the length of the nets and takes into account the load capacitance effects in the power dissipation, by considering a relatively pessimistic layout estimation. The PLE mode also requires the inclusion of Library Exchange Format (LEF) files, which contain the physical layout information of the library. The LEF macro includes the internal library cell's capacitance, and the tech LEF includes the process metal capacitance for the interconnection capacitance estimation [26] . The capacitance table file (CapTable) contains the same information of the LEF, but in a more realistic and fine-grained way, by considering process variations [26] .
The simulation for power dissipation extraction, which generates the .VCD stimuli file considering the .SDF delay files, presents a testbench composed of MATLAB and Cadence IES, using both the netlist simulation, and real input vectors of four test images: Lena, boat, airplane and baboon. Each image contains 512×512 luminance pixels, and therefore, the test contains 32768 lines of eight pixel samples. 
VII. CONCLUSIONS
This paper presented an efficient approximate 8-point DTT combining pruning and approximation of coefficients. A hardware synthesis method enabled the power results with vectors of real images as input. The main results showed that the suitable approximation in the coefficients and the appropriate pruning in the lines of the matrix, enabled by our 8-point DTT solution, resulted in clock frequency increase of about 10.78%, cells area reduction of up to 50.2%, and power savings of up to 55.9%, when compared with state-of-the-art CB-2017 DTT. Furthermore, our approaches improve compression ratio with less quality degradation in the compressed image, evaluated in both objective and subjective quality metrics, when compared with state-of-the-art approximate 8-point DTT algorithms and hardware designs.
