Two multiplierless algorithms are proposed for 4×4 approximate-DCT for transform coding in digital video.
Introduction
Video and multimedia processing based on signal and image compression such as the high efficiency video coding (HEVC) and H.265 reconfigurable video codecs require 2-D transform block coding for block sizes N × N where N ∈ {4, 8, 16, 32, 64} [1] . The transform coding stage requires algorithms for the N-point discrete cosine transform (DCT) of types II and IV. The associate transformation matrices are defined, respectively, according to [2] :
where m, n = 1, 2, . . . , N, α 1 = 1/ √ 2, and α m = 1, for m > 1.
In this letter, our goal is to propose multiplication-free approximations for the 4-point DCT-II and -IV as well as its fast algorithms. We also aim at VLSI realisations of both 1-D and 2-D versions of the derived approximate transforms, while maintaining at high numerical accuracy and low computational complexity.
Let M P (4) be the set of all 4×4 matrices whose entries are defined over P = {−1, 0, 1}. In this set, all matrices represent multiplierless transformations. Our goal is to find matrices in M P (4) that satisfactorily approximate C II and C IV .
Therefore we propose the following multivariate non-linear optimisation problem over M P (4)
where C * k are the optimal matrices and error(·, ·) is an error measure between a given candidate matrix and the exact matrices C II and C IV .
Let h i [n] be the discrete signal formed by the ith row of a given matrix T and the discrete-time Fourier transform (DTFT) of h i [n] be denoted by H i (ω; T). As discussed in [3, 4] , we adopted the total error energy as the error measure.
This particular measure is defined as follows:
for k ∈ {II, IV}. In other words, ε(A, C k ) quantifies the sum of the energy error in the DTFT domain-between A and C k -when the entries of a given matrix row are interpreted as filter coefficients [3, 4] . This quantity can be computed numerically by standard quadrature methods [5] .
As an additional constraint to (1), we impose that the matrix A · A ⊤ must be a diagonal matrix to ensure that orthogonality can be achieved in the obtained approximations [6] . The resulting constrained optimisation problem is algebraically intractable and we resorted to exhaustive computational search.
Proposed 4-point DCT approximations
By solving (1), we obtained the following new DCT approximations:
Although possessing very low complexity, these matrices are not orthogonal. In several contexts, such as image processing for coding, orthogonality is often a desirable property [2] . Adopting the orthogonalization methods detailed in [6] , new orthogonal matricesĈ II andĈ IV can be derived based on C * II and C * IV , respectively. These orthogonal where
Explicitly we obtain that
where I 4 is the identity matrix of size 4. In image compression context the scaling matrices D 1 and D 2 may not introduce any computational overhead, because they can be merged into the quantisation step, as described earlier in [4, [7] [8] [9] .
The signal flow graph for C * II and C * IV is shown in Fig. 1 . We note that the C * II and C * IV transformations require only 6 and 8 additions, respectively. Multiplications or bit-shifting operations are totally absent. Resulting approximationŝ C II andĈ IV are very close to the respective ideal DCT and offer extremely low complexities. In Table 1 , we show the error measure and arithmetic complexity for the proposed transforms, the exact DCT computation [2] , and the well-known signed DCT [3] . 
FPGA prototypes
The approximate DCTs were realised as an architecture for the 4-point 1-D transforms and the extended to 4×4 2-D transformation. The inputs were assumed at 8-bit resolution. Rapid prototypes were realised on a Xilinx Virtex-6 field programmable gate array (FPGA) device and tested to ensure correct on-chip functionality. The results concerning the consumption of configurable logic blocks (CLB), flip-flops (FF), look-up tables (LUT), and slices are shown in Table 2 . The maximum operating frequency (F max ) and dynamic power consumption (D p ) are also displayed.
The register transfer language (RTL) code corresponding to the FPGA-verified designs were targeted to 45 nm CMOS standard cell process using Cadence Encounter. The CMOS designs were realised up to synthesis and placeand-route levels leading to the estimated results in Table 3 . Area-time complexities AT and AT 2 were adopted and measured in µm 2 · ns and µm 2 · ns 2 , respectively. 
