With the introduction of high efficiency video coding (HEVC) standard which provides super compression efficiency, there has been a lot of research works on integer transform matrices that can provide good approximation to the discrete cosine transform (DCT) used in HEVC. Not only maintaining the coding performance, the hardware and power of the circuit to implement the derived integer DCT (Int-DCT) needs to be minimized. To address these multiple design considerations, a new multi-objective optimization algorithm is proposed in this paper to search for efficient Int-DCT matrix, which has the coding performance as close as possible to the transform in HEVC but implemented with reduced hardware and power. Experimental results show that the approximated Int-DCT matrix generated by the proposed algorithm can achieve almost the same coding performance as the transforms in HEVC measured in terms of BjØntegaard Delta rate. Meanwhile, the experiments demonstrate that the proposed 16-point Int-DCT can produce at least 15.5% and 26.8% lower circuit area in FPGA and ASIC respectively, compared with other state-of-the-art Int-DCT realizations which can provide similar coding performance.
I. INTRODUCTION
Discrete cosine transform (DCT) is commonly used for image and video compression [1] such as those in published standards like Joint Photographic Expert Group (JPEG), Moving Picture Experts Group (MPEG) and International Telecommunication Union Telecommunication (ITU.T) standards. Because exact DCTs are very close to the theoretical DCT complexity, they could hardly offer dramatic computational gains and implementation cost reduction [2] . Therefore, approximate DCTs become an alternative to reduce the computational complexity. If the basis properties of the transform matrix such as orthogonality, symmetry and equal norm can be preserved, transform approximations can be applied to reduce the computational cost [3] - [6] . Integer discrete cosine transform (Int-DCT) [7] is one of the approximations whose finite precision transform coefficients can be computed with integer arithmetic. Compared to exact DCT, it has a lower computation cost and causes no drifting error [8] . As a result, it has been used in the recent coding standards like The associate editor coordinating the review of this manuscript and approving it for publication was Md. Kamrul Hasan .
H.264/Advanced Video Coding (AVC) [9] , [10] , Audio and Video Streaming (AVS) [11] , Video Codec 1 (VC-1) [12] , and high efficiency video coding (HEVC) [13] , which uses 4-point to 32-point Int-DCT.
The most straightforward way of deriving Int-DCT coefficient is scaling the DCT coefficients by a factor, followed by rounding to integer values [14] . However, the derived coefficients by this simple scaling cannot guarantee good coding performance. Therefore, a number of Int-DCTs implementations are proposed in the literature which can be used in the core transform matrices of HEVC. Variable blocksize transform (VBT) [15] is one of them which uses 4-point to 32-point Int-DCT adaptively. This method selects similar blocks from the reconstructed area and uses them to derive the Karhunen-Loeve transform. The VBT can adapt to the nonstationary video signals and can improve the coding performance. Cintra et. al. has contributed by developing a sequence of approximated 4-point and 8-point DCTs [2] , [3] , [16] - [21] . In [2] , a new class of matrices based on a parametrization of the Feig-Winograd factorization of 8-point DCT is proposed. In [16] , two multiplierless algorithms are proposed to develop 2D 4-point DCT approximations for coding in digital video. VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
In [3] , the authors introduced low-complexity 3D 8-point DCT approximations which are formalized in terms of highorder tensor theory. Other 8-point approximated DCTs are proposed in [17] - [21] with different techniques to derive efficient transforms with lower number of required additions. In addition, more transforms have been proposed [22] - [24] for 16-point and 32-point Int-DCT used in HEVC, because larger transforms such as 16-point DCT can contribute more coding gains compared with 4-point or 8-point transforms [25] . In [22] , Cintra's team proposed a digital very large scale integration (VLSI) architecture for computing DCT/DST transforms without multiplications. The proposed 16×16 transform are heavily approximated to make the transform matrix consisting of 1, 0 and −1 only. This helps to minimize the hardware cost to implement this transform, but suffers from much bigger errors. In [23] , the method derives orthogonal and high order Int-DCT using the lower order transforms. In [24] , the proposed design can ensure a fully factorized structure and the computation is fast. Another method which can be included into HEVC standards is Joint Collaborative Team on Video Coding (JCTVC)-G579 [26] which uses scaled integer transforms and supports recursive factorization.
In recent years, a group of new Int-DCT are proposed. One of them is the recursive integer cosine transform (RICT) proposed in [8] . RICT is a method to generate high order Int-DCT using lower orders by utilizing the self-recursive property of DCT transform. Compared with JCTVC-G579 [26] , basis row vectors in RICT have almost the same norm, so additional scaling is avoided. Another representative algorithm was proposed in [27] to derive scalable and orthogonal approximation of DCT. An approximate DCT of length N is derived from a pair of DCTs of length N /2 at the cost of additions for input preprocessing. Another more recent algorithm proposed by the authors of [27] was published in [28] , where an approximated kernel for DCT of length 4 is derived. This kernel is adopted for the computation of DCT and IDCT of higher order transforms whose sizes are powerof-two numbers. Another approximated DCT for HEVC was proposed in [29] . The DCT is implemented through the Walsh-Hadamard transform followed by Givens rotations. The proposed method computes four different approximations and skip some rotations. To our best knowledge, one of the most recently proposed algorithms is the design presented in [30] which is also relevant to our proposed work. In [30] , an energy-and area-efficient architecture for approximated DCT is proposed. It achieves good compression performance with reduced computation cost by truncating a couple of least significant bits (LSB), most significant bits (MSB), and some zero columns. Another design was proposed recently in [31] by Chen et. al. Compared with this paper, the main contribution in [31] is a new efficient DCT circuit implementation by using double base number system and an algorithm to minimize distinct shift counts. The design relies on existing Int-DCT coefficients for circuit implementation. There is no approximation made to the given Int-DCT coefficients.
Although these works have contributed significantly in developing low cost Int-DCT matrices with good compression efficiency, given the demanding hardware and power requirements in emerging technologies, there is a continuous need to improve the performance. Therefore, it is meaningful to develop new Int-DCT matrices for HEVC which can lead to good compression performance and at the same time achieve lower hardware cost. Given a good Int-DCT requires multiple properties such as orthogonality, basis vectors norm uniqueness and basis vectors energy compaction, a good optimization approach that can provide the flexibility to adjust the priorities of the above-mentioned measures is necessary. In this paper, we propose a new algorithm to generate power-of-two points Int-DCT, which include three main contributions:
1. We solve the Int-DCT coefficients approximation as a multi-objective optimization problem where the objectives are the critical properties of Int-DCT matrix.
2. We develop the hardware efficient solutions by optimizing the hardware cost and the coding performance measures simultaneously, as the two objectives.
3. We normalize the Int-DCT matrix properties and solve the optimization problem by using a weighted sum approach where the weights for each objectives are adjustable according to the objective priorities.
The experimental results have shown that the proposed Int-DCT can achieve almost the same compression performance as the transform in HEVC measured by Bjøntegaard Delta rate (BD-rate) [32] . In addition, the hardware cost to implement the proposed Int-DCT is significantly reduced. This paper is organized as follows. Section II explains the Int-DCT design criteria related to compression performance and implementation cost. Section III presents the proposed algorithm to derive the efficient Int-DCT with low hardware complexity. The experimental results and discussions are presented in Section IV and the paper is concluded in Section V.
II. DCT PROPERTIES AND HARDWARE COST EVALUATION A. DCT-II TRANSFORMS
In a typical video codec, an N × N forward DCT and an N × N inverse DCT are usually required. The DCT can be categorized into four types, namely DCT-I to DCT-IV [33] . In this paper, we focus on new Int-DCT matrices used for Type II forward transform. The original 2-dimension N × N forward Type II DCT can be expressed as
is the frequency domain output and S(i, j) is the spatial domain input. This can be implemented as a separable transform by applying 1-dimension N -point DCT to each row and each column separately [14] . 1-dimension N -point type-II DCT can be expressed as [15] :
where the input sequence x(n) = [x(0), x(1), . . . , x(N −1)] T and the output y(n) = [y(0), y(1), . . . , y(N −1)] T . C II N is the DCT transform matrix whose infinite precision entries can be computed as
where m ∈ [0, N − 1] and n ∈ [0, N − 1] are the row and column index of the matrix respectively.
B. TRANSFORM QUALITY EVALUATION BY CORE PROPERTIES
The DCT matrices used in HEVC are essentially finite precision approximations of infinite precision DCT matrix computed by (3) . These transform matrices in the standard can help to avoid encoder-decoder mismatch and drift caused by implementations with different floating point representations [14] . Int-DCT has a few core properties which can be used to measure its compression quality. These properties include basis vectors symmetry, orthogonality, closeness to the original DCT, basis vectors norm equality, and so on. The symmetry property of the derived Int-DCT is always preserved, because the repeated coefficients provided by the symmetry are helpful to reduce the number of arithmetic operations. Some other properties such as closeness to the original DCT and transform matrix basis vectors orthogonality [14] are also critical to achieve good compression efficiency while others have impact on quantization/ de-quantization process such as basis vectors norm equality. In the proposed method, the first property evaluated is the Closeness to original DCT, which can be computed by
where d(m,n) is the entry at the mth row and nth column of the derived finite precision Int-DCT matrix d and α is a scaling factor. Close(m,n) is the matrix to store the closeness between each element in d(m,n) with their value in the original DCT matrix. For Type II DCT as in (3), we scale and truncate the entries in the first row vector, i.e. C II N (0, n), from its original value of 1/ √ N to be 2 B , where B is the wordlength we use to represent the finite precision coefficients. In this case, all the first row coefficients become power-of two-integer and in hardware realization the hardwire-free shifts can be used to multiply the first row vector with the corresponding inputs. This makes the scaling factor α = 2 B √ N which is applied to other row vectors in C II N . With α specified, we can evaluate Close of any given Int-DCT d using (4). Therefore, the average closeness, denoted as C A (d), for the entire Int-DCT matrix d can be computed by
Lower value of C A indicates the derived Int-DCT matrix d is closer to the scaled DCT matrix αC II N . Secondly, the basis vectors need to be orthogonal. This property makes the transform coefficients to be uncorrelated which is essential for good compression efficiency. Let v r,m and v c,n be the mth row vector and the nth column vector in d respectively. For any two different row vectors in d, v r,
the orthogonality between them is given as
Similarly, for any two different column vectors in d, v c,n1 = [d(n 1 ,0), d(n 1 ,1),. . . , d(n 1 , N −1)] T at column n 1 and v c,n2 = [d(n 2 ,0), d(n 2 ,1),. . . , d(n 2 , N −1)] T at column n 2 , the orthogonality between them is given as
A total of N (N − 1)/2 different basis row or column vector pairs exist, so the average orthogonality of the Int-DCT matrix d is
Lower value of O A indicates better orthogonality of the Int-DCT matrix d.
Thirdly, the basis vectors should have almost equal norm to simplify the quantization/de-quantization. For any row vector v r,m , its norm is computed as v T r,m v r,m . We scale the norm of any row vector by the norm of the first basis row vector which
Similarly, for any column vector v c,n, its norm is computed as v T c,n v c,n . With the scaling by the norm of the first column vector which is v T c,0 v c,0 , the Norm Variance NV_C(v c,n ) of any column vector v c,n can be computed as
According to (9) , if any v r,m has exactly the same norm as v r,0 , NV_R(v r,m ) will be 0 which means that v r,m does not encounter any norm variance. The same applies for NV_C(v c,n ) in (10) . The average Norm Variance of the entire matrix d, which is denoted as NV A (d), can be computed as
The lower NV A (d) indicates that the basis vectors of the Int-DCT matrix d have similar norm values. These three measures defined in (5), (8) and (11) jointly constitute to the quality of the derived Int-DCT matrix.
C. HARDWARE COST EVALUATION
Besides the above-mentioned evaluations, implementation cost of the Int-DCT matrices is another critical design consideration. Many existing Int-DCT realization such as [34] used conventional multiplier-less multiple constant multiplication (MCM) techniques which involve only adders and hardwired shifts to save hardware cost. In these designs, total full adder cost can be a hardware cost indicator. To achieve lower hardware complexity, we use the reconfigurable multiplier (RM) based method proposed in [35] to implement the multiplications. The architecture of one RM is shown in Fig. 1 (a) , where mux stands for multiplexers. Each RM consists of a partial sum block with add-shift network and a sequence of multiplexers followed by shifters. To further reduce the complexity, the RM design adopted utilizes the newly proposed sporadic logarithmic shifters (SLS) in [36] as shifters. To reduce the complexity of partial sum block and the SLSs, we limit the number of different partial sums and the number of different shift amounts to be generated by the SLSs. Based on the works in [37] - [42] , DBNS has been proven to be efficient to implement add-shift digital circuits. Therefore, in this design, we represent the coefficients in the same row of finite precision Int-DCT matrix d using double base number system (DBNS) [37] as
where α t and β t are respectively the non-negative exponents of 2 and 3 of the t-th nonzero double base term. T is the total number of nonzero double base terms. The value of α t is the shift to be performed by SLSs. The values of 3 β t are the partial sums to be implemented inside the partial sum block. It should be noted that the DBNS representation for the same integer is not unique. This provides us the opportunity to search for the efficient DBNS representation for each coefficient in the same row of d such that the total number of 3 β t and 2 α t are minimized. This can lead to the reduced cost of the partial sum block and the SLSs. After the partial sums are determined, the partial sum block design technique proposed in [42] is adopted. The outputs of the partial sum block are the products by the input and the partial sums. The multiplexers select the correct partial sum product which will be shifted by the SLS. The carry save adder sums up all the T double base terms and the sum is the final product by the input in x(n) and one Int-DCT coefficient in d(m, n). The buffer and accumulator at the bottom add up all the products for each row to generate one output. Ripple carry adders (RCA) are adopted in the design owing to its lower complexity compared with other adder types [43] .
With the RM, the multiplications for 1-dimension DCT can be performed with a total of N RMs, as shown in Fig.1 (b) . By selecting the correct inputs for multiplexers and the correct shift amounts for SLSs, each RM is configured to be one of the N coefficients in one row of d. At one instance, the configured RM multiplies with the corresponding input in x(n). For example, the 0 th RM is firstly configured to d(0, 0) and multiplies with x(0). In the next multiplication, the 0 th RM is configured to d(0, 1) which multiplies with x(1). This reconfiguration and multiplication repeats until all the products are generated.
To evaluate and compare the hardware complexity fairly, we convert the costs of multiplexers [44] and the shifters [36] to full adder count using their approximated area complexity ratios. Knowing that the complexity of a multiplexer is approximately proportional to P which is its number of input lines [35] , the area ratio of one w-bit P-to−1 multiplexer to one full adder is: w × P × ρ. The value of ρ depends on the targeted device technology. Because the SLSs in RMs consist of multiplexers with different numbers of inputs, we can apply the same method to convert the shifter complexity to the equivalent full adder counts. Therefore, in our method, the total approximate full adder count of a given Int-DCT matrix d is formulated as
where N R and N M are the total numbers of RCAs and multiplexers respectively in the design. w RCA_i and w m_i represent the wordlength of the ith RCA and the ith multiplexer respectively. The multiplexers used in programmable shifters are included in (13) . With this area complexity alignment, we can always estimate and compare the implementation cost in terms of FA_total for any given Int-DCT matrix when it is implemented using the RM approach.
III. THE PROPOSED ALGORITHM FOR EFFICIENT INT-DCT A. WEIGHTED SUM APPROACH
As presented in Section II, a group of design criteria to evaluate transform qualities as well as its implementation cost have been discussed. Searching for optimal solutions for these criteria simultaneously is a multi-objective non-linear optimization problem. Unfortunately, it is impossible to have one Int-DCT transform matrix which can simultaneously achieve optimum points for all the above measures. Because each objective can be more critical than others depending on different applications, an efficient optimization approach is desired which can localize the quasi-optimal solutions with flexibilities to adjust the priorities of different objectives. All the criteria in (5), (8) and (11) are normalized and unitless measures. Therefore, we can convert the multiple objectives problem into single objective optimization by adopting the weighted sum approach [45] which is defined as
and J i is the ith objective among all the z objectives whose weight is λ i . sf i is the scaling factor applied to J i . J MO is the summed up single objective. In our case, we first consider z = 3. C A , O A and NV A are all scaled objectives, so we define the summed single objective η as the quality measure of any Int-DCT matrix d, which can be expressed as
where λ 1 , λ 2 and λ 3 are the weighting factors for C A , O A and NV A respectively. When comparing the quality of two different Int-DCT matrices a and b, we can say η(a) dominates η(b) if and only if
One solution a is called efficient if and only if η(a) cannot be dominated by η of any other solution. An Int-DCT matrix a is said to be optimal if η(a) is less or equal to the η of all the remaining candidate solutions [45] .
B. THE PROPOSED ALGORITHM
Because the objective is to search for an Int-DCT matrix with minimized η, higher priority for certain individual measure in (15) implies that a higher weighting factor value should be assigned to it. In such case, any reduction in this individual measure can cause η to be decreased effectively. The selection of weighting factors for λ depends on different transform priority. This flexibility for assigning different weighting factors to multiple objectives can make the design adapted to various applications which are with different priorities. For example, if fast video coding/decoding with simplified quantization is desired, NV A should be with higher weight and hence a bigger λ 3 should be assigned. If the compression efficiency is with priority, C A and O A should be given higher weights. Hence, λ 1 and λ 2 should be assigned with higher values than λ 3 . After defining the weights based on the specific applications, the next step is to search for the Int-DCT matrix with low η and low hardware cost.
To make the search more effective, it is important to reduce the number of inefficient solutions. This can be achieved by starting with the initial Int-DCT matrix solution denoted as c i (m, n), which is from direct scaling of infinite-precision coefficients as α · C II N followed by coefficient truncation at B-bit. This c i is with the lowest C A , but is not necessarily with good η due to the NV A and O A measures. When searching for the solutions to achieve lower η, we apply the constraint that only the least significant bit of the coefficient in c i can be changed, i.e.
Any other candidate d whose coefficients are beyond the range in (17) are not assessed, because they have poorer C A which would affect the compression efficiency. In our experiment, it is observed that the significant C A increment by the solutions beyond the range specified by (17) cannot be offset by gains in O A . In addition, solutions further away from c i generally cannot help to improve O A and NV A . Therefore, if we find a local minimal solution c local in the above range which can produce the lowest η, we treat c local as the solution for this stage. In addition to η which measures the transform quality, the proposed algorithm searches for solution around c local which is with lower implementation cost measured by full adder count as computed by (13) . Unit-less η and FA_total are two different measures, so we need to normalize them before applying the weighted sum approach. The scaling factors are chosen to be η(c local ) and FA_total(c local ) respectively. The overall performance of one Int-DCT matrix d is then evaluated through another weighted sum, given as
where the function FA_total(d) computes the total FA count using (13) . p(d), defined as the p value of the given Int-DCT matrix d, is the overall measure for compression performance and hardware optimality. β 1 and β 2 , with β 1 + β 2 = 1, are the weighting factors for the normalized η and the normalized FA_total respectively. The selection of weighting factors for β depends on the priority between compression performance and implementation cost. For mobile and integrated devices with extremely limited hardware and power budget, β 2 appears to be more critical and should be assigned with higher value. For applications which require high compression efficiency, β 1 should be higher. After assigning the weighting factors with the initial values by following this rule, we perform the design and evaluate the performance. If any performance specifications are not met, the weighing factors are adjusted until the prioritized design objective is fulfilled. To limit the search space for (18) , we apply the constraint that only the least significant bit of the coefficients c local (m, n) can be varied, as
The overall optimization can therefore be summarized as below and c f is the final solution. p_compute computes the value of p(d) using (18) for the Int-DCT matrix being evaluated. The matrix with the lowest p value is the final solution recorded as c f .
IV. RESULTS AND DISCUSSIONS A. DESIGN EXAMPLE OF INT-DCT WITH UNIFORM PRIORITY
In the first part, we demonstrate the design flow on the 16-point Int-DCT using the proposed algorithm. In this example, we set coefficient wordlength B = 8 and assume that C A , O A and NV A have the uniform priority for optimization, i.e. λ 1 = λ 2 = λ 3 = 0.333. Next, a search space is created and the algorithm searches for the solution which can generate lower η than c i , as shown in Fig. 3 . The axes in the 3D plot are C A , O A and NV A respectively. The perfect point is the origin which is achievable by original infinite precision DCT coefficients in C II N . After scaling and truncation, we obtain c i and the search algorithm allocates c local . From Fig. 3 , we can clearly see that c local represented by red dot is closer to the origin compared with c i represented by green dot. The next stage is to search for c f . To achieve lower hardware cost, we assign higher weight for β 2 compared to β 1 . In this design example, we assign β 2 = 0.8, so β 1 = 0.2. A search space is created around c local and p(d) of different solutions are evaluated using (18) . c f is selected which has the lowest p value. Through the process of searching from initial solution c i to the final solution c f , the C A , O A , NV A , η, FA_total and p values are shown in Table 1 below. From Table 1 , we can see that C A increases when c i moves to c local . However, c local achieves better η, contributed by its lower O A and NV A . From the results, we can see that although c f has higher η compared with c local , but it reduces the FA_total by 31% over c local . For the scenario when we have higher β 2 , such c f is with the lowest p value and hence is considered as the final solution. This verifies that the proposed algorithm can generate the efficient Int-DCT solution which provides good trade-off between coding performance and implementation cost.
In addition to 16-point, we also design for the 32-point Int-DCT using the proposed algorithm and compare the performance between c f with some other competing transforms proposed in the recent years. Similar as the proposed transforms, these competing methods are with limited approximations to the DCT coefficients, so the performance of their 16-point and 32-point transforms and corresponding hardware costs are comparable. Besides RICT [8] and CT [14] , EDCT is the hardware efficient DCT proposed by [34] . In addition, we compare the results with the most recently published truncation scheme based DCT (TSDCT) proposed in [30] . To evaluate η, the same coefficient wordlength B is assumed. Because EDCT is a hardware efficient implementation using the Int-DCT coefficients from [14] , both EDCT and CT have the same η results but different implementations. When evaluating the FA_total, a word length of 8 bits is assumed for the input. We have implemented all the designs into Verilog and synthesized on Xilinx Spartan VI FPGA XC6SLX45 by Xilinx ISE WebPACK with the supply power at 1.2V. For this device, our experiment on multiplexer and full adder shows that the ratio ρ m ≈ 0.15 is adopted to evaluate FA_total. To compare both Int-DCT matrix quality and the hardware cost, η, areas in #of LUT slices, delays in ns and powers in mW after place and route are presented in Table 2 . The rows ''Imp.'' present the percentage improvement by the proposed designs for each measure. We take the average of the percentage improvements for the 16-point and the 32-point designs in Table 2 to evaluate the performances of every method. The results show that the proposed algorithm achieves the lowest η compared with other methods. The effort to keep η low by the proposed multi-objective optimization preserves the core properties of the derived Int-DCT. The root mean square errors due to this Int-DCT coefficient approximation are 0.00468 and 0.00522 for 16-point and 32-point respectively. In terms of hardware performance, the proposed algorithm designs the Int-DCT with lower areas by 67.4%, 53.2%, 50.2% and 11.1% over CT, EDCT, RICT and TSDCT respectively. The circuit delays by the proposed method are 28.5%, 48.6%, 28.5% and 33.1% shorter than CT, EDCT, RICT and TSDCT respectively. For total power consumption, the proposed designs reduced the power cost by 71.1%, 19.2%, 18.1% and 4.89% respectively from the designs by CT, EDCT, RICT and TSDCT respectively. The lower implementation cost is achieved by the proposed optimization on p(d) which is a performance measure considering both compression efficiency and the implementation cost, as presented in (13) . Through this effort, the proposed algorithm can always search around c local and find better solutions which are with very similar η but a much lower full adder count. In addition, the adoption of the recently proposed SLSs [36] into the RMs in our designs is another reason of the lower hardware cost achieved. Another 16-point DCT architecture for named as MDA (Multiplication-free Digital Architecture) was proposed in [22] . It is relevant to the proposed transforms and can be compared. However, unlike other competing methods [8] , [14] , [34] , and [30] , this transform by [22] is derived with much higher degree of approximation, so the coefficients are very different from the original DCT. Therefore, besides the scaling matrix, the transform matrix consists of 1, 0 and −1 only, which leads to the minimum hardware cost. However, this hardware minimization is at the expense of coding performance. The approximation quality η, hardware cost and power cost by MDA is listed in Table 3 . It can be observed that the transform matrix with 1, 0 and −1 only can reduce the hardware cost and delay over the proposed transform by around 40%. However, the proposed transform causes slightly less power. More importantly, the heavy approximation by MDA makes η to increase significantly to be around 26 times of the one by the proposed transform. This comparison shows that MDA can achieve less hardware cost only if approximation error and coding performance can be heavily compromised. However, heavy approximation is unaffordable in most applications. This is the reason why all other competing methods [8] , [14] , [30] , [34] , proposed transforms which also achieve much lower η than MDA. In this scenario, the proposed transform is more applicable because of the much lower η. Meanwhile, the hardware cost overhead by the proposed transform over MDA is limited below 50%.
All the proposed designs and the competing designs are also mapped to 45nm standard cell library and run by Synopsys Design-Compiler TM . Synopsys Power-Complier TM with version: J-2014.09-SP3 is used to perform the power analysis. A supply voltage of 1.0V is used. Tool optimization is set to timing constraint. The results of areas in µm 2 , delays in ns and total power in mW are presented in Table 4 . From the results in ASIC, it is evident that the proposed Int-DCT reduces the silicon area cost by at least 26.8% and 62.7% for 16-point and 32-point respectively, compared with other relevant designs. Due to the multipliers reconfiguration time, the proposed Int-DCT encounters slightly longer delay than some existing designs, such as EDCT and TSDCT. However, this overhead is limited within 10.9%. In terms of total power, the proposed Int-DCT can reduce the power cost over other competing methods in most of the comparisons, except the case with EDCT for 16-point. The reason is the shorter critical path delay by 16-point EDCT, and this helps to save switching activities. For 32-point, however, the proposed Int-DCT reduces power by 9.9% over EDCT. In general, although the proposed Int-DCT sometimes causes longer delay, it achieves more significant area and power reduction in ASIC over other competing methods.
To verify that the reduced hardware cost by the proposed transform is achieved without compromising the compression performance, the proposed Int-DCT matrix is implemented into the HEVC reference software HM16.14 and the coding performance is measured. The same benchmark video sequences with different resolutions are compressed and tested under the common test conditions [46] . We verify the performance using the standard BD-rates which are shown in Table 5 below. YUV color space is used to evaluate where Y represents luminance, U and V represent chrominance. For YUV, the ratio of significance for Y, U, V is 4:1:1. The BD-rate number is in terms of the percentage bitrate difference for the same peak signal to noise ratio. A positive BD-rate in Table 4 indicates coding loss compared to the anchor, and a negative BD-rate relates to coding gain. The BD-rate results show that the average difference between the proposed Int-DCT and the original transforms in HEVC is less than 0.03% for different resolutions.
From Table 3 , we conclude that the proposed transforms have much lower η than MDA proposed by [22] . To verify the better coding performance over [22] , we implement the transforms by [22] into the same reference software and code four WQVGA video sequences. The BD-rate results are shown in Table 6 , where positive BD-rate indicates coding loss and a negative BD-rate relates to coding gain. It can be seen that the approximated Int-DCT by MDA always has obvious coding loss from 0.5% up to 3.8%. On the other hand, from the BD-rates of WQVGA videos as shown in Table 5 , the proposed transform can achieve some coding gains compared with the reference and the coding loss is limited at around 0.001% to 0.128% only.
From these results from Table 2 to 6, it can be shown that the proposed transforms have similar performance as the transforms in HEVC with negligible difference. Meanwhile, the proposed transforms are with the reduced hardware cost and power consumption over existing designs without compromising the coding performance.
B. DESIGNS WITH NON-UNIFORM PRIORITIES
In Section IV.A, Int-DCT solutions with the uniform optimization priority are presented. In some scenario, one particular property can have higher priority than others. For example, modern communication technologies demand higher compression efficiency for faster image and video transmission speed in a given channel bandwidth [47] . Electrocardiogram (ECG) signal processing also requires higher compression with little information loss [48] . ECG signal is decomposed by means of a linear orthogonal transformation before the transform coefficients are appropriately encoded. In these applications, we need to set higher priority for orthogonality than other properties. Based on the proposed weighted sum approach in (15), the higher λ 2 value helps to generate solution with the higher orthogonality, when other properties are slightly compromised. To verify the performance, we select λ 2 = 0.8 and λ 1 = λ 3 = 0.1 in this section. We re-run the proposed algorithm to generate the Int-DCT matrix c f . To achieve lower hardware cost, we still assign β 2 = 0.8 and β 1 = 0.2. The C A , O A , NV A , η, FA_total and p values of c i , c local and c f by the proposed algorithm are shown in Table 7 . From Table 7 , we can see that C A increases when c i moves to c local to achieve much lower η. Unlike the previous results, c f for this new set of weighting factor turns out to be the same as c local . The reason is c local in this experiment can produce very low η. Any effort to change c local for lower FA_total can cause ηto increase significantly. Although we have set β 2 = 0.8, the result of the algorithm still shows that c local is the solution which can produce the lowest p. This implies that it is worth to sacrifice η for the little reduction in FA_total and hence we should take c local as the final solution c f . In spite of this, the hardware implementation cost on the same FPGA for this new Int-DCT matrix is still lower than the competing methods, as shown in Table 8 . The areas are in #of LUT slices and the delays are in ns. The powers are with unit of mW. The hardware cost of the proposed solution is at least 15.50% lower than other Int-DCT architectures. Delay and power of the proposed architecture are also lower than the state-of-the-art designs by at least 3.5% and 3.7% respectively. The root mean square errors of the proposed Int-DCT approximation with non-uniform weighting factors is 0.00621.
We also implement c f into the same HEVC reference software. BD-rates are shown in Table 9 . By comparing with the performance by the Int-DCT matrix generated with λ 1 = λ 2 = λ 3 = 0.333 in Table 5 , we can see the average difference between the proposed Int-DCT with the new weighting factors and the original transforms in HEVC becomes smaller which is less than 0.02%. This successfully verifies that the proposed algorithm is capable to generate effective Int-DCT solution with the prioritized property.
C. POWER COST EVALUATION WITH VIDEO SAMPLE
In Section IV.A and B, the solutions with the uniform and non-uniform Int-DCT priorities are evaluated on FPGA and ASIC. The power cost is estimated by assuming random input samples to the DCT circuits. In this section, we evaluate the power cost (mW) when the proposed 16-point Int-DCT and other competing transforms are used to compress the video sequence RaceHorses.yuv. The compression speed is set at 500 frames per second. The experiment is carried on Xilinx Spartan6, xc6slx45 device with clock frequency at 50MHz and supply voltage at 1.2V. Both Int-DCT with uniform and non-uniform priorities by the proposed method are evaluated. From Table 2 , the results by EDCT, RICT and TSDCT are more competitive than CT, so we compare our results with EDCT, RICT and TSCDT solutions in this evaluation. The results are presented in Table 10 . Both our transforms for uniform priority and non-uniform priority encounter lower power cost compared with competing methods when operating at maximum frequency and 50MHz for this video sample. On average, the power reductions contributed by the proposed transforms are 15.5%, 31.1% and 9.1% over EDCT, RICT and TSDCT respectively when compressing at maximum frequency. When running at 50MHz, the reductions by the proposed transform are 9.5%, 26.1% and 2.6% respectively over these competing methods.
V. CONCLUSION
A new algorithm to generate efficient Int-DCT is proposed in this paper. The efficient coding performance of the proposed Int-DCT is achieved by increasing the closeness to original DCT and the orthogonality of the basis vectors using a weighted sum approach. In addition, implementation cost is addressed in the proposed algorithm, so the generated Int-DCT matrices are with good trade-off between compression efficiency and hardware cost. The proposed algorithm can be applied flexibly to generate Int-DCT for different compression or hardware constraints by adjusting the weighting factors. The experimental results show that the proposed algorithm can generate Int-DCT with almost the same coding performance as the HM16.14 in HEVC. Meanwhile, the hardware cost is reduced compared with recent state-of-the-art implementations which can produce similar coding performance.
