Abstract-This work analyses the actual throughput of the Discrete Sine Transform (DST) stage in a realistic HEVC encoder, which executes the rate-distortion optimization algorithm to achieve high compression quality. Then, a low complexity DST factorization, where all the integer multiplications are substituted with add-and-shift operations, is exploited to design an efficient 1D-DST core. The proposed 1D-DST core is employed to derive two area efficient architectures, namely Folded and Full-parallel, for computing the 4×4 2D-DST in HEVC. Finally, the proposed 2D-DST architectures are synthesized on a 90-nm standard cell technology to support the actual target throughput required to encode 4K UHD @30fps video sequences, showing better area efficiency with respect to existing DST architectures for HEVC.
I. INTRODUCTION
The latest High Efficiency Video Coding (HEVC) standard is an hybrid block-based video compression scheme, based on motion estimation and transform coding [1] . Its aim is to double the rate-distortion performance compared to its predecessor Advanced Video Coding (AVC) standard [2] , by involving an increased complexity [3] . Concerning the transform stage, HEVC adopts both Discrete Cosine Transform (DCT) and Discrete Sine Transform (DST) of different lengths [4] . While the DCT has been largely exploited in previous standards to code the inter-predicted blocks, HEVC specifies an integer DST transform for intra-predicted blocks of size 4×4. The benefits of the application of the DST on the residual signal, coming from the directional intra prediction, have been already shown in the literature [5] , [6] .
Moreover, the issue of real-time video compression, especially for ultra high definition (UHD) content (e.g. 4K UHD @30fps), has posed severe throughput requirements on the transform operation at the encoder side, because of the ratedistortion optimization (RDO) algorithm, as shown in [7] . Therefore, hardware accelerators for computing transforms have been designed in order to meet real-time requirements. However, while several architectures have been recently proposed for the integer DCT specified in HEVC [8] - [10] , only few works address the design of hardware architectures to efficiently compute the integer DST. Edirisuriya et al. [11] exploited the relationships between the DCT of different sizes and the DST to design a multiplication-free architecture, which is able to perform both the DCT and the DST on a block of size up to 16×16 samples. However, the reconfigurability of the system to support different transforms is paid in terms of large area occupation due to an high number of hardware resources, which are not fully utilized. Therefore, Nam et al. [12] proposed an hardware architecture to compute the integer 4×4 DST in HEVC, which combines butterfly operations with 4-2 compressors in order to simplify the circuit and to improve the area efficiency.
Stemming from these observations, the first contribution of this work is the analysis of the actual throughput requirements for the design of a DST architecture taking into account the RDO algorithm of a realistic HEVC encoder. Then, the second contribution is to show a novel architecture to compute the integer 4×4 1D-DST. The proposed architecture exploits the factorization in [13] , which reduces the number of hardware resources with respect to the matrix-vector multiplication (MVM). Finally, the 1D-DST core is used to design two 2D-DST HEVC compliant architectures, which outperform existing DST architectures.
The paper is organized as follows. Section II reports the analysis of the actual throughput requirements for the DST design taking into account the RDO algorithm of a realistic HEVC encoder. Then, the proposed low-complexity 1D-DST architecture is described in Section III, while Section IV shows two architectures for the 2D-DST computation in HEVC. Finally, implementation results are presented in Section V, while Section VI concludes the paper.
II. ACTUAL THROUGHPUT ANALYSIS
In order to design a DST module for real-time HEVC encoding, it is important to define the throughput of each processing block. This can be calculated by taking into account the resolution and frame rate of the video sequences to be encoded. However, this approach does not take into account the overhead of computations introduced by the RDO process, which is used to choose the set of coding modes that assures the best quality-compression trade-off. Specifically, in HEVC the encoder has to select one among 35 different intra prediction modes for each intra-predicted block. To perform this operation, the encoder evaluates the sum of squared differences (SSD) cost function between the original block and the reconstructed block after transform and quantization. For these reasons, the DST is applied more than once for each block. Although software profiling [3] provides information about the computational burden of encoding tasks, it is not able to catch the actual throughput requirement. On the other hand, throughput can be calculated by counting the number of transform operations that are computed during the whole encoding process [7] .
978-1-5090-6508-0/17/$31.00 ©2017 IEEE PRIME 2017, Giardini Naxos-Taormina, Italy Digital Circuits and Sub-Systems In this work, the focus is on the design of the DST module in a specific scenario, which is the HEVC intra encoding of 4K UHD @30fps video sequences. Despite an exhaustive test of all coding modes would be optimal for RDO performance, it would be not practical in realistic applications. For this reason the actual throughput analysis has been carried out by using the x265 encoder [14] , which has been configured to encode the sequences using only intra frames. All the video sequences of the SJTU dataset [15] have been coded using four different quantization parameters (QPs), namely 22, 27, 32 and 37. For each case the transform complexity has been calculated as in [7] , namely as the ratio between the actual throughput (T A = 16 · N DST · F s /N f ), which considers the RDO algorithm, and the reference throughput (T H = W · H · S c · F s ), which is computed assuming that each pixel is transformed only once. The Complexity Index (C I ) is calculated by only taking into account the 4×4 DST contribution:
where N DST is the count of 4×4 DST computed by the x265 encoder, and W , H, S c , F s and N f are the width, the height, the chrominance sub-sampling factor, the frame-rate and the number of frames in the video sequence respectively. The values for the 4K UHD video sequences of the SJTU dataset are W = 3840, H = 2160, S c = 1.5 (4:2:0 format), F s = 30 and N f = 300. Table I reports the transform complexity index of each sequence for different QPs. As it can be observed, the transform complexity varies across different QPs and it is also dependent on the video content. Specifically, it is higher in such sequences which show high motion and spatial details, since small blocks are mainly used to describe non-uniform areas. Because of the large variety of encoding scenarios, the worst case has been considered in this work (C I = 3.89). Therefore, to have some margin C I = 4 has been chosen as upper bound, which means that the actual target throughput is T A = T H · C I = 1.493 Gsps.
III. 1D-DST ARCHITECTURE Let x = (x 0 , . . . , x 3 ) and X = (X 0 , . . . , X 3 ) be the input samples and the output transform coefficients, the HEVC standard specifies the core 4-point integer DST as X = S · x, where S is the 1D transform matrix:
which is derived by upscaling and rounding the coefficients of the 4-point type-VII DST matrix, as shown in [1] . The proposed 1D-DST architecture is based on the factorization suggested in [13] , which derives the N -order type-VII DST from the imaginary part of the Discrete Fourier Transform on 2N + 1 points and reduces the computational complexity from 16 multiplications and 12 additions to only 5 multiplications and 11 additions. According to [13] , the 4-point HEVC-compliant DST matrix S can be decomposed by means of three sparse matrices as:
where 
The proposed 1D-DST architecture is depicted in Fig. 1 , where the three main computational blocks, corresponding to matrices M 1 , M 2 and M 3 , are highlighted. First, input samples x are combined as described by M 1 by using simple additions and subtractions to generate five intermediate results. Then, intermediate results are multiplied by constants implementing M 2 . Resorting to the RAG-n technique [16] , all the multipliers have been simplified to add-and-shift blocks, thus loosing flexibility but saving hardware costs. Table II details the arithmetic complexity and the computational depth (expressed in number of cascaded adders/subtracters) of each integer coefficient. It is worth noting that the shifts do not contribute to the overall hardware complexity because they are implemented by simple wiring. Finally, the output coefficients X are calculated by the last stage, which implements M 3 , by means of adders and subtracters.
The internal parallelism has been chosen according to the HEVC specifications [4] . Since the proposed 1D-DST architecture serves as the core processing block for both the Folded and the Full-parallel architectures, shown in Section IV, the four input samples are represented on 16 bits each, whereas the output samples are on 24 bits. Intermediate results after stage M 1 and M 2 are represented on 18 and 24 bits respectively in order to avoid overflow. 
IV. 2D-DST ARCHITECTURES
Thanks to the separability property, the proposed HEVCcompliant 2D-DST architectures rely on the 1D-DST core with no internal pipeline, which is used to perform the transform operation first on the rows and then on the columns of the input block of samples. Two 2D-DST architectures, namely Folded and Full-parallel, have been designed, as in [8] for the 2D-DCT. The former one is depicted in Fig. 2a and uses one 1D-DST module to perform both the row-wise and the column-wise transforms and a transposition buffer to store the intermediate results. The processing rate of this architecture is 2 samples/cycle. On the other hand, the Full-parallel architecture in Fig. 2b is composed of two 1D-DST modules, that allow to achieve double throughput (4 samples/cycle). The transposition buffers have been implemented by means of a 4×4 array of 16-bit registers plus additional logic to read and write either rows or columns, as in Fig. 5b and 6b of [8] . All the control signals for the Folded and the Fullparallel architectures are provided by two different control units, composed of a state machine and a counter. In the Folded structure, the multiplexer (MUX) selects either the input rows or the intermediate columns in a time-multiplexed manner.
Moreover, rounding and scaling operations are required to make the 2D architectures compliant with the HEVC standard [4] . Both architectures are fed with 9-bit input samples, which are extended to 16 bits for row-wise computation. Intermediate results, stored in the transposition buffer, and final results are represented on 16 bits as well. The rounding operation is performed by adding 2
, whereas the scaling operation is a right shift of B positions. The value of B is equal to 1 for row-wise computation and to 8 for column-wise computation respectively, as detailed in [4] . 
V. IMPLEMENTATION RESULTS
The proposed architectures have been described in VHDL, verified and synthesized using a 90-nm standard cell library with Synopsys Design Compiler. It is worth noting that the actual implementation of the multiple-operand adders and subtracters, required in the 1D-DST core, has been left to the synthesis tool. This strategy permits to exploit Design Compiler capability to optimize the overall architecture for area and speed by selecting the best arithmetic implementation.
Implementation results of the 1D-DST architecture are reported in Table III , where the delay, the gate count, the power consumption and the gates-delay product, obtained when synthesizing to achieve the maximum operating frequency, are shown. The proposed architecture is compared with the direct implementation of X = S · x as a MVM, where sixteen multipliers and four 4-input adders work concurrently to compute X. It has been observed that the MVM based implementation achieves the minimum delay. Indeed, the critical path is composed of one constant multiplier and one multi-operand adder, implementing the sum-of-products. On the other hand, the proposed architecture occupies a small area and features low power consumption. Moreover, it requires about 20% less gates-delay product than the MVM based one.
As shown in Section II, the actual throughput required for the HEVC intra encoding of 4K UHD @30fps video sequences, considering the RDO algorithm implemented in the x265 model, is 1.493 Gsps. It is worth noting that a throughput of 1.496 Gsps (which is high enough to satisfy the constraint) can be achieved by setting the clock frequency of Table IV summarize the synthesis results achieved for the Folded and Full-parallel architectures, in terms of technology, operating frequency, supported transform sizes, processing rate, throughput, gate count, throughput-over-gates ratio, power consumption and energy-per-sample. Since the Folded structure uses only one 1D-DST module, it saves about 17% gate count with respect to the Full-parallel implementation. On the other hand, it requires double clock frequency to provide the same throughput, thus leading to higher power consumption. The last two columns of Table IV report the implementation details for the DST architectures proposed in [11] , [12] as well. It is worth noting that the values for [11] refer to the implementation with input word length equal to 8 bits and the throughput has been calculated considering 4×4 DST blocks. Moreover, the solution in [11] undergoes a large area overhead due to the reconfigurability to support multiple-size transforms, which are not required for HEVC applications. Finally, the proposed architectures outperform the work in [12] both in terms of performance and area efficiency, showing 24% larger throughput and higher throughput-over-gates ratio. To satisfy the throughput of [12] , the Folded and the Full-parallel architectures require only 5632 and 6514 gates respectively.
VI. CONCLUSION
In this work, an analysis of the throughput of the DST transform stage for a realistic HEVC encoder performing the RDO algorithm has been presented. Then, an hardware architecture for the 4-point 1D-DST, which exploits the factorization shown in [13] to reduce the number of adders and multipliers, has been proposed. Finally, two 2D-DST architectures for HEVC encoding, namely Folded and Fullparallel, have been described and implemented. Synthesis results show that the proposed architectures achieve the target actual throughput, showing better area efficiency than stateof-art implementations.
