This paper describes a unified VLSI architecture which can be applied to various types of transforms used in MPEG-2/4, H.264, VC-1, AVS and the emerging new video coding standard named HEVC (High Efficiency Video Coding). A novel design named configurable butterfly array (CBA) is also proposed to support both the forward transform and the inverse transform in this unified architecture. Hadamard transform or 4/8-point DCT/IDCT are used in traditional video coding standards while 16/32-point DCT/IDCT are newly introduced in HEVC. The proposed architecture can support all these transform types in a unified architecture. Two levels (architecture level and block level) of hardware sharing are adopted in this design. In the architecture level, the forward transform can share the hardware resource with the inverse transform. In the block level, the hardware for smaller size transform can be recursively reused by larger size transform. The multiplications of 4 or 8-point transform are implemented with Multiplierless MCM (Multiple Constant Multiplication). In order to reduce the hardware overhead, the multiplications of 16/32 point DCT are implemented with ICM (input-muxed constant multipliers) instead of MCM or regular multipliers. The proposed design is 51% more area efficient than previous work. To the author's knowledge, this is the first published work to support both forward and inverse 4/8/16/32-point integer transform for HEVC standard in a unified architecture. key words: HEVC, integer DCT/IDCT, Hadamard transform, input-muxed constant multiplier, multi-standard video coding 
Introduction
As the video coding technology advances, various types of transforms have been adopted in miscellaneous video coding standards. Discrete Cosine Transform (DCT) is the most popular transform for block based video coding due to its ability to concentrate the energy of video residual data into low frequency domain.
Floating-point 8×8 2D DCT is used in the video coding standards like MPEG-1 [1] , MPEG-2 [2] and MPEG-4 [3] . Due to the limited bit precision in any actual digital circuit system, the floating-point DCT may cause data mismatch problem between encoder and decoder.
In order to avoid this mismatch, integer 8 × 8 2D DCT is used in later video standards like H.264 [4] , AVS [5] and VC-1 [6] . H.264 and VC-1 also use smaller size integer DCT such as integer 4 × 4 2D DCT, which can improve coding efficiency for the video sequences with complex texture. In addition, Hadamard transform is used in H.264 to process the DC coefficients of intra 16 × 16 or chroma prediction mode. High Efficiency Video Coding (HEVC) standard [7] is an emerging new video coding standard which is jointly developed by the two video standardization organizations: MPEG and ITU. It is considered as the successor of H.264. Compared with H.264, [8] reported that HEVC can reduce up to 44% bit rate with the same picture quality. In order to achieve this bit rate reduction, HEVC adopts many new coding tools including larger size integer DCT such as 16 × 16 and 32 × 32 2D DCT.
Two different methods can be used to perform 2D DCT: direct 2D method [9] , [10] or Row Column Decomposition Method (RCDM). In [9] , a direct 2D method is proposed for floating-point 8 × 8 2D IDCT. A direct 2D method for 4 × 4 H.264 integer DCT is also proposed in [10] . Direct 2D method has the advantage of higher processing capacity. It also avoids the usage of transpose memory. But its internal connection and computational logic is quite complex. This method is not suitable for larger size transform such as 16 × 16 or 32 × 32 2D transform. RCDM sequentially perform 1D transform twice instead of direct 2D transform. RCDM can greatly reduce the hardware cost in case of larger size transform. The proposed design in this paper also adopts RCDM.
Various VLSI architectures based on RCDM have been proposed for multiple-standard 1D transforms [11] - [17] . A low-cost hardware-sharing architecture of 1D inverse transform is proposed for H.264 and AVS in [11] . An offset matrix is exploited to reduce computational complexity. [12] extends this idea to support more standards such as VC-1 and MPEG-2/4. The offset matrix will become too complex if more standards are to be supported. In [13] and [14] , a flexible architecture is proposed for multi-standard transform. But most previous works [11] - [14] can only support 4/8-point 1D transform. In [15] , the regular multiplier based architecture is proposed to support 4/816/32-point IDCT. The regular multiplier is used and it is much larger than the constant multiplier in terms of silicon area. A VLSI architecture for HEVC 16/32-point IDCT is proposed in [16] . The proposed fast algorithm in [16] is based on the obsolete Working Draft 2 of HEVC and can no longer be applied to the latest HEVC standard. A multiplierless design is proposed for HEVC 16-point DCT in [17] . It is optimized for 16-point DCT only and cannot support other transform size.
In order to address the above problems, we propose a unified VLSI architecture for various types of transforms. It can support both the existing video coding standards like H.264, AVS, MPEG-2/4, VC-1 and the emerging HEVC Copyright c 2013 The Institute of Electronics, Information and Communication Engineers standard. A novel design named configurable butterfly array (CBA) is also proposed to support both the forward transform and the inverse transform in this unified architecture.
The hardware resource sharing in this design is carried out in two different levels: the architecture level or the block level. In the top architecture level, the forward and inverse transform can share the same hardware blocks such as multiplication blocks, adder tree. The CBA block can be configured to support either forward transform or inverse transform.
Hardware sharing is also carried out inside the multiplication/add tree blocks. The hardware for smaller size transform can be reused for larger size transform. One hardware sharing technique called "input-muxed constant multiplier" is used to implement the multiplication circuits of 16/32-point DCT. It can reduce the hardware cost significantly in comparison with the multiplication blocks used in [15] . This architecture is also flexible to support more future video standards as long as the DCT-like transform matrix is used.
The rest of this paper is organized as follows. In Sect. 2, the idea and the fast computational algorithm of integer IDCT/DCT is reviewed. Hadamard transform is also introduced in this section. Section 3 presents the proposed VLSI architecture of the unified 1D forward/inverse transform. The result of VLSI implementation and comparisons with previous designs are shown in Sect. 4. A conclusion is drawn in Sect. 5.
Reviews of 1D DCT/IDCT Transforms
1D integer DCT/IDCT has been widely used in the latest video coding standards like H.264/AVC, AVS, VC-1 and HEVC. 1D integer DCT can be defined as:
where X is the input signal, Y is the transform result and A N is the N × N integer transform matrix defined by each video standard. 1D integer IDCT can be also defined in a similar equation as:
In HEVC, the size of transform matrix can be 4 × 4, 8×8, 16×16 or 32×32. Many fast computational algorithms have been proposed for floating-point DCT [18] - [20] . Matrix factorization is the core idea of these fast algorithms. The fast algorithm proposed by Chen [18] is a fundamental work. The N ×N transform matrix A N can be decomposed in a recursive form, which is shown in Eq. (3). Here P N is the permutation matrix and B N is the butterfly operation. The transform matrix A N is divided into even part matrix (A N/2 ) and odd part matrix (R N/2 ) while matrix A N/2 can be further divided in the same fashion. Matrix R N/2 can also be factorized and decomposed into several matrices, which is shown in [18] .
where P N (N = 4, 8, 16 or 32) is the permutation matrix and it is used to permute the output vector Y. Matrix P N is defined as: 
The integer transform matrix used in H.264, VC-1, AVS or HEVC can reduce computational complexity. But the integer transform matrix is no longer orthogonal due to the integer approximation of the transform coefficients. So the fast algorithms for floating-point DCT as mentioned above cannot be applied without modification.
After the thorough inspection of various integer transform matrices used in AVS, VC-1, H.264 and HEVC, we can see that Eq. (3) can still be applied to integer transform due to its symmetric/asymmetric feature. The integer transform matrix can be recursively divided into smaller matrices. The permutation and butterfly operation remain the same. The only exception is that the generalized method of decomposing the odd part matrix (R N/2 ) can no longer be used for integer transform.
The 4 × 4 integer transform matrix for HEVC is shown in Eq. (6), which is shown below:
The transform matrix of size 8 × 8, 16 × 16 or 32 × 32 can be expressed as Eqs. (7), (8) or (9) .
[
The odd part matrix R 4 , R 8 and R 16 are shown in Eqs. (10), (11) and (12) . c00, c01, . . . , c30, c31 are the 32 transform coefficients defined by each video coding standard. The exact values of these 32 transform coefficients are shown in Table 1 and Table 2 .
4 × 4 Hadamard transform is used in H.264 to further improve the coding efficiency for luma or chroma DC coefficients. It can be regarded as a special integer DCT in which case all the transform coefficients are either 1 or −1. It is quite straightforward to integrate the Hadamard transform with DCT. The transform coefficients of 4 × 4 Hadamard transform are also shown in Table 1 .
VLSI Architecture for Unified 1D Forward/Inverse Transform
In this section, the unified VLSI architecture for 1D integer forward/inverse transform is described with more details. This architecture can support all types of transforms like 8-point floating-point DCT/IDCT, 4/8/16/32-point integer DCT/IDCT and 4-point Hadamard.
Top Architecture and Pipeline Design
The top level architecture and pipeline design of the forward/inverse transform are shown in set as N and N is less than 32, only first N-th input signals are used and only the first N-th output signals are valid outputs. Forward/inverse transforms with different transform size (4/8/16/32-point) can be supported by setting the input ports (BP32, BP16, BP8 and type) to a proper value. The exact setting is described with more details in Sect. 3.2.
The hardware modules used in Fig. 1 can be categorized into three types: (a) configurable butterfly array (CBA). This block is used to perform the butterfly operation which is defined in Eq. (5). It can be configured to support different transform type or transform size. This block will be described in Sect. 3.2 with more details. (b) The multiplication blocks. Four multiplication blocks named Multi0, Multi1, Multi2 and Multi3 are used in this proposed design. (c) Adder tree blocks. Four adder tree blocks are used to sum up the outputs of the corresponding multiplication blocks.
When this design is configured to support 4-point transform, only Multi0 and adder tree 0 are used and other pipeline stages are bypassed. Multi0/1 and adder tree 0/1 are used to calculate 8-point transform. Multi0/1/2 and adder tree 0/1/2 are used to calculate 16-point transform.
As shown in Fig. 1 , the multiplication blocks and add tree
blocks used for 4/8/16/32-point transform are marked with dot line, dot-dash line, dash line and solid line. The CBA block can also be used for 4/8/16/32-point transform. So the hardware sharing among different transform sizes can be well achieved. In order to achieve higher working frequency, the proposed design is divided into 6 pipeline stages. Three butterfly operations (defined by the butterfly matrix B 8 , B 16 and B 32 ) are performed in configurable butterfly array. Each butterfly operation is arranged as one pipeline stage. Totally three pipeline stages are used for configurable butterfly array. The multiplication blocks (Multi0, Multi1, Multi2 and Multi3) are arranged as the 4th pipeline stage. Adder Tree0 and Adder Tree1 are arranged in the 5th pipeline stage. The computational complexity of Add Tree3 or Add Tree4 is much bigger than Adder Tree0 or Adder Tree1. So Add Tree3 and Add Tree4 are divided into two separate pipeline stages: the 5th and 6th stage. The inside details of Add Tree3 and Add Tree4 will be introduced in Sect. 3.3.
VLSI Implementation of Configurable Butterfly Array
The butterfly operation is required by either the forward or the inverse transform. Three types of butterfly matrixes are used for a 32-point transform: B 8 , B 16 and B 32 . These three matrixes are defined by Eq. (5). In case of 32-point forward transform, the 32-point butterfly operation (B 32 ) is performed first and followed by the 16-point butterfly operation (B 16 ). The 8-point butterfly operation (B 8 ) is performed at last. In case of 32-point inverse transform, the 8-point butterfly operation (B 8 ) is performed first and followed by the 16-point butterfly operation (B 16 ). The 32-point butterfly operation (B 32 ) is performed at last.
A flexible architecture named configurable butterfly array is proposed to support the butterfly operation required by both the forward transform and the inverse transform. The detailed diagram of CBA is depicted in Fig. 2 . CBA is The input ports (BP32, BP16 and BP8) are used to con- The input port "type" is used to configure the supported transform type. When "type" is set as 0, the system in Fig. 1 is configured to support forward transform and the outputs of adder tree blocks (bus signal F in Fig. 1 ) are selected as the final transform results. BF0-BF3 are configured to perform 32-point butterfly operation and BF6 is used to perform 8-point butterfly operation.
When "type" is set as 1, the system in Fig. 1 is configured to support inverse transform. The outputs of configurable butterfly array (bus signal D in Fig. 1 ) are selected as the final transform results. BF0 is configured to perform 8-point and BF6-BF9 are configured to perform 32-point butterfly operation. BF4 and BF5 are used to perform 16-point butterfly operation.
The internal architecture of BF block is shown in Fig. 3 . There are six data input ports (0-5) and two output ports (6-7) in BF block. Input port "type" is used to configure the transform type (forward or inverse transform. Input port "BP" is used to support the bypass fuction. Each BF block can perform an 8-point butterfly operation. When the input "BP" is set as 1, the 8-point butterfly block is bypassed and the input ports 0/1 are set as output signals. Otherwise the 8-point butterfly operation is performed for either the forward transform or the inverse transform according the input signal "type.
Signal A, B and C in Fig. 2 represent the intermediate bus signals. In order to simplify the diagram in Fig. 2 , all these bus signals are divided into 8 groups. Each group is named as A0-A7, B0-B7 or C0-C7. For example, the input bus signal X in Fig. 2 actually consists of 32 separate input signals (x 0 , x 1 , . . . , x 31 ). If each input signal is assumed to be 16-bit and the design is configured to support forward transform, A0 consists of four input signals (x 0 , x 1 , x 2 , x 3 ). A0 will be a 64-bit bus signal. The bus width of A1-A7 is the same as that of A0. Bus signals B/C are also divided into 8 groups in a similar way.
VLSI Implementation for Multiplication and Add Tree Blocks
Previous work [13] has shown that the multiplication and adder tree circuit account for more than 80% of the whole hardware area. In most previous works [11] - [14] , [21] on integer transform, multiplication is performed by multiplierless Multiple Constant Multiplication (MCM). MCM requires less hardware resource than regular multipliers when the transform size is small. The experimental results in Table 4 will show that MCM is more area efficient for 4or 8 × 8 transform block. Therefore the multiplications block Multi0 and Multi1 in this proposed design are implemented by MCM approach. The coefficients used in Multi0 and Multi1 are defined by Eqs. (6) and (10). The internal design of Adder Tree block 0/1 is the same as the one used in [15] . But the number of constant multipliers in a MCM block will increase exponentially as the transform size increases. A direct VLSI implementation of such MCM block will become unaffordable in case of large size transform.
The MCM based multiplication block Multi2 is shown in Fig. 4(a) as an example. There are 8 input data (E2, E3) of Multi2. Each input data will be multiplied with 8 different transform coefficients (c02, c06, c10, c14, c18, c22, c26, c30). So the number of multiplications in MCM block Multi2 is 64.
When the 1D transform is done in multiple clock cycles, a hardware sharing technique can be used to reduce the number of constant multipliers. The more clock cycles are used, the fewer multipliers and adder trees are needed. Such approach is adopted in [15] to reduce the hardware cost. The multiplication block Multi2 and Multi3 in [15] are implemented with regular multipliers as shown in Fig. 4(b) . The multiplicator of each multiplier is the corresponding transform coefficient and the multiplicand is the input signal. The transform coefficients are stored in SRAM. The transform coefficients will be read out from SRAM when the transform is in progress. Compared with MCM approach in Fig. 4(a) , the number of multipliers use in Fig. 4(b) is reduced from 64 to 16 .
A novel architecture named input-muxed constant multiplier (ICM) is proposed in this paper to further reduce the hardware cost of multiplication block. The diagram of this architecture is shown in Fig. 4(c) . The number of multipliers used in ICM approach is still 16. But each multiplier is a constant multiplier instead of a regular multiplier. At each clock cycle, one input signal among the eight input signals (d8-d15) is selected and then multiplied by the corresponding transform coefficients. The transform coefficients used in ICM can be either positive or negative, which is decided by the control logic of the multiplication block.
Here we denote the 8 constant multipliers (C02, C06, C10, C14, C18, C22, C26, C30) as one set of constant multipliers, which is marked with dot line in Fig. 4(c) . If the multiplication in Multi2 is done in one cycle, 8 sets of con- stant multipliers are needed. If the multiplication is done in 8 cycles, only one set of constant multipliers is needed. The more cycle used for multiplication, the less multipliers are required.
The minimum throughput of this design is set as 4-pixel per cycle. So the 16-point DCT needs to be done in 4 cycles. Two sets of constant multipliers are used in the Multi2 block which is shown in Fig. 4(c) . Two outputs of adder tree 2 can be generated at each cycle. All 8 outputs of adder tree 2 can be generated in 4 cycles. The number of adders in adder tree 2 of Fig. 4(a) is 48. The number of adders is reduced from 48 to 14 for the adder tree 2 block in Fig. 4(b) or Fig. 4(c) .
The same hardware sharing technique can also be applied to Multi3. As shown in Fig. 5 , two sets of constant multipliers are used in Multi3 and each set consists of 16 constant multipliers (C01, C03,. . . , C29, C31). All 16 outputs of add tree 3 can be generated in 8 cycles. Four add tree blocks are used in adder tree 3 and the internal detail of each add tree is the same as what is marked with dash line in Fig. 4(c) .
Experimental Results
Verilog HDL is used to implement the proposed design. Synthesized with SMIC 0.13 µm standard cell library, this design can work at 191 MHz and the gate count at this working frequency is 54.1 K. The hardware cost of each transform is also shown in Table 3 . Three different approaches mentioned in Fig. 4 are used to implement the multiplication blocks named Multi0, Multi1, Mulit2 and Multi3. The gate count of each block is shown in Table 4 . ICM is not applied to Multi0 and Multi1 due to the design requirement. So the gate count of Multi0 or multi1 with ICM approach is not available in Table 4 . It can be seen from Table 4 that the ICM based multiplication block is much more area efficient than MCM or the regular multiplier based approach. Compared with the regular multiplier based design in [15] , more than 50% silicon area is reduced in this proposed architecture for Multi2 or Multi3.
The precision of each input signals can be described as N-bit. Each 8-point butterfly operation will increase the precision by one bit. So the CBA block in Fig. 1 will increase the precision by 3 bits. According to Table 1 and Table 2 , the maximum transform coefficients used in Multi0/Multi1/ Multi2/ Multi3 are 473, 502, 90 and 90. The bit precisions increased by these four multiplication blocks are 9/9/7/7. The bit precisions increased by adder tree block0/1/2/3 are 2/2/3/4. The width of the bus signal F0 in Fig. 1 In this proposed design, the width of input signal is set as 16-bit. The transform results of this design can be up to 30-bit (30=16+14). Some video coding standards such as AVS, VC-1, H.264 and HEVC use integer transforms. The bit precision of the 1D integer transform result is limited to be no more than 16-bit. Before the column transform starts, the results of row transform should be clipped to 16-bit. The clipping module is quite simple and straightforward. So it is not shown in Fig. 1 . IEEE standard 1180-1990 also defines the required bit precision for floating point 8 × 8 transform. 16-bit input signals and 10-bit transform coefficients in this proposed design can meet the constraint of IEEE standard Table 4 Gate count of multiplication block. If two proposed 1D transform architectures are used to perform row and column transform in a pipelined fashion, the working frequency needed to support a specific video sequence can be calculated by Eq. (13):
where W × H is the resolution of video sequence. Format is set as 1. The comparison between previous work and this work is summarized in Table 5 . The designs proposed in [13] , [14] , [21] , [22] can only support 4 or 8-point transform. The work in [17] can only support 16-point HEVC DCT and the work in [15] can support only support 4/8/16/32-point IDCT. This proposed design can support both the forward and inverse 4/8/16/32-point transforms in a unified architecture. Compared with the work in [15] , this work is still 51% more efficient than [15] in terms of silicon area. Additional on-chip SRAM is used in [15] while this proposed design does not require any on-chip SRAM. The hardware cost is further reduced.
Conclusion
To the best of the authors' knowledge, this is the first published work to support both forward and inverse 4/8/16/32-point integer transform for HEVC standard in a unified architecture. It can support multiple video standards such as MPEG-2/4, VC-1, H.264, AVS and HEVC. A novel design named configurable butterfly array (CBA) can be configured to support either forward or inverse transform. The CBA block can also be configured to support different transform size. The hardware for smaller size integer transform can be recursively reused for larger size integer transform. Hadamard transform can also share the hardware resource with DCT/IDCT. In order to reduce the hardware cost, the multiplication of 16/32-point transform is implemented by ICM. This 6-stage pipelined architecture can support 4 K × 2 K @30 fps video (4:2:0 YUV format) at 191 MHz working frequency. The gate count under this working frequency is 54.1 K. Our design is 51% more efficient than previous work in [15] .
