In this paper, we present an area-efficient 4/8/16/32-point inverse discrete cosine transform (IDCT) architecture for a HEVC decoder. Compared with previous work, this work reduces the hardware cost from two aspects. First, we reduce the logical costs of 1D IDCT by proposing a reordered parallel-in serial-out (RPISO) scheme. By using the RPISO scheme, we can reduce the required calculations for butterfly inputs in each cycle. Secondly, we reduce the area of transpose architecture by proposing a cyclic data mapping scheme that can achieve 100% I/O utilization of each SRAM. To design a fully pipelined 2D IDCT architecture, we propose a pipelining schedule for row and column transform. The results show that the normalized area by maximum throughput for the logical IDCT part can be reduced by 25%, and the memory area can be reduced by 62%. The maximum throughput reaches 1248 Mpixels/s, which can support real-time decoding of a 4 K × 2 K 60 fps video sequence.
Introduction
High Efficiency Video Coding (HEVC) [1] is a new video compression standard that has been approved as an ITU-T standard. HEVC aims to double the compression ratio compared with H.264. To achieve this target, many new coding tools have been used [2] , [3] . Since the development of H.264, the discrete cosine transform (DCT) has been widely employed to perform the transform operation since it can provide energy compaction. To reduce the computational complexity and solve the mismatch problem between forward and inverse transforms, integer DCT has been adopted instead of floating DCT. The largest DCT size provided in H.264 is 8 × 8, while larger DCTs including 16 × 16 and 32 × 32 are provided in HEVC since a larger transform size may also enable higher compression ratios.
In H.264, in order to reduce the complexity of the transform, some area-efficient inverse DCT (IDCT) architecture designs have been presented. In [4] , the authors proposed an architecture for a multi-standard inverse transform; two circuit share strategies, a factor share, and adder share strategies were applied to reduce the required circuit resource. A low-cost hardware sharing architecture was proposed in [5] by adding the offset computations and pipelined design. In [6] tipliers in the matrix calculation blocks. In addition, an adder-sharing strategy was adopted to save circuit area. A high-performance inverse transform circuit was also developed in [7] based on the butterfly architecture, and this work shares the resources efficiently by exploiting the similarities between transforms. In [8] , extensive mathematical analysis and decomposition were performed for the H.264 transform and quantization so that all multiplication and division operations were avoided effectively. The above low-cost architectures were all designed for the 4/8-point IDCT in H.264, and they cannot be directly used in HEVC due to the differences in the transform size.
The large transform in HEVC will cause significant complexity problems from two perspectives. One is the logical computation part of the transform; a larger transform leads to a greater number of multipliers and adders causing the required amount of hardware resources to increase. The other problem is that the transpose memory required to store more intermediate results, so the transpose architecture consumes more area. In order to solve the above two problems, several recently developed low-cost architectures have been reported for the HEVC transform.
To reduce the hardware cost of the logical IDCT computation, [9] utilize the Chen's algorithm [19] to reuse the small transform in the larger transform architecture, the author used the multiple constant multiplication (MCM) for 4/8-point IDCT and share the regular multipliers for 16/32-point IDCT. In [10] , the multiplications of 16/32-point IDCT are implemented with input-muxed constant multipliers. Still based on Chen's algorithm, a fully pipelined transform architecture was proposed for an HEVC codec in [11] . The authors in [12] proposed a 16/32-point IDCT architecture that had no multiplications in the design. The similarity property of transform matrices was utilized in [13] , so a calculation unit can be shared for different transform mode. In [14] , the authors presented a unified forward/inverse transform architecture for HEVC; the unified architecture makes use of symmetrical properties that exist in HEVC transform matrices to achieve hardware sharing. In [15] , the authors turned the multiplications by constant into shift and sum operations to achieve a multiplierless HEVC transform architecture. In [16] , the N-point 1D DCT was performed using an (N/2)-point 1D DCT unit and a constant matrix multiplication recursively. [17] developed two efficient DCT architectures for HEVC, one is based on Chen's algorithm and implemented without multipliers. Moreover, the other one uses pruning scheme to truncate some transformed bits in Copyright c 2014 The Institute of Electronics, Information and Communication Engineers order to save the complexity. The pruned architecture cannot be used in the decoder since the transformed bits are truncated.
To reduce the area of the transpose memory, SRAM instead of register is utilized in some previous works. [9] used 4 single-port 8 × 512-bit SRAM. [11] used 32 dualport 32 × 16-bit SRAM. The total bits of SRAM are 16384 bits in both designs. For the transpose memory design in [9] and [11] , the data width of SRAM is bigger than the data parallelism of logical IDCT, so the I/O utilization of each SRAM cannot reach 100%.
The main contributions and new features of this paper are listed as follows:
1) To reduce the hardware resources of IDCT logical computation, we utilize the symmetric property in the butterfly structure. We propose a reordered parallel-in serialout (RPISO) scheme to reduce the redundant inputs of butterfly and the required calculations for butterfly inputs in each cycle.
2) To reduce the hardware area of transpose memory, we present a cyclic data mapping that achieves 100% I/O utilization for each SRAM. In addition, to make a fully pipelined design for 2D IDCT, two-port SRAM is used. A pipelining schedule for row and column 1D IDCT is proposed to avoid the address conflict of writing and reading.
The results show that our design can significantly reduce the circuit area in the logical computation part and transpose memory compared with existing works. The maximum throughput is able to support real-time decoding of a 4 K × 2 K 60 fps video sequence.
The rest of this paper is arranged as follows. Section 2 explains the inverse transform in HEVC. Section 3 demonstrates the proposed 1D IDCT architecture design. Section 4 presents the SRAM-based transpose memory. Section 5 shows the experimental results, followed by the conclusion in the last section.
HEVC Inverse Transform
HEVC IDCT uses the integer transform and the largest transform size is 32 × 32. A complete 2D IDCT can be decomposed to two 1D IDCTs, and the decomposition method is shown in the following equation.
where T 2N is a 2N × 2N transform matrix defined by HEVC, Y is the 2N × 2N matrix before the IDCT, and X is the result after the 2D IDCT operation. (2) . We can see that the matrix has both symmetric and anti-symmetric properties, as shown in Eq. (3). 
The matrices for each transform size can be found in [18] . Using the symmetry, a fast algorithm was developed to decompose the transform matrix [19] . Using Chen's algorithm, the decomposition method for the HEVC transform matrix is shown in the following equation.
[
where T 2N is a 2N × 2N transform matrix, T N is an N × N transform matrix, O N is an odd part matrix, P 2N is a permutation matrix, and B 2N is a 2N-point butterfly structure. Detailed definitions for P 2N , B 2N , T N , and O N are given as follows. [ 
Proposed 1D IDCT Architecture
First, we design an 8-point 1D IDCT architecture. illustrates the corresponding overall architecture for a 1D 8-point IDCT obtained using Chen's algorithm. Y n represents the 8-point inputs and X n represents the 8-point outputs of IDCT.
As shown in Eq. (6), the transform can be divided into even and odd parts. To generate 8-point results, four results are required from the even and odd parts, respectively. The results from the even part are E n (n = 0, 1, 2, 3) and the four results from the odd part are O n (n = 0, 1, 2, 3). Based on E n and O n , X n can be obtained after an 8-point butterfly structure. The relationship is shown in the following equation.
The even part may be further divided into two parts, namely the EE (Even-Even) part and the EO (Even-Odd) part. Each part is required to generate two results. A detailed architecture for the EE engine is shown in Fig. 2(a) . The architecture for the EO engine is similar to that of the EE engine. Based on EE n and EO n , E n can be calculated as follows.
For the odd part, four results are generated, and details of the architecture for the odd engine are shown in Fig. 2(b) .
Conventional Parallel-In Serial-Out Scheme
In our design, the parallelism is set at 4 pixels/cycle. Therefore, the 4-point IDCT can be finished within one cycle, and the 8/16/32-point IDCT can be finished within 2/4/8 cycles. Using the 8-point 1D IDCT as an example, two cycles are required to generate the eight results X n (n = 0, 1, 2, . . . , 7). Conventionally, {X 0 , X 1 , X 2 , X 3 } are generated in the first cycle, and {X 4 , X 5 , X 6 , X 7 } are generated in the second cycle. In this way, according to Eq. (8), all of the E n and O n results are required in both cycles. To generate all of the E n and O n results in each cycle, the architecture for the odd and EE engines is the same as shown in Fig. 2. 
Proposed Reordered Parallel-In Serial-Out (RPISO) Scheme
In Eq. (8), we see that by using an 8-point butterfly structure, X n and X 7−n are generated by the same E n and O n . If X n and X 7−n can be generated within one cycle, they can share E n and O n . Similarly, using a 4-point butterfly structure, E n and E 3−n are calculated by the same EE n and EO n . Likewise, if E n and E 3−n can be generated within one cycle, they can share EE n and EO n . In order to reduce the redundant inputs of butterfly and the required calculations for butterfly inputs in each cycle, we reorder the outputs. After the reordering, in the first cycle, {X 0 , X 3 , X 4 , X 7 } are generated. In the second cycle, {X 1 , X 2 , X 5 , X 6 } are generated. To calculate X 0,3,4,7 in the first cycle, O 0,3 and E 0,3 are required, while to obtain E 0,3 , EE 0 is required. In the second cycle, to calculate X 1,2,5,6 , O 1,2 and E 1,2 are required, while to obtain E 1,2 , EE 1 is required. Within a single cycle, only one result is generated from the EE part and two results are generated from Odd part. It should be noted that the EO engine is similar to the EE engine; within each cycle, only one result is generated from the EO engine.
Using the RPISO scheme, the architecture for the EE engine is shown in Fig. 3 . In the first cycle, EE 0 is selected for the calculation, and the multiplexers select T[0, 0] and T [4, 0] . In the second cycle, EE 1 is required, thus the multiplexers select T[0, 1] and T [4, 1] . Compared with the original EE engine in Fig. 2(a) , the number of EE sample generators is reduced by one half with some extra multiplexers. The architecture for the odd engine is shown in Fig. 3 . The multiplexers are used to select the transform coefficients. In the first cycle, the transform coefficients for O 0,3 are selected, while in the second cycle, the transform coefficients for O 1,2 are selected. Compared with the original architecture shown in Fig. 2(b) , one half of the multipliers and adders can be reduced for the odd engine. The proposed architecture for an 8-point 1D IDCT is shown in Fig. 4(A) where Y n represents 8-point inputs, the circuits for EE8, EO8 and O8 are shown in Fig. 3 . The values of sel e8 and sel o8 and the reordered outputs in each cycle are shown in Fig. 4(B) . It has to be noted that the pixel parallelism is not necessary to be 4. When the parallelism is 8 or 16, the RPISO scheme can still reduce the redundant inputs of butterfly and the required calculations for butterfly inputs in each cycle. The overall flowchart is shown in Fig. 5 . It is composed of permutation part, even part, odd part and butterfly structure. By using a 16-point butterfly structure, X n and X 15−n are calculated by the same E n and O n as shown in the following equation.
By using the proposed RPISO scheme, X n and X 15−n can share {E n , O n } in the same cycle. To generate four outputs in one cycle, two even results and two odd results are calculated in one cycle. The reordered outputs in each cycle are stated in Fig. 6(A) . Four cycles are required to generate 
16-point results.
For the even part of 16-point IDCT, E n is equal to the result of 8-point IDCT according to [19] . So the architecture in Fig. 4(A) can be reused as the even engine for 16-point IDCT. Within each cycle, four results are generated from 8-point IDCT. A multiplexer is required to select two as the results for even part. The selection signal is sel e16. The selection signal sel o8 and sel e8 can select four results generated by 8-point IDCT. The values for the selection signals in each cycle are given in Fig. 6(A) . For example, in the first cycle, sel o8 and sel e8 are 0, {E 0 , E 3 , E 4 , E 7 } are generated by 8-point IDCT, sel e16 is 0 so that {E 0 , E 7 } are selected as two even results for 16-point IDCT.
For the odd part of 16-point IDCT, O n is generated by multiplication and addition. An odd engine for 16-point IDCT (O16) can generate one O n result. As shown in Fig. 7 , O16 is composed of 8 multiple constant multiplications (MCM) and 7 adders. For each MCM in O16, one input is the odd-index input Y n , and the other is constant coefficient. The constants are selected by a Look Up Table (LUT16) and the selection signal is sel o16. For example, in the first cycle, sel o16 is 0, the corresponding constants for O 0 and O 7 are selected so that {O 0 , O 7 } are generated as the odd results for 16-point IDCT.
Similarly, the 32-point IDCT can reuse the 16-point IDCT as the even engine. A multiplexer with the selection signal sel e32 is used to select two results from 16-point IDCT. sel e16 and sel o16 can decide the outputs from 16-point IDCT. For the odd part of 32-point IDCT, an O32 can generate one result. O32 is composed of 16 MCMs and 15 adders, as shown in Fig. 7 . sel o32 is used to select the constant coefficients from LUT32. The values for all the selection signals in each cycle are shown in Fig. 6(B) . Eight cycles are required to generate 32-point results. The overall architecture for 8/16/32-point IDCT is shown in Fig. 7 . It has to be noted that in the RTL level, we implement the multiplication directly (* operator in Verilog). However, the multiplier in odd engine is multiple constant multiplier (MCM) which can be implemented as the adder tree by the Design Complier tool. Figure 8(A) gives the inputs of MCM in O32, one constant input is selected from an 8to1 multiplexer and the variable is odd-index input of 32-point IDCT.
Similarly, for the multiplier in O16, one constant input is selected from a 4 to 1 multiplexer, and the variable is oddindex input of 16-point IDCT. Moreover, the circuit of O32 is shown in Fig. 7 . 16 multipliers are processed in parallel. The number of multipliers does not influence the critical path. According to the timing report by Design Complier, the critical path is shown in Fig. 7 . We can see that the critical path does not include the O32.
It should be noted that in our design, within each cycle, four final outputs of IDCT can be generated regardless of the TU size. The register is not used to store the intermediate results, and the 4-point IDCT does not adopt the RPISO scheme since it may be generated within one cycle; thus, there is no need to reorder the outputs. The proposed unified architecture only supports the 8/16/32-point IDCT. Because the 4-point IDCT incurs a small hardware cost, we implemented the 4-point IDCT individually. The reason for doing so is that the overhead required to embed a 4-point IDCT into the unified architecture is higher than the direct implementation of a 4-point IDCT.
Area-Efficient Transpose Memory
As shown in Eq. (1), a complete 2D IDCT is composed of IT1 and IT2. IT2 can only start after getting a complete column of IT1 results. However, the IT1 results are obtained row by row. Therefore, we require transpose memory stores the IT1 results.
Data Mapping Scheme for SRAM
To reduce the hardware cost, we use SRAM to realize the transpose memory. The IT1 result is stored in the SRAM. The parallelism in our design is 4 pixels/cycle, so four SRAMs were used. To make full use of the transpose memory, we propose to use two data mapping modes. Taking TU 32 × 32 as an example, the mapping scheme for mode 0 is shown in Fig. 9 . In Fig. 9, (m, n) indicates the IT1 result X m,n (m, n = 0, 1, 2, . . . , 31). The samples in the different columns are marked using various colors. The column 4*k (k = 0, 1, 2, . . . , 7) is shaded in white, the column 4*k+1 is shaded in light gray, the column 4*k+2 is shaded in dark gray, and the column 4*k+3 is shaded in black. In each cycle, four IT1 results are written into SRAMs. Note that we write X 4 * m,n , X 4 * m+1,n , X 4 * m+2,n , and X 4 * m+3,n in four SRAMs so that they can be read out in one cycle. For example, X 0,0 , X 1,0 , X 2,0 , and X 3,0 are written in four SRAMs. Therefore, we can read these four samples within one cycle. Similarly, X 4 * m,0 , X 4 * m+1,0 , X 4 * m+2,0 , and X 4 * m+3,0 are written in four SRAMs. We can read four IT1 results of the first column within one cycle. After eight cycles, the IT1 results of the first column can be fetched. After we obtain the IT1 results of the first column, IT2 can begin. Every eight cycles, the IT1 results of one complete column are read out for IT2.
When the transpose memory is full of the IT1 results of the same transform block N, the IT1 results of the next transform block N+1 are written in from the next cycle. Meanwhile, the data mapping mode interchanges. In the other mapping scheme mode, X' n,m (the IT1 results of block N+1) is stored in the same address as X m,n of block N. The detailed mapping scheme for mode 1 is shown in Fig. 10 . Similarly, we can read IT1 results of one complete column within eight cycles. We aim to design a transpose memory that can store the IT1 results for various transform sizes on the decoder side. The data mapping scheme used to store the 32 × 32 IT1 results is given in Fig. 9 and Fig. 10 . To store the 32 × 32 IT1 results in four SRAMs, the depth of each SRAM is 256 (32 × 32/4). If the transform size is smaller, the IT1 results are stored in the corresponding address in the SRAM. For example, when writing the 16 × 16 IT1 results, X m,n (m, n = 0, 1, 2, . . . , 15) is written in the corresponding position as shown in Fig. 9 and Fig. 10 . One quarter of the transpose memory is used for transform block 16 × 16. When storing By using the proposed data mapping scheme, within each cycle, all the input/output ports of SRAM are used for writing/reading. Therefore, we can achieve 100% I/O utilization of each SRAM. In addition, the data width of SRAM is significantly reduced compared with previous work, so the hardware area can be reduced. The detail comparison result is shown in Sect. 5.2.
Pipelining Schedule for the SRAM
The type of SRAM does not influence the proposed data mapping scheme in Sect. 4.1. No matter a single-port or two-port SRAM is adopted, the I/O utilization of SRAM can always achieve 100% by using our proposed data mapping scheme.
If a single-port SRAM is used as a transpose memory, write and read cannot work in parallel. IT1 is performed at first and the results are written in the memory. After that, the IT1 results are read out and IT2 is performed. IT1 and IT2 can share one 1D IDCT archi-tecture. The overall architecture is shown in Fig. 11(A) .
To make a fully pipelined 2D IDCT architecture, twoport (1r1w) SRAM is used in our design. Writing IT1 results in memory and reading IT1 results from memory can perform in parallel. IT1 and IT2 require the 1D IDCT hardware resource, respectively. The overall architecture is shown in Fig. 11(B) .
The processes for IT1 and IT2 are pipelined in our design. In this way, the writing of the IT1 results and reading results for IT2 can be done in parallel. The read and write address for each SRAM should be different in each cycle. To avoid this address conflict, we propose a pipelining schedule for a two-port SRAM, as shown in Fig. 12 .
Taking TU 32 × 32 as an example, from cycles 0 to 7, the IT1 results of row 31 in block N (blkN, row31) are written in the SRAM. Assume that the data mapping scheme for writing block N is mode 0. Within each cycle, X 31,4 * k , X 31,4 * k+1 , X 31,4 * k+2 , and X 31,4 * k+3 are written. The corresponding data mapping scheme is shown in Fig. 9 . Then, the IT1 results of the next block N+1 are written. The data mapping mode interchanges to mode 1, and the corresponding data mapping scheme is interchanged as shown in Fig. 10 . The IT1 results of row 0 in block N+1 (blkN+1, row0) are written from cycles 8 to 15. Conventionally, after all the IT1 results of block N are written in, we begin to read the IT1 results of block N. In this way, the IT1 results of column 0 in block N (blkN, col0) are read from cycles 8 to 15. However, we can see in Fig. 12 that the address for writing (blkN+1, row0) and reading (blkN, col0) is the same, which will lead to an address conflict. To solve this address conflict problem, reading (blkN, col0) is done eight cycles in advance.
In our pipelining schedule for SRAM, from cycles 0 to 7, the IT1 results of (blkN, col0) are read out. Within each cycle, X 4 * k,0 , X 4 * k+1,0 , X 4 * k+2,0 , and X 4 * k+3,0 are read out. It should be noted that when reading the last sample of the first column X 31,0 in the cycle 7, this sample has already been written into the SRAM in cycle 0. Thus, it could be fetched in cycle 7. From cycles 8 to 15, the IT1 results of (blkN, col1) are read out, while the IT1 results of (blkN+1, row0) are written in. We see that there is also no address conflict between them. After that, the IT1 results of (blkN+1, rowK) are written in and (blkN, colK+1) are read out for IT2; using this schedule, the address conflict between reading and writing is avoided.
Experimental Results
We used Verilog HDL (VHDL) to implement the proposed design. The design was synthesized with a TSMC 90 nm cell library. As mentioned previously, there are two critical problems that lead to the significant hardware cost. One is the logical computation part, and the other is the transpose memory part. We analyze the results from these two perspectives.
Comparison of the Logical Computation of IDCT
In our design, a 4-point IDCT is not included in the unified architecture. The reason is explained at the end of Sect. 3. The individual gate count for a 4-point IDCT is 2.4 k, and the gate count for the 8/16/32-point unified IDCT is 63.9 k. Therefore, an overall gate count of 66.3 k is required for a 1D IDCT design that can support the entire transform size on the decode side. The concept of the normalized area (NA) is introduced to obtain a fair performance comparison, and normalized area by maximum throughput is defined by the following equation.
Normalized area by maximum throughput
where MaxFreq is the maximum frequency of the design and Tp is the throughput within each cycle, which is equal to the pixel parallelism. "Gate" refers to the gate count for the logical calculation part for the IDCT excluding the transpose memory.
In order to demonstrate the effectiveness of our proposed design, we performed a comparison with those in references [9] - [13] , which describe the low-cost designs for the HEVC IDCT. The comparison result is shown in Table 1 . It should be noted that the throughput in [10] , [11] , and [13] is different for various 2N-point IDCTs. On the decoder side, the worst-case scenario should be considered. For example, in [11] , the authors achieve 4/8/16/32 pixels/cycle for a 4/8/16/32-point IDCT, respectively. In the worst case, when the TU size is always 4 × 4 on the decoder side, the parallelism is 4 pixels/cycle. The smallest parallelism is used to calculate the normalized area. For the logical IDCT computation part, compared with [9] - [13] , we can reduce the gate count by more than 25%.
For the 1D DCT architecture in [17] , the pixel parallelism is 16 pixels/cycle. In fact, for the same function, the architecture with higher pixel parallelism can share more computations and hardware resources. So the architecture with higher pixel parallelism usually has less normalized area than one with lower pixel parallelism. The normalized area by parallelism is defined as follows.
Normalized area by parallelism = Gate pixel parallelism (12) To have a fair comparison with [17] , we designed the architecture with higher parallelism such as 8 pixels/cycle and 16 pixels/cycle. Proposed RPISO scheme is still adopted in these architectures. When the pixel parallelism is 8 pixels/cycle, 16/32-point IDCT can take advantage of RPISO to reduce the redundant inputs and corresponding calculations of 16/32-point butterfly structure. Similarly, when the pixel parallelism is 16 pixels/cycle, the 32-point IDCT can take advantage of RPISO scheme to save the hardware cost.
The results are shown in Table 1 . We can see that the design with higher pixel parallelism consumes much less normalized area is than the design with lower pixel parallelism. Compared with [17] , our IDCT architecture with parallelism of 16pixels/cycle consumes less normalized area by parallelism than [17] . The reason is partially because that we use RPISO scheme to reduce the calculations of butterfly inputs for 32-point IDCT.
It has to be noted that in [17] , the author synthesized the 1D architecture with the desired timing constraint for 8 K × 4 K 60 fps. The corresponding frequency is 187 MHz (7680 × 4320 × 60 × 1.5/16). Therefore, we think 187 MHz mentioned in [17] is the operating frequency.
The results in Table 1 show a comparison of our results with other unified architectures for HEVC IDCT. Using the RPISO scheme, it is possible to realize savings in area. Each transform block is also implemented individually. The results are shown in Table 2 . In [10] , the author presented the hardware area for each transform block. Compared with [10] , we can realize savings in the required hardware resources for each transform block by using the RPISO scheme.
In addition, it is obvious that the unification method proposed in Sect. 3.3 can result in reduced hardware overhead. The overall gate count for an individual transform block is 83.1 k (2.4 + 8.6 + 24.3 + 47.8), which is greater than the 66.3 K of the proposed unified architecture.
Comparison of the Transpose Architecture
For the transpose memory part, four SRAM blocks are used since the parallelism is 4 pixels/cycle. To store the IT1 results of a 32 × 32 transform block, the width of each SRAM is 16 bit (one IT1 result) and the depth is 256 (32 × 32/4). A total of 16384 SRAM bits were used. To store the IT1 results of the smaller transform, a part of the transpose memory is used. The detailed data mapping scheme was given in Sect. 4 .
In [9] and [11] , the authors also used SRAM to realize the transpose architecture, and the total SRAM bit is the same as our design. However, we can achieve much smaller area since the width of our SRAM is significantly reduced compared with [9] and [11] . To perform a comparison with various transpose memory designs, we implemented a memory complier using the same process TSMC 90 nm. A single-port SRAM is used in [9] . However, a 1D IDCT architecture in [9] can also be adopted in a fully pipelined 2D IDCT architecture when using a two-port SRAM. To have a fair comparison of all the SRAM-based transpose memory design, we used a two-port memory complier to generate the area. The total area for the SRAMs is shown in Table 3 . Using our proposed narrower and deeper SRAMs, the total SRAM area was 80988 µm 2 . Therefore, we realized a savings of at least 62% in the SRAM area compared with [9] and [11] . In [12] and [13] , the authors used registers to implement a transpose memory based on the traditional method, which is not so area-efficient. The author in [10] did not develop the transpose architecture.
In summary, we have developed an area-efficient SRAM-based transpose architecture.
Conclusions
In this paper, we presented a low-cost multi-size IDCT ar- Table 3 Comparison with other SRAM-based transpose memory. chitecture for HEVC. To reduce the required amount of hardware resources, the RPISO scheme was proposed. In this scheme, the final outputs are reordered so that we can reduce the redundant inputs of butterfly and the required calculations for butterfly inputs in each cycle. Based on the RPISO scheme, we presented a unified architecture for an 8/16/32-point IDCT. Using Chen's algorithm, the N-point architecture was reused in the 2N-point architecture. We used an SRAM instead of a register to realize the transpose memory. We proposed the data mapping scheme and pipelining schedule for the SRAM so that the data width of the SRAM would be significantly reduced. The results show that we can realize savings of about 25% in the area for the logical computational part, and more than 60% in the area for the transpose memory part. Our design can therefore support real-time decoding of a 4 K × 2 K 60 fps video sequence.
