## POLITECNICO DI TORINO

Repository ISTITUZIONALE

## Low-Complexity Reconfigurable DCT-V Architecture

Original
Low-Complexity Reconfigurable DCT-V Architecture / Kello, Jurgen; Roch, Massimo Ruo; Masera, Guido; Martina, Maurizio. - In: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS. II, EXPRESS BRIEFS. - ISSN 1549-7747. STAMPA. - 67:12(2020), pp. 3417-3421. [10.1109/TCSII.2020.2998604]

## Availability:

This version is available at: 11583/2853948 since: 2020-11-27T08:39:49Z

Publisher:
IEEE

Published
DOI:10.1109/TCSII.2020.2998604

Terms of use:
openAccess
This article is made available under terms and conditions as specified in the corresponding bibliographic description in the repository

Publisher copyright
IEEE postprint/Author's Accepted Manuscript
©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collecting works, for resale or lists, or reuse of any copyrighted component of this work in other works.
(Article begins on next page)

# Low-Complexity Reconfigurable DCT-V Architecture 

Jurgen Kello, Massimo Ruo Roch, Guido Masera, Senior Member IEEE, Maurizio Martina, Senior Member IEEE,


#### Abstract

This brief presents a low-complexity, reconfigurable architecture for the Discrete Cosine Transform (DCT) of type V (DCT-V) of length 32. The proposed architecture can be reconfigured to compute five DCT-V of length 4 with negligible area overhead. As the DCT-V is one of the odd type transforms employed in the Adaptive Multiple Transform (AMT) scheme, the effect of fixed point implementation has been assessed in the Joint Exploration Model (JEM) developed by the JVET group for the Versatile-Video-Coding (VVC) forthcoming standard. Simulation results show that the proposed architecture is not only lowcomplexity and reconfigurable, but features also imperceptible quality loss. Moreover, when implemented in 90 nm CMOS technology it occupies only 90 k eq. gates running at 187 MHz .


Index Terms—Low-complexity, DCT, Video coding

## I. InTRODUCTION

The Discrete Cosine Transform (DCT) [1] is one of the most popular transforms for image and video coding. Despite the DCT has been studied for several years, most of the works available in the literature concentrate on even type DCTs, mainly on the DCT of type II (DCT-II) [2], [3], [4] and its approximations [5], [6], [7], to be employed in several image and video standards, such as High-Efficiency-Video-Coding (HEVC) [8].

In the last years several researchers, e.g. [9], have shown that signals produced by intra and inter prediction schemes in video coding systems are better represented by a blend of trigonometric transforms rather than the DCT-II. In particular, an Adaptive Multiple Transform (AMT) scheme [10], derived from the Enhanced Multiple Transform in [11], has been recently proposed to encode the residual signal for both intra and inter coded blocks in the new Versatile-Video-Coding (VVC) forthcoming standard. Based on the coding mode, the encoder chooses for each block the best set of transforms from a certain pool. This pool contains odd type DCTs and odd type Discrete Sine Transforms (DSTs), namely DCT-V, DCTVIII, DST-I and DST-VII [11]. Since each set is composed of two transform candidates, each of which is evaluated both for horizontal and vertical transforms, a total of five different transform candidates (DCT-II plus four multiple transform candidates of the AMT) have to be computed for each block. Moreover, the block length can be $N=4,8,16,32$, thus, as argued in [10], the computational complexity is very high. As a consequence, efficient computation of odd type DCTs is an important issue, which has been partially addressed in the literature. Indeed, while several fast algorithms have been proposed and implemented for the computation of even type

[^0]DCTs and DSTs [12], only few works address the problem of finding low-complexity factorizations and implementations of odd type transforms (i.e. types V, VI, VII and VIII), e.g. [13]. In [14], the $2 M+1$-point DCT-II matrix is decomposed into an $M+1$-point DCT-VI and an $M$-point DST-VII, by the means of the Discrete Fourier Transform (DFT) decomposition of Winograd. Recently, we showed in [15] that the DCT-V of length $N=4,8$ can be easily obtained from the $N=M+1$ DCT-VI and implemented as low complexity architectures. However, such decompositions lead to irregular data flows; as a consequence the hardware reuse of the corresponding architectures is very limited.

Stemming from the general theory presented in [16], in this brief, we derive a new factorization of the DCT-V of length $N=32$, which relies on five instances DCT-V of length $N=4$. Such a factorization leads not only to an architecture with a reduced number of multiplications but also to a noteworthy hardware reuse. The proposed 1D-DCT architecture, which relies on butterfly and butterfly-like structures can compute either one DCT-V of length $N=32$ or five DCT-V of length $N=4$, indeed. The proposed factorization has been tested with a fixed point model in encoder of the Joint Exploration Model (JEM, version HM-16.6-JEM-7.2) developed by the JVET group for the Versatile-Video-Coding (VVC) forthcoming standard, showing negligible quality loss and the corresponding architecture has been implemented on a 90 nm standard cell technology featuring low complexity and power consumption.

## II. Factorization of the DCT-V of Length $N=32$

Let

$$
\begin{align*}
{\left[\mathbf{C}_{N}^{I I}\right]_{k, l} } & =\cos \frac{\pi k\left(l+\frac{1}{2}\right)}{N}  \tag{1}\\
{\left[\mathbf{C}_{N}^{I I I}\right]_{k, l} } & =\cos \frac{\pi l\left(k+\frac{1}{2}\right)}{N}  \tag{2}\\
{\left[\mathbf{C}_{N}^{V}\right]_{k, l} } & =\cos \frac{2 \pi k l}{2 N-1}  \tag{3}\\
{\left[\mathbf{S}_{N}^{V I I}\right]_{k, l} } & =\sin \frac{2 \pi\left(k+\frac{1}{2}\right)(l+1)}{2 N+1} \tag{4}
\end{align*}
$$

with $k, l=[0, N-1]$ be the matrix representation of the DCTII, DCT-III, DCT-V and DST-VII of length $N$. Moreover, let

$$
\mathbf{A} \oplus \mathbf{B}=\left[\begin{array}{ll}
\mathbf{A} &  \tag{5}\\
& \mathbf{B}
\end{array}\right] \quad \bigoplus_{i=1}^{n} \mathbf{A}_{i}=\left[\begin{array}{llll}
\mathbf{A}_{1} & & & \\
& \mathbf{A}_{2} & & \\
& & \ddots & \\
& & & \mathbf{A}_{n}
\end{array}\right]
$$

be the matrix direct sum operator and let $\otimes$ be the Kronecker (or tensor) product between two matrices. Any DCT of a signal $\mathbf{x}=\left\{x_{0}, x_{1}, \ldots, x_{N-1}\right\}$ of length $N$ can be written as $\mathbf{Y}=$ $\mathbf{C}_{N} \cdot \mathbf{x}$, where $\mathbf{Y}=\left\{Y_{0}, Y_{1}, \ldots, Y_{N-1}\right\}$ is the transform result and $\mathbf{C}_{N}$ is the matrix representation of the DCT (either of even or odd type). From the theory presented in [16] the following factorization can be obtained:

$$
\begin{equation*}
\mathbf{C}_{32}^{V}=\mathbf{Q}_{10}^{32} \cdot\left[\mathbf{C}_{11}^{V} \oplus \mathbf{C}_{21}^{I I I}\left(\frac{2}{3}\right)\right] \cdot \mathbf{B}_{32}^{(C 5)} \tag{6}
\end{equation*}
$$

where $\mathbf{Q}_{10}^{32}$ is a permutation matrix defined as

$$
\mathbf{Q}_{m}^{3 m+2}: i_{1}+3 i_{2} \mapsto \begin{cases}i_{2}, & \text { for } i_{1}=0  \tag{7}\\ 2 i_{2}+m+1, & \text { for } i_{1}=1 \\ 2 i_{2}+m+2, & \text { for } i_{1}=2\end{cases}
$$

with $i_{2}=0, \ldots, m$ and $i_{1}+3 i_{2}<3 m+2$, and $\mathbf{B}_{32}^{(C 5)}$ is the pre-addition matrix obtained from

$$
\mathbf{B}_{3 m+2}^{(C 5)}=\left(\begin{array}{ccc|cc}
1 & & & 1 &  \tag{8}\\
& \mathbf{I}_{m} & \mathbf{J}_{m} & & \mathbf{I}_{m} \\
& \mathbf{I}_{2 m+1} & & -1 / 2 & \\
& & & -\mathbf{I}_{m} \\
& -\mathbf{J}_{m}
\end{array}\right)
$$

with $\mathbf{I}_{m}$ and $\mathbf{J}_{m}$ the m-order identity and anti-diagonal identity matrices and $m=10$.

The term $\mathbf{C}_{21}^{I I I}\left(\frac{2}{3}\right)$ is a skew DCT-III of length $N=21$. According with [16] it can be written as:

$$
\begin{equation*}
\mathbf{C}_{21}^{I I I}\left(\frac{2}{3}\right)=\mathbf{K}_{7}^{21} \cdot\left[\bigoplus_{0 \leq i<3} \mathbf{C}_{7}^{I I I}\left(r_{i}\right)\right] \cdot \mathbf{U}_{21} \cdot \mathbf{B}_{3,7}^{(C 3)} \tag{9}
\end{equation*}
$$

where $r_{0}=2 / 9, r_{1}=4 / 9, r_{2}=8 / 9$,

$$
\begin{equation*}
\mathbf{U}_{21}=\mathbf{C}_{3}^{I I I}\left(\frac{2}{3}\right) \otimes \mathbf{I}_{7} \tag{10}
\end{equation*}
$$

and $\mathbf{C}_{3}^{I I I}\left(\frac{2}{3}\right)$ is the skew DCT-III of length $N=3, \mathbf{K}_{7}^{21}$ and $\mathbf{B}_{3,7}^{(C 3)}$ are a permutation and a pre-addition matrix, respectively defined as

$$
\begin{equation*}
\mathbf{K}_{7}^{21}=\left[\bigoplus_{i=1}^{3}\left(\mathbf{I}_{3} \oplus \mathbf{J}_{3}\right)_{i} \oplus \mathbf{I}_{3}\right] \cdot \mathbf{L}_{7}^{21} \tag{11}
\end{equation*}
$$

with

$$
\mathbf{L}_{7}^{21}: i \mapsto \begin{cases}7 \cdot i \bmod 20, & \text { for } 0 \leq i<20  \tag{12}\\ 20 & \text { otherwise }\end{cases}
$$

and

$$
\begin{equation*}
\mathbf{B}_{3,7}^{(C 3)}=\left\{\mathbf{I}_{7} \oplus\left[\mathbf{I}_{2} \otimes \operatorname{diag}_{7}(1,2, \ldots, 2)\right]\right\} \cdot \mathbf{W}_{21} \tag{13}
\end{equation*}
$$

where

$$
\mathbf{W}_{21}=\left[\begin{array}{ccc}
\mathbf{I}_{7} & -\mathbf{Z}_{7} & \mathbf{I}_{7}^{\prime}  \tag{14}\\
& \mathbf{I}_{7} & -\mathbf{Z}_{7} \\
& & \mathbf{I}_{7}
\end{array}\right] \quad \mathbf{Z}_{7}=\left[\begin{array}{ccc} 
& & 0 \\
& & 0 \\
& . & 1 \\
0 & 1 &
\end{array}\right]
$$

and $\mathbf{I}_{7}^{\prime}=\operatorname{diag}_{7}(0,1, \ldots, 1)$ with $\operatorname{diag}_{n}(a, b, \ldots, b)$ the diagonal matrix of size $n$, which elements are the ones listed


Figure 1. Architecture of the DCT-V $N=32$.
inside the parenthesis. The factorization shown in (6) can be exploited to write:

$$
\begin{equation*}
\mathbf{C}_{11}^{V}=\mathbf{Q}_{3}^{11} \cdot\left[\mathbf{C}_{4}^{V} \oplus \mathbf{C}_{7}^{I I I}\left(\frac{2}{3}\right)\right] \cdot \mathbf{B}_{11}^{(C 5)} \tag{15}
\end{equation*}
$$

where $\mathbf{Q}_{3}^{11}$ and $\mathbf{B}_{11}^{(C 5)}$ are obtained as in (7) and (8) with $m=3$. The skew DCTs which length is a prime number can be conveniently rewritten as $\mathbf{C}_{m}^{I I I}(r)=\mathbf{C}_{m}^{I I I} \cdot \mathbf{P}_{m}^{C 3}(r)$, with

$$
\mathbf{P}_{m}^{(C 3)}(r)=\left[\begin{array}{lcccc}
1 & 0 & \cdots & \cdots & 0  \tag{16}\\
0 & c_{1, r, m} & & & s_{m-1, r, m} \\
\vdots & & \ddots & . & \\
\vdots & & . & \ddots & \\
0 & s_{1, r, m} & & & c_{m-1, r, m}
\end{array}\right]
$$

where $c_{l, r, m}=\cos \frac{(1 / 2-r) l \pi}{m}, s_{l, r, m}=\sin \frac{(1 / 2-r) l \pi}{m}$. As a consequence, $\mathbf{C}_{32}^{V}$ is now a function of $\mathbf{C}_{4}^{V}, \mathbf{C}_{7}^{I I^{m}}$ and $\mathbf{C}_{3}^{I I I}$. Since $\mathbf{C}_{N}^{I I I}=\left(\mathbf{C}_{N}^{I I}\right)^{T}$, with $(\cdot)^{T}$ being the transposition operator, the factorization proposed in [14] can be exploited to obtain:

$$
\mathbf{C}_{7}^{I I I}=\mathbf{H}_{7} \cdot\left[\begin{array}{l}
\mathbf{J}_{4} \cdot \mathbf{C}_{4}^{V} \cdot \mathbf{D}_{4}  \tag{17}\\
\\
\\
\\
\\
\\
\\
\\
\left.\mathbf{S}_{3}^{V I I}\right)^{T}
\end{array}\right] \cdot \mathbf{G}_{7}^{T} \cdot \mathbf{D}_{7}^{\prime}
$$

where

$$
\mathbf{H}_{7}=\left[\begin{array}{cc}
\mathbf{I}_{3} & -\mathbf{J}_{3}  \tag{18}\\
& 1 \\
\mathbf{J}_{3} & \\
\mathbf{I}_{3}
\end{array}\right]
$$

$\mathbf{D}_{4}=\operatorname{diag}_{4}(1,-1,1-1), \mathbf{S}_{3}^{V I I}$ is the DST-VII of length $N=3$,

$$
\mathbf{G}_{7}: i \mapsto \begin{cases}2 i, & i=0,1,2,3  \tag{19}\\ 2(i \bmod 4)+1 & i=4,5,6\end{cases}
$$

and $\mathbf{D}_{7}^{\prime}=\operatorname{diag}_{7}(1,-1,1,1,1,-1,1)$. Thus, $\mathbf{C}_{32}^{V}$ is factorized in terms of five DCT-V with length $N=4$, four DST-VII of length $N=3$ and seven DCT-III of length $N=3$, which can be implemented as in [15] and [17].

( $\sigma=$ sign alternation) ( $\Pi=$ permutation)
Figure 2. Architecture of the skew DCT-III $N=7$.

## III. PROPOSED ARCHITECTURE

Stemming from the factorization detailed in Section II, the architecture depicted in Fig. 1 has been obtained where several butterfly and butterfly-like structures are exploited. Let, for the sake of simplicity, $\hat{\mathbf{x}}=\left\{\hat{x}_{0}, \hat{x}_{1}, \ldots, \hat{x}_{N-1}\right\}$ and $\hat{\mathbf{Y}}=\left\{\hat{Y}_{0}, \hat{Y}_{1}, \ldots, \hat{Y}_{N-1}\right\}$ be the input and the output of each building block. As it can be inferred from (8), $\mathbf{B}_{32}^{(C 5)}$ can be implemented by resorting to adders, which properly combine the inputs, e.g. the first and the second results are $\hat{Y}_{0}=\hat{x}_{0}+\hat{x}_{21}$ and $\hat{Y}_{1}=\hat{x}_{1}+\hat{x}_{20}+\hat{x}_{22}$. As a consequence, the total number of adders to implement $\mathbf{B}_{32}^{(C 5)}$ is 42. Similarly, from (8) and (13) one can derive that $\mathbf{B}_{11}^{(C 5)}$ and $\mathbf{B}_{3,7}^{(C 3)}$ require 14 and 18 adders, respectively. Since permutations are fixed, they have been implemented by correctly wiring inputs to outputs, e.g. from (7) one can derive that the first and the second results of $\mathbf{Q}_{3}^{11}$ are $\hat{Y}_{0}=\hat{x}_{0}$ and $\hat{Y}_{1}=\hat{x}_{3}$. Similarly, from (11) and (7) it is possible to derive the connections required to build $\mathbf{K}_{7}^{21}$ and $\mathbf{Q}_{10}^{32}$.

The gray shaded blocks in Fig. 1, corresponding to DCT-V of length $N=4$ and skew DCT-III of length $N=7$ and $N=3$, are depicted in Figs. 2, 3 and 4, respectively. In particular, Fig. 2 shows that the DCT-III of length $N=7$, which inputs and outputs are $\tilde{\mathbf{x}}=\left\{\tilde{x}_{0}, \tilde{x}_{1}, \ldots, \tilde{x}_{N-1}\right\}$ and $\tilde{\mathbf{Y}}=\left\{\tilde{Y}_{0}, \tilde{Y}_{1}, \ldots, \tilde{Y}_{N-1}\right\}$, can be obtained as the cascade of some building blocks. The simplest ones are shown as white boxes where the inside gray shaded lines detail the implementation. On the other hand, blocks corresponding to trigonometric transforms are shown as gray shaded boxes and detailed in Figs. 3 and 5, respectively. As shown in Fig. 2, the $\mathbf{P}_{7}^{(C 3)}(r)$ matrix, which is described by (16), requires 3 butterfly structures (gray shaded lines and dots) to compute $\hat{Y}_{0}=\hat{x}_{0}, \hat{Y}_{1}=c_{1, r, 7} \cdot \hat{x}_{1}+s_{6, r, 7} \cdot \hat{x}_{6}, \ldots, \hat{Y}_{6}=$ $s_{1, r, 7} \cdot \hat{x}_{1}+c_{6, r, 7} \cdot \hat{x}_{6}$. As a consequence, the implementation of $\mathbf{P}_{7}^{(C 3)}(r)$ relies on 12 multipliers and 6 adders. Permutation blocks, namely $\mathbf{G}_{7}^{T}$ and $\mathbf{J}_{4}$ are hardwired. Sign alternation for $\mathbf{D}_{4}$ and $\mathbf{D}_{7}^{\prime}$ require 4 adders to perform 2's complement operations, namely $\hat{Y}_{1}=-\hat{x}_{1}, \hat{Y}_{3}=-\hat{x}_{3}, \hat{Y}_{4}=-\hat{x}_{4}$ and $\hat{Y}_{6}=-\hat{x}_{6} . \mathbf{H}_{7}$ relies on 3 multiplierless butterfly structures (gray shaded lines and diamonds), implementing $\hat{Y}_{0}=\hat{x}_{0}-\hat{x}_{6}$, $\ldots, \hat{Y}_{6}=\hat{x}_{0}+\hat{x}_{6}$; thus, it requires 6 adders.

Finally, according with [15] and [17], the trigonometric transforms represented by $\mathbf{C}_{4}^{V}, \mathbf{C}_{3}^{I I I}(r)$ and $\left(\mathbf{S}_{3}^{V I I}\right)^{T}$ have been implemented as shown in Figs. 3, 4 and 5 by resorting to 4 multipliers and 13 adders, 6 multipliers and 6 adders, 4 multipliers and 10 adders, respectively. The value of the constants required by $\mathbf{C}_{4}^{V}, \mathbf{C}_{3}^{I I I}(r)$ and $\left(\mathbf{S}_{3}^{V I I}\right)^{T}$ are shown in Table I.


Figure 3. Architecture of the DCT-V $N=4$, as in [15].


Figure 4. Architecture of the skew DCT-III $N=3$.
Table I
Coefficients of the DCT-III with $N=3$, DCT-V with $N=4$ And DST-VII with $N=4$.

| block | coefficient | value |
| :---: | :---: | :---: |
| $\mathbf{C}_{3}^{I I I}$ | $C 31$ | $-\frac{\sqrt{3}}{2}$ |
|  | $C 32$ | 1.5 |
|  | $C 51$ | $\frac{7}{6}$ |
| $\mathbf{C}_{4}^{V}$ | $C 52$ | $\frac{2 \cos (u)-\cos (2 u)-\cos (3 u)}{3}$ |
|  | $C 53$ | $\frac{\cos (u)-2 \cos (2 u)+\cos (3 u)}{3}$ |
|  | $C 54$ | $\frac{\cos (u)+\cos (2 u)-2 \cos (3 u)}{3}$ |
|  | $S 31$ | $\frac{\sin (u)+\sin (2 u)-\sin (3 u)}{3}$ |
| $\left(\mathbf{S}_{3}^{V I I}\right)^{T}$ | $S 32$ | $\frac{2 \sin (u)-\sin (2 u)+\sin (3 u)}{3}$ |
|  | $S 33$ | $\frac{\sin (u)-2 \sin (2 u)-\sin (3 u)}{3}$ |
|  | $S 34$ | $\frac{\sin (u)+\sin (2 u)+2 \sin (3 u)}{3}$ |

The total number of multipliers and adders required to implement the proposed architecture is summarized in Table II and is equal to 126 and 285 respectively, which is significantly less than the number of multiplications and additions required by the $\mathbf{C}_{32}^{V}$ matrix product (i.e. 1024 multiplications and 992 additions).

## IV. Implementation results

In order to properly size the proposed architecture, the corresponding factorization has been implemented in fixed point into the JEM, version HM-16.6-JEM-7.2 [18].. Input data are represented with 16 bits, the internal bit-width increases up to 32 bits to have enough precision where $\mathbf{C}_{3}^{I I I}(r)$ is cascaded with $\mathbf{C}_{7}^{I I I}(r)$ and the output data are scaled to be represented with 16 bits as well. Experiments showed that 8 fractional bits

Table II
NUMBER OF MULTIPLIERS AND ADDERS REQUIRED TO IMPLEMENT THE PROPOSED ARCHITECTURE FOR THE DCT-V OF LENGTH $N=32$

|  | $\mathbf{P}_{7}^{(C 3)}(r)$ | $\mathbf{D}_{7}^{\prime}$ | $\mathbf{D}_{4}$ | $\mathbf{C}_{4}^{V}$ | $\left(\mathbf{S}_{3}^{V I I}\right)^{T}$ | $\mathbf{H}_{7}$ | $\mathbf{C}_{7}^{I I I}(r)$ |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| MUL | 12 | 0 | 0 | 4 | 4 | 0 | 20 |  |
| ADD | 6 | 2 | 2 | 13 | 10 | 6 | 39 |  |
|  | $\mathbf{B}_{32}^{(C 5)}$ | $\mathbf{B}_{11}^{(C 5)}$ | $\mathbf{B}_{3,7}^{(C 3)}$ | $\mathbf{C}_{4}^{V}$ | $7 \times \mathbf{C}_{3}^{I I I}(r)$ | $4 \times \mathbf{C}_{7}^{I I I}(r)$ | $\mathbf{C}_{32}^{V}$ |  |
| MUL | 0 | 0 | 0 | 4 | $7 \times 6$ | $4 \times 20$ | 126 |  |
| ADD | 42 | 14 | 18 | 13 | $7 \times 6$ | $4 \times 39$ | 285 |  |



Figure 5. Architecture of the transpose DST-VII $N=3$.
Table III
BJøntegaard Delta rate loss.

|  | mean [\%] | std-var [\%] | $\min [\%]$ | $\max [\%]$ |
| :---: | :---: | :---: | :---: | :---: |
| AI | 0.0398 | 0.0356 | 0.0139 | 0.1103 |
| RA | 0.0438 | 0.0404 | -0.0259 | 0.0874 |

can be used to correctly represent each constant coefficient for the multiplications. Indeed, as show in Table III the proposed solution achieves an average Bjøntegaard Delta rate loss [19] of about $0.04 \%$ in both all-intra (AI) and random-access (RA) configuration with the standard video sequences suggested in the common test conditions [20]. We also observed that the selection of the modified DCT-V with respect to the original one is always above $95 \%$ and $80 \%$ for $N=4$ and $N=32$ respectively.

Moreover, the proposed architecture for the computation of the DCT-V of length $N=32$, contains five DCT-V of length $N=4$. As a consequence, by adding few multiplexers the proposed architecture can be configured to compute one DCTV of length $N=32$ or five DCT-V of length $N=4$, as shown in Fig. 6. This reconfigurable architecture has been implemented in VHDL with the TSMC 90 nm standard cell technology (typical) at 1.1 V and $0^{\circ}$, by the means of Synopsys Design Compiler Graphical, reaching a maximum clock frequency of 222 MHz with an area of $0.32 \mathrm{~mm}^{2}$ (about 113 k eq. gate, NAND2X1) and a power consumption of 17.5 mW .

The proposed architecture cannot be directly compared with other DCT-V architectures, as, to the best of our knowledge, this is the first work addressing the implementation of an architecture for the DCT-V with $N=32$. However, it can be compared with some flexible architectures able to compute the DCT-II of length $N=32$ to quantify the complexity of the proposed solution. For this reason we also synthesized the proposed architecture for a target clock frequency of 187 MHz ,
as in [2], reaching an area occupation of $0.25 \mathrm{~mm}^{2}$ (about 90k eq. gate, NAND2X1). Table IV compares the proposed low complexity and reconfigurable DCT-V architecture with some recent DCT-II architectures in terms of size support $N$, number of eq. gates, maximum/target clock frequency $f_{c k}$, power consumption and throughput (number of produced samples per cycle). It is worth noting that the different speed achieved by different architectures depends also on some architectural choices, such as the number of pipeline registers. As an example the solution referred to as [21] (1) is a pipeline architecture, whereas the proposed one contains registers only at the input and at the output. Moreover, the number of required multipliers can be different as well. As an example the architecture referred to as [21] (2) requires only 80 multipliers. As it can be observed, the proposed

Table IV
1D DCT ARCHITECTURES COMPARISON.

| Arch. | $N$ | eq. <br> gates | $f_{c k}$ <br> $[\mathrm{MHz}]$ | P <br> $[\mathrm{mW}]$ | T <br> [samples/cycle] |
| :---: | :---: | :---: | :---: | :---: | :---: |
| $[2]$ | $4,8,16,32$ | $131 \mathrm{k}(@ 90 \mathrm{~nm})$ | 187 | 23.17 | 32 |
| $[3]$ | $4,8,16,32$ | $88 \mathrm{k}(@ 90 \mathrm{~nm})$ | 256 | 16.20 | 32 |
| $[22]$ | $4,8,16,32$ | $97 \mathrm{k}(@ 45 \mathrm{~nm})$ | 50 | 24.20 | 32 |
| $[23]$ | $4,8,16,32$ | $163 \mathrm{k}(@ 90 \mathrm{~nm})$ | 250 | 15.30 | 32 |
| $[21](1)$ | $4,8,16,32$ | $113 \mathrm{k}(@ 90 \mathrm{~nm})$ | 401 | 15.98 | 32 |
| $[21](2)$ | $4,8,16,32$ | $88 \mathrm{k}(@ 90 \mathrm{~nm})$ | 187 | 32.09 | 32 |
| Prop. $(\mathbf{1})$ | 4,32 | $113 \mathrm{k}(@ 90 \mathrm{~nm})$ | 222 | 17.50 | 20,32 |
| Prop. $(\mathbf{2})$ | 4,32 | $90 \mathrm{k}(@ 90 \mathrm{~nm})$ | 187 | 13.10 | 20,32 |
| Prop. $\left.\mathbf{*}^{*}\right)$ | $4,8,16,32$ | $\approx 158 \mathrm{k}(@ 90 \mathrm{~nm})$ | 187 | $\approx 23$ | $20,8,16,32$ |

architecture supports only $N=4$ and $N=32$, whereas the other ones support all the DCT-II sizes specified in both the HEVC standard and in the VVC forthcoming standard, namely $N=4,8,16,32$. Indeed, the DCT-II of size $N$ can be factorized in terms of at least one DCT-II of size $N / 2$ [12]. As a consequence, DCT-II architectures for $N=32$ allow for a great hardware reuse to support $N=4,8,16$. In order to take into account this aspect, we assume that the area and the power consumption required to implement an architecture for the DCT-V of size $N / 2$ is roughly half the area and the power consumption required for size $N$. As a consequence, we can estimate that the total area and the total power consumption are roughly 1.75 times the area and the power consumption of the proposed architecture and that the critical path is located in the architecture that supports $N=4,32$. These estimation are summarized in the last line of table IV). Despite this comparison is not fair, as the proposed architecture and the compared ones implement different types of DCTs, they have similar complexities, clock frequencies, power consumption and throughput, thus showing the effectiveness of the proposed solution. Finally, the proposed 1D-DCT-V architecture can be used to implement either a folded or a fully parallel 2D-DCT-


Figure 6. Architecture of the proposed reconfigurable DCT-V with $N=4,32$.

V architecture by resorting on the schemes which have already been proposed for the 2D-DCT-II in [2].

## V. Conclusions

In this brief, we presented a low-complexity architecture to compute the DCT-V of length $N=32$, which involves only 126 multiplications. We have also shown that the proposed solution features near-optimal rate-distortion performance in all-intra and random-access configurations with an average Bjøntegaard Delta rate loss of about $0.04 \%$, thus being well suited to implement the AMT scheme, which is part of the VVC forthcoming standard. We have used the proposed architecture to derive a flexible architecture, which can be reconfigured to compute five DCT-V of length $N=4$. Implementation results show that the proposed architecture features complexity, speed and power consumption similar to the best architectures for the DCT-II available in the literature.

## REFERENCES

[1] N. Ahmed, T. Natarajan, and K. Rao, "Discrete Cosine Transform," IEEE Trans. Comput., vol. C-23, no. 1, pp. 90-93, Jan 1974.
[2] P. Meher, S. Y. Park, B. Mohanty, K. S. Lim, and C. Yeo, "Efficient Integer DCT Architectures for HEVC," IEEE Trans. Circuits Syst. Video Technol., vol. 24, no. 1, pp. 168-178, Jan 2014.
[3] S. Chatterjee and K. Sarawadekar, "An optimized architecture of HEVC core transform using real-valued DCT coefficients," IEEE Trans. Circuits Syst. II, vol. 65, no. 12, pp. 2052-2056, Dec 2018.
[4] - , "WHT and matrix decomposition based approximated IDCT architecture for HEVC," IEEE Trans. Circuits Syst. II, vol. 66, no. 6, pp. 1043-1047, Jun 2019.
[5] M. Jridi, A. Alfalou, and P. K. Meher, "Efficient approximate core transform and its reconfigurable architectures for HEVC," Journal of Real-Time Image Processing, pp. 1-11, Apr 2018.
[6] R. S. Oliveira, R. J. Cintra, F. M. Bayer, T. L. T. da Silveira, A. Madanayake, and A. Leite, "Low-complexity 8-point DCT approximation based on angle similarity for image and video coding," Multidimensional Systems and Signal Processing, pp. 1-32, Jul 2018.
[7] S. B. Jdidia, M. Jridi, F. Belghith, and N. Masmoudi, "Low-complexity algorithm using DCT approximation for POST-HEVC standard," in Proceedings of SPIE, 2018, pp. 1-7.
[8] G. J. Sullivan, J. R. Ohm, W. J. Han, and T. Wiegand, "Overview of the High Efficiency Video Coding (HEVC) Standard," IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1649-1668, Dec 2012.
[9] A. Saxena and F. C. Fernandes, "DCT/DST-Based Transform Coding for Intra Prediction in Image/Video Coding," IEEE Trans. Image Process., vol. 22, no. 10, pp. 3974-3981, Oct 2013.
[10] T. Biatek, V. Lorcy, P. Castel, and P. Philippe, "Low-complexity adaptive multiple transforms for post-HEVC video coding," in Picture Coding Symposium, 2016, pp. 1-5.
[11] X. Zhao, J. Chen, M. Karczewicz, X. Li, and C. Wei-Jung, "Enhanced Multiple Transform for Video Coding," in Proc. 2016 Data Compression Conference, 2016, pp. 73-82.
[12] V. Britanak, P. C. Yip, and K. R. Rao, Discrete Cosine and Sine Transforms: General Properties, Fast Algorithms and Integer Approximations. Elsevier, Sep. 2006.
[13] W. Park, B. Lee, and M. Kim, "Fast computation of integer DCT-V, DCT-VIII, and DST-VII for video coding," IEEE Trans. Image Process., vol. 28, no. 12, pp. 5839-5851, Dec 2019.
[14] Y. A. Reznik, "Relationship between DCT-II, DCT-VI, and DST-VII transforms," in Proc. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 5642-5646.
[15] M. Masera, M. Martina, and G. Masera, "Odd Type DCT/DST for Video Coding: Relationships and Low-Complexity Implementations," in IEEE International Workshop on Signal Processing Systems (SiPS), 2017, pp. 1-6.
[16] M. Puschel and J. M. F. Moura, "Algebraic Signal Processing Theory: Cooley-Tukey Type Algorithms for DCTs and DSTs," IEEE Trans. Signal Process., vol. 56, no. 4, pp. 1502-1521, April 2008.
[17] X. Shao and S. G. Johnson, "Type-II/III DCT/DST algorithms with reduced number of arithmetic operations," Signal Processing, vol. 88, no. 6, pp. 1553-1564, 2008.
[18] M. Martina. (2020, Apr.) Modified JEM. [Online]. Available: http://personal.det.polito.it/maurizio.martina/material/JEM/HM-16.6-JEM-7.2_mod.tar.gz
[19] G. Bjontegaard, Calculation of Average PSNR Differences Between RD Curves, document VCEG-M33, ITU-T SG16/Q6, Austin, TX, Apr 2001.
[20] J. Boyce, K. Suehring, X. Li, and V. Seregin, JVET common test conditions and software reference configurations, Apr. 2018.
[21] M. Masera, G. Masera, and M. Martina, "An Area-Efficient VariableSize Fixed-Point DCT Architecture for HEVC Encoding," IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 1, pp. 232-242, 2020.
[22] J. Goebel, G. Paim, L. Agostini, B. Zatt, and M. Porto, "An HEVC Multi-Size DCT Hardware with Constant Throughput and Supporting Heterogeneous CUs," in IEEE International Symposium on Circuits and Systems, 2016, pp. 2202-2205.
[23] M. Masera, M. Martina, and G. Masera, "Adaptive Approximated DCT Architectures for HEVC," IEEE Trans. Circuits Syst. Video Technol., vol. 27, no. 12, pp. 2714-2725, Dec 2017.


[^0]:    The authors are with - Dipartimento di Elettronica e Telecomunicazioni Politecnico di Torino.

