Multiplierless Approximate 4-point DCT VLSI Architectures for Transform
  Block Coding by Bayer, F. M. et al.
ar
X
iv
:1
40
5.
04
13
v1
  [
cs
.A
R]
  2
 M
ay
 20
14
Multiplierless Approximate 4-point DCT VLSI Architectures for
Transform Block Coding
F. M. Bayer∗ R. J. Cintra† A. Madanayake‡ U. S. Potluri‡
Abstract
Two multiplierless algorithms are proposed for 4×4 approximate-DCT for transform coding in digital video.
Computational architectures for 1-D/2-D realisations are implemented using Xilinx FPGA devices. CMOS synthesis
at the 45 nm node indicate real-time operation at 1 GHz yielding 4×4 block rates of 125 MHz at less than 120 mW
of dynamic power consumption.
1 Introduction
Video and multimedia processing based on signal and image compression such as the high efficiency video coding
(HEVC) and H.265 reconfigurable video codecs require 2-D transform block coding for block sizes N ×N where
N ∈ {4,8,16,32,64} [1]. The transform coding stage requires algorithms for the N-point discrete cosine transform
(DCT) of types II and IV. The associate transformation matrices are defined, respectively, according to [2]:
[CII](m,n) =
√
2
N
·αm · cos
[(
m− 1
2
)
· pi(n−1)
N
]
,
[CIV](m,n) =
√
2
N
· cos
[(
m− 1
2
)
·
(
n− 1
2
)
· pi
N
]
,
where m,n = 1,2, . . . ,N, α1 = 1/
√
2, and αm = 1, for m > 1.
In this letter, our goal is to propose multiplication-free approximations for the 4-point DCT-II and -IV as well
as its fast algorithms. We also aim at VLSI realisations of both 1-D and 2-D versions of the derived approximate
transforms, while maintaining at high numerical accuracy and low computational complexity.
∗F. M. Bayer is with the Departamento de Estatı´stica and Laborato´rio de Cieˆncias Espaciais de Santa Maria (LACESM), Universidade Federal
de Santa Maria, RS, Brazil, E-mail: bayer@ufsm.br
†R. J. Cintra is with the Signal Processing Group, Departamento de Estatı´stica, Universidade Federal de Pernambuco, PE, Brazil, E-mail:
rjdsc@stat.ufpe.org
‡A. Madanayake and U. S. Potluri are with the ECE, The University of Akron, Akron, OH, USA, E-mail: arjuna@uakron.edu
1
2 Optimisation and orthogonalization
Let MP(4) be the set of all 4×4 matrices whose entries are defined over P = {−1,0,1}. In this set, all matrices
represent multiplierless transformations. Our goal is to find matrices in MP(4) that satisfactorily approximate CII
and CIV.
Therefore we propose the following multivariate non-linear optimisation problem over MP(4)
C∗k = arg minA∈MP(4)
error(A,Ck), k ∈ {II, IV}, (1)
where C∗k are the optimal matrices and error(·, ·) is an error measure between a given candidate matrix and the exact
matrices CII and CIV.
Let hi[n] be the discrete signal formed by the ith row of a given matrix T and the discrete-time Fourier transform
(DTFT) of hi[n] be denoted by Hi(ω;T). As discussed in [3,4], we adopted the total error energy as the error measure.
This particular measure is defined as follows:
ε(A,Ck) =
4
∑
m=1
∫ pi
0
|Hm(ω;A)−Hm(ω;Ck)|2 dω,
for k ∈ {II, IV}. In other words, ε(A,Ck) quantifies the sum of the energy error in the DTFT domain—between A and
Ck—when the entries of a given matrix row are interpreted as filter coefficients [3,4]. This quantity can be computed
numerically by standard quadrature methods [5].
As an additional constraint to (1), we impose that the matrix A ·A⊤ must be a diagonal matrix to ensure that
orthogonality can be achieved in the obtained approximations [6]. The resulting constrained optimisation problem is
algebraically intractable and we resorted to exhaustive computational search.
3 Proposed 4-point DCT approximations
By solving (1), we obtained the following new DCT approximations:
C∗II =


1 1 1 1
1 0 0 −1
1 −1 −1 1
0 −1 1 0


and C∗IV =


1 1 1 0
1 0 −1 −1
1 −1 0 1
0 −1 1 −1


.
Although possessing very low complexity, these matrices are not orthogonal. In several contexts, such as image pro-
cessing for coding, orthogonality is often a desirable property [2]. Adopting the orthogonalization methods detailed
in [6], new orthogonal matrices ˆCII and ˆCIV can be derived based on C∗II and C∗IV, respectively. These orthogonal
2
x1
x2
x3
X0x0
X2
X3
X1
(a) DCT-II approximation
x0
x1
x2
x3
X0
X1
X2
X3
(b) DCT-IV approximation
Figure 1: Signal flow graph for proposed transforms.
matrices are given by:
ˆCII = DII ·C∗II and ˆCIV = DIV ·C∗IV,
where DII =
√
[C∗II · (C∗II)⊤]−1 and DIV =
√
[C∗IV · (C∗IV)⊤]−1. Explicitly we obtain that
DII = diag
(
1
2
,
1√
2
,
1
2
,
1√
2
)
and
DIV =
1√
3
· I4,
where I4 is the identity matrix of size 4. In image compression context the scaling matrices D1 and D2 may not
introduce any computational overhead, because they can be merged into the quantisation step, as described earlier
in [4, 7–9].
The signal flow graph for C∗II and C∗IV is shown in Fig. 1. We note that the C∗II and C∗IV transformations require only
6 and 8 additions, respectively. Multiplications or bit-shifting operations are totally absent. Resulting approximations
ˆCII and ˆCIV are very close to the respective ideal DCT and offer extremely low complexities. In Table 1, we show
the error measure and arithmetic complexity for the proposed transforms, the exact DCT computation [2], and the
well-known signed DCT [3].
3
Table 1: Total error energy and arithmetic complexity analysis
Method ErrorEnergy
Complexity
Add. Mult. Total
Exact 4-point DCT-II [2] 0.000 8 4 12
4-point Signed DCT-II [3] 0.957 8 0 8
Proposed ˆCII 0.957 6 0 6
Exact 4-point DCT-IV [2] 0.000 12 8 20
4-point Signed DCT-IV [3] 2.359 10 0 10
Proposed ˆCIV 0.838 8 0 8
Table 2: Resource consumption on Xilinx XC6VSX475T-2FF1156
Proposed
Approx. CLB FF LUT Slices
Fmax
(MHz)
Dp
(W)
1-D DCT-II 56 76 92 35 743.5 0.535
1-D DCT-IV 76 132 128 52 735.3 0.574
2-D DCT-II 166 408 330 108 704.2 0.884
2-D DCT-IV 210 528 472 148 689.2 0.921
4 FPGA prototypes
The approximate DCTs were realised as an architecture for the 4-point 1-D transforms and the extended to 4×4 2-D
transformation. The inputs were assumed at 8-bit resolution. Rapid prototypes were realised on a Xilinx Virtex-6 field
programmable gate array (FPGA) device and tested to ensure correct on-chip functionality. The results concerning
the consumption of configurable logic blocks (CLB), flip-flops (FF), look-up tables (LUT), and slices are shown in
Table 2. The maximum operating frequency (Fmax) and dynamic power consumption (Dp) are also displayed.
The register transfer language (RTL) code corresponding to the FPGA-verified designs were targeted to 45 nm
CMOS standard cell process using Cadence Encounter. The CMOS designs were realised up to synthesis and place-
and-route levels leading to the estimated results in Table 3. Area-time complexities AT and AT2 were adopted and
measured in µm2 ·ns and µm2 ·ns2, respectively.
Table 3: Resource consumption for 45 nm CMOS
Proposed
Approx.
ASIC
Gates
Area
(µm2)
Fmax
(GHz)
Dp
(mW) AT AT
2
1-D DCT-II 849 3386.9 1.10 6.31 3160 2948
1-D DCT-IV 1207 4870.4 1.00 8.62 4846 4822
2-D DCT-II 7400 31217.8 0.95 59.33 7770 8159
2-D DCT-IV 13770 59052.5 0.94 115.66 14596 15472
4
5 Conclusion
Numerical optimisation methods have lead to 4-point approximations for the DCT-II and DCT-IV. Such matrices
are tailored for minimal computational complexity and are adequate for computing realisations linked to coding
operations with applications in digital video and multimedia. Fast algorithms were derived and the associate physical
realisations do not require VLSI area- and power-intensive multiplier circuits. Both 1-D and 2-D realisations were
proposed with FPGA prototypes for architecture validation and CMOS synthesis results at the 45 nm node. Results
indicate real-time blockrate of 125 MHz for processing 4×4 blocks at 1 GHz clock frequency.
Acknowledgments
We thank The College of Engineering at UA, CNPq, FACEPE, and FAPERGS for the partial financial support.
References
[1] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC)
standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, pp. 1649–1668, Dec. 2012.
[2] V. Britanak, P. Yip, and K. R. Rao, Discrete Cosine and Sine Transforms. Academic Press, 2007.
[3] T. I. Haweel, “A new square wave transform based on the DCT,” Signal Processing, vol. 82, pp. 2309–2319,
2001.
[4] R. J. Cintra and F. M. Bayer, “A DCT approximation for image compression,” IEEE Signal Processing Letters,
vol. 18, pp. 579–582, Oct. 2011.
[5] R. Piessens, E. deDoncker-Kapenga, C. Uberhuber, and D. Kahaner, Quadpack: a Subroutine Package for Auto-
matic Integration. Springer-Verlag, 1983.
[6] R. J. Cintra, “An integer approximation method for discrete sinusoidal transforms,” Journal of Circuits, Systems,
and Signal Processing, vol. 30, no. 6, pp. 1481–1501, 2011.
[7] S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy, “Low-complexity 8×8 transform for image compression,”
Electronics Letters, vol. 44, pp. 1249–1250, Sept. 2008.
[8] F. M. Bayer and R. J. Cintra, “DCT-like transform for image compression requires 14 additions only,” Electronics
Letters, vol. 48, pp. 919–921, July 2012.
[9] S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy, “A low-complexity parametric transform for image compres-
sion,” in Proceedings of the 2011 IEEE International Symposium on Circuits and Systems, 2011.
5
