VLSI Design of a Fast Pipelined 8x8 Discrete Cosine Transform by Zabidi, Nurulnajah Mohd & Rahman, Ab Al-Hadi Ab
International Journal of Electrical and Computer Engineering (IJECE)





Institute of Advanced Engineering and Science 
w  w  w  .  i  a  e  s  j  o  u  r  n  a  l  .  c  o  m 
 
VLSI Design of a Fast Pipelined 8x8 Discrete Cosine Transform
Nurulnajah Mohd Zabidi and Ab Al-Hadi Ab Rahman
Faculty of Electrical Engineering, Universiti Teknologi Malaysia, Malaysia
Article Info
Article history:
Received Jan 4, 2017
Revised Mar 18, 2017







This paper presents a Very Large Scale Integrated (VLSI) design and implementation of a
fixed-point 8x8 multiplierless Discrete Cosine Transform (DCT) using the ISO/IEC 23002-
2 algorithm. The standard DCT algorithm, which is mainly used in image and video com-
pression technology, consists of only adders, subtractors, and shifters, therefore making it
efficient for hardware implementation. The VLSI implementation of the algorithm given
in this paper further enhances the performance of the transform unit. Furthermore, circuit
pipelining has been applied to the base design of the DCT, which significantly improves the
performance by reducing the longest path in the non-pipeline design. The DCT has been
implemented using semi-custom VLSI design methodology using the TSMC 0.13um pro-
cess technology. Results show that our DCT designs can run up to around 1.7 Giga pixels/s,
which is well above the timing required for real-time ultra-high definition 8K video.
Copyright c© 2017 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Ab Al-Hadi Ab Rahman
Faculty of Electrical Engineering, Universiti Teknologi Malaysia




Image compression is a process to reduce data redundancies in the image in order to increase the storage
capacity. Discrete Cosine Transform (DCT) is widely used for lossy compression. Most image or video coding
standards such as Joint Photographic Expert Group (JPEG) and Moving Picture Expert Group (MPEG) use DCT as a
standard transform coding scheme. The latest High Efficiency Video Coding (HEVC) which is finalized in 2013 also
utilizes the DCT as the main transform unit [1].
DCT basis function is the Cosine, where multiplication and addition are the main arithmetic operations
involved. Many DCT-based research has been conducted in the past few years, which has produced different kind
of DCT Algorithms, such as Arai DCT scheme, Wang Factorization, Lee DCT for power of two block length, Loeffler
algorithm, and Feig-Winograd factorization ([2]). These Algorithms have been used in practical applications. In
recent image processing technology, various hardware implementation of DCT are using Arai DCT scheme [3]. It
uses only five multiplications and twenty-nine addition, which is less arithmetic operations if compared to other stated
algorithms. For the MPEG technology, the International Standards Organization (ISO) released an optimized fixed-
point multiplierless version of the DCT algorithm, suitable for image and video compression. The standard which is
called the ISO/IEC 23002-2 is described and implemented in the present work [4].
VLSI design of DCT can be found in numerous articles, with an overview given in [5]. For comparison
purposes, we have analyzed three similar designs. The work by Mandayake et al in [6] presents a VLSI architecture
of the DCT using the Arai DCT scheme. It proposes a fast algorithm by reducing the number of integer channels.
The design is implemented using 45nm technology. The work by Wahid et al [7] proposes an area efficient fixed point
DCT architecture implemented in 0.18 um CMOS technology. Another interesting work is by Fu et al [8], where a
low power implementation is proposed based on algebraic integer encoding technique. This work also utilizes 0.18um
CMOS technology. Performance results for these works are given in the results section.
The present paper on the other hand, describes the semi-custom Very Large Scale Integration (VLSI) design





Institute of Advanced Engineering and Science 
w  w  w  .  i  a  e  s  j  o  u  r  n  a  l  .  c  o  m 
 
IJECE ISSN: 2088-8708 1431
[9]. The design has also been optimized by applying circuit pipelining.
1.1. DCT Theoretical Background







, 0 ≤ k ≤ N − 1 (1)









, 1 ≤ k ≤ N − 1 (2)













c4 c4 c4 c4 c4 c4 c4 c4
c1 c3 c5 c7 −c7 −c5 −c3 −c1
c2 c6 −c6 −c2 −c2 −c6 c6 c2
c3 −c7 −c1 −c5 c5 c1 c7 −c3
c4 −c4 −c4 c4 c4 −c4 −c4 c4
c5 −c1 c7 c3 −c3 −c7 c1 −c5
c6 −c2 c2 −c6 −c6 c2 −c2 c6

















, i = (2n+ 1)k (4)
Equation (3) shows that the matrix computes in column wise to produce 1D-DCT. Computing it further in
row wise produces the 2D-DCT. Considering the intensive computations required in DCT, many efficient algorithms
are proposed and reported in literature as mentioned, including the ISO/IEC 23002-2 used in the preset work. The
algorithm is given in Figure 1. The inputs are ind0 to ind7, where each input is 32-bit wide. Several variables are
defined for intermediate operations, where finally the results are stored in the outputs outd0 to outd7.
VLSI Design of a Fast Pipelined 8x8 Discrete Cosine Transform (Nurulnajah Mohd Zabidi)
1432 ISSN: 2088-8708
Figure 1. The ISO/IEC 23002-2 1D-DCT algorithm
One of the critical components in the DCT algorithm is the PMUL operation, given in Figure 2. The PMUL
performs Polynomial Multiplication. Equation (5) shows an example of polynomial equation, where the highest degree
of polynomial in this equation is 11. The exponential could be eliminated by replacing with shift right operations.
Equation (6) shows the conversion results of all the exponential in equation (5) into arithmetic shift right.
y = (y3 − y7) + ((y3 − y7)− y11)1 (5)
y = (y  3− y  7) + ((y  3− y  7)− y  11)1 (6)
Figure 2. PMUL components used in the ISO/IEC 23002-2 1D-DCT algorithm
2. PIPELINE CONCEPT
The main concept in circuit pipelining is to split the job process into smaller stages which will help to enhance
the performance by reducing the combinatorial critical path. Figure 3 shows the difference between non pipeline and
pipeline structure in terms of combinational logic circuit. Non-pipeline circuit is made up of a combinational logic, an
input and an output. Pipeline circuit is made up of a combinational logic that has been partitioned into smaller portion
and then connected by registers. Essentially, pipelining allows the design to run at a higher operating frequency at a
neglible cost in latency caused by initializing the pipeline stages.
IJECE Vol. 7, No. 3, June 2017: 1430 – 1435
IJECE ISSN: 2088-8708 1433
Figure 3. Non-pipeline and pipeline concept
3. PROPOSED DCT HARDWARE ARCHITECTURE
The ISO/IEC 23002-2 algorithm presented in Figure 1 is translated into a hardware architecture via a dataflow
graph (DFG), shown in Figure 4. There are a total of 54 intermediate variables, which translates into wires for non-
pipeline implementation. Similar to the input and output widths, wires are set at 32-bits each. The algorithm also
utilizes 27 subtractors, 19 adders, and 24 shifters. From the DFG, it can be seen that the longest path is 7 arithmetic
operators. Therefore, the design can be split for pipelining to reduce the path length. Essentially, the DFG can be
partitioned into stages, and at each partition bounday, registers are added to store and hold intermediate data. It should
be noted that a very small initial delay is expected to fill in the pipeline registers. Optimum paritioning boundary can
be found using some of the proposed algorithms in [11].
Figure 4. Data Flow Graph of Non-pipeline DCT
VLSI Design of a Fast Pipelined 8x8 Discrete Cosine Transform (Nurulnajah Mohd Zabidi)
1434 ISSN: 2088-8708
4. RESULT AND ANALYSIS
In this work, Mentor Graphics EDA tools with VLSI process technology TSMC 0.13um CMOS are used to
design, implement and validate the DCT. Some of the tools used include Pyxis for schematic and physical design;
Modelsim for high-level simulation; and Eldonet and EZ wave for low-level SPICE simulation. Semi-custom VLSI
design methodology is used, whereby the arithmetic operators are instantiated manually from a standard cell library
to form a complete DCT. The simulation results obtained from Modelsim at the RTL level is compared to the one
obtained from SPICE simulation to ensure correct results. Here, we show results for non-pipeline and a 2-stage
pipeline implementation.
The following is the number of components used for the DCT. There are a total of 19 Adders, 27 subtractors
and 24 shifters (all 32-bits) that have been used in both Non-pipeline DCT and pipeline DCT. As for regsiters, 16
32-bit and 42 32-bit registers have been used in Non-pipeline and Pipeline DCT respectively. The pipeline design
uses roughly 2.6 times more registers compared to non-pipeline. As for the total number of transistors, non-pipeline
DCT consists of 74240, while pipeline DCT consists of 85120, with increase of roughly 14% more resource due to
the pipeline registers.
By simulation for two different test patterns called pattern1 and pattern 2 (derived from the Foreman QCIF
video frame), the critical data path (cpd) for Non-pipeline DCT is found to be 7.97ns for input pattern 1, and 6.59ns
for input pattern 2. As for the Pipeline DCT, it is found that the cpd is 4.61ns for input pattern 1 and 3.83ns for input
pattern 2. From this, it can be estimated that the non-pipeline DCT can run at the maximum speed of 125MHz, and
the pipeline DCT at 217MHz. In terms of throughput, maximum output rate achieved are 1 Giga pixels/s and 1.7 Giga
pixels/s respectively for non-pipeline and pipeline DCTs. The graph in Figure 5 shows the plot of maximum operating
frequency vs throughput. Both of these designs can support the speed requirements of well above 8K UHD resolution
at real-time. It should be noted however that there are many other factors and components that may affect the speed of
this DCT when integrated into a complete video codec.
Figure 5. Maximum frequency (Fmax) vs Throughput in Mpixels/s
Table 1 shows the comparison to three similar works in literature: Madanayake et al [6] with 45nm CMOS,
Wahid et al [7] with 0.18um CMOS, and Fu et al [8] with 0.18 um CMOS technologies. In terms of throughput, It can
be seen that our designs are superior to [7, 8], but trails the design from [6]. This work shows higher performance due
to the smaller feature technology used. However, our design has shown to meet the requirements of state-of-the-art
video coding resolution even when using a larger technology.
Table 1. Comparison with similar works in literature
Madanyake Wahid Fu et al Non-pipeline Pipeline
et al [6] et al [7] [8] DCT DCT
CMOS Technology (nm) 45 180 180 130 130
Operating frequency (MHz) 946 194.7 75 116 250
throughput (Mp/s) 7568 194.7 75 1000 1739
IJECE Vol. 7, No. 3, June 2017: 1430 – 1435
IJECE ISSN: 2088-8708 1435
5. CONCLUSION
The objective of the present paper is to present results for implementing a fast VLSI pipelined multiplierless
fixed-point 8x8 DCT. The algorithm used is the ISO/IEC 23002-2 for the image and video compression technology.
The design methodology begins with modeling the algorithm using dataflow graphs in order to analyze for pipelining.
The results show that the implementations are feasible to be applied in the latest ultra-high definition 8K video,
even when using the relatively large 0.13um technology. We have implemented and compared the DCT using two
architectures, non-pipeline and a 2-stage pipeline. Simulation results validated the expectation that throughput can be
improved by almost a factor of two, from around 1 Giga pixels/s to 1.8 Giga pixels/s, at a small cost of roughly 14%
more resource (i.e. pipeline registers). Future work aims to extend the methodology for implementing different DCT
dimensions for other applications.
ACKNOWLEDGEMENT
The authors would like to thank the Malaysia Ministry of Education for providing the funds for the work in
this paper (Vote no. 4F659).
REFERENCES
[1] V. Sze, M. Budagavi, G.J. Sullivan, “High Efficiency Video Coding,” Springer, 2014.
[2] Heyne, B. Gotze, J., A Low Power and High Quality Implementation of The Discrete Cosine Transformation,“
Advances in Radio Science, 2007.
[3] U. S. Potluri, A. Madanayake, R. J. Cintra, F. M. Bayer, S. Kulasekera, and A. Edirisuriya, Improved 8-point
approximate DCT for image and video compression requiring only 14 additions,IEEE Transactions on Circuits
and Systems I: Regular Papers, vol. 6, no. 6, pp. 1727-1740, June 2014.
[4] ISO/IEC, Information technology - MPEG video technologies - Part 2: Fixed-point 8x8 inverse discrete cosine
transform and discrete cosine transform, International Standard, Dec 2014.
[5] A. Madanayake et al., ”Low-Power VLSI Architectures for DCT/DWT: Precision vs Approximation for HD
Video, Biomedical, and Smart Antenna Applications,” IEEE Circuits and Systems Magazine, vol. 15, no. 1, pp.
25-47, 2015.
[6] A. Edirisuriya, A. Madanayake, R. Cintra, V. Dimitrov and N. Rajapaksha, ”A single-channel architecture for al-
gebraic integer-based 8x8 2-D DCT computation”, IEEE Transactions on Circuits Systems for Video Technology,
vol. 23, no. 12, pp. 2083-2089, Dec. 2013.
[7] K. A. Wahid, M. Martuza, M. Das and C. McCrosky, ”Efficient hardware implementation of 8x8 integer cosine
transforms for multiple video codecs”, Journal of Real-Time Processing, pp. 1-8, July 2011.
[8] M. Fu, G. A. Jullien, V. S. Dimitrov and M. Ahmadi, ”A low-power DCT IP core based on 2D algebraic integer
encoding”, Proceedings of the International Symposium on Circuits Systems (ISCAS), vol. 2, pp. 765-768.
[9] A. A. H. Ab-Rahman and I. Kamisian and A. Z. Sha’ameri, VLSI design and implementation of adaptive channel
equalizer, 2008 International Conference on Computer and Communication Engineering, pp. 1121-1124, May
2008.
[10] Rachmad Vidya Wicaksana Putra, Rella Mareta, Nurfitri Anbarsanti , Trio Adiono A New RTL Design Approach
for a DCT/IDCT-Based Image Compression Architecture using the mCBE Algorithm, Journal of ICT Research
and Applications, Vol. 6, No. 2, pp 131-150, 2012
[11] A. Prihozhy and E. Bezati and A. A. H. Ab Rahman and M. Mattavelli Synthesis and Optimization of Pipelines
for HW Implementations of Dataflow Programs, IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, Vol. 34, no. 10, pp 1613-1626, 2015.
VLSI Design of a Fast Pipelined 8x8 Discrete Cosine Transform (Nurulnajah Mohd Zabidi)
