RTL implementation of one-sided jacobi algorithm for singular value decomposition by Wan Mohamad, Wan Ahmad Zainie
iii 
 
 
 
RTL IMPLEMENTATION OF ONE-SIDED JACOBI ALGORITHM FOR 
SINGULAR VALUE DECOMPOSITION 
WAN AHMAD ZAINIE BIN WAN MOHAMAD 
A thesis submitted in fulfilment of the 
requirements for the award of the degree of 
Master of Engineering (Electrical – Electronic and Telecommunication) 
Faculty of Electrical Engineering 
Universiti Teknologi Malaysia 
JANUARY 2016
iii 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Specially dedicated to my parents, 
my other half and 
my children. 
iv 
 
 
 
ACKNOWLEDGEMENT 
I would like to thank my supervisor, Associate Professor Dr. Muhammad 
Nasir bin Ibrahim for giving me the opportunity to partake in this project. Most 
importantly, I would like to thank my family for providing love, support and 
encouragement throughout my life, and for being so understanding during times 
when I have neglected all things other than this work. 
. 
.
v 
 
 
 
ABSTRACT 
Multi-dimensional digital signal processing such as image processing and 
image reconstruction involve manipulating of matrix data. Better quality images 
involve large amount of data, which result in unacceptably slow computation. A 
parallel processing scheme is a possible solution to solve this problem. This project 
presented an analysis and comparison to various algorithms for widely used matrix 
decomposition techniques and various computer architectures. As the result, a 
parallel implementation of one-sided Jacobi algorithm for computing singular value 
decomposition (SVD) of a 2х2 matrix on field programmable gate arrays (FPGA) is 
developed. The proposed SVD design is based on pipelined-datapath architecture 
The design process is started by evaluating the algorithm using Matlab, design 
datapath unit and control unit, coding in SystemVerilog HDL, verification and 
synthesis using Quartus II and simulated on ModelSim-Altera. The original matrix 
size of 4x4 and 8x8 is used to with the SVD processing element (PE). The result are 
compared with the Matlab version of the algorithm to evaluate the PE. The 
computation of SVD can be speed-up of more than 2 by increasing the number of PE 
at the cost of increased in circuit area. 
 
 
 
 
 
vi 
 
 
 
ABSTRAK 
Pemprosesan multi dimensi isyarat digital seperti pemposesan imej dan 
perstrukturan semula imej melibatkan manipulasi data matrik. Imej yang berkualiti 
melibatkan pemprosesan data dalam jumlah yang besar, seterusnya mengakibatkan 
pengiraan yang amat perlahan dan tidak dapat diterima. Salah satu cara untuk 
menyelesaikan masalah ini adalah dengan menggunakan skim pemprosesan selari. 
Projek ini membentangkan analisa dan perbandingan terhadap pelbagai algoritma 
yang digunakan secara meluas dalam teknik menguraikan matrik dan pelbagai 
senibina computer. Hasilnya, pelaksanaan selari algoritma One-sided Jacobi bagi 
mengira Singular Value Decomposition untuk matrix 2x2 direka untuk FPGA. 
Rekabentuk SVD yang dicadangkan adalah berdasarkan senibina pipelined-datapath. 
Proses rekabentuk bermula dengan menulis kembali algoritma untuk Matlab, 
merekabentuk laluan data dan system kawalan, seterusnya mengekod menggunakan 
SystemVerilog HDL, pengesahan dan simulasi menggunakan perisian Quartus II dan 
ModelSim-Altera. Saiz matrix 4x4 dan 8x8 digunakan dengan elemen pemprosesan 
SVD. Keputusan pengiraan dibandingkan dengan keputusan pengiraan Matlab untuk 
menilai pelaksaan. Pengiraan SVD meningkat melebihi 2 kali ganda dengan 
meningkat bilangan elemen pemprosesan, dengan kos peningkatan kawasan litar 
yang digunakan. 
 
vii 
 
 
 
TABLE OF CONTENTS 
CHAPTER TITLE PAGE 
 DECLARATION ii 
 DEDICATION iii 
 ACKNOWLEDGEMENT iv 
 ABSTRACT v 
 ABSTRAK vi 
 TABLE OF CONTENTS vii 
 LIST OF TABLES ix 
 LIST OF FIGURES x 
 LIST OF ABBREVIATIONS xi 
 LIST OF APPENDICES xii 
   
1 INTRODUCTION 1 
 1.1 Background 1 
 1.2 Problem Statement 1 
 1.3 Objectives of the Project 2 
 1.4 Scope of the Project 2 
 1.5 Significance of the Study 2 
 1.6 Thesis Organization 2 
    
2 LITERATURE REVIEW 4 
 2.1 Introduction 4 
 2.2 Parallel Computer Architecture 4 
 2.3 Matrix Decomposition Techniques 5 
 2.4 Hardware Implementation of Matrix Decomposition 
2.4.1    CORDIC-based Processor 
2.4.2    GPU-based Processor 
2.4.3    Various FPGA-based Processor 
6 
6 
8 
8 
 2.5 Speedup Analysis 10 
viii 
 
 
 
    
3 PROJECT METHODOLOGY 8 
 3.1 Project Design and Procedure 8 
 3.2 Singular Value Decomposition 
3.2.1    SVD Algorithms 
3.2.2    One-Sided Jacobi Algorithm for SVD 
13 
13 
14 
 3.3 RTL Design and Implementation of SVD Processor 
3.3.1    RTL Design Methodology 
3.3.2    RTL Implementation of the SVD Processor 
3.3.3    SVD Processor Architecture Modelling 
15 
15 
15 
18 
    
4 RESULT AND DISCUSSION 21 
 4.1 Introduction 21 
 4.2 Simulation Results 
4.2.1    2x2 SVD Module 
4.2.2    Matrix-To-Matrix Multiplication 
4.2.3    Parallel Ordering 
4.2.4    Serial Implementation Single 2x2 SVD 
4.2.5    Parallel Implementation Multiple 2x2 SVD  
21 
22 
23 
24 
25 
25 
 4.3 Speedup Analysis 27 
 4.4 Synthesis Result 27 
    
5 CONCLUSION AND FUTURE WORKS 28 
 5.1 Conclusion 28 
 5.2 Future Works 29 
 
REFERENCES 30 
ix 
 
 
 
LIST OF TABLES 
TABLE NO. TITLE PAGE 
4.1 Row-column (i,j) pairs for parallel implementation for 8×8 
matrix. 
24 
4.2 Speedup of the parallel design. 27 
4.3 Resource utilization for serial and parallel design. 27 
x 
 
 
 
LIST OF FIGURES 
 
FIGURE NO. TITLE PAGE 
3.1 The flowchart for overall project workflow. 12 
3.2 One-Sided Jacobi algorithm to compute SVD. 14 
3.3 2×2 SVD module. 16 
3.4 matrix_mult module. 17 
3.5 parallel_order module. 17 
3.6 4×4 matrix annihilation sequence. 20 
4.1 2×2 SVD module simulation result. 22 
4.2 matrix_mult module simulation result. 23 
4.3 parallel_order module simulation result. 24 
4.4 Single 2×2 SVD module in serial simulation result. 25 
4.5 Multiple 2×2 SVD module in parallel simulation result. 25 
xi 
 
 
 
LIST OF ABBREVIATION 
iobd   - I/O block diagram 
RTL   - Register Transfer Level 
SVD   - Singular Value Decomposition 
xii 
 
 
 
LIST OF APPENDICES 
1 
 
 
 
CHAPTER 1  
INTRODUCTION 
1.1 Background 
Multi-dimensional digital signal processing such as image processing and 
image reconstruction involve manipulating of matrix data. Better quality images 
involve large amount of data, which result in unacceptably slow computation. A 
parallel processing scheme is a possible solution to solve this problem. 
1.2 Problem Statement  
The time to compute the matrix decomposition increases significantly with 
the increase of the size of the matrix. Using parallel algorithm to reduce the time to 
compute huge matrices as seen in image processing is not sufficient, thus it is 
necessary to implement the parallel algorithm in parallel processors too. 
 
Therefore, it is justify to design a data path unit which implement a parallel 
algorithm, and a control unit capable to run multiple of such data path unit in 
parallel. 
2 
 
 
 
1.3 Objectives of the Project 
Followings are the objectives for this project:- 
(1) To study matrix decomposition techniques and parallel processing hardware 
implementation on FPGA. 
(2) To design, simulate and verify the RTL implementation of the matrix 
decomposition algorithm using SystemVerilog HDL. 
(3) To evaluate the speedup improvement by comparing parallel processing 
against single processing. 
1.4 Scope of the Project 
The scope of this project focuses on the study of matrix decomposition 
techniques to translate single processing task to two or more processing tasks. 
Implement and verify a matrix decomposition algorithm using MATLAB. The 
algorithm is mapped into RTL implementation using System Verilog HDL, targeted 
for Altera FPGA board. The design is captured, simulated, verified and synthesized 
using ModelSim-Altera and Altera Quartus II and. A speedup improvement analysis 
is done by comparing single processing versus parallel processing. 
1.5  Significance of the Study 
This project proposes a parallel RTL implementation of matrix 
decomposition algorithm to speed up the computation for large matrices, suitable for 
image processing. 
1.6 Thesis Organization 
The rest of the thesis is organized based on the following structure. 
3 
 
 
 
 
Chapter 2 covers literature review of this project, which are related 
theoretical background and related works. Discussion on literature review mainly 
focus on hardware implementation of various matrix decomposition techniques, 
especially the singular value decomposition. 
 
Chapter 3 describes the methodology to achieve the project objectives. This 
includes explanation on the architecture components, implementation flow, 
development environment and verification techniques. 
 
Chapter 4 presents details on the results of simulation of the proposed RTL 
design and implementation. This chapter also includes evaluation of the 
implemented algorithmic processor for verification and benchmarking. 
 
Chapter 5 summarizes this thesis, stating limitations of this project and 
provides suggestions for future works. 
 
 
30 
 
 
 
REFERENCES 
Ahmedsaid, A. and Bouridane, A. (2003). Improved SVD systolic array and 
implementation on FPGA. 2003 IEEE International Conference on Field-
Programmable Technology. 2003 IEEE, pp. 35–42. 
Amdahl, G.M. (1967). Validity of the Single Processor Approach to 
Achieving Large Scale Computing Capabilities. AFIPS Conference Proceedings. 
1967 ACM, pp. 483–485. 
Andraka, R. (1998). A survey of CORDIC algorithms for FPGA based 
computers. Proceedings of the 1998 ACMSIGDA sixth international symposium on 
Field programmable gate arrays FPGA 98. 1998 ACM, pp. 191–200. 
Aslan, S., Niu, S. and Saniie, J. (2012). FPGA implementation of fast QR 
decomposition based on givens rotation. Circuits and Systems (MWSCAS), 2012 
IEEE 55th International Midwest Symposium on. 2012 pp. 470–473. 
Barhen, C.K.J. (2011). Singular value decomposition utilizing parallel 
algorithms on graphical processors. OCEANS 2011. 2011 pp. 1–7. 
Berry, M.W., Mezher, D., Philippe, B. and Sameh, A. (2006). Parallel 
Algorithms for the Singular Value Decomposition. Statistics Textbooks and 
Monographs, 184, p.117. 
Cavallaro, J.R. and Luk, F.T. (1988). CORDIC arithmetic for an SVD 
processor. Journal of Parallel and Distributed Computing, 5(3), pp.271–290. 
Available at: http://www.sciencedirect.com/science/article/pii/0743731588900214 
[Accessed: 6 March 2015]. 
Chen, D. and Sima, M. (2011). Fixed-Point CORDIC-Based QR 
Decomposition by Givens Rotations on FPGA. 2011 International Conference on 
Reconfigurable Computing and FPGAs. 2011 pp. 327–332. 
Chi-Chia, S. and Goetze, J. (2013). FPGA implementation of parallel unitary-
rotation Jacobi EVD method based on Network-on-Chip. Intelligent Signal 
31 
 
 
 
Processing and Communications Systems (ISPACS), 2013 International Symposium 
on. 2013 pp. 1–4. 
Cline, A.K. and Dhillon, I.S. (2006). Computation of the singular value 
decomposition. Handbook of linear algebra, pp.41–45. 
Demmel, J. and Veselić, K. (1992). Jacobi’s method is more accurate than 
QR. SIAM Journal on Matrix Analysis and Applications, 13(4), pp.1204–1245. 
Available at: http://epubs.siam.org/doi/abs/10.1137/0613074. 
Ding, W., Li, J., He, G. and Ma, J. (2013). A Low-Complexity Parallel Two-
Sided Jacobi Complex SVD Algorithm and Architecture for MIMO Beamforming 
Systems *. In: Computer Engineering and Technology. Springer, pp. 202–210. 
Dou, Y. et al. (2010). A Unified Co-Processor Architecture for Matrix 
Decomposition. Journal of Computer Science and Technology, 25(4), pp.874–885.  
Dou, Y. et al. (2009). FPGA Accelerating Three QR Decomposition 
Algorithms in the Unified Pipelined Framework. Field Programmable Logic and 
Applications, 2009. FPL 2009. International Conference on. 2009 pp. 410–416. 
Duncan, R. (1990). Survey of parallel computer architectures. Computer, 
23(2), pp.5–16. 
Eager, D.L., Zahorjan, J. and Lazowska, E.D. (1989). Speedup versus 
efficiency in parallel systems. IEEE Transactions on Computers, 38(3), pp.408–423.  
Ercegovac, M.D. and Lang, T. (1990). Redundant and on-line CORDIC: 
Application to matrix triangularization and SVD. IEEE Transactions on Computers, 
39(6), pp.725–740. 
Flynn, M.J. (1972). Some Computer Organizations and Their Effectiveness. 
Computers, IEEE Transactions on, 100(9), pp.948–960. 
Flynn, M.J. (1966). Very high-speed computing systems. Proceedings of the 
IEEE, 54(12), pp.1901–1909. 
Guiming, W., Yong, D. and Peterson, G.D. (2010). Blocking LU 
Decomposition for FPGAs. Field-Programmable Custom Computing Machines 
(FCCM), 2010 18th IEEE Annual International Symposium on. 2010 pp. 109–112. 
Gustafson, J.L. (1988). Reevaluating Amdahl’s Law. Communications of the 
ACM, 31(5), pp.532–533. 
Hemkumar, N.D. and Cavallaro, J.R. (1992). A systolic VLSI architecture for 
complex SVD. [Proceedings] 1992 IEEE International Symposium on Circuits and 
Systems. 1992 IEEE, pp. 1061–1064. 
32 
 
 
 
Hestenes, M.R. (1958). Inversion of matrices by biorthogonalization and 
related results. Journal of the Society for Industrial and Applied Mathematics, 6(1), 
pp.51–90. 
Ibrahim, A., Valle, M., Noli, L. and Chible, H. (2015). FPGA 
implementation of fixed point CORDIC-SVD for E-skin systems. 2015 11th 
Conference on Ph.D. Research in Microelectronics and Electronics (PRIME). 2015 
pp. 318–321. 
Kim, D. and Rajopadhye, S. V (2006). An Improved Systolic Architecture 
for LU Decomposition. Application-specific Systems, Architectures and Processors, 
2006. ASAP ’06. International Conference on. 2006 pp. 231–238. 
Kogbetliantz, E. (1954). Diagonalization of general complex matrices as a 
new method for solution of linear equations. Proc. Intern. Congr. Math. Amsterdam, 
2, pp.356–357. 
Lahabar, S. and Narayanan, P.J. (2009). Singular value decomposition on 
GPU using CUDA. Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE 
International Symposium on, pp.1–10. 
Ledesma-Carrillo, L.M. et al. (2011). Reconfigurable FPGA-Based Unit for 
Singular Value Decomposition of Large m x n Matrices. Reconfigurable Computing 
and FPGAs (ReConFig), 2011 International Conference on. 2011 pp. 345–350. 
Liu, Z., Dickson, K. and McCanny, J.V. (2003a). A floating-point CORDIC 
based SVD processor. Proceedings IEEE International Conference on Application-
Specific Systems, Architectures, and Processors. ASAP 2003. 2003 pp. 194–203. 
Liu, Z., Dickson, K. and McCanny, J.V. (2003b). CORDIC based application 
specific instruction set processor for QRD/SVD. The Thrity-Seventh Asilomar 
Conference on Signals, Systems & Computers, 2003. 2003 pp. 1456–1460. 
Luo, J. et al. (2013). High Throughput Cholesky Decomposition Based on 
FPGA. Image and Signal Processing (CISP), 2013 6th International Congress on. 
2013 pp. 1649–1653. 
Martinez-Corral, U., Basterretxea, K. and Finker, R. (2014). Scalable parallel 
architecture for singular value decomposition of large matrices. Field Programmable 
Logic and Applications (FPL), 2014 24th International Conference on. September 
2014 IEEE, pp. 1–4. 
Maslennikow, O., Ratuszniak, P. and Sergyienko, A. (2007). Implementation 
of Cholesky LLT-Decomposition Algorithm in FPGA-Based Rational Fraction 
33 
 
 
 
Parallel Processor. Mixed Design of Integrated Circuits and Systems, 2007. MIXDES 
’07. 14th International Conference on. 2007 pp. 287–292. 
Mohanty, R. et al. (2014). Design and Performance Analysis of Fixed-point 
Jacobi SVD Algorithm on Reconfigurable System. IERI Procedia, 7, pp.21–27.  
Ong, K.S.H., Fahmy, S.A. and Keck-Voon, L. (2014). A scalable and 
compact systolic architecture for linear solvers. Application-specific Systems, 
Architectures and Processors (ASAP), 2014 IEEE 25th International Conference on. 
2014 pp. 186–187. 
Snopce, H. and Spahiu, I. (2010). Parallelization of SVD of a matrix-systolic 
approach. Computer Science and Information Technology (IMCSIT), Proceedings of 
the 2010 International Multiconference on. 2010 pp. 343–348. 
Sun, X.-H. and Ni, L.M. (1990). Another View On Parallel Speedup. 
Proceedings SUPERCOMPUTING ’90, pp.324–333. 
Volder, J.E. (1959). The CORDIC Trigonometric Computing Technique. 
IEEE Transactions on Electronic Computers, EC-8(3), pp.330–334. 
Wu, G., Dou, Y., Sun, J. and Peterson, G.D. (2012). A High Performance and 
Memory Efficient LU Decomposer on FPGAs. IEEE Transactions on Computers, 
61(3), pp.366–378. 
Xinying, W., Jones, P. and Zambreno, J. (2014). A Reconfigurable 
Architecture for QR Decomposition Using a Hybrid Approach. VLSI (ISVLSI), 2014 
IEEE Computer Society Annual Symposium on. July 2014 IEEE, pp. 541–546. 
Xu, T.C. et al. (2012). Implementation and Analysis of Block Dense Matrix 
Decomposition on Network-on-Chips. High Performance Computing and 
Communication & 2012 IEEE 9th International Conference on Embedded Software 
and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on. 2012 pp. 
516–523. 
