Acceleration of Hyperspectral Image Compression for Remote Sensing Applications. by Egho, Chafik.
Acceleration of Hyperspectral Image 
Compression for Remote Sensing Applications
Chafik Egho
Submitted for the Degree of 
Doetor of Philosophy 
from the 
University of Surrey
.0
UNIVERSITY OF
SURREY
Surrey Space Centre 
School of Engineering and Physical Sciences 
University of Surrey 
Guildford, Surrey GU2 7XH, UK
August 2014
ProQuest N um ber: 27558474
All rights reserved
INFORMATION TO ALL USERS 
The qua lity  of this reproduction  is d e p e n d e n t upon the qua lity  of the copy subm itted.
In the unlikely e ve n t that the au tho r did not send a co m p le te  m anuscrip t 
and there are missing pages, these will be no ted . Also, if m ateria l had to be rem oved,
a no te  will ind ica te  the de le tion .
uest
ProQuest 27558474
Published by ProQuest LLO (2019). C opyrigh t of the Dissertation is held by the Author.
All rights reserved.
This work is protected aga inst unauthorized copying under Title 17, United States C o de
M icroform  Edition © ProQuest LLO.
ProQuest LLO.
789 East Eisenhower Parkway 
P.Q. Box 1346 
Ann Arbor, Ml 4 81 06 - 1346
Abstract
The growing demands for advanced space applications have intensified the interest in on-board 
satellite processing. High performance on-board computing is an essential capability to meet the high 
speed and throughput requirements of these applications. The main overheads of high performance 
computing are the large hardware resources and the high power consumption, which are major design 
challenges in power constrained embedded applications such as spacecraft. Investigating a high 
performance intelligent system-on-a-chip (SoC) based architecture for space applications is the main 
objective of this research. In the proposed architecture, a spectral decorrelation module for 
hyperspectral image compression is developed for FPGA-based System-on-a-Chip platforms for 
space applications. While various techniques have been developed for spectral decorrelation, the 
Karhunen-Loéve Transform (KLT) technique outperforms other techniques in term of compression 
performance. However, this algorithm consists of sequential processes, which are computationally 
intensive, such as covariance matrix computation, eigenvector evaluation and matrix factorisations 
and multiplications. These processes slow down the overall computation significantly and increase the 
latency. In the proposed spectral decorrelation system, the KLT is utilised for lossy compression and 
the reversible Integer KLT is utilised for lossless compression. The computation of the algorithm is 
deeply investigated from mathematical and hardware perspectives in order to achieve a feasible 
solution within limited hardware resources, and therefore, limited power budget.
A'he novelty of this work lies in the new architecture for the acceleration of the Integer Karhunen- 
Loéve Transform computation on FPGA-based System-on-a-Chip platforms for lossless hyperspectral 
image compression. Moreover, the proposed KLT architecture for lossy hyperspectral image 
compression outperforms previously proposed hardware architecture, as the proposed architecture 
offers further level of parallelism, which is more significant for hyperspectral data with large number 
of spectral bands. The experiments of the proposed system on the AVIRIS and the Hyperion data 
showed an overall improvement to the level of parallelism of up to 4.9%, 11.8% and 18.4 % for 8, 16 
and 32 spectral bands, respectively. Furthermore, this work addresses the KLT computations for large 
number of spectral bands from a hardware perspective, which has not been addressed in other works. 
In addition, this work also proposes a novel eigenvalues / eigenvectors computing hardware algorithm 
for large symmetric matrices, which can reduce the number of required iterations for large symmetric 
matrices and can offer partial computations of the eigenvectors and eigenvalues. Consequently, this 
can improve the parallelism level not only for the KLT computations but also for other applications 
where the some of the eigenvectors / eigenvalues can be utilised in the next computation stage. 
Therefore, this work contributed toward improving the acceleration of hyperspectral image 
compression and other high performance applications, where the hardware and the power resources 
are limited, such as space applications.
Acknowledgments
First and foremost, I would like to sineerely thank my supervisor Professor Tanya 
Vladimorva; I highly appreciate all her invaluable contributions to make my PhD experience 
productive and stimulating; and for her essential efforts on our publications and her 
encouragement and patience throughout the PhD journey. I also would like to thank my 
second supervisor Professor Sir Martin Sweeting for the precious guidance and support he 
has provided.
I would like to thank the administrator of the Surrey Space Centre Mrs. Karen Collar for all 
the support and guidance she has provided. I also would like to thank all my friends and 
colleagues for their personal support; these include, Elisabetta lorfida, Jason Forshaw, 
Christopher Brunskill, Rizuan Mat Noor, Vivien Rohwedder, Shokollah Karimian and Fahed 
Maida.
My final and special thank is for my mother and siblings for all their support and backing 
and I thank them from the bottom of my heart.
Ill
Table of Contents
Abstract.............................................................   ii
Acknowledgments......................................................................................................................... iii
Table of Contents..........................................................................................................................iv
List of Figures................................................................................................................................. x
List of Tables 
List of Abbreviations
1 Introduction...................................................................................................................................1
1.1 Research Motivation and Scope...............................................................................2
1.2 Novelty Summary........................................................................................................ 4
1.3 Publications.................................................................................................................5
1.4 Thesis Outline..............................................................................................................6
2 Literature Review.....................................................................................................................8
2.1 In troduction ...............................................................................................................8
2.2 Space R adiation E ffec ts ........................................................................................9
2.2.1 Galactic Cosmic R ays..............................................................................9
2.2.2 Trapped Radiation B elts .......................................................................... 9
2.2.3 Solar Particle Events..................................................................................9
2.3 FPGA Types and their Suitability for Space Applications.......................................10
2.3.1 FPGA Architecture.................................................................................. 11
2.3.1.1 FPGA Input /Output Cells...........................................................11
2.3.1.2 FPGA Programmable Blocks...................................................... 12
2.3.2 FPGAs Trends.......................................................................................... 13
2.3.3 FPGAs in Space....................................................................................... 14
2.3.3.1 Mitigation Techniques for Space Radiations........................... 15
2.4 Reconfigurable Computing........................................................................................ 16
2.4.1 Reconfigurable Systems Architecture....................................................17
iv
2.4.2 Reconfîgurable Fabric Structure............................................................. 19
2.4.3 Reconfiguration Scheme..........................................................................19
2.4.3.1 Static Reconfiguration..................................................................19
2.4.3.2 Dynamic Reconfiguration...........................................................20
2.4.4 Reconfigurable Computing Trends and Challenges............................. 21
2.4.5 Reconfigurable Computing in Space.............................................   22
2.5 System-on-a-Chip...................................................................................................... 23
2.5.1 Intellectual Property Cores..................................................................... 24
2.5.2 Soft Processors......................................................................................... 24
2.5.2.1 LEO N...........................................................................................25
2.5.2.2 Altera NIOS II.............................................................................. 27
2.5.2.3 Xilinx MicroBlaze................................................   27
2.5.3 Hard Core Processors.............................................................................. 28
2.5.3.1 PowerPC....................................................................................... 28
2.5.3.2 Cortex M-3................................................................................... 28
2.5.4 On-Chip Bus Protocols........................................................................... 29
2.5.4.1 AMBA
2.5.4.2 CoreConnect
2.5.4.3 Avalon
2.5.5 System-on-a-Chip Challenges and Trends............................................30
2.5.5.1 SoC Challenges............................................................................ 31
2.5.5.2 SoC Trends................................................................................... 32
2.6 Acceleration of High-Performance Applications.................................................. .33
2.6.1 High Performance Computing (HPC) on FPGAs.................................. 33
2.6.2 Discrete Transforms................................................................................ 34
2.7 On Board Computing.................................................................................................. 36
2.7.1 0BC386................................................................................................... 36
2.7.2 OBC750................................................................................................... 37
2.7.3 X-SAT On-Board Payload Computer...................................................38
2.7.4 System-on-a-Chip based on OBC386................................................... 38
2.8 Conclusion..................................................................................................................40
3 Hyperspectral Images and Their Compression Techniques.................................................41
3.1 Introduction.................................................................................................................41
3.2 Overview of Satellite Imaging................................................................................... 42
V
3.2.1 Passive and Active Imaging Sensors......................................................42
3.2.2 Scanning Mechanisms............................................................................. 43
3.3 Hyperspectral Imaging.............................................................................................. 43
3.3.1 Hyperspectral Imaging Applications......................................................45
3.3.2 Overview of the Current Spaee-bome Hyperspectral Imagers.............45
3.3.3 The Test Hyperspectral Data...................................................................47
3.4 Hyperspectral Image Compressions......................................................................... 48
3.4.1 Classifications of Hyperspectral Image Compressions Techniques... .49
3.4.2 Lossy, lossless and Near-Lossless Compressions................................. 50
3.4.3 Evaluation Factors for Compression Performance................................50
3.5 Consultative Committee for Space Data Systems...................................................51
3.5.1 CCSDS Predictor..................................................................................... 52
3.5.2 CCSDS Encoder.......................................................................................53
3.6 Spectral Decorrelation Techniques...........................................................................53
3.6.1 Performance Comparison........................................................................ 55
3.6.2 Complexity Comparison..........................................................................56
3.6.3 Clustering and Tiling Techniques...........................................................57
3.7 Conclusion..................................................................................................................58
Investigation of the Karhunen-Loéve Transform Computational Process.........................59
4.1 Introduction................................................................................................................59
4.2 Computations Overview............................................................................................60
4.3 Computational Requirements....................................................................................62
4.3.1 BandMean and MeanSub Computations................................................62
4.3.2 Covariance Matrix....................................................................................62
4.3.3 Eigenvectors and Eigenvalues.................................................................63
4.3.3.1 Jacobi Algorithm.........................................................................64
4.3.3.2 Matrix Reduction Technique for the Jacobi algorithm............ 68
4.3.3.3 QR Algorithm..............................................................................70
4.3.3.4 Jacobi versus QR Algorithm......................................................71
4.3.4 Eigen Mapping......................................................................................... 74
4.4 Fixed Point Error Analyses.......................................................................................76
4.4.1 Covariance Matrix Computation Data Form at..................................... 77
4.4.2 Eigenvectors Computation Data Format............................................... 77
4.4.3 Eigen Mapping Computation Data Form at...........................................81
vi
4.5 Conclusion..............................................................  82
Acceleration of the Karhunen-Loéve Transform............................................................... 83
5.1 Introduction................................................................................................................ 83
5.2 Overview.....................................................................................................................84
5.3 Prototyping Platforms.................................................................................................85
5.4 KLT Computation Flow.............................................................................................87
5.5 Acceleration of the Eigenvectors Computation........................................................89
5.5.1 Implementation of the Jacobi Algorithm on embedded processors.... 90
5.5.2 Matrix Reduction Technique..................................................................91
5.6 KLT SoC Architecture on the Altera SRAM FPGA................................................94
5.6.1 Acceleration of Stage 1........................................................................... 94
5.6.2 Acceleration of Stage 2 .................................................................. 97
5.6.2.1 Hardware Acceleration of the Eigenvectors Computation 97
5.6.3 Acceleration of Stage 3 ........................................................................104
5.6.4 Hardware Utilisation and Processing Time......................................... 106
5.6.5 Advantage of the Proposed Architecture............................................. 109
5.7 KLT SoC Architecture on the SmartFusion Flash FPGA..................................... 110
5.7.1 Approach 1............................................................................................. 110
5.7.2 Approach 2 ............................................................................................. 113
5.7.3 Discussion of Experimental Results.................................................... 114
5.8 Conclusion................................................................................................................117
Investigation of the Integer Karhunen-Loéve Transform................................................... 118
6.1 Introduction...............................................................................................................118
6.2 Overview of the Integer KLT...................................................................................119
6.3 Computational Process of the Integer K L T ............................................................120
6.3.1 Overall Computation Process...............................................................120
6.3.2 PLUS Matrix Factorization...................................................................122
6.3.3 Lifting Scheme.......................................................................................127
6.4 Fixed-point Implementation Analysis..................................................................... 128
6.4.1 PLUS Matrix Factorization
6.4.2 Lifting Scheme
6.5 KLT versus Integer KLT Computational Requirements........................................129
6.6 Conclusion.................................................................................................................132
VII
7 Hardware Acceleration of the Integer Karhunen-Loéve Transform................................. 133
7.1 Introduction........................................................................................................ .....133
7.2 Overview..............................................................   134
7.3 Computational Flow................................................................................................. 135
7.4 System Architecture................................................................................................. 136
7.4.1 DE2-115 SRAM Altera Cyclone IV System Architecture................. 136
7.4.1.1 PLUS Matrix Factorization....................................................... 137
7.4.1.2 Lifting Scheme....................................  138
7.4.1.3 Hardware Utilisation and Processing Time........................... 142
7.4.2 Integer KLT SoC Architecture on the SmartFusion.........................145
7.4.2.1 The Experimental Results  ...............................................146
7.5 Adaptive KLT / Integer KLT Computation........................................................... 148
7.6 Conclusion.................................................................................................................153
8 Conclusions and Future Work................................................. ............................................154
8.1 Conclusions...............................................................................................................154
8.2 Novelty Claims..........................................  156
8.3 Publications...............................................................................................................158
8.4 Suggestions of Future Work.....................................................................................159
References....................................................................................................................................160
Appendix A. Radiation Effects on Microelectronics..............................................................180
Appendix B. MATLAB Simulations (Error Analysis)........................................................ .186
Appendix C. Hardware Prototyping Platforms....................................................................... 190
o C.l SRAM Altera Cyclone IV DE2-115 Board 
o C.2 Flash Actel Smartfusion Kit
Appendix D. ModelSim Simulations......................................................................................194
D.l Simulation of the 0 Computation 
D. 1 Simulation of the Lifting Process
Appendix E. Summary of the Hardware Utilizations...............................................................198
VIII
List of Figures
Figure 1.1: Hyperspectral image 2
Figure 1.2: On-Board Hyperspectral Image Compression Module 3
Figure 2.1 : FPGA Internal Arehitecture 11
Figure2.2: The implementation of reeonfigurable computing 16
Figure 2.3: System-level architectures for RC Systems 18
Figure 2.4 (a) Static Reconfiguration (b) Dynamic Reconfiguration 19
Figure 2.5: Reconfiguration Schemes 20
Figure 2.6: A Block Diagram of LEON-4 26
Figure 2.7: A Block Diagram of the NIOS II 27
Figure 2.8: A Block Diagram of MicroBlaze 27
Figure 2.9: the CPU transistor count over the last 4 decades 30
Figure 2.10: System on Chip motives, approaches, trends and challenges 31
Figure 2.11: The Technology Gap between single CPU and HPC 33
Figure 2.12: The OBC386 Diagram 36
Figure 2.13: The OBC750 Diagram 37
Figure 2.14: The X-SAT Block Diagram 38
Figure 2.15: Block Diagram of SoC for Space Applications 39
Figure 3.1 (a) The Whisk-broom and (b) the Push-broom Scanning Mechanism 43
Figure 3.2: Hyperspectral Image 44
Figure 3.3: (a) Multispectral Image, and (b) Hyperspectral image 44
Figure 3.4: The AVIRIS Cuprite Hyperspectral Image 47
Figure 3.5: The EO-1 Hyperion Boston Hyperspectral Images 48
Figure 3.6: CCSDS Hyperspectral Image Compression Module 52
Figure 3.7: Local Sums Calculations 52
Figure 3.8: The Structure of the Compressed Image 53
Figure 3.6: The Spectral Rang of 4 Different Pixels before the Spectral Decorrelation 54
Figure 3.7: The Spectral Rang of 4 Different Pixels after the Spectral Decorrelation 54
Figure 3.8: Comparison of the KLT and the DWT Lossy compression performance 55
Figure 4.1 : The Computation Process of Karhunen-Loéve Transform 61
Figure 4.2: The Computation Process of the Covariance Matrix 63
Figure 4.3: The Jacobi Algorithm 64
Figure 4.4: the Jacobi Algorithm Convergence of different Hyperspectral Data 67
ix
Figure 4.5: The Proposed Matrix Reduction Technique 68
Figure 4.6: The QR Algorithm for Eigenvalues Computations 70
Figure 4.7: The QR versus the Jacobi Algorithm (AVIRIS Cuprite) 73
Figure 4.8: The QR versus the Jacobi Algorithm (Hyperion Boston) 73
Figure 4.9: The Required Number of Operations for the KLT computation 76
Figure 4.10: The Maximum Output Errors of the Eigenvectors for Fixed-point Computation 
(AVIRIS Cuprite 32 bands) 78
Figure 4.11: The Maximum Output Errors of the Eigenvectors for Fixed-point Computation 
(Hyperion Boston 32 bands) 78
Figure 4.12: The Maximum Output Errors of the Eigenvectors for Single Precision Floating 
Point of the Hyperion Boston 79
Figure 4.13: The Maximum Output Errors of the Eigenvectors for Single Precision Floating 
Point of the AVIRIS Cuprite 80
Figure 4.14: The Maximum Output Errors of the Eigenvectors for Double Precision Floating 
Point of the Hyperion Boston 80
Figure 4.15: The Maximum Output Errors of the Eigenvectors for Double Precision Floating 
Point of the AVIRIS Cuprite 81
Figure 5.1 : The KLT Computation Flow 87
Figure 5.2: The KLT Computation Flow Proposed in [4] 88
Figure 5.3: The Execution Time (microseconds) of a Single Jacobi Iteration 91
Figure 5.4: The Variation of 0 over the Jacobi Algorithm (Hyperion Greenland) 92
Figure 5.5: The Approximation Error of Arctangent and Sine Functions for small 0 93
Figure 5.6: The System-on-a-Chip Architeeture the KLT Computation 95
Figure 5.7: The Hardware Architeeture of Stage 1 Accelerator 96
Figure 5.8: The Hardware Arehitecture for the Jacobi Algorithm 99
Figure 5.9: The Multiply and Add / Sub Timing Diagram 100
Figure 5.10: The Hardware Architecture for the 0 Computer 101
Figure 5.11: The Hardware Architecture for the Eq(3,4,5) Computer 102
Figure 5.12: The Processing Flow Diagram for the Eq(3,4,5) Computer 103
Figure 5.13: The Hardware Architecture of Stage 3 Accelerator 105
Figure 5.14: The Computation Flow of the Proposed Architecture 109
Figure 5.15: The Proposed Computation Flow for the SmartFusion Arehitecture 111
Figure 5.16: The Block diagram of the Proposed Computation (Approach 1) 112
Figure 5.18: The Bloek diagram of the Multiplication Unit (Approach 2) 113
Figure 5.16: The Block diagram of the Proposed Computation (Approach 2) 114
Figure 6.1 : The Computations process of the Integer KLT Algorithm 121
Figure 6.2: The PLUS Factorization 123
Figure 6.3: Pivoting Proeess 124
Figure 6.4: The Lifting process of 4 Pixels 127
Figure 6.5: The Required Number of Floating Point Operations for the KLT and RKLT 130 
Figure 6.6: The Required Number of Fixed Point Operations for the KLT and RKLT 130 
Figure 7.1: The Integer KLT Computation Flow 136
Figure 7.2: The Proposed System Architeeture for the Integer KLT 137
Figure 7.3: The Lifting Hardware Architecture of 4 pixels 140
Figure 7.4: The Block Diagram of the SmartFusion Arehitecture 146
Figure 7.5: The Computation Flow of the Adaptive KLT/ Integer KLT System 149
Figure B. 1 : The Jacobi Convergence of different data-width of 8x8 matrix 186
Figure B.2: The Jacobi Convergence of different data-width of 16x16 matrix 186
Figure B.3: The Jacobi Convergence of different data-width of 32x32 matrix 187
Figure B.4: The Jacobi Convergence of different data-width of 64x64 matrix 187
Figure B.5: The Maximum Output Errors of the Eigenvectors for Fixed-point Computation 
(AVIRIS Cuprite 8 bands) 188
Figure B.6: The Maximum Output Errors of the Eigenvectors for Fixed-point Computation 
(AVIRIS Cuprite 16 bands) 188
Figure B.7: The Maximum Output Errors of the Eigenvectors for Fixed-point Computation 
(Hyperion Boston 8 bands) 189
Figure B.8: The Maximum Output Errors of the Eigenveetors for Fixed-point Computation 
(Hyperion Boston 16 bands) 189
Figure C.l: The Altera DE2-115 Board 190
Figure C.2: The Smartfusion Kit 192
Figure D. 1 : The Post-Synthesis Simulation of the 0 Computer 195
Figure D.2: The Post-Synthesis Simulation of the Lifting Process 197
XI
List of Tables
Table Page
Table 2.1 : The Main Differences of FPGA Types 10
Table 2.2: Logic Blocks comparison 12
Table 2.3: A Comparison between SRAM, Flash and TAS-MRAM Technologies 23
Table 2.4: A list of a variety of soft proeessors 25
Table 2.5: Comparison of FPGA Accelerators with General Purpose Processors 34
Table 2.6: Compression Discrete Transforms Comparison 35
Table 2.7: OBC386 Specifieations 36
Table 3.1: List of Hyperspeetral / Multispectral Space-borne Missions 46
Table 3.2 The Improvements of the Integer KLT lossless Compression Performance 56
Table 3.3: The Required Silicon Surface for the KLT Transform 56
Table 3.4: Complexity Comparison between the DWT, KLT and the Integer KLT 57
Table 4.1 : The Computation Requirements for the Covariance Matrix 63
Table 4.2: Number of Iterations required for certain output error (Mean Square MSB) 66
Table 4.2: Covariance Data Bit-length of various Hyperspectral Images 67
Table 4.3: Eigenvalues Convergence throughout 10 Sweeps 69
Table 4.4: Iterations Reduction by the Proposed Matrix Reduction Technique 69
Table 4.5: Computational Requirements of Different QR Factorization Techniques 71
Table 4.6: the Computational Requirements of the KLT algorithm 75
Table 4.7: Table 4.7: The Data Format of the Covariance Matrix Computation Process 77
Table5.1 : The DE2-115 Development Board versus the SmartFusion Evaluation Kit 86
Table5.2: Operations Occurrenee of a Jacobi Iteration for different Matrix sizes. 90
Table5.3: The Execution Time of the Jacobi Algorithm for Different Matrix Sizes. 91
Table 5.4: The Exeeution Time using the Matrix Reduction Technique (MRT) 94
Table 5.5: Stage 1 Accelerator Hardware Usage 96
Table 5.6: The Required MegaFunctions for the Jacobi Hardware Acceleration 98
Table 5.7: The Processing Time of a Jacobi Iteration on the Proposed Architecture 102
Table 5.8: The Execution Time using the Matrix Reduction Technique (MRT) 103
Table 5.9: The FPGA Resources for the computation components of the Proposed
Architecture 103
Table 5.11: The Hardware and Power Resources of the Proposed Architecture 107
Table 5.12: The Execution Time {milliseconds) of the Proposed KLT Architeeture 108
xii
T able 5.13: The Hardware Resources of the Proposed SmartFusion Architeeture 115
Table 5.14: The Power Consumption of the Proposed SmartFusion Architecture 115
Table 5.15: The Execution Time of the Proposed KLT Architecture (SmartFusion) 116
Table 6.1 : The Computation requirements of the PLUS factorization 126
Table 6.2: The Computational Requirements of the Integer KLT vs the KLT algorithm 131 
Table 7.1 : The Execution Time of the PLUS Factorization on the embedded processors 137 
Table 7.2: The Exeeution Time of the Lifting Scheme for Different Data Sets’ Sizes 141 
Table 7.3: The Hardware Resources Utilised for the Proposed Lifting Scheme 141
Table 7.4: The Hardware and Power Resources of the Proposed Architecture 143
Table 7.5: The Exeeution Time of the Proposed Integer KLT Architeeture 144
Table 7.6: The processing time of the PLUS over the overall processing time 144
Table 7.7: The Required Resources of the SmartFusion Architecture 146
Table 7.8: The Exeeution Time of the Integer KLT Arehitecture (SmartFusion) 147
Table 7.9: The Execution Time {milliseconds) of the Adaptive KLT / Integer KLT 
Architecture of the Altera SRAM DE2-115 150
Table 7.10: The Execution Time of the Adaptive KLT / Integer KLT (SmartFusion) 151
Table 7.11 : The Hardware and Power Resources of the Adaptive KLT / Integer KLT
Architecture of the Altera SRAM DE2-115 152
Table 7.12: The Required Resources (Hardware and Power) of the Adaptive KLT / Integer 
KLT SmartFusion Architecture 152
XIII
List of Abbreviations
ACE Analogue Computing Engine
ADCS Attitude Determination and Control System
AHP Advanced High-Performanee Bus
AMBA Advanced Micro-controller Bus Architeeture
APB Advanced Peripheral Bus
ASIC Application-Specifie Integrated Circuit
AVIRIS Airborne Visible/Infrared Imaging Spectrometer
AXI Advanced extensible Interface
BJT Bipolar Junction Transistors
BPP Bits per Pixel
BPPPB Bits per Pixel Per Band
C&DH Commands and Data Handling
CAN Controller-Area Network
CHRIS Compact High Resolution Imaging Spectrometer
CLB Configurable Logic Blocks
CMOS Complementary Metal-Oxide-S emieonductor
CORDIC coordinate Rotation Digital Computer
COTS Components off the Shelf
CPU Central Processing Unit
DCM Digital Clock Manager
DCR Device Control Register
DCT Discrete Cosine Transform
DPT Discrete Fourier Transform
DPCM Differential Pulse Code Modulation
DSP Digital Signal Processing
DST Discrete Sine Transform
xiv
DWT Discrete Wavelet Transform
EDAC Error Deteetion and Correction Code
EPS Eleetrieal Power Subsystem
ESA European Space Agency
FPGA Field-Programmable Gate Array
FFT Fast Fourier Transform
FU Funetional Unit
GPL General Publie License
HDL Hardware Description Language
HLDC High-Level Data Link Control
HPC High-Performanee Computing
JPEG Joint Photographic Experts Group
JPL Jet Propulsion Laboratory
JTAG Joint Test Action Group
KLT Karhunen-Loéve Transform
IP Intellectual Property
LGPL Lesser General Public License
LUT Look-up Table
LVDS Low Voltage Differential Signal
MAC Multiply Accumulate
MBU Multiple Bit Upset
MSE Mean Square Error
MSPS Mega-Samples per Second
MSS Microcontroller Subsystem
NASA National Aeronauties and Space Administration
NoC Network-on-Chip
OBC On-Board Computer
XV
OPB On-chip Peripheral Bus
OTP One-Time Programming
PLB Processor Local Bus
POR Power on Reset
PPU Payload Proeessing Unit
RC Reeonfigurable Computing
RHBD Radiation Hardening by Design
RISC Reduced Instruetion Set Computer
SEE Single Event Effects
SEFI Single Event Functional Interrupt
SEE Single Event Latch-up
SET Single Event Transient
SEU Single Event Upset
SMD Single Instruction Multiple Data
SNR Signal to Noise Ratio
SoC System-on-a-Chip
SRAM Static Random Access Memory
SSTL Surrey Satellite Technology Ltd
SSC Surrey Space Centre
TAS-MRAM Thermally Assisted Switching Magnetic Random Access Memory
TID Total Ionising Dose
TMR Triple Module Redundancy
UART Universal Asynchronous Receiver/Transmitter
VHDL VHSIC Hardware Description Language
VHSIC V ery-High-Speed Integrated Cireuits
VLSI Very-Large-Seale Integration
VQ Veetor Quantization
XVI
Chapter 1
Introduction
Over the last deeades, the spaee technologies have dramatically advanced offering more 
services with higher quality to satisfy the growing demands of payload processing 
applications. This growth has been supported by the rapid advancement of electronics 
technology on different levels such as components, hardware systems and software packages. 
Therefore, more efficient multi-functional devices have become available to the end users for 
more affordable prices. Such advancements have shaped the trend of various technologies 
and products, but associated coneems sueh as power and hardware resources still exist.
Satellite imaging is an example of this dramatic advancement, which has been stimulated 
mainly by its applications and the relatively affordable cost of small satellites. Hyperspectral 
imaging, which evolved from multispectral imaging in the 1980s [1], has a wide range of 
applications in agriculture, forestry, natural resources exploration, intelligenee, warfare, 
cartographic and disaster observation. Therefore, the significance of this technology has been 
intensified and more interests have been expressed from different governments and 
organisations. However, the advantages of hyperspectral images come with the overhead of 
their large sizes, which increase the memory and the bandwidth requirements of the satellite. 
The process of eliminating the spectral redundancies is defined as the spectral decorrelation; 
this significantly improves the compression performance of hyperspeetral images, and as 
such, this has been the interest of various research works in the last decades. The acceleration 
of the spectral decorrelation for space application is the main objective of this thesis
In this chapter, the motivation of this research will be presented, where the hyperspectral 
imaging will be defined along with the importance of their compression. The objectives and 
the scope of this work will be defined within the context of hyperspectral image compression. 
The novelty of this research and the publications will be outlined in Sections 1.2 and 1.3, 
respectively; and the structure of the thesis will be presented in Section 1.4.
1.1 Research Motivation and Scope
Hyperspectral images are defined as “the collection of measurements in a large number of 
contiguous and narrow spectral bands” in [2], and they are also referred to as imaging 
spectrometry in some literatures [3]. The data of these images have high resolution spectral 
details, which include comprehensive information about the reflecting subjects, such as the 
material of the building, rock strata or the type of vegetation.
Hyperspectral sensors measure 
the spectrum of the light 
reflected at each pixel
C om ponents of Spectrum
0 4 ixD 2.5 pm
Wavelength
r  _  t Green Vegetation
Dry Vegetation
0 4 pm
1 4 pm
I i1.9 prn
12 5 pm
Kaolinite
Figure 1.1: Hyperspectral image [4]
From a computational perspective, hyperspectral images are 3-dimensional arrays, large 
dimensions. Therefore, such images require large memory resources if saved as raw data and 
require too much power if sent as raw data. In order to overcome these overheads, the data 
(image) should be compressed. The compression process is performed by eliminating 
different types of redundancies: i.e. the statistical, the human vision, the spectral and the 
spatial redundancy.
The human vision and the statistical redundancies are beyond the scope of this research. This 
research will focus on the spectral redundancy and the ultimate objective is to incorporate the 
spectral decorrelation module with a spatial decorrelation one, which was developed at 
Surrey Space Centre (SSC) [5]. Figure 1.2 illustrates the block diagram of the complete 
system, which will target hyperspectral image compression on-board satellites.
Lossy
Compressed
Image
Hyperspectral
Image
Lossless
Compressed
Image
Spectral
Decorrelation
Spatial
Decorrelation
Lossless
Compression
Lossy
KLT
Lossless 
Integer KLT
Lossy
Compression
Figure 1.2: On-Board Hyperspectral Image Compression Module
Spectral decorrelation plays a significant role in hyperspectral data compression, this process 
eliminates the spectral redundancies, and hence, it makes the hyperspectral data more 
compressible. Different sets of hyperspectral data were considered for spatial and spectral 
decorrelations in [6], significant improvements (up to more than 50%) in the compression 
were achieved by performing the spectral decorrelation. While different techniques can be 
employed for the spectral decorrelation, the Karhunen-Loève Transform (KLT) exhibits 
superior compression (up to more than 15%) performance comparing to other techniques [7] 
[8] [9] [10].
There are two schemes of the KLT algorithm: the traditional KLT for lossy compression and 
the reversible Integer KLT for lossless compression. The computations of both schemes 
consist of sequential processes, which are computationally intensive, such as the covariance 
matrix computation, eigenvectors evaluation and matrix factorization and multiplications. 
Therefore, these computationally intensive processes require a longer processing time, which 
increases the latency and affects the real-time characteristics of the operations. The 
acceleration of these processes within the context of limited power and hardware budgets is 
the main aim of this work.
The airborne hyperspectral Airborne Visible / Infrared Imaging Spectrometer AVIRIS images 
are commonly used in many hyperspectral image researches. Therefore they will be 
considered in this research; however, as they are not space-home images, the space-bome 
EO-1 Hyperion hyperspectral images will be considered as well.
1.2 Novelty Summary
The research outcome of this PhD work provides novel contributions to the state of the art as 
follows:
• A new architecture for the acceleration of the integer Karhunen-Loéve Transform 
computation on a Field Programmable Gate Array (FPGA) based System-on-a-Chip 
platform for lossless hyperspectral image compression has been proposed. This 
includes comprehensive investigations of the power consumption, hardware resources 
and performance constraints. Moreover, since no hardware work has been done 
before, this work has highlighted the algorithm computations constraints from a 
hardware perspective.
• A novel architecture for the acceleration of the Karhunen-Loéve Transform 
computation on an FPGA based System-on-a-Chip platform for lossy hyperspectral 
image compression has been proposed. Comparing to previously proposed hardware 
architecture [11] and [12], this work has improved the level of parallelism with a 
minor overhead of convergence check (comparison); the improved level of 
parallelism is more significant for hyperspectral data with large spectral dimension.
• A novel technique for the eigenvalues and eigenvectors computations based on the 
Jacobi algorithm has been proposed; this technique reduces the number of required 
iterations for large symmetric matrices and can also offer partial computations of the 
eigenvectors and eigenvalues. Therefore, the proposed technique can improve the 
parallelism level not only for the KLT computations but also for other applications 
where the some of the eigenvectors or the eigenvalues can be utilised in the next 
computation stage.
• A novel hardware algorithm for computing the eigenvalues and the eigenvectors of 
large symmetric matrices, which is required for the KLT, has been addressed; this 
employs the proposed matrix reduction technique with a selectable level of output
accuracy. In addition to KLT applications, this eigenvectors computer can be utilised 
for different applications that employ eigenvectors computations.
• A comprehensive analysis of both fixed- and floating-point implementations of the 
proposed system has been presented. This analysis includes a comprehensive 
comparison considering different design and performance aspects such as power 
consumption, hardware resources, accuracy and processing time.
• Two different approaches to the KLT computations will be proposed: the first is for 
fast processing time on a Static random-access memory SRAM FPGA platform and 
the second is for low power and hardware resources on a Flash FPGA platform.
1.3 Publications
The results of this thesis are reported in six conference publications. A list of the published
papers related to this thesis is given below.
[1] C. Egho and T. Vladimirova. “Adaptive Hyperspectral Image Compression using the 
KLT and Integer KLT algorithm”, NASA/ESA Conference on Adaptive Hardware and 
Systems (AHS-2014). July 2014, Leicester, UK
[2] C. Egho and T. Vladimirova. “Hardware Acceleration of the Integer Karhunen-Loéve 
Transform Algorithm for Satellite Image Compression”, IEEE International 
Geoscience and Remote Sensing Symposium (IGARSS 2012), July 2012, Munich, 
Germany.
[3] C. Egho and T. Vladimirova, M. Sweeting. “Acceleration of the Karhunen-Loéve 
Transform for System-on-a-Chip Platform”, NASA/ESA Conference on Adaptive 
Hardware and Systems (AHS-2012). June 2012, Nuremberg, Germany
[4] C. Egho and T. Vladimirova. “Hardware Acceleration of the Karhunen-Loéve 
Transform for Compression of Hyperspectral Satellite Imagery”, The 11th Australian 
Space Science Conference (ASSC2011). September 2011 Canberra, Australia
[5] C. Egho and T. Vladimirova. “Eigenvectors Computation on a System-on-Chip 
Platform for Satellite On-Board Use”, 7* Jordanian International Electrical and 
Electronics Engineering Conference, (JIEEEC 2011). April 2011, Amman, Jordan.
[6] C. Egho, Tanya Vladimirova. “Design of Low-Power Multifunctional System-on-a- 
Chip Based On-Board Controllers”, Surrey Postgraduate Research Conference, 
September 2010, Guildford, UK
1.4 Thesis Outline
The outline of this thesis is as follow:
Chapter 2 presents a comprehensive literature review, which includes: space radiation effects 
and their mitigation techniques, FPGA platforms and their suitability for space applications, 
reeonfigurable computing and the System-on-a-Chip technology along with their trends and 
challenges. The needs and the benefits of high-performance computing acceleration for 
different applications are also highlighted. Finally, an overview of different on-board satellite 
computers is presented in this chapter.
Chapter 3 presents an overview of hyperspectral satellite image compression, its significance 
for different applications and the mechanism of their operation. The compression process of 
the hyperspectral data is also discussed and the Consultative Committee for Space Data 
Systems (CCSDS) Standards is addressed. A discussion of the spectral decorrelation 
techniques is presented; this includes the compression performances of these techniques; the 
complexities and the approaches to reduce these complexities
Chapter 4 presents a comprehensive analysis of the KLT computation process; this includes: 
an overview of the KLT computation, the computational requirements of each individual 
computation processes, different techniques for the computations of the eigenvectors are 
investigated, analysed and compared. In addition, a comprehensive error analysis of the fixed 
point implementation of the KLT algorithm is also presented in this chapter.
Chapter 5 addresses the acceleration of the KLT algorithm in from a hardware perspective. 
The computation flow of the algorithm is presented and the acceleration of the eigenvectors 
computations is addressed, where the demonstration of the Jacobi algorithm is presented on 
the on-chip processors (Cortex M-3 and NIOS II) and the proposed Matrix Reduction 
Technique is investigated in a context of embedded processors. In addition, this chapter 
addresses the proposed hardware architecture on both platforms (Flash and SRAM based 
FPGAs) and outlines the required resources (hardware and power) and the execution time in 
details highlighting the main benefits over previous work.
Chapter 6 presents a comprehensive analysis of the Integer KLT computation; this includes 
an overview of the integer KLT, the computational requirements, an investigation of the fixed 
point and floating point implementation so the design considerations can be defined for 
lossless compression. In addition, this chapter outlines the differences in the computational 
requirements between the KLT and the Integer KLT algorithm.
Chapter 7 presents the acceleration of the Integer KLT computation from a hardware 
perspective with a specific emphasis on the processing time, hardware resources and power 
consumption. The computation flow of the algorithm and the proposed hardware architecture 
on the hardware platforms are discussed. Therefore, the required resources (hardware and 
power) and the execution time are outlined and the main constraints and challenges of 
accelerating the Integer KLT fi*om a hardware perspective are defined. An adaptive KLT/ 
Integer KLT system for Lossy / lossless compression is also presented in this chapter.
Chapter 8 presents a summary of the research work and suggest a future potential research 
work.
Chapter 2
Literature Review
2.1 Introduction
This research is mainly concerned with high-performance computing (hyperspectral image 
compression) on FPGA-based System-on-a-Chip platforms for space applications. The 
multidisciplinary nature of this research dictated a broad literature review. Since the research 
is targeting space applications, space radiation effects will be presented in section 2. 
Sections 3, 4 and 5 will present overviews of the FPGA, the reeonfigurable computing and 
the System-on-a-Chip technologies, respectively. These three sections will also highlight the 
suitability of these technologies for space applications along with their current trends and 
challenges. Section 6 will address the needs and the benefits of high-performance computing 
acceleration for different applications. Finally, an overview of different on-board satellite 
computers will be presented in section 7.
2.2 Space Radiation Effects
Our planet is sheltered by an opaque atmosphere, which can prevent or decay many space 
phenomena such as high-energy cosmic rays and solar radiations. Therefore, electronic 
systems operating in the space are far more vulnerable to radiation effects than the ones 
operating in terrestrial regions. This has raised one of the most challenging problems in 
space electronic design. Therefore, the space is a hostile environment for electronic 
systems, as numerous radiation types are presented and can affect their performance, and 
in some cases, they can even cause these systems to malfunction [13]. The main sources of 
these radiation effects are:
2.2.1 Galactic Cosmic Rays
Galactic cosmic rays (OCRs) are high-energy particles (lOOMeV to lOGeV) 
composed of electrons, protons and ionized nuclei and originated from outside the 
solar system [14]. After a supernova, clouds of gas and magnetic field are formed, 
from which the GCRs are formed and given a high-speed (almost the speed of hght) 
[15]. However, some GCRs have extremely high energy and their creation process is 
still unknown.
2.2.2 Trapped Radiation Belts
The trapped radiation belts are two tori of energetic charged particles (plasma) around 
our planet. The radiation belts are also known as the Van Allen belts, after the man 
who first discovered them. The inner belt can extend up to 2.5 Earth radii, while the 
outer belt can extend up to 10 Earth radii [16]. Electrons and protons constitute most 
of the inner belt, while electrons mainly constitute the outer belt. The electron can 
have energies up to 1 MeV in the inner belt and up to 20 MeV in the outer belt, while 
the protons can have energies up to 500 MeV [16].
2.2.3 Solar Particle Events
Solar Particle Events concur with the solar flares, which happen once every 11 years 
and last for few days, resulting intensive stream of charged particles (proton and 
heavy ions) in the space. These are similar to the galactic cosmic rays with slightly 
less energies [16] [17].
These radiations leads to different effects on electronics systems; these effects and their 
mitigation techniques are discussed in details in Appendix A.
2.3 FPGA Types and their Suitability for Space Applications
Field Programmable Gate Array (FPGA) is a reprogrammable integrated circuit, usually 
programmed using a Hardware Description Language HDL, such as VHDL or Verilog. In 
comparison to Application Specific Integrated Circuit ASIC devices, FPGAs have a shorter 
time to market and lower nonrecurring engineering cost; moreover, their on-board 
reprogramming ability makes the debug and the development of the product possible 
after the production [18] [19].
There are three types of FPGA: Anti-fuse, Flash and SRAM FPGAs; Table 2.1 outlines the 
main differences of these types. The anti-fuse FPGAs have high radiation tolerance; however, 
since they can only be programmed once, they are in somehow considered as ASICs [20]. 
Therefore, the anti-fuse FPGAs will not be considered in this research.
The main advantages of Flash FPGA over SRAM FPGAs are:
• The power consumption of the Flash FPGA is less than the SRAM FPGAs [21]
• The Flash FPGAs are more immune to single events radiation effects [22]
• The Flash FPGAs are non-volatile, so they retain the configuration when powered down, 
thus, they don’t require external non-volatile memory for the configuration bits; this can 
reduce the power and footprint requirements.
On the other hands, the main advantages of SRAM FPGA over Flash FPGAs are:
• SRAM FPGAs offer much larger logic resources than Flash FPGAs
• SRAM FPGAs offer dynamic reconfigurabilty, while Flash FPGAs can only offer Static 
reconfigurabilty
• SRAM FPGAs have faster reprogramming speed than Flash FPGAs (approximately 3 
times faster) [23]
SRAM FLASH Anti-fuse
Manufacturing process
Standard
CMOS
Flash
Process
Anti-fuse 
(needs special 
Development)
Area
(Storage Elements)
High 
(6 transistor)
Moderate 
(1 transistor)
Low 
(0 transistor)
Power Consumption High Low Low
Require External Configuration Memory Yes No No
Reconfigurabilty Dynamic Static Not Possible
Reprogramming Speed Fast Slow Not Possible
Radiation Tolerance Low High High
10
Altera and Xilinx dominate almost 90% of the FPGA market with their SRAM FPGAs [24]. 
However, the US aerospace market is mainly dominated by both Actel and Xilinx, where 
Atmel has larger presence in the EU aerospace market (ESA and ONES) [25]. In this work, 
the Flash based SmartFusion from Actel and the SRAM Based Cyclone IV from 
Altera will be considered; both devices and their development boards are presented in details in 
section 5.3 and Appendix C.
2.3.1 FPGA Architecture
As shown in Figure 2.1, the FPGA architecture consists of three main elements: 
programmable blocks (Logic, memory and multipliers), their interconnecting resources 
(grey) and the I/O Cells (red).
I/o L ogic M em ory L ogic Multiplier I/O
I/O ^ ^ ^ B  L og ic  ^ ^ ^ B  M em ory ^ ^ ^ B  Logic ^ ^ ^ B  Multiplier ^ ^ ^ B  I/O
I/O ^ ^ ^ B  L og ic  ^ ^ ^ B  M em ory ^ ^ ^ B  L ogic ^ ^ ^ B  Multiplier ^ ^ ^ B  I/O
I/O ^ ^ ^ B  L og ic  ^ ^ ^ B  M em ory ^ ^ ^ B  L ogic ^ ^ ^ B  Multiplier ^ ^ ^ B  I/O
Figure 2.1: FPGA Internal Architecture [23]
2.3.1.1 FPGA Input /Output Cells:
The FPGA I/O Cells serve as interfaces between the internal logic of the FPGA and the other 
components on the board. Some of these cells are designated for power, ground and analogue 
interface, while others contain input and output buffers, so they can be configured as digital 
inputs or outputs. These cells support different voltage and interface standards to protect the 
FPGA and to reduce the complexity of the communication with the outside world.
11
2.3.1.2 FPGA Programmable Blocks
The programmable blocks are the logic fabric where the hardware computations are 
performed; these can be simple functions such as logical functions (AND, OR etc.) or 
complicated functions such trigonometric functions. Some of these blocks are generic (gates 
and LUTs) while others are functions specific (Multipliers, MAC and DSPs); in addition, 
some of these blocks are designed and optimised to work memory blocks (RAM and ROM).
General Blocks:
Different vendors refer to their general blocks different terms; the Xilinx’s are Configurable 
Logic Blocks CLBs and Slices [26], the Altera’s are Logic Elements LEs and Adaptive Logic 
Modules ALM [27] and the Actel’s are VersaTiles [28]. The architecture of these blocks 
comprises mainly of Look-up Tables LUTs, Latch elements and some combinational and 
sequential logic. However, Blocks from different vendors have entirely different 
architectures, which exhibit different performance depending on the application. Therefore an 
objective comparison between these FPGAs is not a straight forward procedure. 
Nevertheless, a comparison based on the logic densities of FPGA block fabrics from Xilinx, 
Altera and Actel was undertaken in [29], which is summarised in Table 2.2
Device CMOS
Technology
logic density 
(in LUT bits)
Register
(bits)
Effective Size 
(approximately)
Xilinx Virtex-4 90 nm (SRAM) 32 2 1
Xilinx Virtex-5 65 nm (SRAM) 256 4 2
Altera Cyclone IV 65 nm (SRAM) 64 2 1.3
Actel ProASIC3 130 nm (Flash) 4 0.5 0.25
Functional Blocks:
Altera and Xilinx provide hardware blocks (Multipliers and MAGs) to improve the 
performance of Digital Signal Processing (DSP) applications. In addition, these blocks offers 
much less power consumption than if these functions were to be implemented on the general 
blocks. The low-cost Altera Cyclone FPGAs are supported with dedicated multipliers and the 
high-end Altera Stratix FPGAs are supported with dedicated multiplier-accumulators MAGs 
blocks [30]. On the other hands, the Xilinx supports their FPGA devices (Virtex and Spartan) 
with the DSP dedicated blocks XtremeDSP; these blocks support multiplications, addition 
and accumulations [31].
12
Soft Functional Blocks (Soft Intellectual Property IPs) are also provided by FPGA and third 
party vendors; these blocks can perform more complicated functions (Complex Mathematics, 
Bus Interfaces etc.). However, the soft IPs utilise the dedicated DSP blocks and the general 
blocks, so they don’t offer much further power reduction like the DSP blocks. Therefore, the 
major advantage of the soft IP is reducing the development time. Other IPs can be much 
complicated, such as microprocessors; these will be discussed in the System-on-a-Chip 
section.
2.3.4 FPGA Trends
Since their introduction in the mid-80s, FPGA devices have become fundamental to many 
electronic applications. Therefore, they have been through extensive development processes, 
which dramatically altered their architecture and significantly improved their performance 
and reduced the power consumption. This development has revolutionized the FPGA 
technology on the system level architecture as well as the manufacturing process of the die. 
In February 2013, Altera and Intel announced the development of an FPGA based on 14 nm 
TriGate process technology. Also known as FinFET, this technology is based on a 3- 
dimensional fabrication technique, which offers significant advantages over the traditional 
planner CMOS technology [32]. The main advantages of this technology are reducing the 
power consumption, improving the performance, improving the transistor design density and 
reducing the susceptibility to single event upsets SEU [34]. Altera has adapted the new 
technology in their high-end Stratix family, while the low-cost and midrange families are still 
fabricated through the traditional planner CMOS technology. Xilinx has also introduced the 
FinFET technology (16 nm) for the UltraScale architecture [19].
In addition to the development of the manufacturing process, the FPGA system level 
architecture has gone through a thorough development in order to satisfy the demanding 
requirements of the hi-tech market. Most modem FPGA devices can be seen as system-on- 
chip SoCs; therefore, they incorporate hardwired functional units for general and specific 
applications, such as DSP Units, processors, interface blocks and transceivers. The high 
performance ARM dual core Cortex-A9 MPCore processor has been introduced in different 
Xilinx FPGAs. The same processor was also introduced in the new low-cost Altera Cyclone 
V, while the high-end Stratix 10 incorporates the 64-bit Quad-Core ARM Cortex-A53 
processor [18].
13
2.3.3 FPGA in Space:
FPGAs experience many different types of radiation effects in the space: Total Ionising Dose 
(TID), Single Event Latch-up (SEE), Single Event Functional Interrupt (SEFI) and Single 
Event Upsets (SEU). Xilinx and Actel introduced radiation tolerant FPGAs, which are 
believed to show TID and SEE immunity up to certain level [35].
• The SEUs in FPGA can be classified into three categories:
Configuration Upsets: occur when the configuration memory experience an SEU, so 
only SRAM-based FPGAs can experience this type of errors [36]. This can be detected 
through readback and verification of the configuration memory [37].
User Logic Upsets: the contents of the user logic vary during the normal operation of 
the FPGA, so not all these contents are accessible through the bit-stream. Hence, 
detecting this type of errors is not always possible by the read-back technique [36]. 
Architecture Upsets: This type of errors occurs when an upset hit the control logic of 
the FPGA (reset control, configuration circuits or JTAG TAP controller). This type is 
also defined as a Single Event Functional Interrupt (SEIF) [37].
• Single Event Functional Interrupt (SEFI) can lead to
Device De-configuration: The purpose of Power-On-Reset (PGR) circuit is to 
initialize the device and prepare it for configuration by clearing the configuration 
memory. This process starts as soon as the PGR circuit detects a device power-up. 
When an upset hits the PGR circuit during a normal operation, a device de­
configuration arises, so losing all the configuration data. Consequently, the device 
must be reconfigured again [35].
Interruption from JTAG operations: The JTAG/ Boundary scan circuit implements a 
standard test access port (TAP), which is a 4-bit binary encoded state machine. When 
an upset occurs to any of the state machine registers, it can change or shift its current 
state to any available one, hence, affecting its normal operation [38] [39].
Activating Output drivers on an input pin: Even the possibility of this is very small; 
when it occurs a bus contention is resulted. An SEU might short-circuit two internal 
output drivers, resulting in an unintentional high current state that may exceed current 
density requirements for reliable operation [40].
In addition to the above potential problems, the design tools can sometimes introduce some 
unwanted modifications. It has been observed that some synthesis tools can eliminate the 
redundancy logic, which is intentionally implemented for SEU mitigation. Moreover, some
14
place-and-route tools can sometimes insert flip-flops in the design, which can ruin any SEU 
mitigation techniques [41].
2.3.3.1 FPGA Mitigation Techniques for Space Radiations
There are two types of mitigations techniques for FPGAs: techniques for configuration 
memories to ensure that the functionalities of the design are not corrupted; and techniques for 
the user logic to ensure that the processed data are not corrupted.
Configuration Memory Techniques: Reconfiguration of the configuration memory is used 
for the detection and correction of the SEUs. This can be performed by either completely 
reconfiguring the FPGA design or partially reconfiguring the affected part. The complete 
reconfiguration is discussed in [42]; however, this approach requires operation interrupt and 
leads to loss of data. On the other hand, the partial reconfiguration can be performed using 
two methods:
1. SEU detection and single frame correction: this is performed through reading back the 
configuration memory and when detecting any upset, correcting the data-frame of that 
upset. In this method, the configuration logic will be on read mode most of the time 
and on write mode otherwise. Therefore, this will limit the consequences of any upset 
that might occur to the configuration memory. However, this method requires some 
hardware overhead for the read-back and detection mechanisms.
2. SEU scrubbing: The read-back and scrubbing processes requires less hardware than 
the other method. However, in this method, the configuration logic will be on write 
mode for longer time; hence, increasing the potential risk of configuration memory 
upsets.
User Logic Techniques: Triple Module Redundancy (TMR) very common technique, which 
can be achieved by tripling the design on system, subsystem or component level, at the output 
a voting scheme is applied and only the majority votes are considered [35]. In this technique, 
the entire design should not exceed 1/3 of the device. The voter of the traditional TMR is not 
protected; hence, the system will still be vulnerable if the voter is hit. Xilinx developed the 
Xilinx TMR approach, in which all the inputs, the feedbacks and the outputs are tripled in 
order to protect the voter. With some additional complexity, Xilinx TMR has shown better 
protection against SEUs and SETs [43] [44].
15
2.4 Reconfîgurable Computing
Reconfïgurable computing RC is defined as a system with adaptive hardware according to 
specific functions [45]; in other word, reconfigurability is the ability to change the system’s 
hardware in order to optimise the overall performance by improving the processing speed 
while reducing the power and hardware requirements. The concept of reconfigurable 
computing was first introduced by Prof Gerald Estrin in 1962. Estrin proposed this concept 
by introducing the idea of “fixed-plus-variable” computer structure [46]. However, the 
limitation in the silicon technology at the time prevented the feasible realisation of this 
concept. In the 1980s and 1990s, different reconfigurable architecture were developed and 
proposed in academia and industry [47]. However, it was till 1991, when the concept could 
be realised commercially when Algotronix announced the first reconfigurable system [48], 
which was later bought by Xilinx Inc.
Reconfigurable
hardware
CPU
Memory system
Figure2.2: The implementation of reconfigurable computing [49].
This technique employs a processor and an adaptive hardware that can always be 
reconfigured by the processor. Therefore, as shown in Figure 2.2, the processor can remove, 
add and swap certain parts of the hardware depending on the requirements of tasks being 
executed [49]. Therefore, a significant reduction in space and power requirements can be 
achieved when implementing the reconfigurability into the system. Moreover, since it offers 
the capability of altering the hardware, reconfigurability increases the system flexibility and 
can increase the processing speed. Reconfigurable computing has been implemented in many 
cutting-edge technologies like medical imaging, networking, robotics, video applications.
16
emulation systems and space applications. In addition to the traditional benefits of 
reconfigurable computing, the ability to update mission objectives and fix design errors, 
boost the significance of reconfigurable computing in space applications as it can increase the 
mission lifetime. Therefore, reconfigurable computing has been adopted in various space 
missions such as the Mars Pathfinder and Surveyor [50] [51].
2.4.1. Reconfigurable Systems Architecture
Consisting of one or more CPUs, reconfigurable fabric and memory units, reconfigurable 
systems can be presented in different topologies at their system-level architecture [52]. 
Figure 2.3 depicts five possible system level architectures for reconfigurable systems as 
presented in [52] and [55].
The first architecture utilizes the reconfigurable fabrics as stand-alone unites; and CPU 
communicates with these units through the processor’s inputs and outputs. Therefore, data 
transfer between the CPU and the reconfigurable fabrics is rather slow. Furthermore, when 
the CPU significantly involves with the reconfigurable fabrics to process certain tasks, this 
will lead to a substantial increase in the processing time. Therefore, this architecture is 
convenient only when the level of CPU and the reconfigurable fabrics communications is 
very limited, this applies in emulation systems, where this architecture is quite popular such 
as in Cadence Palladium II or Mentor Graphics VStation Pro [53] [54].
In the second and the third architectures shown in 2.3.b and 2.3.c, the coupling between the 
CPU and the reconfigurable fabric is tighter; thus, less delay in the communication between 
them and less parallelism in program execution incurred by more CPU intervention to the 
reconfigurable fabrics. The reconfigurable fabrics in these architectures can be seen as a 
coprocessor, which is configured by the main (host) processor. After configuration, the 
reconfigurable fabrics can work independently with no supervision of the main processor 
until the task is done and when the results are ready, then they will be sent to the main 
processor. The noticeable difference between the two architectures is coupling level, which is 
tighter in 2.3.c. Moreover, main CPU cash is only accessible to the reconfigurable fabric in 
architecture 2.3.c, so there is more delay in data transfer (input, result and configuration data) 
in architecture 2.3.b. In other word, the architecture 2.3.b can be seen as a compromise 
between architectures 2.3.a and 2.3.c.
In 2.3.d architecture, the reconfigurable fabrics have been embedded inside the main CPU. 
This fabric, referred to as a functional unit FU, can provide additional instructions, which are
17
amendable as the FU is made of reconfigurable fabric [55]. In the last architecture, 2.3.e, the 
CPU has been embedded inside the reconfigurable fabric. The CPU in this architecture can be 
either hard processor or soft processor programmed into the reconfigurable fabric; the latter 
ease is more vulnerable to radiation effects.
Each of the above architectures has certain advantages and disadvantages; therefore, an 
appropriate architecture can be selected according to the application requirements (execution 
parallelism, data communication delay etc.).
□ □ □ □ □ □ □ □
□ □
□ □
□ u
□ CPU □
u u
□ □
□ □
□ □ □ □ □ □ □ □
LU
lU
o
<
LLI rrO HI
< 1-o z
o
111111
11II11
T T  T I I JJZ
T T T
RECONFIGURABLE 
PROCESSING UNIT
( a )
FU
u_u_
□ □ □ □ □ □ □ □
□ □ □ □ □ □ □ □
CPUCPU
C P U
□ □
( d )
□ □ □ □□ □ □ □
□
□
□
□
□
□
□
P rogram m able
Fabric
CPU
□
□
□
□
□
□
□
□ □ □ D O  □ □  □
( e )
Figure 2.3: System-level architectures for RC Systems [52]
18
2.4.2 Reconfîgurable Fabric Structure:
From the granularity perspective, reconfigurable systems can be classified into two 
categories: fine-grained and coarse-grained. While fine-grained systems consist of a large 
number of small components, coarse-grained systems consist of small number of large 
components. Having large components makes the coarse-grained systems more efficient for 
certain applications with a trade-off on flexibility; on the other hand, fine-grained systems 
provide more flexibility with less efficiency for applications requiring high level of 
computation. In order to utilize the advantages of both categories, some reconfigurable 
systems comprise of both fine and coarse-grained fabrics.
2.4.3 Reconfiguration Scheme:
From a procedural perspective, the process of reconfigurability can be performed either 
dynamically or statically as shown in Figure 2.4 [56].
Figure 2.4 (a) Static Reconfiguration (b) Dynamic Reconfiguration [56]
2.4.3.1 Static Reconfiguration
When the static reconfigurability is implemented, the hardware is configured at the beginning 
of the application process, and then, it stays fixed during the whole operation. Any changes in 
the hardware, will require halting the system, loading the configuration bit-stream and then 
restarting with the new configuration. In this scheme, only a single configuration can be 
loaded at a time; hence, it is referred to as a single-context reconfiguration as will be 
explained in the following subsection.
19
[ncnming oontlguration /  Lcigic & 
Routing
Logic & 
Routine
(after rcctinHgurafion)
(a)
Logic & Logic & 
RoutingRoutin
Incom ing confieiualicin
(after reconfiguration)
(b)
Incoming configuration /  Routine
Logic & 
Routing
(c)
(after reconfiguration)
Figure 2.5: Reconfiguration Schemes [56]
2.4.3.2 Dynamic Reconfiguration
In this scheme, the hardware reconfiguration can be performed while the operation is 
running; thus, it is referred to as run-time reconfiguration. Figure 2.5 depicts three different 
methodologies that can be assumed when implementing dynamic reconfiguration.
• Single Context Reconfiguration: The first method, shown in Figure 2.5(a), is 
referred to as a single context and is usually used for static reconfiguration. This 
method requires loading the full configurable hardware even if a small part needs to 
be reconfigured.
• Multiple Context Reconfigurations: This method requires the hardware to have 
multiple programming memories, so it can be loaded with multiple configuration 
contexts at the same time. These contexts can be seen as planes, as shown in Figure 
2.5(b). When in operation, the system is executing one configuration (plane); 
however, if a different configuration is needed, the hardware can swiftly switch to the 
required configuration (plane). As many contexts are supposed to be programmed at 
the same time, a large programming memory is required. The coarse-grained fabric is 
usually used for this approach because it can fulfil the memory requirements, [57].
20
• Partial reconfiguration: In the previous approaches, the reconfiguration of the entire 
hardware is performed; however, in some cases, only a part of the hardware needs to 
be reconfigured, this is referred to as a partial reconfiguration Figure 2.5(c). This 
feature is only supported by relatively modem FPGA families. Unlike the previous 
approaches, reconfiguration can be performed without halting the operation of the 
system. In other word, the system is divided into many modules, and while the 
reconfiguration of certain modules is undertaken, other modules are executing the 
normal functionality without intermption. Unlike Traditional design tools, which 
perform full reconfiguration by loading the entire system with a new bit-stream, the 
partial reconfiguration procedure requires special design tools, which can perform the 
reconfiguration on certain modules without intermpting the others [58]. The 
SoCWire, on-Chip bus architecture based the well-known SpaceWire, was used to 
perform partial reconfiguration for space applications [59]. The SoCWire was used to 
guarantee a reliable performance in the harsh environment of the space.
2.4.4 Reconfigurable Computing Trends and Challenges:
The field of reconfigurable computing is paving the way for new applications and, when 
appropriately implemented, it can dramatically improve many of the system characteristics 
such as, speed, area and power consumption. Therefore, since its introduction, there have 
always been intensive research activities on reconfigurable computing. These activities aim 
to:
1. Introduce reconfigurable techniques to new applications, by investigating the 
suitability of RC on these applications and propose optimal approaches for them.
2. Improve the reconfigurable system characteristics (speed, power consumption etc) by 
proposing new approaches on the system level architecture or the fabric granularity.
3. Introduce new approaches to increase the fault tolerance level of the reconfigurable 
systems.
4. Investigate new methodologies to make the reconfigurable schemes more efficient.
5. Investigate new technologies to make dynamic reconfiguration possible with the 
presence of radiations in harsh environment like the space (this will be discussed in 
the next subsection)
21
However, there are many challenges facing the reconfigurable computing paradigm:
• First, the complexity of an efficient RC design requires competence and expertise that 
cannot be found in most embedded designers. However, as it is becoming more 
important for many embedded applications, more universities started offering 
modules and labs focusing on this technique. Moreover, the continuous developments 
of the RC design chain tools and CAD help toward simplifying this procedure. For 
these two reasons, lack of expertise problem can be gradually overcome [49].
• The second challenge is the power consumption; conceptually, the RC technique can 
reduce the power consumption, however, this is the case only when a proficient 
design methodology is implemented. Therefore, because of the hardware overhead, 
RC systems can consume more power when not designed carefully. Therefore, in 
order to encourage their clients to use this technique, RC platforms vendors 
continuously work on the development of intelligent CAD tools that can help the 
designer toward delivering optimal RC design [49].
• The third challenge comes with the hardware overhead, which increases the system 
susceptibility to errors.
These are the main challenges facing RC applications; however, when RC is considered for 
space applications, the presence of space radiations introduce new challenges [49].
2.4.5 Reconfigurable Computing in Space
In addition to its traditional advantages, reconfigurability offers more features for space 
applications; after-launch hardware modification is the most important of these features. 
Therefore, the importance of this technique has increased the space application interest in this 
technique; hence, many research activities have been working toward utilising this technique 
in space. In addition to the challenges illustrated in the previous section, the space radiation 
environment presents a new challenge. Thus, protecting the RC system fi"om the space 
radiation is another concern. The susceptibility of FPGAs to space radiation can be sorted 
fi*om high to low as following: SRAM, Flash and then Anti-fuse. While no reconfigurability 
is possible on anti-fuse FPGAs, only Flash and SRAM FPGAs offer this feature, where Flash 
FPGAs can only offer static reconfiguration and the SRAM can offer both static and dynamic 
reconfiguration.
22
Xilinx offers the space upgrade radiation hardened Virtex-4 QProV FPGA that can offer 
significant immunity against TDD, SEU and SEE Latch-up [60]; in addition, the Xilinx 
TMRTools offers further SEU mitigations. A recent study proposed an innovative non­
volatile FPGA architecture based on Thermally Assisted Switching Magnetic Random 
Access Memory TAS-MRAM [61]. In term of configuration time, this approach can be seen 
as a compromise between the Flash and the SRAM technology. Nevertheless, configuration 
speed offered by the TAS-MRAM is fast enough to handle dynamic reconfiguration. On the 
other hand, TAS-MRAM offers radiation hardening; however, this incurs larger die area, 
which means more power consumption. Table 2.3 presents a brief comparison between Flash, 
SRAM and TAS-MRAM FPGA [61].
Table 2.3: A Comparison between SRAM, Flash and TAS-MRAM Teellinologies
Technology SRAM TAS-MRAM Flash
Read Speed Fast Medium Low
W rite Speed Fast Medium Low
Die Area Medium Large Small
Retention (years) 0 10 10
Radiation Tolerance Low High Medium
Dynamic Reconfigurability Yes Yes No
2.5 System-on-a-Chip
The term System on Chip (SoC) refers to a set of components embedded into single 
heterogeneous silicon chip, forming a complete system [62]. This chip is typically comprised 
of one or more processors, memories, peripheral blocks and accelerated fiinctional units such 
as encoder and decoder blocks. In addition, SoC might also contain analogue and RF 
components, micro electro mechanical systems MEMS and optical inputs and outputs. 
Therefore, a variety of complex components can be integrated into one SoC chip, and each of 
these components is a subsystem by itself. In addition to the power and the space reductions, 
the coherent structure of SoCs offers better performance. Therefore, there has been a 
significant interest in this technology, which paves the way for new generations in embedded 
systems. The architecture of a system on chip SoC varies according to the designated 
application; however, all SoCs contain at least one CPU, interface blocks and different types 
of memory blocks. Therefore, modem microcontrollers can be seen as SoCs in their simplest 
form. Nevertheless, as the objective of the SoC is to perform far complex tasks, some
23
application specific components are usually incorporated into the same chip; these 
components are referred to as Intellectual Property Cores “IP cores”. While cores can be 
relatively simple such as communication interface blocks, others can be very complicated 
such as DSP and image processors [63].
2.5.1 IP Cores
The main objective of the System on Chip is to perform complicated tasks within specified 
budget (power consumption, memory, hardware resources). In order to attain these 
requirements, various highly complex blocks are required, and each of these blocks is 
designed to execute certain advanced functions. Designing these blocks requires a long 
development time, and if all blocks had to be designed from scratch, the design life cycle of 
any SoC would take unreasonable time incurring irrational cost, so the whole process would 
not be feasible. Therefore, the significance of the reuse methodology arises when designing 
any system on chip; thus, previously designed and verified blocks (IP cores) can be utilized 
for new designs to reduce the design life cycle. These IP cores can be obtained from many 
semiconductor vendors such as ARM and Xilinx. On the other hand, many open source 
vendors provide IP cores with no cost; however some of them can only be used under certain 
conditions. Moreover, some companies usually design their own IP cores, which can be used 
internally [64] [65].
Hard versus Soft IP Cores
IP cores can be classified into two categories: soft cores, which come as synthesizable RTL 
code, and hard cores, which come completely designed, placed and routed as ASIC. While 
hard cores can offer higher performance and higher speed, soft cores can offer higher 
flexibility, re-usability and portability [66].
2.5.2 Soft Processors
Table 2.4 outlines a list of a variety of 8, 16, 32 and 64-bit soft processors [67]. While some 
of these processors are only compatible with the devices of their vendors such as Nios II for 
Altera and MicroBlaze for Xilinx, others, mainly the open source ones, are not restricted to 
certain FPGA vendors. Three different soft processors will be presented in this report: The 
main FPGA vendors (Altera and Xilinx) processors: the NIOS II and the MicroBlaze; and the 
LEON processor, which is designed for space applications.
24
Table 2.4: A list of a variety of soft processors
CPU Architecture Bit License Pipelin
e
Cycle per 
instruction
FPU Area (LEs) Comment
SPARC-V9 64 Open GPL 6 1 Yes 37000-
60000
Single core 
version of  
Ultra SPARC 
T1
SPARC-vSF 32 Proprietary 7 1 Yes -4000
32 Proprietary 7 1 Yes -4000 Fault Tolerant
SPARC-vS 32 Open GPL 7 1 Yes 3300
SPARC-vS 32 Open
LGPL
5 1 ext 3000
OpenRISC
1000
32 Open
LGPL
5 1 No 6000
MicroBlaze 32 Proprietary 3,5 1 Opt 7324 Limited to 
Xilinx
MicroBlaze 32 Open
LGPL
3 1 No 2336 Open source 
clone of  
MicroBlaze
MicroBlaze 32 Open
(MIT)
3 1 No 7P2&
NIOS 11 32 Proprietary 6 1 Opt 7&00
Limited to 
AlteraNIOS 11 32 Proprietary 5 1 Opt 7770
NIOS 11 32 Proprietary no 6 Opt 390
LatticeMicro3
2
32 Open 6 1 No 79&4
ARM v6 32 Proprietary 3 1 No 2600
DSPuval6 16 Open No 4 No 370
PicoBlaze 8 Proprietary No 2 No 792 Limited to 
Xilinx
PicoBlaze 8 Open
(BSD)
No 2 No 204 Open source 
clone o f  
PicoBlaze
LatticeMicroS 8 Open No 2 No 200
2.5.2.1 LEON
Initially was designed by the European Space Agency ESA, then Gaisler Research continued 
the development of LEON processor [68]. Based on SPARC-V8 architecture, LEON is an 
open-source 32-bit soft processor, comes without any license fees (standard LEON) or with 
proprietary license (Fault Tolerant LEON FT). LEON is also available as a radiation 
hardened ASIC from Atmel in the AT697E processor [69] and the AT7913E SpaceWire 
Remote Terminal Controller [70].
There are two families of LEON, standard LEON and Fault-Tolerant LEON FT (EDAC) for 
the hostile space environment. The latest versions of these families are:
25
1. LEON-3 FT: As the space upgrade version of LEON-3, LE0N-3FT provides an 
efficient protection against Single Event Upsets SEUs. In addition to most 
functions in LEON-3, LE0N-3FT utilizes error correction mechanisms, which can 
handle up to 4 errors per 32-bit word in both register file and cache memory 
without affecting the performance. Elowever, some LEON-3 functionalities are 
not supported by LE0N-3FT, such as cache locking, local scratch RAM and LLR 
cache replacement. The overhead incurred by the fault tolerant mechanisms is less 
than 15%, so, depending on the supporting cache, LE0N-3FT can occupies 4k- 
4.5k Virtex-4 LUTs (about 2K-2.5K Virtex-5 LUTs) [67] [71].
2. LEON-4: Released in January 2010, this is the latest version of LEON; its 
performance enhancement makes it specifieally suitable for multi-proeessor 
system-on-chip solutions for both synchronous and asynchronous approaches. 
Moreover, with its high eonfigurability, LEON-4 ean be optimized for higher 
performance, lower power consumption, input/output speed and physieal area. 
When implement on Vitex-5 FPGAs, a speed of 125 MHz can be attained; on the 
other hand, an area of 4K LUTs can be occupied by LEON-4 [67] [72].
IRQ
15
MAC 16
MUL 32
DIV 32
4 -P o rt R e g is te r  F ile
7-Stage 
In teg er Pipeline
l-RAM
Interrupt
Control
Instruction
C ache
Data
C ache
Memory Management 
Unit
AMBA AHB in te rface
LE0N4
IEEE 754 
Floating-Point 
Unit
Co-Processor
D-RAM DebugInterface
Power
Down
Trace
Buffer
Debug 
l/F 
< »
64/128
Q  Minimum Configuration 
n  Optional Blocks 
H  C o-P rocessors
Figure 2.6: A Block Diagram of LEON-4 [72]
26
2.S.2.2 Altera NIOS II
Designed by Altera, NIOS II is a 32-bit proprietary soft core RISC proeessor; thus, it exhibits 
optimal hardware utilization on the Altera FPGA than other soft processors. Three different 
versions of the NIOS II are available: the économie eore e, the standard core s and the fast 
eore /  Moreover, it also supports an optional hardware floating point unit and hardware 
division [73], whieh ean aceelerate floating point and division operations, respectively.
TOM
l-MEM
CUSTOM 
INSTR IF TCM D-MEM -
IS ^ Niosll ^  DS #1*111#
INI
CNTRL
MMU MPU 
Debug
JTAG HW 
DEBUG BP
I & D 
TRCE
EXP
CNTRL
TRCE
PORT
Figure 2.7: A Block Diagram of the NIOS II [73]
2.5.2.3 Xilinx MicroBlaze
Designed and developed by Xilinx, MicroBlaze is a 32-bit proprietary soft proeessor; thus, 
when implemented on Xilinx FPGA, it oecupies less area and offers better performance than 
any other soft proeessors. MicroBlaze can be optimized for higher performanee with FPU, 
lower power consumption, speed and physical area [49].
In s tru c tion -s id e
bus in te rfa c e
Q
O
lOPB
1 /
IF
P ro g ram
C o u n te r
In s tru c tio n
B uffer
Special
P u rp o se
R eg is te rs
In s tru c tio n
D ecode
B arrel S h ift
M u ltip lie r
D iv ider
7 Y
R e g is te r  File 
32 X 3 2b
D ata -s id e  
bus in te rfa c e
9V
W
IF
Figure 2.8: A Block Diagram of MicroBlaze [54]
27
2.5.3 Hard Core Processor
Two hard processors that are used in SoC FPGA platforms will be presented: the PowerPC, 
which is incorporated in different Xilinx FPGAs, the Cortex-M which is presented in Actel 
Smartfusion SoC.
2.5.3.1 PowerPC (Performance Optimization with Enhanced RISC Performance Computing)
PowerPC is 32-bit RISC processor introdueed by AIM (Apple-IBM-Motorola) in 
1991. Initially it was meant to target personal computers, however, it has been 
employed for embedded system applications as well [74]. Xilinx has incorporated two 
PowerPC cores into their FPGAs, PowerPC 405 for Virtex-II Pro and Virtex-4 FX, 
PowerPC 440 for Virtex-5.
The PowerPC includes different IP eores with it; these IP cores include bus 
infrastructure and bridge cores, memory and memory controller cores, DMA 
controllers, peripherals and arithmetic cores etc. Moreover, PowerPC supports DSP 
algorithm with multiply and accumulate extensions; also, PowerPC 440 is supported 
with an Auxiliary Processing Unit (APU), which supports further DSP functionalities 
such as Single-Instruction-Multiple-Data (SFMD) and floating point hardware [74] 
[75].
2.5.3.2 Cortex M-3
The ARM Cortex M-3 is a 32-bit RISC processor developed to provide a high- 
performance with low power eonsumption. It is incorporated in SmartFusion and 
SmartFusion 2 SoC platforms, with operating frequency of 100 and 166 MHz, 
respectively. The Cortex M-3 supports advanced features such as a single-cycle 
multiply and a hardware divide [76]. However, there is no DSP support or hardware 
floating point unit in the Cortex M-3.
In addition to the midrange processor Cortex M-3, the high-performance multicore 
application processor Cortex A-9 has also been introduced in different FPGAs as a 
hard processor, such as in the Xilinx Zynq and the Altera Cyclon V. In October 2013, 
Altera announced it will include the quad-core Cortex A-53 in their Stratix FPGA; 
this will be the first 64-bit hard processor on FPGAs SoC [77].
28
2.5.4 On-Chip Bus Protocols
The main objective of on-chip bus is to maintain a highly reliable and efficient 
communication between the SoC’s cores [65]. There are different on-chip bus architectures:
AMBA (Advanced Microcontroller Bus Architecture): is an open standard protocol 
introduced in 1996 by ARM ltd, where it is described as '"The de facto standard for  
on-chip communication"". The first version of AMBA included only two bus/interface 
protocols: Advanced System Bus (ASB) as a high-speed high-performance 
communication standard, and Advanced Peripheral Bus (APB) as a low power low 
performance standard. In AMBA 3, three additional standards were introduced: 
Advanced extensible Interface (AXI), Advanced High-performance Bus (AHP) and 
Advanced Trace Bus (ATB). In the latest versions, AMBA 4 and 5, the Coherency 
Extensions was introduced; this enabled the processors to access each other’s caches 
[78].
CoreConnect: is introduced by IBM and used in the PowerPC and supported by the 
Xilinx MicroBlaze. CoreConnect defines three bus protocols: processor local bus 
(PLB), the on-chip peripheral bus (OPB) and a device control register (DCR) [79].
Avalon: Introduced by Altera and supported by the NIOS II and other Altera IP cores. 
The architecture supports 7 different types of interface [80]:
1. Avalon Streaming Interface (Avalon-ST): for unidirectional flow of data, 
including multiplexed streams and DSP packets.
2. Avalon Memory Mapped Interface (Avalon-MM): for address-based 
read/write interface typical of master-slave connections.
3. Avalon Conduit Interface: for connecting an arbitrary collection of signals
4. Avalon Tri-State Conduit Interface (Avalon-TC): for point to point 
interface for connections between on-chip to off-chip peripherals.
5. Avalon Interrupt Interface: for driving interrupts that allows components to 
signal events to each other.
6. Avalon Clock Interface: for interface that drives or receives clocks.
7. Avalon Reset Interface: for interface that provides reset signals.
29
2.5.5 System-on-a-Chip Challenges and Trends
The remarkable advancement in the VLSI technology in the past two decades has 
revolutionized the IC industry; and by drastically increasing the transistor count per die as 
shown in Figure 2.9, this industry has been brought into the era of multibillion-gate chip e.g. 
the 32 nm CMOS IBM zEC12.
T ra n s is to r  c o u n t  (m illio n s)
10,000
1000
100
10
1
> 1 b illio n !! !
 V t
♦  $
1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
Figure 2.9: the CPU transistor count over the last 4 decades [81]
In addition to the continuously increasing transistor count, the need for a rapid design cycle, 
high performance and flexibility have been the main motives for the System-on-a-Chip 
technology. To overcome the complexity issue of SoC design, a divide-and-conquer strategy 
is usually assumed. Therefore, after specifying the system requirements, the design flow ean 
be divided into digital and analogue domains, each of which can be broken down in order to 
reduce the complexity of the design. These segments ean usually be designed using 
programmable and IP eores, which can help in improving the flexibility as well as 
accelerating the design cycle as explained earlier in this chapter. These approaches play a 
significant role in lessening the design complexity; however, challenges always emerge 
proposing different design trends. An overview of the System on Chip motives, approaches, 
emerging issues, trends and challenges is outlined in Figure 2.10 [82]. The main challenges 
and trends will be discussed in the next subsections.
30
Basic
D riving
Forces
Prevailing
D ivide-and-
C onquer
S trategy
Rapid 
V ^ e s ig n  cycle  y
Em erging
Issues
M odern
Design
Trends
Future
C hallenges
High 
performance
More Flexibility
transistors
HW/SW \  
C o-design
IP reuse
IntegrationProgrammable 
Core
Memory
V T Z T  V  consum ption
Power Transistor
v a r ia b i l i t y ^
Power MPSoCV i^ n a g e m e n t verificatio
Embedded
memory
Network-on 
Chip IP integration Reliability
Scalable  
; , & reusable j
\^ch itecture,/
Figure 2.10: System on Chip motives, approaches, emerging issues, trends and challenges [82] 
2.5.5.1 SoC Challenges
With the advancement of the SoC technology, many challenges arise. The main challenges
are:
1. Power Consumption (Power Wall): the growing demands for higher transistor count 
per die incurs more power consumption as shown in Figure 2.10. On the other hand, 
the demand for better performance requires higher speed, which increases the power 
consumption. Therefore, the higher transistor count and the higher speed do promise 
better performance; however, these come at the cost of higher power consumption. 
Therefore, it is always important to optimize the design to achieve high perfoiTnance 
within the available power budget [82].
2. Memory bandwidth and Latency (Memory Wall): While the computational speed can 
increase by 50 % every year, the time to access off-chip memory cannot increase 
more than 20% a year. This gap is known as the “Memory wall”. This issue raised the 
interest in embedded memories, which can offer higher speed. This type of memories 
has become very common in most SoC, in which they are incorporated as SRAM, 
ROM, Flash or DRAM. However, the embedded memories raise other problems such 
as cash coherence, memory consistency and indeterminate delay [82].
31
3. IP Integration: Implementing different IP cores on a single chip raises many issues:
• Working with black-box IP cores
• Implementing IP cores from different resources with different standards
• Handling the IP cores’ Inputs/outputs request in a timely manner
• Handling the noise from analogue to digital or digital to analogue components 
when implementing analogue IP cores
Therefore, implementing IP cores is not a straight forward procedure; and a special 
attention must be given in order to achieve a stable performance [82].
2.5.S.2 SoC Trends
The above challenges have steered the directions of the design trends as follow:
1. Power Efficiency: The issue of power consumption has always been a big concern 
for System on Chip; so many techniques have been developed to tackle this problem. 
The first technique is implemented on the system level, in which the unused parts are 
switched off, and when needed, upon to a request, they can be switched on. The 
second technique is to compromise the global clock distribution, which increases the 
power consumption. This can be done by using mesochronous design, in which a 
single frequency with different phases is used [82]. The drawback of this technique is 
that it can compromise the synchronisation of the system. In the last technique, 
asynchronous design is used to save the power incurred by the global clock. In this 
technique hand-shaking protocols are used to synchronize the operation of the 
system, therefore, increasing the complexity of the design [82].
2. Multiple Processor System on Chip: These types of SoCs are usually known as 
MPSoC, in which more than one processor are integrated into the chip in order to 
achieve more powerful processing at both data and instruction levels. Different types 
of processors can be combined in one chip; for example, application specific 
processors (DSP, Graphic or multimedia) and general purpose processors can deliver 
better performance with more flexibility. Furthermore, this technique improves the 
performance with less power consumption overhead comparing to the single 
processor [83].
3. Network-on-Chip (NoC): The need for efficient communications between the 
functional blocks has highlighted the importance of NoC technology [65].
32
2.6 Acceleration of High-Performance Applications
The growing demands of cutting-edge applications, which exceed the expectation of Moore’s 
Law (Figure 2.11), have driven the research to examine the issue from different perspectives. 
Therefore, instead of associating development (speed, performance, power consumption 
etc...) with the advancement of CMOS technology only, new paradigms have been 
introduced, such as hardware acceleration, application specific processing, multi-processing 
and parallel processing.
A p p lica tio n s  o re  
o u tp a c in g  M o o r e 's  Law
s
D
E "The T e c h n o lo g y  G ap "
- HD video and im age manipulatii
- Com plex, multiphase models
- Real-time data dem ands
- N ew  algorithms
O
X 8 6  p erfo rm a n ce
D ata-driven
com p u tin g
Time Source: AMD
Figure 2.11: The Technology Gap between single CPU and HPC [84]
2.6.1 High Performance Computing on FPGAs
In the last two decades FPGAs have gone through extensive development in different aspects: 
fabric, embedded memories, embedded processors, DSP accelerators and analogue computers 
and interfaces. This heterogeneous characteristic made FPGAs strong potential candidates for 
System-on-a-Chip platforms offering solutions for various applications. In addition, their 
reconfigurability makes FPGAs potentially versatile platforms for HPC. Many techniques 
have been employed on FPGA to achieve a high performance. These techniques include: 
parallel processing, hardware co-processing, reconfigurable hardware and evolvable 
hardware; the latter two support fault tolerance as well [84] [85]. Most HPC systems are 
application specific; hence, they are designed to deal with the problems of certain application 
and to benefit from the computational properties of such applications in order to provide 
optimum performance.
33
Table 2.5 outlines the use and benefits of FPGA Acceleration on a range of applications; also, 
the table shows the contribution in a wide range of applications: financial analysis, medical, 
bioinformatics as well as image and video processing
Table 2.5: Comparison of FPGA Accelerators with General Purpose Processors f841
Application
Processing Time
AccelerationGeneral Purpose 
Processor
FPGA Accelerator
Hough and Inverse Hough 
Processing
12 minutes processing 
time Pentium 4-3 GHz
2 seconds o f processing 
time at 20 MHz
370X
Spatial Statistics (Two-Point 
Angular Correlation 
Cosmology)
3,397 CPU hours with 
2.8-GHz Pentium 
(approximate solution)
36 Hours (exact solution) 96X
Black-Scholes (Financial 
Application)
2.3M experiments/sec 
with a 2.8-GHz CPU
299M experiments/sec 130X
Smith Waterman SS search 34 
from PASTA
6461 sec processing time 
(Opteron)
100-sec FPGA processing 64X
Prewitt Edge Detection 
(compute intensive video and 
image processing)
327M clocks (1-GHz 
processing power)
13 IK clocks at 0.33 MHz 83X
Monte Carlo Radiative Heat 
Transfer
60-ns processing time (3- 
GHz processor)
6.12 ns o f processing time lOX
BJM Financial Analysis (5M 
paths)
6300 sec processing time 
(Pentium 4-1.5-GHz)
242 sec o f processing at 61 
MHz
26X
2.6.2 Discrete Transforms
In digital systems, discrete transforms are fundamentals for many applications; the utilisation 
of these transforms assists with the processing of the digital signals, which, from a 
computation perspective are seen as data sets. Based on the Fourier transform, the first 
discrete transform was invented by the German mathematician Carl Gauss in 1805 [86] and it 
is still considered efficient for many applications. Nowadays, there are many discrete 
transforms, each of which exhibits different efficiencies for different applications. Moreover, 
with the advances of modem applications and their higher demands, more computationally 
intensive transforms have been developed.
Table 2.6 outlines some of the common discrete transforms and their applications. 
Mathematically fast computing schemes have been introduced to reduce the number of 
computations needed. However, developing such fast schemes is not possible for all these 
transforms; for example, the computationally intensive Karhunen-Loéve Transform (KLT) 
has no fast computation scheme. The KLT has a substantial advantage when used in satellite 
hyperspectral image compression. Therefore, a powerful acceleration of this transform on 
embedded hardware is of high importance; and, this has not been addressed before. The KLT 
computations will be comprehensively discussed in Chapter 4 and Chapter 6 of this thesis.
34
O0)
Q .
O
>.O) CD)
8 . 1 8 . §  1
0 d) o  §:
O) 0 D) 0  0C CD. C CD. T:
ë î 2 g  §
03 o C/3 o  + O
Ui
. CD C c
0  .2  o
iS p
CD
1-- O  , 9  CD 
CO Ü
0co
y= Z'
Ü) X3 0
Q. n
b
o
Ü
E
8
o
O)o
O
h o S  o ^ h o S  O  ^  o
k o S
Î È
i l l
ro 03 o  
^  0)-o
i i
0 0 
Sm S5T
O (O o
0 CD^ 0 - ^ 0  JD 0  XÎ 0  
CD 0  CD 0
0c
0
1 
CD.
<
C  _  
& ■ § !
H i .
HiQ. i5 CD cr (/) O Q_ 0
I
i
■O
cr
0
s i l
s i S
■2> 2  CD 
03 CD.O
_c
B  ^
0  0  o
1 1 1
ifil
0 <  ^
. 3 ^ - g
u
Q)a.
CO
c
g
I
CD.
E
8
S  CL 
0  C/3 
O  Q
0
o” i
T3
II
S  8
c  C  DÎ.P 
c  o  
LU 03
0 O)
0  . 5
"Oo  
E
o0 .SÜ
1
2 
CD.
0
O)
0
c
g
Ig
I
o
II
o
II o
II
Cl3
w
R|%:
I
(13
w
^ 1^
I__________ I
8
eI
ÏW1
I
I
pL,
. s
1
CM
_
lE a ,
8
o
H
Ü
Q
Ü
Q
I-
X
Q
te
Q
te3d
in
CO
2.7 On Board Computing
Since this research work targets space applications, it is important to present an overview of 
some of the current on-board satellite computer systems.
2.7.1 OBC386
The OBC386 is a general purpose on-board data handling computer manufactured by Surrey 
Satellite Technology Ltd, SSTL for LEO applications. The block diagram of OBC386 is 
shown in Figure 2.12; Intel 386EX is the main processor and Intel 387SL as a co-processor, 
which speed up the floating point calculations. The 128MB RAMs are protected using Reed- 
Solomon technique (software) and the 4MB program memory is protected with Triple 
Modular Redundancy. Moreover, the interface is provided via Controller Area Network CAN 
at a speed of 32Kbps. Table 2.7 shows the main speeifications for OBC386 [87].
Total 4 ..128M byte H U R eset CHO 3
+5V12*PS
T
TO
^  f ~  Se le c t f
CLK
Gen
SYNC
R E SE T
" m o .  17.25 M Hz
+T
ISO
C LK 0..7 C H 4..7
m m
ISO ISO
MUX
Bus 
Controller
i t
386EX 387SL
mx i^ ux I TC
16C35
SC C+DM A
16C35
SC C+D M A
H A V  I 
♦
SEB Bus 
C ontro ller
BUS
A rb iter
CANO
CAN1
+28V
a CAN NODE TCSP
DC.’DC
+ 2 8 + 5
DC .'DC 
+ 2 8 + 5
+5V
LOGIC
TM R
2M"8 1 2NT8
2 M *8 11 2NT8
2M *8 2M"8
PAGE
A
EPR O M
3 2 K 8
Enable
i
DC .'DC
Local Bus
Y  Data B us
A dd ress Bus
Y  SEB B us
A V
CAN CANGlue
A EX
CANO CAN1
Figure 2.12: The OBC386 Diagram [87] 
Table 2.7: OBC386 Specifications
Processor Intel 386EX
Co-processor 387SL
Clock 8/16/20/25 MHz
Power 2.5 W when non isolated and 5 W i f  isolated
D im ensions 3 30X 330X 32  mm
36
2.7.2 OBC750
The OBC750 is a high performance on-board data handling computer manufactured by 
Surrey Satellite Technology Ltd, SSTL for LEO applications. The block diagram of OBC750 
is shown in Figure 2.13; PowerPC 750FL from IBM is the main processor. In addition to the 
256MB ED AC protected SDRAMs, OBC750 has 2 MB of Non-Volatile MRAM and 16 MB 
of Flash Memory [88] and it supports the following interface:
Controller Area Network CAN 
MIL1553B 
8 X LYDS Inputs 
8 X LYDS Outputs 
8 X RS485 / ML YDS Transceivers 
X Optical Inputs 
X Optical Driver Outputs
Moreover, different protocols are also supported by OBC750, such as the MIL-STD-1553 
and the SpaceWire
POWcf
PIC ICO
ClSOLCj
a-H
JTftSiCOP
PFCCstxig <____ y
/ ------------- N
-'.‘DS noote
'C p-J irpUSE 
L'.TC C to ijs  
C p a C U t u t s
PPG irouts 
RWES'SLVOG.
PPSCxipjts
RS4£S-y.\TE.
i '  M!-3;o-15£3 \  V PrrBi
/  Mf-so-isa V  Gao"dao' ,
Corrtnar 3
Processor Vcni^or 
Node SKtraotSKC
jTfc,; ccpsiz
Fenoheral Bridge 
FP13.A
M;L-S-D-:553 ;<J=
BricgeFFSA
Vemory Bridge 
FF3A
\
Beet ROW
15WIB>:&FKri
\ /
MOTr '^MUB RAM 
2 Vis>te MFAM
.iSDItDLE SCfctfcX,
/
Mar trteTOf)’
V J
Processor 
IBM PPC75DFL
ECCWemt)
123 MBii'e SDRAM
/
ML-ST>1553 
WIessage Sterco’‘sKtj: . 5t2ME)1eH^0RA);
Figure 2.13: The OBC750 Diagram [88]
37
2.7.3 X-SAT On-Board Payload Computer:
X-Sat is a high resolution imaging satellite designed, developed and manufactured by NTU 
(Nanyang Technical University) and DSC National Laboratories, Singapore. The Parallel 
Processing Unit (PPU) of the payload computer is shown in Figure 2.14, which utilises 
COTS. This PPU consists of 20 SAlllO  StrongARM processors intereonneeted by 2 anti­
fuse FPGA devices. In addition to the high resolution imaging, image compression is 
performed within the payload as well. The PPU interfaces with other subsystems via CAN 
and 2 Low Voltage Differential Signaling (LVDS) (@200MB/s). The operational power 
consumption is 15 W, and the maximum power can reach 22W [89].
Attitude system
PPU
Communication
LVDS 1
LVDS 2
CAN 1
Power system
Actuators Solar panels X-Band
Batteries
Fash Fash Flash Flash Flash Flash
L /
LVDS 
< >
Payloads
ADAM
Camera
Data Handler
1
GPS
OBC
; RAM-Disk <
 ^k ) k
Figure 2.14: The X-SAT Block Diagram [89]
2.7.3 System-on-a-Chip based on OBC386:
Satellite miniaturization has shown a remarkable saving in mission cost; hence, it has been 
the main interest of various research works. One of these works was conducted by Surrey 
Space Centre SSC and European Space Agency ESA [90]; the objective of this research was 
to build an on-board computer on a single programmable chip. The OBC386 of SSTL was 
used as a reference design for this project, while the Xilinx Virtex was the targeted chip. The 
block diagram of the system on a chip is depicted in Figure 2.15, which is supposed to be
38
functionally similar to the SSTL OBC386. Since the Intel 386EX processor was not available 
as a soft core, the open source LEON was used instead; moreover, instead of Intel 387SL, an 
in-house developed Coordinate Rotation Digital Computer CORDIC IP Core was used as a 
co-processor. For the interface, an open source ESA developed CAN-based IP Core named 
HurriCANe was implemented; since this IP core has a standard AMBA APB bus interface, 
the interfacing with the LEON was straightforward. Finally, a Quasi-Cyclic (16, 8) ED AC, 
which can correct up to two bits in a single word, was used instead of the traditional 
Hamming code (12, 8) [90]
RXO CLK0 D
u
RX1 CLK□ □
B H B B i B a e H H S
RX2 CLK
P 0
TX CLK 
P □
CAN Network >100Mbps
L i  B B B B L i
I CAN I [BV lLÿD S
B B B B B B B B B B B B t
RESE
B B B B B B
HDLC RX " 
C o n tro lle r
HDLC RX ' 
C o n tro lle r
HDLC RX 
C on tro lle r
HDLC TX 
C on tro lle r CAN
Parallel Port 
lu teiface
FIFO FIFO FIFO FIFO Interface FIFO
I  AMBA AHB I I AMBA AHB I [ AMBA AHB I f AMBA AHB
t ________
(AMBA APB I
 V___
I
System Bus
I I
ROM LUT (16,8)EDAC
B o o ts tra p DECDED
I AMBA AHB/APB 1
L E O N  S p a r c  V8
+Z5V -3.3V
I I BI I I I f n i B l HBHIÏ
UART
I AMBA AHB]
I AMBA AHB]
i .AMBA BUS
i
I AMBA AHB]
CORDIC
Coprocessor C F+I/F  
T rue IDE E l
51
Linear j<4 
Regulator] f
f— T n6Vi t r i l l □
1M*64 S P  T C D e b u g 170Mbyte
+3.3V SRAM Microdrive
Figure 2.15: Block Diagram of SoC for Space Applications [90]
The outeome of this research stated the following:
• Integrating some IP cores took more time than developing these eores.
• Due to the imprecise modelling of the LEON interfaee, the hardware testing results 
did not exactly match the simulation.
• IP developers’ support is needed when using the readymade IP Cores
• The implementation of high-performance processor and peripheral IP cores requires 
a significant amount of memory blocks on the programmable logie device.
• Availability of a prototyping FPGA board enables early software development.
39
2.7 Conclusion
This chapter presented a comprehensive literature review, which covers relevant theory 
background, challenges and trends for this research. The space radiation effects were 
presented in section 2. Sections 3, 4 and 5 presented the FPGA, the reconfigurable computing 
and the System-on-a-Chip technologies, respectively. These three sections also presented the 
suitability of these technologies for space applications along with their current trends and 
challenges. Section 6 presented the needs and the benefits of high-performance computing for 
different applications, the accelerations of different discrete transforms were discussed as 
well. The last section presented an overview of different on-board satellite computers. This 
chapter highlighted the need for a powerful acceleration of the KLT transform for 
hyperspectral image compression on embedded hardware, which has not been addressed 
before. Therefore, the importance of hyperspectral imaging and their compression techniques 
will be addressed in the next chapter; where the advantages and the challenges of the KLT 
algorithm will be analysed and compared to other techniques.
40
Chapter 3
Compression Techniques for 
Hyperspectral Images
3.1 Introduction
The selected high-performance application in this work is hyperspectral image compression; 
therefore, hyperspectral imaging and their compression techniques will be addressed in this 
chapter. Section 3.2 will present an overview of satellite imaging and their operational 
mechanisms. An overview of hyperspectral images and their applications will be presented in 
section 3.3; this section will also outline the hyperspectral images that will be used as test 
data in this work. The importance of hyperspectral image compression will be highlighted in 
section 3.4 and 3.5 along with the used compression techniques and the recommended 
standards of the Consultative Committee for Space Data Systems CCSDS. A discussion of 
the spectral decorrelation techniques will be presented in section 3.6; this includes the 
compression performances of these techniques; the complexities and the approaches to 
reduce these complexities.
41
3.2 Overview of Satellite Imaging
Remote sensing dates back to the mid-1850s, where balloons were used as platforms [91]. 
Recent remote sensing applications use satellites (space-bome) or airplanes (airborne) as 
platforms. These platforms usually incorporate the imaging sensors, image processing units 
(processor or FPGA) and memory units. Since the processing and memory units have already 
been discussed in Chapter 2, the image sensors and imaging mechanism will be briefly 
addressed in this section.
3.2.1 Passive and Active Imaging Sensors
Satellite imagers can be classified into two categories: active and passive [92].
• Active Imagers: where microwave signals are transmitted from the satellite imager 
toward the surface of the earth, or a planet, the reflections of these signals are received 
and synthesised as an image. Synthetic aperture radar (SAR) is an example of this type 
of imagers; they can penetrate obstacles, such as clouds, and can operate regardless of 
the sunlight present; therefore, they can be characterised by high reliability and 
effieieney. However, because of the radio waves transmission and reception, the 
power consumption of these imagers is very high.
• Passive Imagers: unlike the active type, passive imagers do not transmit any signal, 
they rather use solar electromagnetic waves reflected off the earth surface. Therefore, 
this type can be strongly affected by atmospheric conditions and they require sunlight. 
Since this type does not transmit microwave signals, its power consumption is 
significantly lower than the active type.
In this work, only the data of the passive imagers are considered. The concept of the passive 
imagers is based on the human vision sense, where an object can be identified by the 
perception of the reflected light waves off that object. Electro-optical components are usually 
used in passive imagers to absorb and process the electromagnetic waves; these waves can be 
light waves as for panchromatic imaging or thermal waves as for Long-Wave InfraRed 
(LWfR) imaging. In the last decades, there have been significant advancements on this type 
of imagers, which enabled such imagers of processing other electromagnetic waves, such as 
Near InfraRed (NER) and ShortWave InfraRed (SWIR). Therefore, the produced images 
could incorporate more information about the reflecting object; these images comprise of 
several spectral bands and referred to as multispectral or as hyperspectral images [93].
42
3.2.2 Scanning Mechanisms
The scanning mechanism of the remote sensing can be classified as follow [94]:
• Whisk-broom scanning: an opto-mechanical mechanism performs a side to side 
scanning in the cross track direction of the orbit as shown in Figure 3.1 (a).
• Push-broom scanning: an opto-electronic mechanism of a linear array of solid semi- 
conductive elements (detectors), in which the scanning is performed in parallel for all 
the detectors in the cross-track direction as shown in Figure 3.1 (b).
The push-broom technique provides wider spectrum of sensed signal with better spatial and 
radiometric resolution; besides, their scanners are smaller and less complex than the Whisk- 
broom scanner. However, since they require more detectors, calibration of more detectors is 
required for the push-broom scanners [95].
Ï !c tt» "  'icvh .inK j1  S j'.l ' itc Scn«>î I c t 'i r im it  S .U cllitc  Scrvsot
U ( A
FOV
F O V
kc-w'.’utK in < ;-li -  S c a n  I  m e
, i c a n L i n t  A  c m u p ic t c l i n c  o f  n  t c lK
V a n  D ,rc c tn .r  H ig h i O o c c lio n  s c a n lu  J  o n e  tim e
(a) (b)
Figure 3.1 (a) The Whisk-broom and (b) the Push-broom Scanning Mechanism [96]
3.3 Hyperspectral Imaging
A hyperspectral image is a collection of measurements in a large number (100s) of 
contiguous spectral bands, which provide the spectral information needed to distinguish and 
identify spectrally unique materials [97]. As shown in Figure 3.2, a hyperspectral image is 
represented in 3 dimensions, spatial M  x L and a spectral N; therefore, from a computational 
viewpoint, a hyperspectral image is matrix of three dimensions M  x L x
43
Figure 3.2: Hyperspectral Image
The difference between the multispectral and the hyperspectral imaging can be vague; while 
some literatures define hyperspectral images when the number of their spectral bands is in the 
range of hundreds [98] [99], others define images with more than ten bands as hyperspectral 
and less than that as multispectral [100] [101]. In general, considering the reflectance of each 
band, the multispectral images can be seen as discrete signals and the hyperspectral images as 
continuous-like signals as depicted in Figure 3.3
0 . 5
0 . 5
1 2  3 4  5
Band N um ber
( a )
4 0 0  2 4 0 0
W ave leng th , nm
(b)
Figure 3.3: (a) Multispectral Image, and (b) Hyperspectral image [101]
44
3.3.1 Hyperspectral Imaging Applications
The detailed and accurate information that can be extracted from hyperspectral images makes 
them of a significant importance for a wide range of different applications.
• In geology, hyperspectral imaging has been used for mineral mapping and to identify 
soil properties such as moisture, organic content, and salinity [102] [103].
• In agriculture, hyperspectral imaging has been used to detect vegetation species, 
investigate plant canopy chemistry and identify vegetation stress [102] [104] [105].
• In military and defence, the usage of hyperspectral imaging includes border 
protection, reconnaissance, spectral tagging and targeting [92].
• In medicine, hyperspectral imaging has been used for microscopic analysis and for 
non-invasive tissue diagnosis for different diseases [106] [107].
• In Oil and Gas exploration, hyperspectral imaging has been a valuable tool to evaluate 
the potentiality of natural resources by detecting the spectral signatures of certain 
materials [108].
Hyperspectral imaging can either be performed from a spacecraft (spacebome) or from an 
aircraft (airborne). The main advantage of airborne missions over spacebome ones is the 
flexibility is term of time schedules, calibration measurements, flight arrangements, spectral 
and spatial resolutions, and suitable weather conditions. However, since the spatial coverage 
of airborne missions is limited, more flight might be needed to cover the study area; hence, 
this type of missions is more expensive than the spacebome ones. In order to verify the 
performance of a hyperspectral imager for a space mission, the imager is usually tested on an 
airbome flight; and then it can be launched for a space mission, which can provide 
continuous and larger spatial coverage [109].
3.3.2 Overview of the Current Space-borne Hyperspectral Imagers
In the last two decades, there have been many space and airbome hyperspectral missions 
targeting different applications. Different imagers and on-board data processing payloads 
were employed on these missions; in [110], a comprehensive survey of the hyperspectral 
airbome and space-bome missions is presented. From a hardware perspective, which is one 
of the main concems of this work, two parameters are major factors in the design 
considerations: the number of spectral bands and the data bit-length (the radiometric 
resolution). Table 3.1 outlines different space-bome hyperspectral imagers with their 
applications, on-board processing units and data specifications [110].
45
Table 3.1: List of Hyperspectral / Multispectral Space-borne Missions [110]
Mission Application On-hoard Data Processing Spectral Bands Data hit-length
Midcourse Space 
Experiment 
MSX
Military ADS? 2100 5 14
Terra (MODIS) Earth Observation MIL-STD 1750A 36 12
Aqua (MODIS) Earth Observation MIL-STD 1750A 36 12
Might Sat II Military 4 X TMS320C40 (TIDSP C40) 256 8/12
EO-1 (Hyperion) Earth Observation RISC Mongoose 5 220 12
PROBA-1 Earth Observation ADS? 21020 18,37, 62 12
EnviSat AtmosphereMonitoring N/A 15 12
Mars Express Mars Exploration N/A 352 N/A
Aura
Climate and 
Ozone 
Monitoring
N/A 740 N/A
IMS-1 / TWSat ResolutionImagery N/A 64 10
TacSat-3 Military Xilinx FPGA N/A 10
HERO VegetationEnvironment Xilinx FPGA 240 12
ZASat-H Regional Science Dual Redundant computer 200 10
PRISMA
Atmospheric 
Monitoring and 
natural resources
SDAB (Scientific 
Data Acquisition 
Board)
250 12
TAIKI AgricultureMonitoring
32-bit RISC, 
MPU (SH4) 61 10
EnMAP
Agriculture and 
Forestry 
Monitoring
N/A 218 14
HyspIRI Land surface composition FPGA, FPPA 210
14
46
3.3.3 The Test Hyperspectral Data
Not all hyperspectral data are available for public and research use, a few missions made 
some of their hyperspectral data available for researchers.
Hyperspectral images from two imagers will be considered in this work:
• AVIRIS: The Airbome Visible / Infrared Imaging Spectrometer (AVIRIS) 
hyperspectral images [111] have been widely used in different research works on 
hyperspectral imaging; therefore, data sets of AVIRIS Cuprite image. Figure 3.4, will 
be considered in this work. The AVRIS Cuprite comprises of 224 spectral bands and 
its raw data consists of 14-bit unsigned integers.
• EO-1 Hyperion: Since this work targets space applications, different images from the 
space-bome EOl Hyperion will also be considered. Therefore, data sets of the 
Boston, the Greenland, the Portobago and the Edenton images with spectral 
dimensions up to 196 bands will be used as test hyperspectral data in this work. Like 
the AVIRIS data, the Hyperion data consists of 14-bit unsigned integers.
Figure 3.4: The AVIRIS Cuprite Hyperspectral Image [112]
47
Figure 3.5: The EO-1 Hyperion Hyperspectral Image of Boston [113]
3.4 Hyperspectral Image Compressions
Hyperspectral images presents serions design challenges with their memory and bandwidth 
requirements, these challenges are more significant in spaee-bome missions (limited power, 
hardware and memory resources). In order to tackle these challenges, data (image) 
compression is usually applied on-board the satellite. The image compression process is 
performed by eliminating the redundant components; hence, compacting the information 
delivered in that image. Hyperspectral images have different types of redundancies:
• Statistical redundancy: this can be reduced by using coding techniques such as 
Huffman algorithm, which gives high-probability symbols shorter code-words than 
low-probability ones; hence, reduce the overall memory bits needed for the data set.
• Human vision redundancy: this can be reduced by quantization to filter out the high
frequency components, to which the human eye is not sensitive.
• Spatial redundancy: this is performed through prediction techniques such as
Differential Pulse Code Modulation (DPCM) or through transfonnations such as
Discrete Wavelet Transform (DWT) and Discrete Cosine Transform (DCT) [92]. By 
performing the spatial decorrelation, a single band (2-D) compression is achieved.
48
• Spectral redundancy: this is performed to through inter-band prediction or 
transformations like DWT, DCT and Karhunen-Loéve Transform (KLT). By 
performing spectral decorrelation, the similarities between different bands can be 
eliminated. This is only applicable for multispectral and hyperspectral images.
3.4.1 Classifications of Hyperspectral Image Compressions Techniques
Various techniques have been developed to eliminate the above redundancies; these 
techniques can be classified into three categories:
• Prediction based: These techniques decorrelate the raw data through prediction, and 
then the prediction error is coded by an entropy coder. Various works have presented 
prediction based techniques, such as [114], [115], [116], [117] and [118]. This 
technique has also been considered in the Consultative Committee for Space Data 
Systems CCSDS Recommended Standards [119] as explained in section 3.5. 
Moreover, an adaptive prediction based technique was proposed by the Jet Propulsion 
Laboratory JPL Caltech named Fast Lossless, which can offer effective compressing 
performance with low complexity. Since it can offer high level of parallel 
implementation, the processing performance of the Fast Lossless can be suitable for 
real-time implementation. This technique was proposed on FPGA and GPU platforms 
in [127] and [128], respectively.
• Vector Quantization (VQ): In these techniques, the data are partitioned into blocks; 
these blocks are processed through two stages: first, the training stage, where similar 
vectors are grouped and each group is assigned to single representative vector (code 
vector); the second stage is the coding stage, where each vector is compressed by 
substituting it with the nearest code vector [120]. The size of the partitioned blocks is 
proportional to the complexity of the algorithm and the output compression rate; 
therefore, the performance of the VQ techniques is limited [121]. Various works have 
presented VQ based techniques, such as [120], [122], [123] and [124].
• Transform based: These techniques consist of two stages: the Transform and the 
Coding stage. The first stage transforms the data in order to compact their energy by 
making them less correlated. Since the spectral variations are much slower than the 
spatial variations, by applying the transform to the spectral dimensions, more energy 
can be compacted. Therefore, this stage is referred to as the spectral decorrelation, 
where different techniques can be utilised such as the KLT and the Wavelet
49
Transforms. The second stage is responsible of the spatial decorrelation, where 
techniques like the JPEG 2000 [125] and JPEG-LS [126] can be utilised.
In this research a transform based technique is considered, and more specifically, the first 
stage, the spectral decorrelation, is the main scope of the work. Therefore, this will be 
discussed in section 3.6.
3.4.2 Lossy, lossless and Near-Lossless Compressions
In term of performance, hyperspectral image compressions can be classified into 3 different 
categories:
Lossy: In this category, the compression process introduces a certain level of 
distortion to the compressed data; therefore, extracting the exact original data is not 
possible. The level of this distortion is inversely proportional to the bitrate.
Lossless: In this category, the compression process maintains the exact original data; 
therefore, these data can be extracted with no loss in the information. However, this 
type can only offer limited compression rate, approximately 3:1.
Near-Lossless: In this category, like the lossy, some distortion in the data is 
presented; however, the level of the distortion is smaller than the sensor noise [133].
3.4.3 Evaluation Factors for Compression Performance
In order to evaluate the performance of compression systems, different factors are used [139]. 
The compression ratio (Eq 3.1) and the compression factor (Eq 3.1) are commonly used to 
measure the output compression for all types of data compression systems.
_  • r> ^ o f  the compressed data  .Compression Ratio =   (3.1)
^  Size o f  the original data '
_  r" o f  the original dataCompression Factor =  ---- -f---------------- ——  (3.2)Size o f  the compressed data ^
For image compression systems. Bit Per Pixel bpp (Eq3.3) is commonly used for 2D 
dimensional images; while Bit Per Pixel Per Band bpppb (Eq3.4) is used for hyperspectral 
images
bpp  =   ----------T— — — (3.3)^  Compression Factor '
50
bpp pb  =   ----------:— — (3.4)Compression Factor
Where p is the hit-length of the input data.
For lossy compression, the performance evaluation includes the level of data distortion, 
which is usually measured by the Mean Square Error {MSE) (Eq 3.5), the Signal to Noise 
Ratio SNR (Eq 3.6) and the Peak the Signal to Noise Ratio (PSNR) (Eq 3.7)
MJE = S iS ;  I k W . j . k )  -  m . h k ) ?  (3.5)
SNR  =  10 X log Si Z ;  I.klxU,J. k ) r ) - 1 0 x log(M5E) (3.6)
P 5 ™  = 1 0 x l o g ( ^ ^ )  (3.7)
3.5 Consultative Committee for Space Data Systems Standards
Established by the world major space agencies in 1982, the Consultative Committee for 
Space Data Systems (CCSDS) provide a forum for discussion of common issues in the 
development and operation of space data systems. In May 2012, the CCSDS released the 
Lossless Multispectral and Hyperspectral Image Compression Blue Book CCSDS 123.0-B-l 
[119]. This book outlines the recommended standards for lossless compression applied to 
space multispectral and Hyperspectral image data and the specifications of the compressed 
data format.
The Universal Source Encoder for Space (USES) chip [129] has been commonly used for on­
board satellite hyperspectral image compression [128]. The USES utilises lossless 
compression technique based on the Rice algorithm and proposed by the CCSDS [130]. The 
compression performance offered by the USES is limited compared to other state-of-the-art 
techniques. However, since the USES chip is radiation hardened, it has been an advantageous 
choice for space applieations.
In accordance to the recent CCSDS Blue Book [119], the compression is carried out through 
two stages: predication and encoding as shown in Figure 3.6.
51
Mapped
prediction Compressed
Raw Image
r """ ' > residuals Image
P r p H ir tn r F n rn H p r ' 'Jf
Figure 3.6: CCSDS Hyperspectral Image Compression Module 
3.5.1 CCSDS Predictor
Using an adaptive linear prediction algorithm, the predictor can estimate the predicted sample 
value Sz^ y^ x the mapped prediction residual % of eaeh image sample This 
prediction is based on the values of the nearby pixels of the current and preceding bands. 
Different parameters are involved in this prediction:
• Local Sum y • The weighted sum of the adjacent pixels to the Szy^x pixel; for a 
neighbour oriented prediction, the four adjacent pixels (West, West North, North and 
East North) are considered; for a column oriented prediction only the north pixel 
multiplied by 4 as shown in Figure 3.7.
1x 1x 1x
Aj'-lpr+l
1x
4 x
neighbor-oriented column-oriented
Figure 3.7: Local Sum Calculations [119]
Local Differences (Central and Directional):
o The central local difference dzyy  equals the difference between the local sum 
Oz,y,x and four times the pixel value s^ y^. 
o Three directional local differences are defined as follow:
-^z,x,y — ^z,y,x "- 4 X ^z,y-l,x
^z,x,y — ^z,y,x ~- 4 X ^z,y,x-l
^NW _  
^z,x,y — ^z,y,x ~- 4 X ^ z ,y - l ,x - l
P: The number of preceding spectral bands used for prediction
Weights: are signed numbers with predefined range of values; these weights are used
as coefficients for the local differences.
52
3.5.2 CCSDS Encoder
The encoder takes the mapped prediction residuals generated by the predictor and 
encapsulates them into a compressed image structure; this structure consists of a header and 
an image body as shown in Figure 3.8.
Header Image Body
Figure 3.8: The Structure of the Compressed Image
The Image Body is variable length data incorporating a sequence of losslessly encoded 
mapped prediction residuals of the compressed image. The image header is a variable length 
data incorporating the compression parameters:
• The Image M etadata includes different image information such as, the image 
dimensions, the image data type (signed or unsigned integers) and the dynamic range; 
in addition, some eneoding speeifications are also included: eneoding order (band- 
interleaved or band-sequential), sub-frame interleaving depth, output ward size, and 
entropy Coder type (sample adaptive or block adaptive).
• The Predictor Metadata includes different specification of the parameters used in 
the prediction, such as, the number of prediction bands P, the prediction mode (full or 
reduced prediction), the local sum type (neighbour- or column-oriented) and the 
weight specifications.
• The Entropy Coder The contents of this depend on the encoder scheme (sample- 
adaptive or bloek-adaptive), and these contents speeify the used encoding parameters.
3.6 Spectral Decorrelation Techniques
The spectral redundancies are the major concems for the spectral decorrelation process. 
Figure 3.6 and Figure 3.7 illustrate the spectral range of 4 different pixels of the AVIRIS 
Cuprite before and after the spectral decorrelation process (KLT).
From Figure 3.6, it can be noticed that the pixels follow the same pattern; this behaviour 
present the spectral (inter-band) correlations. However, when applying the spectral 
decorrelation, the energy was concentrated into limited number of bands and the inter-band 
redundancies have been eliminated as shown in Figure 3.7. In this section, the compression 
performances of different spectral decorrelation techniques will be addressed; the
53
complexities of these techniques and the approaches to reduce these complexities will be 
investigated as well.
S 10'
100 150 200 250
Spectral Band
Figure 3.6: The Spectral Rang of 4 Different Pixels before the Spectral Decorrelation
100 150 200 250
Spectral Band
Figure 3.7: The Spectral Rang of 4 Different Pixels after the Spectral Decorrelation
54
3.6.1 Techniques’ Performance Comparison
Different techniques can be employed for spectral decorrelation, such as Band Differential, 
Inter-band Prediction, Discrete wavelet Transform (DWT) Discrete Cosine Transform 
(DCT), and Karhunen-Loéve Transform (KLT). However, since the KLT can offer the 
highest energy compaction [134], it offers the optimal performance in term of compression 
rate comparing to other techniques. Various works have investigated the performance of the 
KLT comparing to other techniques for both lossy and lossless compressions.
The lossy compression performances of different spectral decorrelation techniques have been 
addressed in [135], [136] and [137], where the superior compression performance of the KLT 
is shown for different AVIRIS images in [135] and [136]; and for Hyperion images in [137]. 
Figure 3.8 depicts some of these compression results, where JPEG 2000, DWT-JPEG2000 
and KLT-JPEG2000 are considered, the other test data showed similar behaviour. In [138], 
the significant advantage in spectral decorrelation of the KLT over the DCT, where a sample 
test image (Airfield) of 16 spectral bands was considered.
65
6060
•JPEG2K55 55
5050CÛ
T3 DWT-
JPEG2K
45 45
4040
35 35 KLT-
JPEG2K3030
25
20 20
bpppb bpppb
Cuprite Jasper Ridge
Figure 3.8: Comparison of the KLT and the DWT Lossy compression performance
The performances of spectral decorrelation techniques for lossless compression have also 
been investigated in different works [140], [141] and [142]. In the compression performance, 
a reversible version of the KLT, named the Integer KLT, was compared with the multi- 
component compression algorithm (Part II of the JPEG2000) and the Integer Wavelet
55
Transform (IWT) in [140] and [141], respectively. In [142], the Band Differential and Linear 
Prediction, which exhibit very similar compression performance, were compared with the 
Integer KLT. Different AVIRIS images were considered in these works; Table 3.2 
summarises the improvements of the Integer KLT compression performance over other 
techniques as extracted from [140], [141] and [142].
The lossless compression performance of the KLT has also been investigated on multispectral 
images as well [131]. However, due to the small number of spectral bands, a very limited 
improvement could be noticed.
Table 3.2 The Improvements of the Integer KLT lossless Compression Performance
Image Technique Improvement
Cuprite
MCP 8.94%
BD &LP 7.7%
IWT 8.86
Jasper
MCP 15.04%
BD &LP 11.84%
IWT 14.46
Low Latitude
MCP 15.08
BD &LP 13.15%
IWT 14.64
3.6.2 Techniques Complexity Comparison
The computation complexity of the technique can affect the processing time; and 
consequently, the latency. Therefore, a rational assessment of the technique performance 
should include the computation complexity. While the KLT outperforms other techniques in 
term of compression performance, it has a major drawback with its heavy computation. Table
3.3 outlines the required surface of silicon if the KLT transform were to be implemented as 
ASIC [143]; this will be thoroughly investigated from a computational perspective in Chapter 
4. It can be noticed that the required silicon area is exponentially proportional to the number 
of spectral bands.
Table 3.3: The Required Silicon Surface (65 nm CMOS) for the KLT Transform
Number of Bands 16 64 128 256
Silicon Surface mm 6 99 400 1500
56
The complexity of the KLT computation was compared with the DWT in [136], where the 
JPEG2000 computation was considered as a measurement unit. The computation of Integer 
KLT is investigated in Chapter 6 of this thesis and has also been included in this comparison 
as outlined in Table3.4.
Table 3.4: Complexity Comparison between the DWT, KLT and the Integer KLT
Technique JPEG2000 DWT-JPEG2000 KLT-JPEG2000 RKLT-JPEG2000
Complexity 1 1.78 4.12 5.3
3.6.3 Clustering and Tiling Techniques
The complexity of the KLT computations has been the main concern different works; to 
overcome this overhead, the reuse of the transform coefficients was suggested; however, this 
can only work for multispectral images with few spectral bands [144]. Low complexity KLT 
schemes were proposed in [145] and [146]; however, these can compromise the compression 
performance and can only offer limited improvement to the processing time. Clustering and 
tilling are common practices in image processing applications, as they reduce the on-chip 
memory requirements; moreover, they can reduce the computational requirements of the KLT 
process.
Clustering is performed by splitting the hyperspectral data into k  subsets of data from a
spectral perspective, so an image of N  spectral bands will be split into ^ ”  subsets of n
spectral bands. Clustering can significantly reduce the processing time for both the KLT and 
the Integer KLT; this is mainly because of the reduction of the input data for eigenvectors and 
the matrix factorization as will be shown in Chapter 4 and Chapter 6 of this thesis.
In addition to its benefit to the on-chip memory requirements, tilling has proven to improve 
the resilience of the algorithm to Single Event Upsets (SEUs) [148] [149]. However, the 
compression performance can be affected if the tiles are too small. A comprehensive analysis 
of the benefits introduced by clustering and tilling is presented in [150], where the 
simulations showed optimal clustering sizes for different tilling. These simulations 
considered different AVIRIS data sets and shown that optimal clustering varies between 16 
or 32 spectral bands.
57
3.7 Conclusion
In this chapter, hyperspectral imaging, their operational mechanism and their importance of 
for different applications was presented. An overview of hyperspectral image compression, as 
one of the main concems of this arena, was presented and classifications of the used 
techniques were outlined. This Chapter also presented the recommended standards of the 
Consultative Committee for Space Data Systems (CCSDS). A discussion of the spectral 
decorrelation techniques was presented; this included the compression performances of these 
techniques; the complexities and the approaches to reduce these complexities. This discussion 
highlighted the significance of the KLT process for hyperspectral data compression and the 
complexity of its computational process.
58
Chapter 4 
Investigation of the Karhunen-Loéve 
Transform Computational Process
4.1 Introduction
The Karhunen-Loéve Transform (KLT) is an orthogonal linear transform developed by Kari 
Karhunen [151] and Michel Loéve [152]. This transform has been used in wide range of 
applications including different image processing applications (feature extraction, 
classification, segmentation and compression) [153] [154], biomedical applications [155] and 
network security applications [156]. This transform is defined by mapping the normalised 
raw data with the eigenvectors of the covariance matrix. This can convert discrete signals 
into uncorrelated coefficients; therefore, when applied to 3-D data sets, it can de-correlate 
data on different bands. If the 3-D data set is a hyperspectral image, the KLT will remove the 
correlations between the image spectral bands, hence, constructing a more compressible data 
set (image) [157]. However, the heavy computations required for KLT is a major challenge 
when implementing this transform. Besides, unlike other transforms such as Discrete Cosine 
Transform (DCT) and Discrete Fourier Transform (DPT), KLT does not have a fast 
computation scheme [157].
In this Chapter, a comprehensive analysis of the KLT computation process is presented. 
Therefore, an overview of the KLT computation is presented in section 4.2. The 
computational requirement of each individual computation processes of the KLT is 
determined in section 4.3. In the same section, different techniques for the computations of 
the eigenvectors are investigated, analysed and compared. In section 4.4, an error analysis of 
the fixed-point implementation of the KLT algorithm is presented.
59
4.2 Computations Overview
Assuming a hyperspectral image H  o f N  spectral bands and M  x L spatial dimensions, a 3 
dimensional matrix M xLxN, the computation process of the KLT is illustrated in Figure 4.1. 
In order to demonstrate the KLT effects on the spectral characteristics, the spectral data of 5 
different pixels of the Hyperion Boston image have also been included in Figure 4. The 
process can be outlined as follow:
• Finding the mean of each spectral band, forming the 1 xJSf vector BandMean
• Subtracting the elements (pixels) of each band from their correspondent BandMean 
element, this will result a 3 dimensional matrix MeanSub, which is the zero-mean 
(normalized) of the input image H.
• Computing the covariance matrix of the MeanSub, this matrix represents the strength 
of the correlations across the spectral dimension. Therefore, the covariance matrix is 
an N xN  real symmetric matrix.
• Computing the eigenvectors and eigenvalues of the covariance matrix
• Multiplying the eigenvectors matrix with the MeanSub, this will compact most of the 
energy in the first spectral bands as shown in Figure 4.1; therefore, higher 
compression rate can be achieved.
Before applying the transform, the spectral pixels followed similar pattern, this similarity 
represent the spectral redundancies. As a result of the KLT transform, it can be noticed that 
this pattern has been eliminated and most of the pixels energies have been compacted in the 
first spectral bands.
60
7000
6oœ
2,1.1 3.1.1
3.1.2
1.1.1 5000
2.1.21.1.2 4000
3000
2000l.ljv  "3.1.N
1000
180 200100
I BandMean =
^2,1: "  RîfCR; « -  I7!f Cîl;
■ a^,v “ 2^ ‘.v G% Hsx.v ~ cn.v "
COFij COJ\
co r ,, co r .
I Covariance (MeanSub) =
1.2
2,2
Eigenvector, A =
8000
6000
2000
Eigenvectors x MeanSub -2000
-4000
-6000
Figure 4.1: The Computation Process of Karhunen-Loéve Transform
61
4.3 Computational Requirements
The KLT computational processes will be investigated and analysed individually in this 
section; consequently, the computational requirements of the overall KLT process can be 
determined.
4.3.1 BandMean and MeanSub Computations
Computing the BandMean requires the estimation of the mean of each (M x L) spectral band; 
therefore, each band requires (M x L - 1) additions and one division, and an M x L xN  image 
requires N  x (M x L -1) additions and N  division. However, since M and L are of size of 
100s, adding (accumulating) such large number of elements might cause data overflow, 
especially if a fixed point format is considered in the design. Therefore, to avoid the 
overflow, each M  x L band can be divided into K  number of blocks. This will increase the 
number of required division operations to N xK  instead of N. However, by carefiilly selecting 
Æ as a power of 2 , the division operations can be performed instantaneously through bit- 
shifting. On the other hand, the computation of the MeanSub is more primitive and only 
requires (Mx LxN) subtractions.
4.3.2 Covariance Matrix
The covariance matrix is a real symmetric matrix that presents the data variance between the 
spectral bands; therefore its dimensions are equal to the number of these bands in the 
hyperspectral image. The computation process of the covariance matrix is illustrated in 
Figure 4.2; this requires a vector mean computation (Mean (H)), a matrix multiplication, a 
vector multiplication, a matrix subtraction, a scalar matrix multiplications and divisions. The 
computation of the Mean (H) overlaps with the previous process; therefore, this will already 
be computed and no needs to be recomputed. As shown in Figure 4.2, the most 
computational! intensive part of this process is the matrix multiplication / /  X //^ ,jyhich is 
the only process that involves (Mx L xN) matrix multiplications, while all the other processes 
involve (NxN)  matrix operations. Table 4.1 outlines the computational requirements of the 
covariance matrix process of Fig 4.2; it can be noticed that multiplication and addition 
(multiply-accumulate) are the main dominant operations of this process, which are the 
required operations for the matrix multiplication H x
62
Figure 4.2: The Computation Process of the Covariance Matrix
Table 4.1: The Com;putation Requirements for the Covariance Matrix
Operation
Addition 0.5xMxLxNx(N+l) + MxLxN
Subtraction N"
Multiplication 0.5xMxLx(N^+N) + 0.5x(N^+N) +
Division N"+N
4.3.3 Eigenvectors and Eigenvalues Computations
The computations of Eigenvalues and Eigenvectors have been a main point of interest for 
many researchers because of their significance for different applications. Many libraries have 
been introduced for implementing linear algebra operations on software for general purpose 
processors and parallel processors. These libraries offer optimized processor perfonuance for 
different linear algebra operations, and this optimization is usually platform specific. The 
Linear Algebra Package (LAPACK) [158] is one of these libraries. In addition to the 
Eigenvalues problems, LAPACK provides routines for executing various linear algebra 
operations such as singular value problems and matrix factorizations. On the other hand, for 
parallel processor platforms, the scalable LAPACK library (ScaLAPACK) [159] is
63
commonly used. This library targets distributed memory MIMD systems, and like LAPACK, 
it offers optimized solution for many linear algebra problems including the eigenvalue ones. 
Since the eigenvalues problems are of a major significant for wide range of applications, 
different techniques have been developed for eigenvalues / eigenvectors computations. Some 
of these techniques have been developed for specific types of matrices or applications. The 
bisection and the divide-and-conquer methods are only applicable for tri-diagonal matrices 
[160]; the power iteration, Lanczos and Amoldi methods are developed for computing the 
external eigenvalues [161]; the Jacobi algorithm has been developed for real symmetric 
matrices [160]; finally, the QR algorithm is one of the most important techniques for 
eigenvector computations [162]. Therefore, since the covariance matrix is real symmetric, 
only the Jacobi and the QR algorithm will be considered in this work.
4.3.3.1 Jacobi Algorithm
The Jacobi algorithm is a well-known technique used for eigenvalues / eigenvectors 
computations of real symmetric matrices. In this algorithm, the eigenvalues and eigenvectors 
are evaluated by the diagonalizing the input matrix through an iterative transformations of 
2x2 sub-matrices of the original NxN matrix [160]. Figure 4.3 illustrates the computation 
process of a single Jacobi iteration, where the matrix B converges to the eigenvalues and the 
matrix V converges to the eigenvectors.
/  
I 
I 
I
f n  _
T 0 0 0 ••• 0
0 Ti,i 0 TiJ .. 0
0 0 0 .. 0
0 Tjti 0 TjJ .. 0
0 0 0 0 ••• 1
iM.aa j
= B n+ l =  X B n X r W
I yn+1 =  yn X
â.iic I;» 'vVvl
Til =  Tit - cos 6
Til =  — Til =  sin  e
Figure 4.3: The Jacobi Algorithm
64
From a computation viewpoint, each iteration involves:
• The computation of 0, sine 0 and cosine 0 to form the transform matrix From 
Figure 4.3, this requires: one subtraction, one division, and multiplication by 2, 
division by 2 and the computations of sin 0, cos 9 and tan“  ^9.
• The matrix multiplications:
Bn+i =  X  X  ( 4 .1 )
yn+l =  X  (4.2)
Theoretically, for NxN matrices, each multiplication requires element multiplications and 
ÇN — 1)N^  elements addition. Nevertheless, taking into consideration that is an identity 
matrix with only four non-one non-zero elements: and 7}y, the matrix
multiplication in (4.1) will alter only a pair of rows and a pair of columns, while in (4.2) only 
a pair of columns will be altered. Therefore, (4.1) requires SN multiplications and 4 #  
additions; while (4.2) requires 4 #  multiplications and 2N  additions. Therefore, equations 
(4.1) and (4.2) can be decomposed into (4.3) to (4.7), and (4.8) and (4.9), respectively.
= Bfi (cos -  Bfj (sin 0^)2 -  cos 9^ s inG ^  (4.3)
= B i i ( s in G ^ Ÿ  + Bjj (cos0 )^  ^+ 2B- cos0  ^ s in 0  ^ (4.4)
= B f t ^  = (Rj5+^  -  Bf/-^)  cos 0* sin 0  ^+ Bf^ ((cos 6^)^  -  (sin 0^)^) (4.5)
for n E [1 N] and i , n ^  j
COS6^ -  s in 0  ^ (4.6)
;^-n  ^  ^= Bfn COS 9^ + sin 0  ^ (4.7)
for n  E [1 TV]
= Vfn cos 0  ^-  Vfn sin 0  ^ (4.8)
K r  = ^ jn  cos 9^  + sin 0  ^ (4.9)
Therefore, in each iteration:
• Equations (4.3), (4.4) and (4.5) are executed once
• Equations (4.6) and (4.7) are executed (N — 2) times
• Equations (4.8) and (4.9) are executed N  times
65
Therefore, Equations (4.6), (4.7), (4.8) and (4.9) are the high recurring operations (N per 
iteration), whereas Equations (4.3), (4.4) and (4.5) are the less recurring operations (once per 
iteration).
In order to evaluate the required number of iterations to compute the eigenvectors, different 
matrices sizes were considered in a MATLAB simulation of the Jacobi algorithm. In order to 
make the analysis more comprehensive, different input data width are considered in the 
simulation. Table 4.2 outlines the required number of iterations for certain for output error 
(Mean Square Error); Figures C.l, C.2, C.3 and C.4 of Appendix C illustrate the Jacobi 
convergence of these matrices.
Table 4.2: Number of Iterations required for certain output error (Mean Square MSE)
M SE  
M atrix  Size^X.
10^ 10"* 10^ 10^ IQ-IO
8x8
8-bit 76 87 94 99 107
16-bit 120 123 125 128 132
24-bit 125 128 132 138 147
16x16
8-bit 390 480 488 555 561
16-bit 545 564 582 630 630
24-bit 556 604 655 656 656
32x32
8-bit 1820 2480 2600 2610 3000
16-bit 2712 2824 3283 3284 3284
24-bit 3222 3224 3224 3563 3564
64x64
8-bit 8000 10800 11070 12900 12950
16-bit 12900 14240 14500 14950 14960
24-bit 14530 14610 14710 16190 16380
As an iterative method, the more iterations performed, the more accurate eigenvalues can be 
obtained. The number of required iterations is exponentially proportional to the size of the 
processed matrix. Moreover, for larger the input bit-length, more iterations are required to 
achieve a certain level of output accuracy. While the input image data of test images has a 
bit-length of 14-bit, the covariance matrix has larger bit-length. Assuming p is the input data 
bit-length of a hyperspectral image H (M x £ x T V ) ,  by tracing the computation procedure of 
Figure 4.2, the theoretical maximum bit-length of the covariance matrix a is shown in 
equation (4.10).
a  = 2 p  + log2  ( ^ x L - i )  ^  ^  (4-10)
66
The maximum bit-length presented in equation (4.10) is a theoretical estimation; however, 
when using hyperspectral images, the covariance data bit-length will be smaller. Table 4.3 
outlines the theoretical and the actual covariance matrix bit-length of AVIRIS and the 
Hyperion hyperspectral images.
Theoretical AVIRIS
Cuprite
Hyperion
Greenland
Hyperion
Boston
Hyperion
Edenton
Hyperion
Portobago
Bit-length 29 24 24 22 23 23
A sweep is defined as the number of iterations required covers the whole matrix [160] and is 
iv(yv-i)equal to iterations. Therefore, using a number of sweeps as measuring figure instead
of iterations makes the analysis more independent of the matrix size. The eigenvalues and 
eigenvectors convergence offered by the Jacobi algorithm is ultimately quadratic [160], 
where it is suggested that less than 10 sweeps are usually enough to compute the eigenvalues 
and the eigenvectors of any matrix with dimensions of less than 1000. In order to investigate 
the performance of the Jacobi algorithm on hyperspectral data, different sizes of data sets 
(256 X 256 X  8) of the Hyperion and AVIRIS are considered for a MATLAB simulations 
and the average of the results is shown in Figure 4.4. It can be noticed that for both images, it 
takes the Jacobi algorithm around 6 sweeps to completely evaluate the eigenvalues.
10 r
Hyperion Boston 
AVIRIS Cuprite
10® -
111
£
03
:
I
10“ 2 3 4 5 6
Sw eeps
10
Figure 4.4: the Jacobi Algorithm Convergence of different Hyperspectral Data
67
4.3.3.2 Matrix Reduction Technique for the Jacobi algorithm
The Jacobi Algorithm converges the matrix into its Eigenvalues through matrix 
transformations. Throughout this convergence, the Eigenvalues accumulate on the diagonal 
elements of the matrix while the off diagonal elements converge to zero. Therefore, after a 
certain number of iterations, a certain diagonal element converges to one of the Eigenvalues 
as much as the off diagonal elements on its correspondent row and column converge to zero 
(e.g. the element a, , is as close to one of the Eigenvalues as the off diagonal elements on the 
i-th row and the i-th column close to zero). The convergence of some diagonal elements is 
faster than others; this varies from matrix to matrix. While the diagonal elements converge to 
the eigenvalues, their correspondence eigenvectors also converge at the same rate. Therefore, 
when a certain eigenvalue is completely converged, its correspondence eigenvector is also 
converged. An online matrix reduction technique has been developed as shown in Figure 4.5. 
In the proposed technique, the convergence of each diagonal element is checked after each 
sweep; when the required accuracy of any eigenvalues (diagonal elements) is achieved, these 
elements are saved in a separate array and their correspondent rows and columns are 
eliminated; therefore a smaller matrix is processed in the next sweep, which will require less 
iterations.
Sweep 
(Figure 4.3)
Convergence
Check
Row / Column 
Eliminations
Figure 4.5: The Proposed Matrix Reduction Technique
6 8
In order to assess the proposed technique, a sweep based simulation of different matrices 
sizes was undertaken to determine the number of converged Eigenvalues (eliminated 
rows/columns). In this simulation, different hyperspectral data sets (AVIRIS Cuprite and 
Hyperion Boston) were considered to produce 5 covariance matrices of each size and the 
average of their results was taken. The results of the simulation are outlined in Table 4.3 and 
can be slightly different from matrix to matrix; however, they show that the Eigenvalues 
convergence is a gradual process, e.g. not all Eigenvalues are converged at a certain sweep. 
By applying the proposed matrix reduction technique, the number of iterations per sweep will 
be reduced for the last sweeps. From Table 4.3, it can be noticed that after the 5* sweep, the 
matrix reduction starts, where 6, 4, 3 and I eigenvalues are completely converged for 8x8, 
16x16, 32x32 and 64x64 matrices, respectively. After the 7* sweep, the Eigenvalues 
computation of the 8x8, 16x16 and the 32x32 matrices are completed and 64x64 matrices are 
completed after the 8* sweep.
Table 4.3: Eigenvalues Convergence throughout 10 Sweeps
Matrix
Size
8x8 16x16 32x32 64x64
AVIRIS
Cuprite
Hyperion
Boston
AVIRIS
Cuprite
Hyperion
Boston
AVIRIS
Cuprite
Hyperion
Boston
Avnus
Cuprite
Hyperion
Boston
Sweep 1 - - - - - - - -
Sweep 2 - - - - - - - -
Sweep 3 - - - - - - - -
Sweep 4 I 2 I I I I - -
Sweep 5 6 6 4 3 3 3 1 I
Sweep 6
Completed
13.4 12 19.4 14 4 3
Sweep 7 15 14 29 25 30 26
Sweep 8 Completed Completed 60 53
Sweep 9 Completed
In this simulation, a maximum error of ±10'^ is considered; different accuracy requirements 
might give different results, nevertheless, similar iteration reduction will be exhibited. The 
overall reduction in iterations incurred by the proposed matrix reduction technique is 
summarized in Table 4.4. The reduced number of iterations varies between 20-30%, however, 
this does not mean that the computation cost will be reduced by the same percentage. The 
proposed technique requires the Convergence Check process which will add some minor 
computation overhead; this will be discussed in details in section 5.5.
Table 4.4: Iterations Reduction by the Proposed Matrix Reduction Technique
Matrix Size 8x8 16x16 32x32 64x64
AVIRIS
Cuprite
Hyperion
Boston
AVIRIS
Cuprite
Hyperion
Boston
AVIRIS
Cuprite
Hyperion
Boston
AVIRIS
Cuprite
Hyperion
Boston
Reduced
Iterations
20% 23% 31% 30% 26% 24% 21% 20%
69
4.3.S.3. QR Algorithm
In the late 1950s, the QR factorization was developed based on the LR factorization; since the 
latter exhibits slower convergence comparing to the QR factorization, the QR has become 
more common for different applications [164]. The computation of eigenvectors and 
eigenvalues is one of these applications. Like the Jacobi algorithm, the QR algorithm for 
computing the eigenvalues/eigenvector is an iterative process as shown in Figure 4.6. 
Therefore, in order to achieve higher accuracy, more iterations are required.
QR factoriza tion
nxn
nxn
V Converges to the Eigenvectors
E Converges to the Eigenvalues
Iterations
Figure 4.6: The QR Algorithm for Eigenvalues Computations
As illustrated in Figure 4.6, in addition to the QR factorization, each algorithm iteration 
performs 2 matrix multiplications = R * Ç and * Q). Theoretically, each of
these matrix multiplications requires A ^ element multiplications and elements addition.
However, since R is an upper triangular matrix, the R * Q multiplication will require V 2 ( V + 1 )
element multiplications and ^ element additions. Moreover, both Q and R matrices
will converge to sparse matrices and eventually into identity or diagonal matrices as more 
iterations are executed. Therefore, at higher iterations, more additions multiplications by Os 
or Is are performed, these operations requires less processing time on embedded processors.
70
QR Factorizations
The QR faetorization is employed to decompose a matrix A into an orthogonal matrix Q and 
an upper triangular matrix R, so that A = Q x  R. Different techniques have been developed 
for QR factorizations: Gram Schmidt, Modified Gram Schmidt, Householder Transformation 
and Givens Rotations. MATLAB simulations of these techniques was undertaken to assess 
the computational requirements of each of these techniques as outlined in Table 4.5.
The Gram Schmidt technique was one of the first attempts for QR factorizations; however, 
this technique can exhibit propagating rounding errors and loss of the orthogonal 
eharacteristie of Q [165]. Therefore, a more numerically stable algorithm, the Modified Gram 
Schmidt, was proposed, which is achieved by rearranging the computations of the classical 
Gram Schmidt. The Householder faetorization is a numerically more stable technique than 
the Gram Schmidt [160]; this technique adopts a sequence of unitary matrix operations that 
lead to an orthogonal triangularization of the matrix. Finally, QR factorization can be 
achieved through a series of Givens rotations, each of these rotations zeroes out a sub­
diagonal element leading to the triangularization of the matrix. While the Givens Rotation 
technique is more suitable for sparse matrices, the Householder transformation is more 
suitable for dense matrices [166]. Therefore, since the covariance matrix of the target 
application (spectral decorrelation through KLT) is a dense matrix, the Householder 
transformation is a more suitable choice than the Given Rotation technique.
Table 4.5: Computational Requirements of Different QR Factorization Techniques
Gram
Schmidt
Modified Gram 
Schmidt
Householder
Additions 2N(N -  1) 2N(N -  1) 2N(N -  1)
Subtraction N2 0.5N(N -  1) 2N(N -  1)
Multiplication 2N2 2N^ 2N(N -  1)
Division N2 N2 N
4.3.3.4 Jacobi versus QR Algorithm
While the QR algorithm is developed to solve general eigenvalues / eigenvectors problems, 
the Jacobi algorithm is developed specifically for real symmetric matrices. Nevertheless, the 
covariance matrix of the KLT is a real symmetric matrix, so both techniques can be used in
71
this work. However, in term of aecuracy, the Jacobi algorithm outperforms the QR algorithm 
[167]; in addition, when implemented on a parallel platform, the Jacobi algorithm offers 
higher level of parallelism.
In the previous section, the convergence of the Jacobi Algorithm in terms of iterations and 
sweeps was presented; however, the iteration of the QR algorithm is completely different 
than the Jacobi iteration. Therefore, in order to compare the convergence of both algorithms, 
a unified measure needs to be used. FLOP (Floating Point Operations) is a measure used to 
assess algorithms computational cost and measure computing systems performance, so each 
floating point operation (addition, subtraction, multiplication and division) is considered as 
one flop [160]. However, complex mathematical functions, such as the trigonometric 
functions are highly dependent on the processor architecture and the used math libraries. A 
comprehensive comparison of most of the mathematical fonctions, including the 
trigonometric fonctions, is presented in [168]; however, only PC processors’ architectures 
(Intel and AMD) were considered. Due to the wide diversity of embedded processors 
architectures, it is very difficult to quantify the performance of the trigonometric fonctions in 
term of flops for all embedded processors. In this work, two different embedded processors 
will be considered, the ARM Cortex M-3 and the Altera NIOS II. Therefore, a test bench was 
developed on these platforms to quantify the performance of the trigonometric functions in 
term of flops. The trigonometric functions (sine, cosine and arctangent) require around 12 
flops on the Cortex M-3 and 16 flops on the NIOS II.
The QR factorisation (Householder) requires 3N^ flops [169] and the matrix multiplications 
for each iteration of Figure 4.5 require 3N^ -t- flops; hence, each iteration of the QR 
algorithm requires 3N^ + 4N^ flops. On the other hand, the Jacobi algorithm. Figure 4.3, 
requires (Cortex M-3: 12(N+4) + 36, NIOS II: 12(N+4) + 48) flops for each iteration, 
equations (4.1) (4.2), including the trigonometric fonctions. A MAT ALB simulation has been 
undertaken to evaluate and compare the performance of the QR and the Jacobi algorithm. In 
this simulation, different sizes hyperspectral data sets from AVIRIS Cuprite and Hyperion 
Boston are considered as shown in Figures 4.7 and 4.8. For the target matrices (dense real 
symmetric), the results of the simulation exhibit a noticeable advantage for the Jacobi over 
the QR algorithm in term of output accuracy and computational cost. Therefore, only the 
Jacobi algorithm will be considered in this work.
72
QR versus Jacobi Eigenvalues Algorithm convergence
 Jacobi
 QR
I
FLOPS X 10
Figure 4.7: The QR versus the Jacobi Algorithm (AVIRIS Cuprite)
OR versus Jacobi Eigenvalues Algorithm convergence
QR
I
FLOPS X 10
Figure 4.8: The QR versus the Jacobi Algorithm (Hyperion Boston)
73
4.3.4. Eigen Mapping
The Eigen mapping is the matrix multiplication of the eigenvectors {N x N)  by the 
normalised data MeanSub {N X (M * L)) as shown in equation (4.11). The first matrix 
contains the eigenvectors, where each row is an eigenvector (a transpose of the eigenvectors); 
the second matrix contains the MeanSub, where each column contains all the pixels of same 
spatial coordinates along the spectral bands. The output is (TV x (M * L)) matrix, where each 
element is the result of vector multiplication of an eigenvector with column of the MeanSub 
matrix. Therefore, the computation of each element of the output matrix requires N 
multiplications and (N -  1) additions; hence, since the output matrix contains N * M * L 
elements, equation (4.11) will require * M * L multiplications and TV(TV — 
additions.
TV M xL
I *------- 1 I-------------- *-------------- 1
••• v ^ -
X
V n ••• ^ n - m i X n
= Output Data (4.11)
Table 4.6 outlines the individual and the overall computational requirements of the KLT 
computation processes individually. It can be noticed that the computation of the 
eigenvectors only depends on the spectral dimension; while all the other processes depends 
on both spectral and spatial dimensions. In order to visually depict these requirements, the 
number of required operations (addition, subtraction, multiplication and division) for the 
KLT computation of a data set of 128 x 128 x 32 is shown in Figure 4.9. It is very obvious 
that the computation of the covariance matrix and the Eigen mapping are the most 
computationally intensive processes. On the other hand, since it only depends on the spectral 
dimensions, which are in general smaller the spatial ones, the computation of the 
eigenvectors require the least number of operations. Nevertheless, the computational process 
of the eigenvectors is far more complicated than the other process. In addition, while the 
computation of the other processes offers high level of parallel computing, the iterative nature 
of the eigenvectors computations makes it more sequential and offers much lower level of 
parallel computing.
74
I&
CM CM
+
CM
LO
+
&
%
X
'-Q
X
LO
<3
+
CO
LO
O
+
00
+
X
X
+
+
CO
X
X
lO
C5
+
+
CO
LO
<3
CO
+
LO1^
Band M e an
MeanSub
Covariance
Eigenvectors
Eigen Mapping
5 0 0 0 0 0 0  1 0 0 0 0 0 0 0  1 5 0 0 0 0 0 0
■ Multiplication ■ Addition/Subtraction
20000000
Figure 4.9: The Required Number of Operations for the KLT computation of 128 x 128 x 32
pixels
4.4. Fixed Point and Floating Point Error Analyses
One of the major design considerations when performing signal processing in hardware is the 
data type, floating-point or fixed-point. While floating-point offers higher precision, and thus, 
less output errors, fixed-point implementation usually offers higher processing speed and 
requires much less hardware resources, and thus, less power consumption [170]. Therefore, if 
the output accuracy offered by fixed-point implementation is adequate for the target 
application, fixed point implementation is usually considered; otherwise, floating-point 
implementation is required.
The input data of the KLT computational process. Figure 4.1, are hyperspectral data 
(integers), and therefore, fixed-point. The first two processes, the BandMean and the 
MeanSub, are quite straight forward. The first involves accumulations and a single division 
for each spectral band; the latter, involves a single subtraction for each pixel. Therefore, in 
both processes, there is no accumulation error; hence, the output error is limited and 
manageable when fixed-point format is considered. The following three processes involve 
more complicated computations, so they will be discussed individually in the next 
subsections.
76
4.4.1 Covariance Matrix Computation Data Format
The computation process of the covariance matrix, Figure 4.2, involves different operations 
as outline in Table 4.7. The data input for these operations can be either integers or fractions. 
Moreover, the type of these operations can be single, where each element of the output data is 
a result of a single operation, or cumulative, where each element of the output data is a result 
of a sequence of operations (i.e. vector multiplication). While the output error of single 
operations is limited and manageable, the output error of cumulative operations is 
accumulative. It can be noticed from Table 4.7 that all the cumulative operations have integer 
input data, hence no output error. Therefore, fixed point implementation will result limited 
output error and it can be considered for the covariance matrix computation.
Table 4.7: The Data Format of the Covariance Matrix Computation Process
Operation Input Data Operation Type
a  = H X Integer Cumulative
P =  M ean{H ) x  M ean(H ^) Fraction Single
y  = M * L * p Fraction Single
S = a  — y Fraction Single
C ovariance M a tr ix  = Ô* —— -— -M * L - 1
Fraction Single
4.4.2 Eigenvectors Computation Data Format
The eigenvectors computation process, the Jacobi algorithm, is far more complicated than the 
other KLT computation processes. This process is iterative and involves different 
trigonometric computations; thus, the fixed-point output errors cannot be driven 
mathematically as for the covariance matrix computation. Therefore, the estimation of these 
errors will be carried out though simulation.
Fixed-Foint Implementations
Different sets of hyperspectral data from both AVIRIS and Hyperion images are considered 
in a MATLAB simulation employing different fixed-point data format. The input data for the 
eigenvectors process (covariance matrix) have a data bit length of 22-24 bits as shown in 
Table 4.2, which is the integer bit-length. For the fractional part, different bit-length will be 
considered (10, 12, 14 and 16 bits). To simplify the simulation, only the simple mathematics
77
operations were performed in fixed-point, while the complex functions (trigonometric) were 
performed in floating-point.
Integer (24-bit) Fraction (10-16 bits)
Figure 4.10: Fixed-Point Data Structure
Figures 4.11a and 4.11b outline the maximum output error of the eigenvectors incurred by the 
fixed point implementations of the Jacobi algorithm when 32 spectral bands are considered 
for the AVIRIS Cuprite and the Hyperion Boston, respectively. The simulation of 8 and 16 
spectral bands of the same images are shown in Figures B.5, B.6, B.7 and B.8 of Appendix B.
16-bit
20-bit
24-bit
1 0 'V
S w e e p
Figure 4.11a: The Maximum Output Errors of the Eigenvectors for Fixed-point Computation
(AVIRIS Cuprite 32 bands)
 16-bit
 20-bit
 24-bit
S w e ep
Figure 4.11b: The Maximum Output Errors of the Eigenvectors for Fixed-point Computation
(Hyperion Boston 32 bands)
78
The processed matrices V and B of Equations (4.1) and (4.2) will eventually converge into 
the eigenvectors and the eigenvalues, respectively. When executing Equation (4.2), the 
elements of two columns of the V matrix, the converging eigenvectors, are altered (as results 
of Equations (4.8) and (4.9)). In each Sweep, any certain column is altered N times; hence, 
each element is altered N times. The more alteration introduced to the matrix elements, the 
higher the fixed point output error resulted. Therefore, larger matrices will produce larger 
output errors as presented in Figures 4.8, 4.9 and the appendix Figures B.5, B.6, B.7 and B.8. 
This applies for both the eigenvectors and the eigenvalues.
Floating-Point Implementations
There are two formats of the floating point representation: the 32-bit single precession and 
the 64-bit double precession. Figures 4.12 and 4.13 depict the maximum output errors of the 
eigenvectors computations when a single precession is considered; this implementation 
resulted in a maximum output error in the range of 10'  ^to 10' .^
loV
 8 b a n d s
 16 b a n d s
  32 b a n d s
 -------64 b a n d s
10- p
S w eep
Figure 4.12: The Maximum Output Errors of the Eigenvectors for Single Precision Floating
point of the Hyperion Boston
79
b a n d s
 16 b a n d s
 32  ban d
S w e e p
Figure 4.13: The Maximum Output Errors of the Eigenvectors for Single Precision Floating
Point of the AVIRIS Cuprite
Figures 4.14 and 4.15 depict the maximum output errors of the eigenvectors computations 
when a double precession floating point is considered; this implementation resulted in a 
maximum output error of less than 10'^ .^ The output accuracy offered by the double 
precession is far higher than the one of the single precession floating-point. Therefore, for 
lossy compression fixed-point or single precession floating point implementation can be 
used; however, these will introduce some distortion to the output data. On the other hand, for 
a lossless compression, a double precession floating point implementation is required so no 
distortion is introduced to the output data.
 8 bands
 16 bands
 32 bands
—  64 bands
S w e e p
Figure 4.14: The Maximum Output Errors of the Eigenvectors for Double Precision Floating
Point of the Hyperion Boston
80
10
—  8 b a n d s  
— 16 b a n d s
—  32 b a n d s
—  64 b a n d s
10"
10"
10'
10'
1 2 3 4 5 6 7 8 9 10
S w eep s
Figure 4.15: The Maximum Output Errors of the Eigenvectors for Double Precision Floating
Point of the AVIRIS Cuprite
4.4.3 Eigen Mapping Computation Data Format
The eigen mapping is the matrix multiplication of the eigenvectors (N x N) by the MeanSub 
(yV X (M * L)) as shown in equation (4.11). Therefore, the input data are fractional numbers 
and the operation is cumulative. The fixed point maximum output error is presented in 
equation (4.12), where N the number of the spectral bands, a is the number of the fractional 
bits and p is the number of integer bits.
Maximum Error = N  ^ 2^“ “ (4.12)
The output data will be rounded to the nearest integer; thus, in order to eliminate the output 
error, the maximum error of equation (4.12) should be less than 0.5. Therefore, should 
be less than ^  , which can be realised in equations (4.13)
a > P + logzN  (4.13)
Therefore, when performing the KLT computation process, the fixed point output errors of 
the BandMean, the SubMean, the Covariance Matrix and the Eigen mapping processes are 
limited and can be eliminated. Nevertheless, the eigenvectors computations process is more 
complicated and the iterative nature of the process can make the fixed point errors 
unmanageable; hence, these output errors can only be eliminated when using floating point 
implementation.
81
4.5 Conclusion
In this chapter, the computational process of the KLT algorithm was investigated thoroughly; 
the computational requirements of each individual process were outlined and the 
dependencies of these requirements with the hyperspectral image dimensions were defined. 
The computation of the eigenvectors and the eigenvalues, which is the most complicated 
process in the KLT algorithm, were analysed and different techniques were compared in term 
of output accuracy and computational requirements. A novel Matrix Reduction Technique 
based on the Jacobi algorithm was proposed, this technique reduces the number of required 
iterations; the simulation of different data sets of the test data (AVIRIS and Hyperion) 
showed reductions of 20% to 30%. Moreover, a comprehensive error analysis of the fixed- 
and floating-point implementation of the KLT algorithm was presented. Therefore, the 
required data formats were determined, where a fixed point implementation can result no 
output error for all the process but the eigenvectors computation. A simulation of the fixed- 
and floating-point output error of the eigenvectors computations was also presented; this 
simulation used the hyperspectral data from the AVIRIS and the Hyperion imagers.
82
Chapter 5
Acceleration of the Karhunen-Loéve
Transform
5.1 Introduction
The KLT algorithm discussed in the previous chapter has very high computational 
requirements. These requirements will increase the utilised hardware resources and the power 
consumption when implemented on an embedded platform. Since the target application of 
this work is remote sensing for space applications, where both power and hardware resources 
are major constraints, there are significant demands to perform such an intensive computation 
within limited power and hardware budget. Moreover, the acceleration of the KLT algorithm 
will improve the latency of the algorithm, which is the main drawback of the KLT algorithm.
In this chapter, the acceleration of the KLT algorithm in a context of hardware platforms will 
be addressed. Section 5.2 reviews previous works that addressed the KLT algorithm; the 
computation flow of the algorithm is presented in section 5.3. The acceleration of the 
eigenvectors computation is addressed in section 5.4, where the demonstration of the Jacobi 
algorithm is presented on the on-chip processors (Cortex M-3 and NIOS II) and the proposed 
Matrix Reduction Technique is investigated in a context of embedded implementation. 
Section 5.5 and 5.6 present the proposed hardware architecture on both platforms (the Flash 
based Smartfusion and the SRAM based Cyclone IV) and outline the required resources 
(hardware and power) and the execution time in details highlighting the main benefits over 
previous work. Finally, Section 5.7 presents the conclusion of this chapter.
83
5.2 Overview
The KLT offers optimal spectral decorrelation from a statistical perspective [171]; so, in term 
of output compression rate, the KLT exhibits a superior performance over other spectral 
decorrelation techniques [172] [173]. Therefore, the acceleration of the KLT algorithm has 
been addressed in various works.
In [174] and [175], a novel KLT hardware architecture was proposed for multispectral image 
compression. The architecture was presented on an FPGA platform offering massive parallel 
processing of the algorithm that outperforms the implementation on a high-end 
microprocessor. However, since the target application in [174] was multispectral image 
compression, where the number of spectral bands is limited, the computation of the 
eigenvectors was not a major overhead and was performed on the on-chip processor 
sequentially. However, hyperspectral images have much more spectral bands, where the 
computation of the eigenvectors will require much intensive computations that will be a 
major overhead for the processing time. Therefore, the proposed architecture is not a practical 
approach for hyperspectral image compression.
In [142], the computation of the KLT algorithm was investigated to improve the parallelizing 
of the algorithm on the top level. The context of the work targeted an on-board satellite 
embedded implementation; however, only DSP and multiprocessors platforms were 
considered.
Other works have addressed the computations of the KLT as well, but not in the context of 
embedded implementation. A low complexity version of the KLT algorithm was proposed in 
[177]; in the proposed version, sampling the input signal is used to reduce complexity the 
covariance matrix computation and offers no advantages to the other computations stages. In 
[172], the performance of the KLT technique was compared with the DWT in term of rate- 
distortion information preservation; the work shows that the KLT significantly outperforms 
the wavelet transforms. In [178], several parallel implementation of the KLT was proposed 
on parallel computer (2048-processor cluster based on 2.6-GHz AMD Opteron 2218 dual­
core processors); the processing speed and compression performance were investigated. In 
[179], a multilevel clustering technique was proposed to reduce the computational cost and 
increase the scalability of the KLT algorithm.
84
5.3 Prototyping Platforms
Two hardware platforms will be considered in this work, the Flash based FPGA System on a 
chip Smartfusion jfrom Actel (the SmartFusion Evaluation Kit [180]) and the SRAM based 
FPGA Cyclone IV from Altera (DE2-115 Development Board [181]). The full specifications 
of theses boards are presented in Appendix C.
The SmartFusion FPGA fabric is flash based, which is reconfigurable and non-volatile; 
therefore, no external program memory is needed to save the configuration bits. SRAM 
memory blocks are also provided within this FPGA device to increase the operational 
memory. A various characteristics of the SmartFusion make it a suitable choice for this 
research. These characteristics can be outlined as follow:
• Since the target application for this research is satellite imaging, the non-volatile 
reconfigurable Flash-based FPGA, which is more immune to space single event 
radiation effects than SRAM-based FPGAs [22], makes the SmartFusion an attractive 
choice.
• The low power consumption that is offered by the Smartfusion SoC
• The ARM Cortex M-3 MSS provides a valuable computational supports for the
FPGA. In other words, the MSS offers further parallel computing options at higher 
system levels. Moreover, the MSS can work as a Power Management Module for the 
entire system. This can be done through the current monitors of the ACE, which can
provide a continuous surveillance of the overall power consumption.
• In addition to the power monitoring, the current monitors can help in detecting some 
of the space radiation effects such as SELs. This can be done by detecting any high 
current consumption on-board. In order to limit the damage, the struck part can be 
disabled by the MSS, which has current monitor interrupts.
• The Analogue Computing Engine can provide efficient analogue interfaces for closed 
loop control systems [180]. Thus, the ACE can be utilised in order to incorporate an 
Attitude Determination and Control System (ADCS) for Space applications. 
Therefore, the SmartFusion can be a potential candidate for Multi-functional on-board 
computers for space applications.
However, Flash FPGAs do not offer high densities like their SRAM counterparts. Moreover, 
since SRAM FPGAs dominate over 80% of the market [24], more libraries and IP cores 
have been developed and optimised for them. The DE2-115 SRAM based FPGA offers 
much larger hardware resources and larger on-chip and on-board memory resources. In
85
addition, this FPGA device offers hardware embedded multipliers [182], which boost the 
performance and reduce the power consumption and development cost. Table 5.1 outlines 
the main differences between these two boards [183] [184].
TableS.l: The DE2-115 Development Board versus the SmartFusion Evaluation Kit.
SmartFusion Evaluation Kit DE2-115
SmartFusion A2F200M 3 C yclone IV EP4C E115
FLASH SRAM
130-nm 60-nm
200,000 system  gates 114,480 LEs + 532 9-bit 
Embedded Multipliers
Up to 350 M Hz Up to 437 M Hz
Up to 100 M Hz Up to 100 M Hz
Medium Low
Only Static Partial and Dynam ic
ARM  Cortex M-3 NIOS II
Intelligent A nalogue computer N o
64 k-Bytes 486K  Byte (On-Chip) +  
2M B (On-Board)
N o 128 M  Byte (on-board)
256K Byte (MSS) 8 M Byte (on-board)
The embedded processor is a substantial part of most SoC platforms; while the Smartfusion 
platform incorporates the ARM Cortex M-3 hard processor, the DE2-115 Board incorporates 
the Altera NIOS II soft processor. The main differences of these processors were discussed in 
section 2.5 and detailed specifications are provided in Appendix C. The performance of these 
processors is comparable in terms of operating frequency and performance efficiency 
(MIPS/MHz). However, since the Cortex M-3 is hardwired (ASIC), it offers lower power 
consumption and faster processing time; on the other hand, the soft processor NIOS II can 
offer more flexibility in term of available memory (both data and program). Furthermore, 
NIOS II is supported with an optional hardware Floating-Point Custom Instructions (FPU) to 
boost the performance of floating point mathematics.
8 6
5.4 KLT Computation Flow
As shown in Figure 5.1, the computations of the KLT algorithm can be divided into 5 
processes: Mean Vector, Normalization, Covariance Matrix, Eigenvectors and Eigen 
mapping. When implementing on software (sequential), the same order is usually followed; 
however, when implementing on hardware, this order can be altered to increase the level of 
parallelism [174]. Therefore, the Normalization process can be postponed, so the covariance 
matrix and the Mean Vector can be executed in parallel, followed by the normalization and 
eigenvectors as shown in Figure 5.2.
^ L x M x N  —
H-1.M.1 HL.M.N1
f
I
I  BandMean =
I 
\
[Mean 1 
Mean2
Mean
M ean Vector {BandMean)
    /
1
V
- Mean 1 1
i Mean 2 1 Norm alization
••• .Mean N. I
MeanSub =
    y
». ^
Ci,i ••• -  Q.i 1
cov = 1
Cl N Cnn 1
"  ” r
■V
vi,i ••• -  Vn.1 1
Eig = 1
Vl,N - -  N^.N. 1
J.
/
Eigenvectors x  MeanSub
Covariance M atrix
Eigenvectors
Eigen Mapping
Figure 5.1: The KLT Computation Flow 
87
Covariance
Matrix
Eigenvectors
Normalization
Mean Vector
Eigen
Mapping
Stage 1 Stage 2 Stage 3
Figure 5.2: The KLT Computation Flow Proposed in [174]
As was shown in the simulation in section 4.3, the eigenvectors computational cost 
exponentially increases as the number of spectral bands increases. Since only multispectral 
images were considered in [174], the execution time of the eigenvectors computations took 
less time than the normalization process. However, since Hyperspectral images are 
considered in this thesis, the execution time for the normalization is much less than the 
eigenvectors computation time. Therefore, partial computations of the eigenvectors can make 
the execution of the Eigen Mapping possible before the completion of Stage 2.
The simulation of the error analysis in section 4.4 showed that the fixed-point implementation 
of eigenvectors computation results in output errors, which in turn can alter the output data 
(compressed data). On the other hand, floating-point implementation requires lager hardware 
resources. Therefore, only fixed-point implementation will be considered for the low power 
SmartFusion platform, which offers very limited hardware resources, as Flash FPGAs in 
general. On the other hand, the SRAM-based FPGAs have much higher density and offer 
much larger hardware resources; therefore, on the SRAM DE2-115 platform only floating 
point implementation will be considered in the eigenvectors computation process. On the 
other hand, as shown in section 4.4, in all the other processes, the fixed point output error can 
be eliminated, therefore it will be considered for these processes.
8 8
5.5 Acceleration of the Eigenvectors Computation
The computation of eigenvectors on embedded platforms has been addressed is various 
works, such as in [185], [186] and [187]. In [185], a digital signal processor platform was 
considered, where the proposed solution is highly sequential. In [186] and [187] a systolic 
architecture is proposed for multi-processor platforms; however, this architecture is highly 
dependent on the size of input matrix as it requires processors for an input matrix of NxN 
dimensions. This will require extremely large hardware resources as the input matrices are 
rather large for the target application of this thesis (KLT for hyperspectral data).
On the other hand, different hardware architectures based on FPGA platforms for 
eigenvectors computation were proposed in [188], [189] and [190]. In [188], a non-systolic 
architecture with reduced number of hardware resources is proposed; this architecture was 
implemented in [191] where a PGA algorithm is employed for detection of moving objects. 
In [189] and [190], a systolic architecture based on the solution addressed in [187] was 
proposed. The proposed systems in all these works are based on the CORDIC and the Jacobi 
algorithm, where only fixed point implementation is considered. The bit shift and add nature 
of the CORDIC algorithm, makes it only suitable for fixed-point implementations [192]. The 
utilisation of Look-Up Tables (LUT) for the computation of the eigenvalues was investigated 
in [223]
In this thesis, the eigenvectors computation within the context of hyperspectral image 
compression is considered. Therefore, large symmetric matrices are considered with a 
particular interest in increasing the output accuracy. The simulation in section 4.4 shows that 
the fixed point implementation of the eigenvectors computation will result in output errors 
that can compromise the integrity of the compressed information; therefore, making the 
compression more lossy. In this thesis, the fixed point implementation of the eigenvectors 
computation will only be considered on the ultra-low power Smartfusion, which does not 
offer enough FPGA resources for a floating point implementation. In the Altera DE2-115, 
floating point implementation will be considered; this will eliminate or limit the propagation 
of the output errors. Therefore, lossy compression with limited output data loss can be 
achieved for the KLT computing system; and ultimately, a lossless compression can be 
achieved for the Integer KLT computing system as shown in Chapter 6 and 7.
89
5.5.1 Implementation of the Jacobi Algorithm on the embedded processors
The Jacobi algorithm is thoroughly investigated in section 4.3.3; however, this was 
performed on a desktop PC. In this section the algorithm will be implemented on embedded 
processors (Cortex M-3 and NIOS II); therefore, the algorithm performance can be assessed 
on embedded platforms. The computation process of the Jacobi algorithm is illustrated in 
Figure 4.3 and in equations (4.1) to (4.9) of Chapter 4. Table 5.2 outlines the occurrence of 
the mathematical operations required for each Jacobi iteration for different matrix sizes. It 
can be noticed that required number of additions, subtractions and multiplications increase as 
the matrix size increases. On the other hand, the division and trigonometric operations are 
only required for the computation of 0, which is performed once for each iteration; hence, 
these operations occur only once for each iteration regardless the matrix size.
Table5.2: Operations Occurrence of a .racobi Iteration for different Matrix sizes.
8x8 16x16 32x32 64x64
Arctangent 1 1 1 1
Sine 1 1 1 1
Cosine 1 1 1 1
Addition 17 33 65 129
Subtraction 19 35 67 131
Multiplication 76 142 270 526
Division 1 1 1 1
Table 5.4 outlines the execution time of the Jacobi algorithm for different matrix sizes on the 
Cortex M-3 and the NIOS II processors; Figure 5.3 shows the execution time of a single 
Jacobi iteration. The NIOS II includes an optional hardware floating point instruction unit 
(FPU). This unit boosts up the floating point performance of the NIOS processor by a factor 
of 17-20 times for addition, subtraction and multiplication operations [193]. However, the 
FPU does not accelerate the performance for the trigonometric functions (sine, cosine and 
arctangent) [194]. The trigonometric functions are far more computational expensive than the 
other operations and they require much longer execution time. Therefore, for larger matrices, 
where more additions and multiplications are performed, the presence of the FPU has more 
noticeable performance improvement; i.e. for 8x8 matrix the FPU offers a performance boost 
of 20%, for 16x16 matrix it offers 35% and for 32x32 matrix it offers 45%.
The Cortex M-3 does not have a hardware floating point unit; however, it offers much faster 
computation of the trigonometric functions than the NIOS II. Therefore, the performance
90
difference between the Cortex and the NIOS with FPU is smaller for larger matrices as 
shown in Table 5.3 and Figure 5.3
Table5.3: The Execution Time {milliseconds) of the Jacobi Algorithm for Different Matrix Sizes
Spectral Bands 8x8 16x16 32x32
Cortex M-3
Iteration 0.07 0.125 032
Sweep 2 17 152
Full Algorithm 16 140 1256
NIOS II with 
FPU
Iteration 0.19 0.3 0.51
Sweep 5.5 36.7 253
Full Algorithm 44 2943 2024
NIOS II no 
FPU
Iteration 025 0.5 1.12
Sweep 7 60 560
Full Algorithm 57 480 4479
1200
1000
8 0 0
6 0 0
4 0 0
200
C o rte x  M -3  
N IO S  W ith  FPU 
N IO S  W ith  N o  FPU
8 x 8 1 6 x 1 6 3 2 x 3 2
Figure 5.3: The Execution Time (microseconds) of a Single Jacobi Iteration 
5.5.2 Matrix Reduction Technique
In section 4.3.3.2, the matrix reduction technique was proposed to reduce the processing time 
of the Jacobi algorithm. However, this technique requires a continuous check for any rows or 
columns that have converged to zero. This can consume processing time and be an overhead 
if not implemented in a systematic and intelligent way. In the algorithm process, Figure 4.3, 
during each iteration an off-diagonal element is altered so it will eventually converge to 
zero, the amount of this alteration is represented by the value 0. Figure 5.4 shows the 
variation of the value of 0 along 8 sweeps (960 iterations) of 16x16 covariance matrix taken
91
from the Hyperion Greenland Hyperspectral data, similar behaviour can be noticed when 
different hyperspectral data are considered. The fluctuation of the 0 represents the alterations 
applied on the off-diagonal elements. These alterations are relatively high at the first 100s 
iterations, then their values decreases till they converge to zero at the last 100s iterations 
where all the off-diagonal elements have already converged to zero, and no further alteration 
is required. Therefore, 0 can be used as measuring guide of how close is the off-diagonal 
elements to zero and can be used for the zero check process. The first (N — 1) iterations of 
each sweep are responsible of converging the first row of the matrix; hence, if any of these 
elements has already converged to zero, the corresponding column can be zero-checked. This 
makes the zero check process applied in a systematic manner and only performed when there 
is a potential of a converged eigenvalue.
0.6
0.4
0.2
£  - 0.2
-0 .4
- 0 . 6 -
600 700100 200 3 00 400 500
Iterations
800 900
Figure 5.4: The Variation of 0 over the Jacobi Algorithm (Hyperion Greenland Image)
In Figure 5.4, it can be noticed that the value of 0 can be zero not only during the last 
iterations, but also at different points during the first 100s iterations. When 0 is zero, the 
iteration will not alter the corresponding element; therefore, these iterations can be bypassed
2.3^ '
to reduce the processing time. Actually, since 8 = 0.5 x atan(  ^ even the
computationally expensive atan can be bypassed when  ^ is too small; hence, saving the
j^.j -^i.i
computation time of the trigonometric functions.
Moreover, further processing acceleration can be achieved by approximating the arctangent 
and the sine functions when 0 is very small, where both the arctangent and the sine are
92
approximately equal to the value of 0. Figure 5.5 shows the error resulted in by this 
approximation for —0.005 < 6 < 0.005. For this range, the approximation error is within 
±2 X  10“  ^and ±4 x 10“  ^ for the sine and arctangent functions, respectively.
x10^5
 Arctangent
 Sine
4
3
2
2
LU 1c
o
ro 0
I -1
< ■2
3
■4
■5,
Value of Theta x10^
Figure 5.5; The Approximation Error of Arctangent and Sine Functions for small 0
In order to evaluate the benefit of the proposed technique, hyperspectral data sets from the 
AVIRIS Cuprite and the Hyperion Boston are executed on the NIOS II and the Cortex M-3 
processors and the execution times are listed in Table 5.4. A noticeable improvement in the 
execution time is realised through the proposed Matrix Reduction Technique MRT (17-22% 
for the NIOS II and 24-27% for the Cortex M-3). Since the NIOS II is supported with a 
hardware FPU, it exhibits slightly less acceleration benefit than the Cortex M-3 when using 
the proposed MRT. On the other hand, since the computation of the trigonometric functions 
requires more processing time on the NIOS II, the approximation of 0 result in more 
acceleration than the Cortex M-3.
As shown in section 4.3.3, when all the off-diagonal elements of a certain row or column are
converged to zero, the diagonal element of that row or column has converged to its
eigenvalue, and the corresponding eigenvector has also completely converged. Therefore,
using the proposed MRT technique, some eigenvectors are computed before the completion 
of the total eigenvectors computation; these early computed eigenvectors can be passed 
through to the next stage (Eigen mapping) while the eigenvectors computation still running; 
therefore, improving the level of parallelism on the system level as will be explain in section
5.5.4 and 5.5.5. Therefore, the execution time acceleration of the eigenvectors computation is
93
not the only benefit of the proposed MRT technique; it also offers higher level of parallelism 
when applied within the context of spectral decorrelation using the KLT algorithm.
Table 5.4: The Execution Time {milliseconds) using the Matrix Reduction Technique (MRT)
NIOS II with FPU Cortex M-3
8x8 16x16 32x32 8x8 16x16 32x32
Jacobi 44 294 2024 16 140 1256
Jacobi MRT 36.7 229 1573 11.6 100 950
Improvement 17% 22% 22% 27% 28% 24%
Jacobi MRT + 
0 approximation
24.6 175.6 1330 10.8 93 880
Improvement 33% 23% 15% 7% 7% 7%
5.6 KLT SoC Architecture on the Altera SRAM FPGA
The system level architecture, Figure 5.6, consists of the NIOS II processor, 3 acceleration 
co-processor modules, a JTAG module, an SDRAM Controller and 2 SDRAM memory 
chips; in addition, the Altera AVALON on-chip bus interconnect these modules. The NIOS II 
will be responsible of the high level operation management and the scheduling of the 
computation stages. Since no heavy computations will be executed by the NIOS processor, 
the standard core NIOS II S with no hardware floating point unit will be employed to save 
some hardware resources. The three accelerators will be mainly involved in the computations 
of their correspondence stages; since both Stage I and 3 are fixed point hardware, both stages 
will share most of their resources. The JTAG module is used during the development for 
programming, testing and debugging. Since the FPGA device used in this research (Cyclone 
IV) does not have enough memory resources to save a whole Hyperspectral image, the image 
will be saved into the off-chip SDRAM memories and data will be buffered in segments. 
Other FPGA devices, such as Cyclone V and Stratix have larger on-chip memories to save a 
whole Hyperspectral image.
5.6.1 Acceleration of Stage 1
In this stage both the Mean Vectors and the Covariance Matrix are computed in parallel; 
equations (5.1) and (5.2) outline the mathematical expressions of these processes, 
respectively. The Mean Vector computation is simply accumulation of (M X L) elements and 
division by (M X L) elements; so, by appropriately selecting the spatial dimensions {M x  L) 
as a power of 2, division can be performed by wire shifting.
94
i = M , j = L
A = - ^  y  Hij where A (5.1)
M XL  ^ ^
COV =
1=1,j=i
H x H ' ^  - A x  where COV e (5.2)M XL
In Equation 5.2, the Covariance matrix computation, A E and H E
therefore, the computational cost of second term is much less than the first one. The result of 
H X is a symmetric matrix (a multiplication of a matrix with its transpose); hence,
K = elements need to be computed. Each of these elements requires multiplication
and accumulation of {M x  L) elements. Therefore, the hardware architecture of Stage 1, 
Figure 5.7, contains N  FIFOs of (M x L) elements executed in parallel by N  accumulators 
ACC (Mean Vector Process) and up to K Multiply-Accumulate units MAC (Covariance 
Matrix Process).
DE2-115
Board
64 MB 
SDRAM
64 MB 
SDRAM
Stage 3 
Accelerator
NIOS I!
S tag e  2 
A c c e le ra to r
S tag e 1 
A c c e le ra to r
JTAG
SDRAM
Controller
AVALON Bus
Cyclone IV 
FPGA
Figure 5.6: The System-on-a-Chip Architecture the KLT Computation
Since the data resolution of the AVIRIS and the EOl Hyperion is 14 bits, the FIFOs data 
width will be 14 bits. In concept, the optimal size of the FIFOs is (M x L), so each FIFO 
contains a whole spectral band. However, due to on-chip memory limitations of the used
95
FPGA, the image will be segmented and buffered sequentially. The used FIFOs can 
accumulate up to 4096 elements; hence, for a spatial dimensions of 256 x 256, each spectral 
band will be divided into 16 segments. Since the main objective of this work is the 
acceleration of the execution process, the input / output time will be excluded. On the other 
hand, since the MAC units require more hardware resources than the other units, the available 
FPGA resources can be insufficient for K MAC units. Therefore, R units are utilised and the
H X computation can be performed over ^  rounds. Running at 100 MHz, each
accumulation operation takes a single clock cycle and each multiply-and-accumulate 
operation take 2 clock cycles; since both are running in parallel, they will take 2 clock cycles.
Therefore, executing Stage 1 will take 2 x M x L x ^  clock cycles. The hardware resources
required for each of the accumulators (ACC) and multiply-accumulate (MAC) are outlined in 
Table 5.5, where the accumulators utilizes LEs (Altera Logic element) and the multipliers can 
utilizes the embedded 9-bit multipliers or only LEs.
Table 5.5: Stage 1 Accelerator Hardware Usage
LE 9-bit Multiplier SRAM bit/ M9K
ACC 32 - -
MAC 32 2 -
MAC (Only LE) 392 - -
FIFO (4096 X 14 bits) 36 - 57344/7
M X L
FIFO 1
MAC1
MAC3
MAC R-1
ACC1
MAC2
MAC R
ACC3
ACC2
ACC N
ACC W-1
FIFO W -1
FIFO 2
FIFO W
Figure 5.7: The Hardware Architecture of Stage 1 Accelerator
96
5.6.2 Acceleration of Stage 2
The normalization and eigenvectors computation are executed in this stage; the first is 
straight forward subtractions that can be performed in parallel similar to Stage 1. The 
eigenvectors computation is the most complicated process of the KLT algorithm, and for 
greater number spectral bands, the computation time increases exponentially. Therefore, the 
acceleration of this process is very significant, especially for hyperspectral data with large 
number of spectral bands.
5.6.2.1 Hardware Acceleration of the Eigenvectors Computation
Altera offers ready-made and pre-tested intellectual property blocks named MegaFunctions 
[195]; these IP blocks are optimised (speed or area) for Altera FPGA devices. The 
MegaFunctions library offers solutions for different applications; hence, it offers more 
optimized performance and significantly reduce development time. Moreover, different 
floating point functions are also supported [196]; these functions include basic arithmetic 
(addition and multiplication) and more complex mathematical functions such as 
trigonometric and exponential functions. Table 5.6 outlines the MegaFunctions functions 
required for the Jacobi algorithm. It can be noticed that the trigonometric and the division 
functions require more hardware resources and longer execution time. However, these 
functions are less recurring and the Jacobi algorithm employs them sequentially (once for 
each iteration); hence, only one module of each function will be required. On the other hand, 
the proposed matrix reduction technique only requires one additional function, a comparator, 
which relatively requires much less hardware resources and processing time. Therefore, the 
overhead introduced by the matrix reduction technique is insignificant in term of hardware 
resources and execution time.
The Altera MegaFunctions can only support single precession floating point for the 
trigonometric functions. Therefore, an error simulation, as in section 4.4, was performed 
where a single precession was considered for the trigonometric functions and double 
precession for all the other functions. The simulation showed that the maximum output error 
was less the ±10’^ , which is still an acceptable figure according to equation 4.13. Moreover, 
conversion between single and double precision is supported by the MegaFunctions and 
requires only 3 clock cycles.
97
Table 5.6: The Required MegaFunctions for the Jacobi Algorithm Hardware Acceleration
IP Function Execution 
Time 
(# ciock 
cycles )
Hardware Resources
Single Precession Double Precession
Multipliers Logic
Elements
Multipliers Logic
Elements
Adder 14 0 1100 0 2200
Subtracter 14 0 1100 0 2200
Comparator 3 0 94 0 188
Multiplier 11 7 325 18 700
Divider 14 16 400 + 1 M9K 44 1500+ 1 M9K
Arctangent 36 36 7000 + 1 Kbits - -
Cosine 35 31 5100 - -
Sine 36 31 5300 - -
Since the Altera Cyclone IV FPGA offers high volume of SRAM memory (approximately 3.9 
M-bits), the eigenvectors and eigenvalues matrices can be saved, while processing, in the 
FPGA fabric rather than the NIOS processor; this will save the on-chip bus communication 
time. Figure 5.8 illustrates the proposed hardware architecture of the Jacobi algorithm 
accelerator (Stage 2). The processed eigenvectors and eigenvalues are saved in two sets of 
FIFOs (N FIFOs of N 32- or 64-bit elements for single or double precession floating point, 
respectively), these FIFOs are connected to a multiplexing logic that manages data fetching 
and loading to and from the computing logic. The computing logic is responsible of the main 
computations, so the 0 Computer will estimate 0, sin 0 and cos 0; Eq(3,4,5) Computer will 
compute equations (4.3),(4.4) and (4.5); and 8 multipliers, 2 adders and 2 subtractors will 
compute equations (4.6), (4.7), (4.8) and (4.9). All three parts of the accelerator (FIFOs, Mux 
Logic and computing logic) are controlled and scheduled by the on-chip NIOS processor.
The computation of equations (4.6), (4.7), (4.8) and (4.9) requires multiplication and addition 
or subtraction. The floating point multiplication takes 11 clock cycles and the addition / 
subtraction takes 14 clock cycles. In order to reduce the processing time of multiply-and-add 
operation, the multiplier can process the next data while the adder / subtractor is processing 
the current data; therefore, instead taking 25 clock cycles, the multiply and addition/ 
subtraction operation can take only 16 clock cycles as shown in the timing simulation in 
Figure 5.9. For the first element, only the multipliers will be executing the data, while the 
adder will be waiting for the results to be ready; for the last element, the adder will be 
executing the multiplication results, while the multipliers will have no more data to execute. 
Therefore, the multipliers and the adder cannot work in parallel for the first and last elements, 
so they will take 25 clock cycles each, while all the other elements will take 16 clock cycles
98
to process. Consequently, the number of clock cycles required to process a FIFO of size N is 
shown equation (5.3), which is the required number of elock cycles to perform equations
(4.6), (4.7), (4.8) and (4.9).
Computing Logic
i
N
FIFO 1
FIFO 2
FIFO S
FIFO N-1
FIFO N
FIFO 1
FIFO 2
FIFO S
FIFO N-1
FIFO N
c
X
&
Eq (3,4, 5) Computer
9 Computer
Figure 5.8: The Hardware Architecture for the Jacobi Algorithm
sin 0^
cos 6^
V
*
#
Back to  the FIFOs
Back to the FIFOs
Back to the FIFOs
Back to the FIFOs
99
N um ber  o f  Clock Cycles  =  50 +  16 x  (iV — 2) (5.3)
2 B
The 0 Computer will execute equation (4.1), 0 = 0.5 x atan(  ^ this requires an
arctangent computer, a subtractor, a divider, a multiplier by 2 and a divider by 2. The latter 
two do not necessarily require a multiplier or a divider as one of the operand is always 2. 
Where the single precision representation of 2 is 2^ X 1, equations (5.4) and (5.5) illustrate 
the multiplication and the division by 2, respectively. It can be noticed that floating point 
multiplication by 2 can be performed by incrementing the exponent by 1 ; and the division by 
2, by decrementing the exponent by 1.
Multiplying th e (nth + 1) e lem ents  
11 clock cycles
t /Chp54FP/CLOCK_100 StO
^^g^i/Chp54FP/EN 
C h^: /Chp54Ff/FlF03 
/Chp5^ /C0S 
! tSr^ /Chp54FPfIF04 
yChp54FP/SlN 
B - %  /Chp54FP/FlFO 10 
B-im/Chp54FP/ADDl_l
41e5999a
3e2Sf5c3
42713333
3f7c7e28
4&2b5bO
40565e35
42ac02be
40978938
425de543
llilihfalilililriiiiiiinnuiiiiiiiiiiiHiiiimiuimiiiiiiiiMyHHilil
l iI lI iI H iM ir i l
IfcfîliWüiitie: sinaqia
Adding th e nth elem ents  
14 clock cycles
Figure 5.9: The Multiply and Add / Sub Timing Diagram (ModelSim Simulation)
{ s ig n  X x  m a n t i s s a  ) x  (2  ^ x  l )  = s ig n  x  x  m a n t i s s a  (5.4)
{ s ig n  X x  m a n t i s s a )  4- (2  ^ x  l )  = s ig n  x  2®^ ®^”®"^ “  ^ x  m a n t i s s a  (5.5)
Therefore, the 6 Computer, shown in Figure 5.10, comprises of a subtractor, a divider, a 
comparator, an arctangent, a sine and a cosine units (all floating point), and an increment and 
a decrement units (8-bit fixed-point). The processing is undertaken through 5 sequential 
stages, each requires certain number of clock cycles; the dataflow is controlled via a 
frequency counter. The processed data are saved into intermediate registers to make sure that 
the input values of a certain stage do not vary while processing. The overall processing time 
for the 0 computation is 125 clock cycles if (absolute  (0) > 0); otherwise, 33 clock cycles
100
and the iteration will be bypassed. A ModelSim simulation of the 9 Computer is presented in 
Appendix D.l; the hardware utilisation of the 9 Computer is outlined in Appendix E.l.
Equations (4.3), (4.4) and (4.5) are performed by the Eq(3,4,5) Computer, which involves 
multiplications, additions and subtractions. It is possible to execute each of these equations in 
parallel; however, this will require more hardware resources; and, since these equations are of 
low occurrence (once per iteration), the benefit of the parallelism will be insignificant. 
Therefore, this process will be mainly executed sequentially.
Bypass
Ite ra tio n
ATAN
%
29
T
14  C lock  
Cycles
T J L T J L
1 4  C lock  
Cycle
3 C lock  
Cycle
T
3 6  C lock  
Cycle
T
1 C lock  
Cycle
T
3 6  C lock  
C ycle
Figure 5.10: The Hardware Architecture for the 0 Computer
Figure 5.11 illustrates the hardware architecture of the Eq(3,4,5) Computer, which comprises 
of a multiplier, an adder, a subtractor, control logic, input registers, processing registers and 
output registers . The processing diagram is depicted in Figure 5.12, it can be noticed that for 
approximately 50% of the computational process, more than one operation are executed in 
parallel. The overall processing time of the Eq(3,4,5) computation, including data loading and 
fetching, is 125 clock cycles.
When both indices / and j  are less than N, the input data for the 0 Computer and Eq(3,4,5) 
Computer (Bfj , RF and are available before the completion of Equations (6), (7), (8) 
and (9), hence the 0 Computer and Eq(3,4,5) Computer can run in parallel with the 
computations of Equations (6), (7), (8) and (9). Otherwise, if any of the indices i or j  are 
equal to N, the computations of Equations (6), (7), (8) and (9) need to be completed before 
the 0 Computer and Eq(3,4,5) Computer can start processing. Therefore, the processing time
101
of each iteration varies according to the indices of that iteration. So, any iteration can take 
50 + 16 X  (A — 2)to 30 + 16 X  (y v  — 2) clock cycles.
While the Altera Megafunction IP cores offers same clock latency for both single and double 
precession floating point, the conversion of the 0 when the double precession is considered 
will require additional 14 clock cycles for each iteration. Table 5.7 outlines the processing 
time of iterations according to the values of their indices. In order to evaluate the 
performance of the proposed architecture, hyperspectral data sets from the AVIRIS Cuprite 
and the Hyperion Boston (the same of Table 5.4) are executed on the Cyclone IV FPGA (at 
lOOMHz). The proposed Matrix Reduction Technique offers further acceleration of (28-30%) 
for the tested data. However, the approximation of 0 shows insignificant improvements; this 
is because when the 0 is approximated, only the arctangent computing time is saved as the 
cosine and sine are executed in parallel as shown in Figure 5.10.
Table 5.7: The Processing Time of a Single Jacobi Iteration on the Proposed Architecture
Indices range P r o cess in g  1rime (clock  cycle)
S in g le  P recess io n D ouble P r e cess io n
/ o r 7 equal to N 50 +  16 X  (iV -  2) 64 +  16 X  (iV -  2)
/ and 7 < N 300 -h 16 X  (iV -  2) 3 1 4 -h 16 X  ( A - 2 )
sin 0"
cos 0”
Bli
sin 0^
cos 0^
2sin 0 cos 0
R1 <-
R2
Control
Logic
= 3  ' Back to  th e  C ontro l Logic
- Back to  th e  C ontro l Logic
Back to  th e  C ontro l Logic
Bfj
Bit
Figure 5.11: The Hardware Architecture for the Eq(3,4,5) Computer
102
Running Operations Clock
CyclesMultiplication Addition Subtraction
11
11
11
11
14
14
14
11
14
Figure 5.12: The Processing Flow Diagram for the Eq(3,4,5) Computer 
Table 5.8: The Execution Time {milliseconds) using the Matrix Reduction Technique (MRT)
8x8 16x16 32x32
Jacobi 0.667 5 35
Jacobi MRT 0.47 3.5 25.3
Improvement 29% 30% 28%
Jacobi MRT + 
0 approximation
0.46.8 3.492 25.28
Improvement 0.5% 0.025% 0.009%
Most of the required logic of the proposed architecture (Figure 5.8) is utilised by the 
computation components (arithmetic and trigonometric functions), while the other 
components (FIFO and control logic) utilise much less logic. Moreover, increasing the size of 
the processed matrix will only increase the logic requirements for the FIFOs and the control 
logic and will not increase the required logic for the computation components. Consequently, 
the proposed architecture is mainly independent of the input matrix size (number of spectral 
bands). Table 5.9 outlines the required FPGA logic for the computation components of the 
proposed architecture. Therefore, when implemented on the Cyclone IV, this architecture 
requires 34% of the hardware multipliers and 25% of the logic elements for a single 
precession implementation; 57% of the hardware multipliers and 36% of the logic elements 
for a double precession implementation;
Table 5.9: The FPGA Resources for the computation components of the Proposed Architecture
Sin gle  P recessio n D ouble P r e cess io n
Multipliers % Logic E lem ents % Multipliers % Logic E lem ents %
177 34 28725 25 304 57 40800 36
103
5.6.3. Acceleration of Stage 3
After the completion of Stage 2, the MeanSub and eigenvectors are computed and ready for 
the Eigen mapping (Stage 3). The Eigen mapping is a matrix multiplication of the 
eigenvectors (N x N)  by the MeanSub (V x (M * L)) as shown in equation (5.6). The main 
operation of a matrix multiplication is multiply-accumulate, these units have already been 
utilised in Stage 1, and can be utilised in this stage as well. Figure 5.13 illustrates the 
hardware architecture for Stage 3 Accelerator. This architecture comprises of R multiply- 
accumulate units, multiplexing logic and two sets of FIFOs. The first set of FIFOs consists of 
N  FIFOs each of size N  holding the eigenvectors (each FIFO holds an eigenvector) and; the 
second set of FIFOs consists of A  FIFOs each of size (M X L) holding the MeanSub matrix.
I 1 r
M XL
 A___
••• v^-
X
^ 1 ,1 ,1
Vn - m i X n ^l,Tn,n\
= Output Data (5.6)
The first matrix of equation 5.6 contains the eigenvectors, where each row is an eigenvector; 
the second matrix contains the MeanSub, where each column contains all the pixels of same 
spatial coordinates along the spectral bands. The output is (N x  (M X L)) matrix, where each 
element is the result of vector multiplication of an eigenvector with column of the MeanSub 
matrix. Therefore, the computation of each element of the output matrix requires N multiply- 
accumulate operations; hence, since the output matrix contains A * M * L elements, equation
(5.6) will require A^ * M * L multiply-accumulate operations. On the other hand, the 
throughput of a single MAC is 2A -f 1 clock cycles per pixel (an additional clock cycle for 
saving and zeroing the accumulator). In order to fully map 2 sets of A FIFOs, A^ multiply- 
accumulate units will be required. However, as there are only R available multiply- 
accumulate units (Stage 1); therefore, the execution of this stage will be executed on more 
than one round, where each round takes M x L x  (2N + 1) clock cycle and will generate 
M x L x R  pixels. Consequently, operating at 100 Mhz, executing a data set of (A x M X L) 
on this stage will require M x L x N  (2 A +  1)/R clock cycles.
104
M XL
FIFO N  -1
FIFO 2
MAC 2
MAC R-1
M A C l
M A C R
FIFO/V
3
FIFO N -1
FIFO 1
FIFO N
FIFO 2
FIFO S
FIFO 1
Figure 5.13: The Hardware Architecture of Stage 3 Accelerator
105
5.6.4 Hardware Utilisation and Processing Time
The processing time and the required hardware resources have been addressed for individual 
components. The main design parameter in Stage 1 and Stage 3 is the number of utilised 
MAC units R. In principle, utilising more MAC units, results faster computation and requires 
more power consumption. Nevertheless, for each of the stages (1 and 3), there is an optimal R
(R =  for Stage 1 and R =  for Stage 3); using this optimal R, will lead to execute
the computation of that stage in a single round. However, since the optimal R is a function of 
N, the number of spectral bands, it can be too large as the input data are hyperspectral 
images. Therefore, selecting R as a division of the optimal R can be more realistic approach 
to meet the hardware budget of the FPGA device. Table 5.11 outlines the hardware and 
power requirements of the proposed architecture when implemented on the target FPGA 
(Cyclone IV EP4CE115). Since the RAM memory limitations of the used FPGA, only a 
segments of 128 X 128 X 8 pixel is considered, this required about 81% of the available 
RAM. The Altera Quartus II PowerPlay Analyser was used to estimate the power 
consumption. Since the interface with the hyperspectral sensor and any external memories is 
beyond the scope of this research, the input / output power consumption was excluded and 
only the dynamic and static power consumption of the core logic was considered.
The following can be noticed from Table 5.11
• The maximum R could be realised for a single precession computations is 256, which 
is the optimal R for the Eigen mapping of 16 spectral bands
• The maximum R could be realised for a double precession computations is 136, which 
is the optimal R for the covariance computation of 16 spectral bands
• When the hardware multipliers are fully used and Logic Elements are utilised instead 
(R>136), a significant increase in the power consumption can be seen, (i.e. when R 
increases from 64 to 128, the power consumption increases by 9%; while when it 
increases from 128 to 256 the power consumption increases by more than 100%. 
This is because the hardware multipliers offer much less power consumption. 
(Experimental results show that an LE based multiplier requires almost 9 times more 
power than a hardware multiplier). Therefore, avoid using LE based multiplier can 
significantly reduce the power consumption, which is a major demand for space 
applications.
106
Table 5.11; The Hardware and Power Resources of the Proposed Architecture
Single Precession Doubie Precession
R Multipliers % LEs % Power
(mW)*
Multipliers % LEs % Power
(mW)*
8 197 37 37000 32 608 324 61 49000 43 767
16 213 40 37300 32 616 340 64 49300 43 775
32 245 46 38000 33 631 372 70 50000 44 791
36 253 48 38000 33 634 380 71 50000 44 794
64 309 58 40000 35 670 436 82 51000 45 820
128 437 82 42000 37 728 532 100 64000 56 981
136 453 85 42000 37 733 532 100 71000 62 1053
256 532 100 114000 99 1494 g # m s # m m m
m g m # m « # m m
(*) for the power estimation 1.2V Core Supply Voltage was considered at an ambient temperature o f 25°C
In order to assess the performance of the proposed architecture, different data sets of the 
hyperspectral images AVIRIS Cuprite and Hyperion Boston (the same of Table 4.3) were used 
as test data as shown in Table 5.12. The processing time of each individual stage is stated 
where different values of R are considered. The overlap time is the difference between the 
complete computation of the eigenvectors and when some of the eigenvectors are ready for 
eigenvectors; hence, the overlap presents the extra parallelism offered by the proposed Matrix 
Reduction Technique (MRT).
It can be noticed from Table 5.12
• The processing time of Stage 1 and Stage 3 are proportional to both the spectral and 
spatial dimensions while Stage 2 is only proportional to the spectral dimension
• The processing time of Stage 2 over the overall processing time is much more 
significant for larger number of spectral bands (for 8 spectral bands: up to 33%; for 16 
spectral bands: up to 68%; for 32 spectral bands: up to 91%)
• The overlap time is independent from the spatial dimensions and exponentially 
propotional to the spectral dimension, hence it is much more significant for larger 
spectral bands (for 8 spectral bands: up to 4.89%; for 16 spectral bands: up to 11.82%; 
for 32 spectral bands: up to 18.4%). Therefore, the benefit of the proposed MRT is 
more valuable for larger spectral bands, hyperspectral images rather than multispectral.
107
Table 5.12: The Execution Time {milliseconds) of the Proposed KLT Architecture
COverlap Total Time
Image Size R Stage 1 Stage 2 Stage 3 Time Percentage
9 1.32 2.8 1.54 4.52
128 X  128 X  8 18 0.66 1.4 2.84 2.46
36 0.33 0.7
4.89 1.43
9 5.28 11.2 0.41 16.88
256x 2 56x8 18 2.64 0.47 5.6 0.07 0.81 8.64
36 1.32 2.8 1.55 4.52
9 21.12 44.8 0.1 66.32
5 1 2 x 5 1 2 x 8 18 10.56 22.4 0.21 33.36
36 5.28 11.2 0.41 16.88
17 2.64 5.6 4.45 11.24
128 X  128 X  16 34 1.32 2.8 7.02 7.12
68 0.66 1.4 9.8 5.1
136 0.33 0.7 11.82 4.23
17 10.56 22 1.41 35.56
256 X  256 X  16 34 5.28 3.5 11 0.5 2.59 19.28
68 2.64 5.5 4.49 11.14
136 1.32 2.75 7.07 7.07
17 42.24 88 0.37 133.24
512 X  512 X  16 34 21.12 44 0.73 68.12
68 10.56 22 1.4 35.56
136 5.28 11 2.59 19.28
33 5.28 11 14.29 36.38
66 2.64 5.5 16.64 31.24
128x 128x 32 132 1.32 2.75 18.18 28.6
264 0.66 1.88 18.3 27.8
33 21.12 44 6.1 85.22
66 10.56 22 9.87 52.66
256 X  256 X  32 132 5.28 11 14.29 36.38
264 2.64 25.3 5.5 5.2 18.41 28.24
33 84.48 176 1.85 280.58
66 42.24 88 3.46 150.34
5 1 2x 5 1 2 x 3 2 132 21.12 44 6.1 85.22
264 10.56 22 9.87 52.66
108
5.6.5. Advantage of the Proposed Architecture
In [174], a similar SoC architecture for KLT computation was proposed, and the computation 
process of is shown in Figure 5.2. However, the Architecture proposed in this thesis offers 
further level of parallel computing as shown in Figure 5.14. Using the Matrix Reduction 
technique proposed in Stage 2, some eigenvectors can be computed and ready for the next 
stage (Eigen mapping), before the overall eigenvectors computation is completed. Therefore, 
Stage 3 can commence executing before the eigenvectors process is finished as shown in 
Figure 5.14.
Computation Flow of [174]
NormalizationMean V ector
Eigen
M apping
E igenvectors
C ovariance
Matrix
Stage 1 Stage 2 Stage 3
Computation Flow of the proposed Architecture
Overlap
NormalizationMean V ector
Eigen
M apping
C ovariance
Matrix
E igenvectors
Figure 5.14: The Computation Flow of the Proposed Architecture
The experiments results of the previous subsection have shown that the additional level of 
parallelism (overlap) is more significant for larger number spectral bands (hyperspectral 
rather than multispectral); hence, the benefit of the proposed architecture is more notieeable 
for hyperspectral data. Moreover, the proposed matrix reduction technique can also improve 
the level parallelism in other applications, where floating point eigenvectors or eigenvalues 
are required and can be utilised partially in the following stage.
109
5.7 KLT SoC Architecture on the SmartFusion Flash FPGA
The Flash FPGA offers less hardware logic resources; in addition, the on-chip processor of 
the Smartfusion (Cortex M-3) is more computationally powerful than NIOS II as was shown 
earlier in this Chapter (Table 5.4, 5.5 and Figure 5.3). Therefore, unlike the proposed 
architecture on the Altera FPGA, the on-chip processor can assume more computations in the 
Smartfusion. The KLT algorithm is mapped onto the SmartFusion SoC dividing the 
constituent computational processes between the embedded Cortex M-3 processor and the 
hardware accelerator on the FPGA fabric using two approaches as detailed in the next 
subsections.
5.7.1 Approach 1
The computational requirements discussed in seetion 4.3 are taken into consideration when 
hardware/software co-design is carried out. A hardware accelerator (co-processor) is built 
within the FPGA fabric. The high recurring operations can be performed in the FPGA fabric 
to accelerate the exeeution; while the less recurring ones, high level management, and tasks 
scheduling are executed by the embedded Cortex M-3 processor. While the FPGA logic 
fabrics run at 50 MHz, the Cortex M-3 runs at 100 MHz. The flow chart of the computation 
process is shown in Figure 5.15. The co-processor’s tasks (on the right-hand side shown in 
red) are:
• Covariance: H^ x H
• Eigen: Equations (4.6) and (4.7)
• Eigen x SubMean multiplication
Therefore, after the initialization of the system, the Cortex M-3 processor performs the 
BandMean and SubMean processes, which are much less computationally intensive. The 
covariance proeess (Figure 4.2) involves a multiplication of large matrices (H and H^), a 
vector multiplication, two scalar multiplications and a matrix subtraction. In this process, the 
H and H^ multiplication is far more computationally intensive than the other operations. 
Therefore, this matrix multiplication will be executed by the hardware co-processor.
Mathematically, the result of H^ x H is an NxN symmetrical matrix; therefore, only
elements need to be computed rather than NxN elements.
For the eigenveetors computation, the high recurring (4.6) and (4.7) equations will be 
performed in the hardware co-proeessor, while all the other equations will be executed in the
110
Cortex M-3. Each of the equations (4.6) and (4.7) requires two multipliers and an 
adder/subtractor. The on-chip processor computations will be in floating point while the 
hardware co-processor will only be in fixed point. Since equations (4.8) and (4.9) are 
responsible of the eigenvectors and equations (4.6) and (4.7) are responsible of the 
eigenvalues, the fixed point error will have fewer effects on the computed eigenvectors.
Finally, the computationally intensive Eigen mapping (Eigen x SubMean matrix 
multiplication) will also be performed within the FPGA co-processor. In order to be efficient 
in using the FPGA resources, a resource sharing of the multipliers is considered.
System
Initialization
BandMean
SubMean
Covariance
Eigenvectors
Eigen x SubMean
JRTEX-M3 F P G A
Figure 5.15: The Proposed Computation Flow for the SmartFusion Architecture
Figure 5.16 illustrates the block diagram of the proposed system, which consists of:
• ARM Cortex-M3 Processor
• AMBA on-chip bus for data communications between the Cortex and the FPGA 
fabric.
• Four FIFOs to store the data being processed, the size of each FIFO equals the 
number of spectral bands
• 2 Registers for holding the sin 0 and cos 0
111
Two multipliers and two accumulator 
An adder/subtractor controlled by the GPIO of the Cortex 
Multiplexing logic controlled by the GPIO of the Cortex 
A control unit to manage the data traffic to the FIFOs
Cortex M-3
Î
t Î t
Figure 5.16: The Block diagram of the Proposed Computation (Approach 1)
During the matrix multiplications, a pair of FIFOs is being filled in with data, while the 
contents of the other FIFOs are being processed. Thus, when the first pair of FIFOs is filled 
up, the multiply-accumulate result of the other FIFOs is readily available.
112
5.7.2 Approach 2
In approach 2, only the covariance and the matrix multiplication (Eigen x SubMean) are 
executed within the hardware co-processor. The rationale behind that is as follows
• In approach 1, part of the eigenvectors computation was performed on a fixed point
implementation; therefore, in order to totally eliminate the fixed point error, the
computation of the eigenvectors should be totally execute as floating point on the 
Cortex M-3 as the FPGA hardware does offer sufficient resources for floating point.
• In approach 1, the hardware accelerator utilizes a 32-bit data-width; however, only the
eigenvectors computations require 32-bit operations. Both the covariance and the 
matrix multiplication (Eigen x SubMean) are performed on a 14-bit data-width path. 
Therefore, excluding the eigenvectors computations from the hardware co-processor 
makes all the operations performed on a 14-bit wide hardware data-width, hence, 
providing more hardware resources for further parallelism.
Figure 5.17 illustrates the block diagram of approach 2. Since the multiplier IP supported by 
Actel can only run at up to 50 MHz frequency, a scheduling technique is utilized to provide a 
performance efficiency equivalent to 100 MHz. Figure 5.18 depicts the scheduling technique, 
used in this approach, in which, a de-multiplexer passes the input data to one of the 
multipliers at each clock cycle. The multipliers run simultaneously at 50 MHz. The outputs of 
the multipliers are added by an adder controlled by a 50 MHz delayed clock to ensure that the 
correct data are available on the multipliers outputs.
Data A
Data B
DeMuX
T
CLK
Divider
3-»<8h
MULT1
MULT2
e
Figure 5.18: The Block diagram of the Multiplication Unit (Approach 2)
113
Cortex M-3
A M BA
F I F O l F I F 0 3
Accumulator
Figure 5.17: The Block diagram of the Proposed Computation (Approach 2)
On a higher system level, when two FIFOs are filled up, their corresponding multiply and 
accumulate logic executes their data contents, while the other two FIFOs are being filled in. 
Since the eigenveetors form an NxN matrix (N being the number of spectral bands), the 
(Eigen x SubMean) multiplication requires multiplying each Nxf eigenvector with all Nxl 
vectors along the SubMean matrix. Therefore, to reduce the on-ehip communications time 
between the Cortex processor and the FPGA, two Nxl eigenveetors are sent to FIFO 1 and 
FIFO 3 at each loop-cyele and then all the vectors of the SubMean matrix are sent two-by- 
two to FIFO 2 and FIFO 4.
5.7.3 Discussion of Experimental Results
Table 5.13 summarizes the required hardware resources for the implementation of the co- 
proeessor using both proposed approaches (including the AMBA Bus interface). Since Flash 
FPGA devices usually offer much smaller hardware resources compared to SRAM FPGAs, in 
both approaches, more than 80% of the FPGA fabric was utilized. The Smart Power tool, part 
of the Actel Libero Design suite of tools, was used to estimate the power consumption. The 
power consumption of both approaches is outlined in Table 5.14, which shows that it is less 
than a 0.25 W.
114
It can be noticed that the power consumption of this architecture is much less than the 
architecture proposed for the Altera SRAM FPGA in the previous section
Table 5.13; The Hardware Resources of the Proposed SmartFusion Architecture
Used Total Percentage
Approach 1
FPGA Fabric 
(System Gates)
4248 4608 92%
Embedded SRAM 
(Blocks)
4 8 50%
Approach 2
FPGA Fabric 
(System Gates)
3780 4608 82%
Embedded SRAM 
(Blocks)
4 8 50%
Table 5.14: The Power Consumption of the Proposed SmartFusion Architecture
Static Dynamic Total
Approach 1 8.99 mW 215.4 mW 224.39 mW
Approach 2 8.99 mW 204.59 mW 213.58 mW
In order to assess the performance of the proposed system, the KLT algorithm was 
implemented on embedded Cortex M-3 processor. The performance were compared with the 
performance of the proposed SoC architectures, operating at 100 MHz. The performances, in 
terms of execution time, are outlined in Table 5.15, where the processing time of each 
process of the KLT algorithm is presented. The performance on the Cortex M-3 practically 
illustrates the computation intensity of each process. It can be noticed that the Eigen mapping 
(Eigen x SubMean), eigenvectors and the covariance computations are the most 
computationally intensive operations as they consume more than 95% of the overall 
processing time.
The BandMean process requires only sequential addition and division operations on a very 
large set of data and, if implemented on the hardware co-processor, will lead to an intensive 
exchange of data between the Cortex M-3 and the FPGA fabric, which will consume a 
significant time. Therefore, they cannot be efficiently implemented on the hardware co­
processor. The same applies to the MeanSub process, where only subtraction operations are 
involved.
115
Table 5.15: The Execution Time {seconds) of the Proposed KLT Architecture (SmartFusion)
Cortex A pp l Improvement % App 2 Improvement %
BandMean 0.0224 0.0056 - 0.0056 -
MeanSub 0.0224 0.0056 - 0.0056 -
256 X  256 
x8
Covariance 0.16 0.109 31.61% 0.086 46.3%
Eigenvectors 0.0108 0.0087 19.4% 0.0108 -
Eigen
Mapping
1.39 0.913 34.31% 0.564 59.4%
Overall KLT 1.6 1.04 35% 0.67 58.1%
BandMean 0.0448 0.0112 - 0.0112 -
MeanSub 0.0448 0.0112 - 0.0112 -
256x256
X  16
Covariance 0.32 0.218 31.61% 0.172 46.3%
Eigenvectors 0.093 0.073 24% 0.093 -
Eigen
Mapping
2.78 1.826 34.31% 1.128 59.4%
Overall KLT 3.28 2.14 34.7% 1.415 56.8%
BandMean 0.0896 0.0224 - 0.0224 -
MeanSub 0.0896 0.0224 - 0.0224 -
256x 256 
x32
Covariance 0.64 0.436 31.61% 0.344 46.3%
Eigenvectors 0.88 0.67 24% 0.88 -
Eigen
Mapping
5.56 3.65 34.34% 2.256 59.4%
Overall KLT 7.26 4.8 33.8% 3.525 51.4%
Approach 1: The iterative nature of the eigenvectors computations makes the hardware 
acceleration very efficient. As it can be noticed from Table 5.15, the execution time is 
reduced by more than 50%. Efficient accelerations of 31- 34% could be achieved on the 
covariance and the matrix multiplication (Eigen x SubMean) processes. In conclusion, the 
novel SoC architecture of approach 1 offers a higher execution speed for overall the KLT 
algorithm by more than 33-35%.
116
Approach 2: The further parallelism offered in this approach exhibits a noticeable 
improvement in the performance. An acceleration of 46% in the covariance process and 59% 
in the matrix multiplication (Eigen x SubMean) is achieved, leading to an overall acceleration 
of 54% (saving more than half the processing time).
Since the Smartfusion FPGA offers very limited SRAM memory resources, the processed 
data had to be passed thought the on-chip bus between the Cortex M-3 and the FPGA. Since 
the processed data are too large, the on-chip bus communication is a major overhead to the 
processing time. Therefore, the acceleration of the proposed SmartFusion architecture would 
have been much more significant of there were enough on-chip SRAM memory to save the 
processed data.
5.8 Conclusion
In this chapter, the acceleration of the KLT computations on SoC FPGA platforms was 
addressed. Therefore, novel hardware architecture for accelerating the KLT was proposed. 
Comparing to previously proposed hardware architecture [174] [175], the proposed 
architecture offers further level of parallelism, which is more significant for hyperspectral 
data with large spectral band. Therefore, when testing hyperspectral data on the proposed 
system, it shows an overall improvement of up to 4.9%, 11.8% and 18.4 % for 8, 16 and 32 
speetral bands, respectively. This architecture was presented on both Flash Actel Smartfusion 
and SRAM Altera DE2-115 FPGA platforms for low power and high-performance 
applications, respectively. The performance of the proposed architecture was assessed 
thoroughly in term of processing time, required hardware resources and power consumption. 
For the low power platform, the processing time of a hyperspectral data set of 256 x 256 x 
32 was approximately 3.5 seconds with acceleration of more than 50% and a power 
consumption of less than 0.25 Watt. For the high-performance platform, the processing time 
of the same data set was less than 37 milliseconds with a power consumption of 
approximately 1.05 Watt.
117
118
Chapter 6
Investigation of the Integer Karhunen-
Loéve Transform
6.1 Introduction
The computation of the Karhunen-Loéve Transform was investigated in Chapter 4 and the 
hardware acceleration was discussed in Chapter 5. However, the KLT will result in rounding 
errors, which will introduce distortion to the output data; hence, this will lead to a lossy 
compression. In order to elude the data distortion, a modified KLT was proposed in [197]; the 
modified KLT, named integer KLT or Reversible KLT, will results limited rounding errors 
that will not introduce any distortion to the output data, thus, lossless compression.
In this chapter the computation of the Integer KLT is addressed; an overview of the integer 
KLT will be presented in Section 6.2. The computational requirements of the Integer KLT 
will be discussed in Section 6.3. In Section 6.4, the fixed point and floating point 
implementation of the Integer KLT computation will be investigated so the design 
considerations can be defined for lossless compression. Section 6.5 outlines the differences in 
the computational requirements between the KLT and the Integer KLT algorithm.
118
6.2 Overview of the Integer KLT
In the KLT computation process, the output data is the result of Eigen mapping, which is the 
multiplication of the MeanSub matrix by the eigenvectors. Since both the MeanSub and the 
eigenvectors are fractional numbers, the output will be rounded leading to compromise the 
output accuracy, and therefore, lossy process. In order to maintain the output accuracy, the 
Eigen mapping process needs to be performed on integer data rather than fractional data. For 
the MeanSub data, this is straight forward and can be done by rounding the mean vectors, so 
the MeanSub will be integer data. However, for the eigenvectors this is much more 
complicated issue. The eigenvectors matrix needs to be decomposed through matrix 
factorizations into a set of matrices, which can be applied to the MeanSub data through a 
lifting scheme.
By employing a reversible matrix factorization technique proposed in [198], P. Hao and Q. 
Shi proposed a reversible form of the KLT algorithm [197] (Integer KLT). In the proposed 
technique in [198], it is suggested that any non-singular matrix ( Nx N)  with a determinant of 
±1, such as the eigenvectors, can be decomposed into several Elementary Reversible 
Matrixes (ERMs) using two types of linear transforms: Single-Row Elementary Reversible 
Matrixes (SERM) and Triangular Elementary Reversible Matrixes (TERM). However, in 
[199], it was shown that the pivoting technique suggested in [198] can result output errors for 
large number of spectral bands. Therefore, the TERM transform proposed in [198] was 
employed along with a quasi-complete pivoting technique in [199]. Consequently, the matrix 
factorization will decompose the eigenvectors matrix into four matrices:
• A permutation matrix P
• 2 lower triangular matrices L  and S
• An upper triangular matrix U
Therefore this matrix factorization is referred to as the PLUS factorization, which will be 
illustrated in the next section.
Following the matrix factorization, the output matrices will be applied to the MeanSub data 
through a lifting scheme. Therefore, the Eigen mapping is applied through this lifting scheme, 
which is employed in a certain manner so that the rounding errors do not lead to any 
distortion to the output data; hence, maintaining lossless compression.
119
6.3 Computational Process of the Integer KLT
6.3.1 Overall Computation Process
The computation process of the integer KLT is illustrated in Figure 6.1. The similarity 
between the computation of KLT and Integer KLT can easily be noticed. The differences are 
highlighted in the red boxes as follow:
• The rounding of the BandMean Vector
• The factorization of the eigenvectors
• The lifting scheme for Eigen mapping
The rounding of the BandMean Vector is a straight forward process. Moreover, from an 
FPGA point of view, the rounding is done through wire shifting and comparison; thus, its 
computations requirements are very insignificant. On the hand, the PLUS factorization is 
executed through iterative algorithms, and the lifting scheme requires heavy computations. 
Therefore, the computation process and requirements of the PLUS factorization and the 
lifting scheme will be discussed in the next subsection.
1 2 0
HLxMxN _
k
I
I BandMean =  Round 
I
Mean  l i
Mean2
Mean3
Mpnn N-
i
MeanSub =
'Mean  1 '  
Mean 2
.Mean N.
Q,i ..........  Q ,i
co v  =
l^l.N 'N.Nj
l
Eig =
n,N
••• ^N.l
T
PLUS Factorization (Eig) Factorization
* Final Data = Lifting {F LU S, MeanSub) | Lifting
Figure 6.1: The Computations process of the Integer KLT Algorithm
6.3.2 PLUS Matrix Factorization
In linear algebra, matrix factorizations are usually employed to restate a certain problem in 
such a way so it can solved more efficiently [200]. Moreover, these techniques are very 
useful in image processing and lossless transform coding. In integer KLT, the PLUS matrix 
factorization is used so the eigenvectors matrix can be represented in four matrices rather 
than one. These four matrices (P, L, U and S) maintain the same information of the original 
eigenvectors matrix, so when applying the Eigen mapping process, these matrices are used as 
the eigenvectors. Therefore, by applying the Eigen mapping on these four matrices, the 
output error incurred from the rounding of the fractional numbers will be eliminated; hence, 
maintaining a lossless process.
The input data of the matrix factorization process is the eigenvectors matrix A as shown in 
the Figure 6.2. Therefore, the dimensions of the input matrix are NxN, where N is the 
number of spectral bands and the elements of this matrix are real fractional numbers of 
absolute values < 1. The matrix PLUS factorization is a sequentially iterative process. The 
required number of iterations is (IV — 1), and in each of these iterations, the following 
processes are performed sequentially: Pivoting, LU and S factorisations.
Pivoting
Pivoting is the process of permuting the order of the rows of the matrix being processed 
[160]. While different techniques can be used for pivoting, the quasi-complete technique 
[199] exhibits an optimal trade-off between the complexity and output errors [160]. 
Therefore, this technique will be considered in this work. The computational process of the 
quasi-complete pivoting is illustrated in Figure 6.3.
Where:
Yk =  -  1 )/U y  j for k < i < N and (k+1) < j < N 6.1
Uq is the Eigenvectors matrix A and Pqis the identity matrix
It can be seen fi*om Figure 6.3 that the pivoting process requires the computation of the 
intermediate vector Y, finding the minimum element of Y and swapping two rows of P .
1 2 2
k < N - l
y^ (o) _  Eigenvectors
U =
M a tr ix
M a tr ix
P ivo ting
LS =
LL =
^(fc+i) _  ^(fc+i) y  p(fc+i)^(fc) X 5 (^+1)
Figure 6.2: The PLUS Factorization
Finding the minimum of Y and swapping two rows is relatively much less computationally 
intensive than the computation of the vector Y, in which, each element requires a division 
and subtraction of 1. For all the (N — 1) iterations, the required number of divisions and 
subtractions is outline in Equation 6.2.
123
k = N - l
6.2
k=0
Compute Y Y(r) = minimum (Y)
Swap the and 
rows of P
i teration
Figure 6.3: Pivoting Process
S Matrix
Initially, the S matrix is assumed to be an identity matrix. However, during the (N — 1) 
iterations, (N-1) elements of the S matrix will be altered. These elements are of the lower 
triangle of the S matrix and they are computed according to equation 6.3 below.
^c,k = [-(V'k.k -  1)/Wfc,c] 6.3
where, c is the column index of the minimum element of Y, computed in the pivoting process 
and k is the iteration number. Therefore, the computation of the S Matrix requires {N — 1) 
subtractions and (N — 1) divisions.
L U Matrices
The L matrix is a lower triangular matrix, initially an identity matrix, and at each iteration the 
column is computed as shown in equation 6.4.
= Skbf)  
.m .m
6.4
0 ...
Where is the permuted so 
c: the column index of the minimum element of Y (Pivoting)
124
Therefore, at the iteration, {N — k) elements are computed, which will require (N — k) 
subtractions and {N — k) multiplications. For all the {N — 1) iterations, the required 
number of multiplications and subtractions is shown in Equation 6.5.
k = N - l
^ (W-fc) 6.5
fc=0
The U matrix is an upper triangular matrix, which is computed by applying Equation 6.6 to 
the column, which is followed by Gaussian elimination so that the lower triangular 
elements of the column are zeroed out. Therefore, at each iteration, equation 6.6 will 
require N subtractions and N multiplications. On the other hand, the Gaussian elimination 
will require {N — k Y  addition and multiplications at a iteration. Therefore, the total 
number of multiplication and addition or subtractions required for computing the Matrix U is 
stated in equation 6.7.
6.6
1 4 5  ■■■ “ l,iV
A(fc) = 0 « S  ■ ^2,N
. 0 4  •" % ,iv J
Where andE^^^ is the permuted so
k = N - l
y  (W -  6.7N +
k=l
At this point X X and eventually, after (A — 1) iterations
^(iV-l) _  j^iN-l) p(iV-l)  p(2) p(2)p(l) p(l)^(0)^(l)^(2)........ ^(N-l)
Since U = and S =   (•^ 2)~^(6’i)~^ so that:
US =  ^(2) p(2)p(i) p(2)^(0)
Taking L\ as an elementary Gauss matrix: US = L'n-xL'j^ - 2 ......
If L = ....... ipN-2 r \ l ^ ' N - i T \  then PA =  LUS
As P is a permutation matrix, P =  P “ ;^ hence, A has been factorized as follow:
A = P L U S
125
Therefore, after performing {N — 1) iterations, the inverse matrix of L and S needs to be 
computed. The computation of the inverse matrix can be computationally expensive, 
especially for large matrices. However, since both L and S are lower triangular matrices, the 
computation requirements are significantly reduced; these computations are mainly multiply 
and accumulate operations and some sign changes (multiply by -1). The number of multiply 
and accumulate operations required for computing the inverse of a lower triangular matrix is 
stated in equation 6 .8 . Moreover, the S matrix it not only a lower triangular matrix, it is 
sparse matrix, where most of the lower triangular elements are zeroes. Consequently, 
computing the inverse matrix of S will be much less than what stated in equation 6 .8 .
k = N - l
! ( » - k — 1) x k 6.8
k = l
The computations requirements of the PLUS factorizations are stated in equations 6.2, 6.5, 
6.7 and 6 .8 ; these are summarized in Table 6.1. Like the computations of the eigenvectors, 
the computational requirements of the PLUS are only dependent on the number of the 
spectral bands and totally independent of the spatial dimensions.
S ubtrac tion Division MAC
p k = N - l
^  d N - k X N - k - 1 )
k —n
k = N - l
^  ( N - k X N - k - 1 )
k = 0
s (JV-l) ( W - l) -
L k = N - l
L  (w-fc)
k = 0
U k = N - l
N +  ^  (iV-/c)2
fe=1
Inverse
L
k = N - l
^  i N - k - l } k
k = l
Inverse
S
k = N - l
^  d N - k - l ) k
/r= 1
Total
(JV -1 )  +
k = N - l
^  ' ÇN — fc)(AT — fc — 1)
k = 0
(W -  1) +
k = N —l
^  { N - k X N - k - 1 )
k = 0
k = N ~ l
N +  ^  i N - k X
k = l
k = N - l
+  y  (W -fc)
k = 0
k = N - l
+2 ^  ( , N - k - l ) k
k = 1
126
6.3.3 Lifting Scheme
Instead of the matrix multiplication (MeanSub x Eigenvectors) in the KLT algorithm, a 
lifting technique is employed in the Integer KLT. Figure 6.4 illustrates this lifting technique, 
where the PLUS matrices are applied to the MeanSub data set according to the reversible 
lifting scheme presented in [201].
While the dimensions of the P, L, U and S are usually large (16 or 32 or even larger), in 
Figure 6.4, a dimension of 4 is considered to simplify the illustration. In order to maintain no 
distortion to the output data, after each stage of Figure 6.4, the output data are rounded. So, 
after multiplying the vector X (the image data) with the S matrix the output vector is rounded 
before it is multiplied with the U matrix and so on. Therefore, the lifting scheme can be 
mathematically represented as in Equation 6.9
Y = round (round(round(X x  S) x  U) x  L) x  P 6.9
Since P is a permuting matrix; hence, multiplying by P is row swapping, where 
multiplications are only performed by ones and zeroes. Therefore, the permuting is not 
computationally intensive, as it only requires a loop through the vector swapping certain 
elements. The S Matrix is a sparse lower triangle matrix; therefore, by applying a zero-check 
technique to the % * .S' multiplication, much less elements multiplications can be required. 
The most computationally intensive part of the lifting scheme is the multiplication with the 
upper triangle matrix U and the lower triangle matrix L. Therefore, each spectral pixel, a set
of N pixels of the same spatial coordinates, will require multiply and accumulate
operations and one rounding to perform the lifting by any of the matrices L, U or S.
Consequently, the lifting process for a spectral pixel will require multiply and
accumulate and 3 rounding operations plus the permuting.
Figure 6.4: The Lifting process of 4 Pixels [201]
127
6.4 Fixed-point Implementation Analysis
In order to evaluate the requirements of the integer KLT hardware implementation, an error 
analysis of the fixed point implementation (as in Section 4.4) needs to be undertaken. This 
analysis will determine the possibility of fixed point implementation of the integer KLT 
algorithm, and if so, it will determine the required data-width of the fixed point hardware.
6.4.1 PLUS Matrix Factorization
Unlike the KLT process, the integer KLT does not tolerate any output error because the main 
objective is lossless data compression. Since the PLUS matrix factorization is iterative 
process, it is very difficult to limit the output error when fixed point implementation is 
considered. Moreover, because of the comparison operation in the pivoting process, fixed 
point implementation can certainly exhibits error output. Therefore, fixed-point 
implementation will not be considered for the PLUS factorization. Moreover, as shown in 
Figure 6.2, the iterative PLUS factorization is mainly sequential and does offer high level of 
parallel computing. Therefore, it will not be feasible to execute this process in hardware, 
which can also require large hardware resources for floating point implementation.
6.4.2 The Lifting Scheme
The lifting process mainly involves vector multiplication (multiply and accumulate) and 
rounding. The rounding limits the accumulated output error; therefore, it is a helpful factor. 
The output error of the vector multiplication depends on the input data and the size of the 
vector, which is the number of accumulations. The input data are the MeanSub, integers, and 
the LUS matrices, which are formed of fi*action numbers of absolute values<l. Since they are 
integers, the MeanSub data present no input errors; the input error of the LUS matrices is 2 '“, 
where a is number of bits representing the fractional number. Therefore, when p is the data 
width of MeanSub data, the output error of each elements multiplication is shown in equation 
6.10. The output error of the vector multiplication (multiply-accumulate) is shown in 
equation 6.11, where N is the vector size (number of spectral bands). The multipliers must 
comply with equation 6.10 and the accumulators must comply with equation 6 .11.
Multiplication Output Error < 2^““ 6.10
Multiply — Accumulate Output Error < N x  2^““ 6.11
128
Therefore, in order to eliminate the multiplication output errors, p should be greater than a, in 
other word, the number of bits representing the fractional numbers should be greater than 
data width of MeanSub data. While the input spectral image data are unsigned integers, the 
MeanSub data are signed integers; hence, the MeanSub data width requires additional sign 
bit. Both the AVIRIS and the Hyprion images have 14-bit data width; and the MeanSub data 
width is 15 bits. The LUS fractional numbers should be greater than 15 plus a sign bit, so 17 
bits. Therefore, the multiplier inputs should be 15-bit x 17-bit. From equation 6.11, it can be 
notices that the error output of the accumulation also proportional to the number of the 
processed spectral bands; so that the data width should greater than a + + log2 N.
6.5 KLT versus Integer KLT Computational Requirements
Table 6.2 outlines all the computational requirements for both the KLT and the Integer KLT; 
it also shows the differences between them. The red boxes in the table represent the floating 
point operations, while all the other boxes are fixed point operations. It can be noticed that 
the number of the floating point operations is only dependent on the number of the spectral 
bands; on the other hand, the fixed point computations are dependent on both the spectral and 
spatial dimensions.
In order to visually depict these computational requirements, a hyperspectral image of 
256x256 spatial dimensions and 32 spectral bands is considered. The number of operations 
required by both the KLT and the Integer KLT for this data set is shown in Figure 6.5, from 
which the following can be concluded:
• Since number of floating point operations only depend on the spectral dimension while 
the fixed point operations are dependent on both the spectral and spatial dimensions, the 
required number of fixed point operations is much larger than the floating point ones.
• For fixed point operations The Integer KLT requires around 30% more multiplication and 
addition operations than KLT.
• The difference in the number of floating point operations is much smaller than the fixed 
point ones; hence, the PLUS factorization is much less computationally intensive than the 
eigenvectors computations
• Since the required number of division and trigonometric operations is independent of the 
spatial dimensions, these operations are occurring much less frequently than the other 
operations.
129
The most dominating operations are the multiplications and the additions, which are 
mainly required for the matrix multiplication and the lifting process.
O)
Q.O
1200000
1000000
800000
600000
400000
200000
■  KLT
In te g e r  KLT
I I I
A d d itio n  S u b tra c tio n  M u lt ip lic a t io n  D ivis ion  T r ig o n o m e tr ic
Figure 6.5: The Required Number of Floating Point Operations for the KLT / Integer KLT
160000000 n
140000000 -
120000000 -
100000000 -
c
80000000 nTO
0)
O 60000000 -
40000000 -
20000000 -
0 -
KLT
In te g e r  KLT
A d d itio n  S u b tra c tio n  M u lt ip lic a t io n  D ivis ion
Figure 6.6: The Required Number of Fixed Point Operations for the KLT / Integer KLT
130
BI
&
H
0^
5
H
gI
t
H
(S
V©
+
T-4
1 + +
5 Ç T
z 5 +
X % S :
- J X e
X %
s
X
%
6.6 Conclusion
In this Chapter, the computational requirements of the Integer KLT process have been 
investigated. Since some of these requirements are the same of the KLT algorithm and have 
already been addressed in Chapter 4, only the requirements of the PLUS matrix factorization 
and the lifting scheme have been investigated in this Chapter. The number of the required 
operations for each process has been determined for different spectral and spatial dimensions. 
The required data format (floating and fixed point) for the computations has also been 
discussed, and it has been concluded that the PLUS factorization requires floating point 
computation while the lifting scheme can be performed on a fixed point format, where the 
required data width was determined. This Chapter also presented the differences in the 
computational requirements between the KLT and the Integer KLT algorithm. Therefore, this 
investigation has highlighted the potential constraints for the hardware implementations of 
the Integer KLT, which will be discussed in the next Chapter.
132
Chapter 7
Hardware Acceleration of the Integer 
Karhunen-Loéve Transform
7.1 Introduction
The computation of the Integer KLT algorithm was discussed in the previous chapter. In this 
Chapter, the acceleration of this computation in the context of SoC hardware platforms will 
be addressed, with a specific emphasis on the processing time, hardware resources and power 
consumption. Section 7.2 reviews previous works that addressed the integer KLT algorithm; 
the computation flow of the algorithm is presented in section 7.3. Section 7.4 presents the 
proposed hardware architecture on both platforms (the Flash based Smartfusion and the 
SRAM based Cyclone IV). The same section outlines the required resources (hardware and 
power) and the execution time and defines the main constraints and challenges of 
accelerating the Integer KLT on a hardware platform. An adaptive KLT/ Integer KLT system 
for Lossy / lossless compression is presented in section 7.5. Finally, Section 7.6 will present 
the conclusion of this chapter.
133
7.2 Overview
In addition to the lossless feature of the Integer KLT, it offers a compression performance 
that outperforms other techniques [150]. This has intensified the interest of many researchers 
in this technique and it has been addressed in different works.
• In [142] a comprehensive review of the lossless hyperspectral image compression, 
which specifically highlighted the works that presented the Integer KLT in the context 
of spectral decorrelation. The works of same author investigated the Integer KLT and 
the technique implementation on DSP platforms [150], [142], [202] and [110].
• A reversible Integer KLT algorithm was proposed in [197], which outperforms other 
lossy and lossless compression algorithms. In this work, different spectral 
decorrelation algorithms where considered and compared on three test images were 
taken, only one of them was hyperspectral. Moreover, the computations overhead and 
complexity were discussed and compared with the DWT algorithm.
• The performance of the Integer KLT was investigated in term of the output error in 
[199], where a different pivoting technique than the one of [197] was proposed to 
obtain a better approximation of the linear transforms.
• Different pivoting techniques where investigated in [203], with a specific emphasis on 
the output error and the computational cost for hyperspectral data with greater number 
of spectral bands
• In [204], a clustered-multileveled technique for the Integer KLT was proposed and 
the algorithm performance (bit-rate) was investigated and compared against normal 
clustering Integer KLT.
• A low complexity integer KLT was proposed in [205], where the computation of the 
covariance matrix was reduced by sampling the input signal.
• In [206], a lossy-to-lossless hyperspectral image compression based on multiplier-less 
Integer KLT was proposed, where a reduced complexity integer KLT of [205] was 
adopted. The multiplications were performed using shift and add after decomposing 
the LUS matrices into fi-actions with power of 2 dominators. The authors investigated 
the proposed technique on the airborne hyperspectral AVIRIS image. In [207] and 
[208], the same author investigated the proposed technique on medical imaging.
134
• In [209], the Integer KLT was employed for electroencephalogram (LEG) signal 
processing. The EGG signal had different channels, where each channel is and 
electrode. The inter-channel redundancies were eliminated using the Integer KLT.
In [142], the computation of the integer KLT was proposed for DSP platforms; on the other 
hand, none of the other works addressed the computation of the Integer KLT on embedded 
platforms. Since the target application is for satellite imaging, where both hardware and 
power resources are major constrains, the computation of the integer KLT within the context 
of embedded platform is the main scope of this thesis. Moreover, unlike the work presented 
in [142], this thesis addresses the Integer KLT on hardware platform (FPGA) rather than DSP 
processor. Therefore, addressing novel hardware architecture with a specific emphasis on the 
computation acceleration within the required resources (hardware and power) is the main 
objective of this thesis.
7.3 Computational Flow
As shown in Figure 6.1 of the previous chapter, the computation process of the Integer KLT 
is very similar to the computation process of the KLT algorithm, where the only differences 
are the rounding of the MeanVector, the PLUS factorization and the lifting. Therefore, a 
similar computation flow will be adopted for the Integer KLT as shown in Figure 7.1. This 
computation flow is divided into 3 stages:
• Stage 1: Covariance matrix and the Mean Vector
• Stage 2: Normalization and eigenvectors computations
• Stage 3: PLUS factorization followed by the lifting (Eigen Mapping)
In the proposed KLT architecture in Section 5.6, the Eigen mapping (Stage 3) could be 
partially performed in parallel with the eigenvectors computations. However, this will not be 
possible for the Integer KLT because the PLUS factorizations require the whole eigenvectors
matrix at once. Moreover, the PLUS factorization computes the P, L, U and S matrices at
once (no possibility for computing these matrices individually); therefore, the lifting scheme 
cannot be performed in parallel with the PLUS factorization. All the other processes will be 
performed in a similar way as in the KLT algorithm.
135
Mean­
Vector
—> Normalization
Covariance Eigen­
Matrix vectors
PLUS Lifting
Stage 1 Stage 2 Stage 3
Figure 7.1: The Integer KLT Computation Flow
7.4 System Architecture
In this section, the hardware architecture for the computations of the Integer KLT will be 
addressed on both Flash and SRAM platforms as in Chapter 5. Since many of the 
computations processes have already been addressed in the KLT architecture of Chapter 5, 
only the computation of the PLUS matrix factorizations and the lifting scheme will be 
discussed in this Chapter.
7.4.1 DE2-115 SRAM Altera Cyclone IV System Architecture
As shown in Figure 7.1, the proposed system architecture for the Integer KLT computation is 
very similar to the one proposed for KLT computation in section 5.6. However, for the 
computations of the integer KLT, the PLUS matrix factorizations will be executed on the 
NIOS II on-chip processor, while the lifting process will be executed within the Stag 3 
Accelerator. Therefore, unlike the KLT architecture, where the on-chip processor was only 
responsible for the high-level process management, the on-chip processor will execute some 
intensive computations (PLUS factorizations); hence, the fast version of the NIOS II will be 
used for the Integer KLT.
136
6 4  M B  
S D R A M
6 4  M B  
S D R A M
S tag e  3 
A c c e le ra to r
N IO S  II
S tag e 2 
A c c e le ra to r
S tag e  1 
A c c e le ra to r
JTAG
SDRAM
Controller
AVALON Bus
Cyclone IV 
FPGA
DE2-115 Board
Figure 7.2: The Proposed System Architecture for the Integer KLT 
7.4.1.1 PLUS Matrix Factorization
As discussed section 6.3 and 6.4, the matrix factorization process requires floating point 
computations and has a mainly sequential computational procedure that offers insignificant 
level of parallel computing. Therefore, performing this process in the on-ehip processor will 
be more feasible. Since floating point computation is required, the NIOS II floating point unit 
will be utilised; Table 7.1 outlines the execution time (in milliseconds) of the PLUS 
factorization of different matrices’ sizes on the Cortex M-3 and the NIOS II processors at a 
clock speed of 100 Mhz. It can be noticed that the execution time exponentially increases 
while the size of the input matrix (the number of the spectral bands) increases. Moreover, the 
hardwired Cortex M-3 exhibits faster performance than the soft NIOS II, which is 
understandable as hard processors outperform soft processors in performance, area and power 
[210].
Table 7.1: The Execution Time (ms) of the PLUS Factorization on the embedded processors
8 x 8 16 X 16 32 X 32
NIOS II 6.7 71 880
Cortex M-3 1.8 15.2 14Z8
137
7.4.1.2 Lifting Scheme
As illustrated in Figure 6.4 and equation 6.9 of the previous chapter, the lifting process is 
comprised of a sequential multiplication of the vector X (the pixels of the same spatial 
coordinates along all spectral bands) with the S, U, L and P matrices, respectively. Since P is 
a pivoting matrix, the multiplication by P will only result changing the order of the rows; in 
other words, when this is applied to hyperspectral data, it will only alter the order of the 
spectral bands. Thus, to reduce the processing time, the multiplication by P can be performed 
after completing the LUS lifting of the whole hyperspectral data.
In order to illustrate the computation of this process, a small pixel set X of 4 pixels of the 
same spatial coordinates is considered. Equations 7.1, 7.2 and 7.3 outline the lifting by S, U 
and L respectively. From equation 7.1, it can be noticed that for any element X5„, it requires 
n operation cycles (multiply-aecumulate), the same applies to equation 7.3. On the other 
hand, in equation 7.2, it requires (N — n — 1) operation cycles for the computation of each 
element xsu^. From the primitive equations of 7.1, it can be noticed that all the elements 
are being executed at the same time; the same can be noticed for the element xs.^ and xsu^  of 
equations 7.2 and 7.3, respectively. This can make the parallel execution of these equations 
more feasible.
Figure 7.3 illustrates the proposed hardware architecture for the lifting scheme for a set of 4 
pixels; this process is executed over 3 phases: multiplying by S, U and L. The output results 
of each phase are saved into intermediate FIFOs {[S X  MeanSub] and [L x [5 x MeanSub]]) 
While any of the phases exeeutes a certain pixels set, the other phases can execute the next or 
the previous pixels sets and so on. The execution of each phase will require N operation 
cycles (computing) plus N buffering cycles (filling the buffer of the next phase). Since the 
multiply-aecumulate (MAC) units used in this design requires 2 clock cycles, the execution 
of each phase will require { 2 x  N + N = 3N) clock cycles. Since the order of each 
intermediate FIFO need to be reversed, the computing operations cannot be performed in 
parallel with the buffering. Consequently, the processing throughput can reach 
approximately N  pixels per (37V) clock cycles for each unit of Figure 7.3. Therefore, for an 
(M X  L X  TV) image, the execution time for the lifting scheme will be slightly more the 
3 X M X L X N clock cycles. The post synthesises Modelsim Simulation of the lifting 
hardware unit is presented in Appendix D.2.
138
X^Sq\
XSi
XS2
<a:s3>
So that:
X5q — Xq
XSi = [5i,o^O + %]
■^^ 2 — [-^2,0^0 "b -^2,1^1
^2,0 *2,1
*3,0 *3,1
^2]
%^3 =  [ ^ 3,0 ^ 0  +  5 3 ,1 X 1  +  5 3 ,2 X 2  +  X 3 ]
(7.1)
1 ^ 0,1 ^ 0,2 ^ 0,3
0 1 ^ 1,2 ^ 1,3
0 0 1 ^ 2,3
0 0 0 1
<XSUo\ 
xsu:i \ _  
XSU2 I ~  
<XSUs^
So that:
XSUq =  [X 5o +  U o ,iX 5 i +  ^ 0 ,2X 52 +  W o,3^5s] 
X 5 U i =  [X 5 i +  U i,2X 52  +  ^ 1 ,3 X5 3 ]
XSU2 = [XS2 + ^2,3X53]
(7.2)
X 5 U 3  = X5g
1 0 0 0
h,o 1 0 0
2,0 h , i 1 0
■3,0 k , i h.2 1
y^o''
yi
yi
^ys'
So that: 
yo =  xsuo
y i  =  [k,oXSUo + X5Ui]
yz = [h.oxsuo +  /2,1^5Ui +  X5W2]
73 =  [^3,0^5Uo +  /s,1^5Ui +  (3,2X51^ 2 +  ^5U3]
(X S U q \  XSUj_ XSU2  XSU2 ' (7.3)
139
MUX Logic
MUX Logic
MUX Logic
*a
I
S
I
<
I
I
K
DD
I
S
n
S)
o
—
t/5
T-4
1 M
MCc
______________________________ = 3 = ________I
Table 7.2 outlines the required time (in milliseconds) for a single unit to execute different 
data sets sizes when running at a clock speed of 100 Mhz. More than one unit can be utilised 
at a time, so they can decrease the execution time by working in parallel.
Table 7.2: The Execution Time (ms) of the Lifting Scheme for Different Data Sets’ Sizes
128 X 128 X 16 128 X 1 2 8 x 3 2 256 X 256 X 16 2 5 6 x 2 5 6 x 3 2 5 1 2 x 5 1 2 x 1 6 5 1 2 x 5 1 2 x 3 2
8.125 16.25 32.5 65 130 260
The hardware resources required for the lifting unit (Figure 7.3) are mainly the multiply- 
aecumulate units (MAC), the FIFOs and the multiplexing logic. These requirements are 
mainly proportional to number of the spectral bands N; so, in addition to the multiplexing 
logic, 3N FIFOs and 3(TV — 1) MAC units are utilised for a lifting unit of size N. As shown 
in section 5.6, the MAC units require more hardware logic than other components. Therefore, 
it is important to maintain higher hardware utilisation of these units. From Figure 7.3 and 
equations 7.1, 7.2 and 7.3, it can be noticed that at a certain time some of the MAC units are 
in operation while others are multiplying by zeroes (idle). For example, when the lower 
MACs of L and S sections are processing and Ti g, the lower MAC of the U section is 
processing zeroes. Therefore, the lower elements of the U section can be computed by 
multiplexing the lower MAC of the L or the S sections. In the case of 4 pixels data set. Figure 
7.4, this will save only one MAC (MAC5); however, for larger data sets, more MAC units
can be multiplexed ( ^ ^  units). Therefore, reducing the number of used MAC units and
increasing the utilisation time of the used ones. Consequently, the required number of MAC
units is — — 2 when N is an even number and — 2 when N is an odd number. Table
2 2
7.3 outlines the required hardware resources for the lifting unit when implemented on the 
Altera Cyclone IV FPGA.
Table 7.3: The Hardware Resources Utilised for the Proposed Lifting Scheme Architecture
LE 9-bit Multiplier SRAM bit
With the hardware 
multipliers
/SN \
32 f —  -  2 j +  400* S N - 4
3Af2 oc +  MLTPX
Without the hardware 
multipliers
/SN \  
3 9 2 ( — - 2 j  +  400* -
3N^ oc +  MLTPX
(*) 400 LEs are mainly for the multiplexing logic, this might slightly vary. An insignificant amount knowing 
that the used Cyclone IV offers around 115000 LEs
oc: The data width
MLTPX: is the RAM used by the multiplexing logic, which is approximately 3N  oc, an insignificant amount 
knowing that the used Cyclone IV offers around 4 Mbit o f SRAM
141
7.4.1.3 Hardware Utilisation and Processing Time
The processing time and the required hardware resources have been addressed for the lifting 
and the PLUS factorizations while the other components have been addressed in Section 5.6. 
While the design parameter of the covariance computation is R (defined in Section 5.6.4), the
main design parameter for the lifting scheme is the number of lifting units U = ^  — 2 MACs.
In principle, utilising more MAC units, results faster computation and requires more power
consumption. Nevertheless, the optimal R for the covariance is or its division if N is
large and no enough hardware; on the other hand the optimal number of R for the lifting is a 
multiple of U. Table 7.4 outlines the hardware and power requirements of the proposed 
architecture when implemented on the target FPGA (Cyclone IV EP4CE115). Since the 
RAM memory limitations of the used FPGA, only a segment of 128 X 128 X 8 pixel is 
considered, this required about 81% of the available RAM. The Altera Quartus II PowerPlay 
Analyser was used to estimate the power consumption. Since the interface with the 
hyperspectral sensor and any external memories is beyond the scope of this research, the 
input / output power consumption was excluded and only the dynamic and static power 
consumption of the core logic was considered.
The following can be noticed from the Table
• The maximum R could be realised on the target FPGA is 156, this offers 2 lifting 
units for 32 spectral bands and more than one fourth of the optimal R for the 
covariance computations; so, the computation of the covariance matrix will be 
performed in 4 rounds when 32 spectral bands are processed.
• For 16 spectral bands, R = 152 offers 4 lifting units and more than the optimal R 
(136) for the covariance computation, which can be performed in a single round.
• For 8 spectral bands, the optimal R for the covariance computation is 36 MACs, 
which can offer 2 lifting units. The target FPGA can offer logic for more lifting units; 
however, this additional logic will be idle during the covariance matrix computation.
• Since the embedded multipliers offer much less power consumption, when the 
hardware multipliers are fully used and Logic Elements are utilised instead (R=152), a 
significant increase in the power consumption can be seen.
142
Table 7.4: The Hardware and Power Resources of the Proposed Integer KLT Architecture
Single Precession Double Precession
R Multipliers % LEs % Power
(mW)*
Multipliers % LEs % Power
(mW)*
18 224 42 39000 34 649 351 66 51000 45 809
36 260 49 40000 35 670 387 73 52000 46 829
38 264 50 40000 35 672 391 83 53000 46 841
76 340 64 43000 38 724 467 88 54000 47 873
152 492 92 46000 56 798 532 100 90000 79 1261
156 500 94 46000 56 801 532 100 92000 81 1282
(*) for the power estimation 1.2V Core Supply Voltage was considered at an ambient temperature of 25°C at 100 MHz
In order to assess the performance of the proposed architecture, different data sets of the 
hyperspectral images AVIRIS Cuprite and Hyperion Boston (the same of Table 4.3) were used 
as test data as shown in Table 7.5. The processing time of each individual stage is stated where 
different values of R are considered. Since the hard Cortex M-3 showed a significant 
computational performance comparing to the soft NIOS II (Table 7.1), the Cortex M-3 will be 
considered in the assessment as a conceptual approach if the proposed architecture were to be 
implemented on an FPGA platform incorporating a hard processor.
It can be noticed from Table 5.12
• The processing time of Stage 1 and the lifting are proportional to both the spectral and 
spatial dimensions while the eigenvectors computation and the PLUS factorization are 
only proportional to the spectral dimension
• The processing time of the eigenvectors computation over the overall processing time 
is much more significant for larger number of spectral bands (for 8 spectral bands: up 
to 8%; for 16 spectral bands: up to 26%; for 32 spectral bands: up to 63%)
• The processing time of the PLUS factorization over the overall processing time is more 
significant for larger number of spectral bands as outlined in Table 7.6.
• While the computational requirements of the PLUS factorization (Section 6.5) are far 
less than the other processes (eigenvectors, covariance and lifting), the execution time 
of this process can be an overhead, especially if implemented on less powerful 
processor such as the soft NIOS II. Therefore, it would be significantly valuable to 
have similar factorization process to the PLUS but with less sequential computation 
process, so it can be more suitable for hardware implementation.
143
Table 7.5: The Execution Time {milliseconds) of the Proposed Integer KLT Architecture
Image Size
U R
Stage
1
Stage
2
Stage 3
Total Time
PLUS Lifting NOIS Cortex
NOIS Cortex
128x 128x8 1 18 0.66
0.47 6.7 1.8
4.1 11.93 7.03
2 36 0.33 2.1
9.6 4.7
256 X 256 X 8 1 18 2.64 16.25 26.06 21.16
2 36 1.32 8.125
16.615 11.715
5 1 2 x 5 1 2 x 8 1 18 10.56 65 82.73 77.83
2 36 5.28 32.5 44.95 40.05
128x 128x 16
1 38 1.32
3.5 71 15.2
8.125 83.945 28.145
2 76 0.66 4.1 79.26 23.46
4 152 0.33 2.1 76.93 21.13
256 X 256 X 16
1 38 5.28 32.5 112.28 56.48
2 76 2.64 16.25 93.39 37.59
4 152 1.32 8.125 83.945 28.145
512 X  512 X 16
1 38 21.12 130 225.62 169.82
2 76 10.56 65 150.06 94.26
4 152 5.28 32.5 112.28 56.48
128x 128x 32 1 78 2.64
25.3 880 142
16.25 999.55 261.55
2 156 1.32 8.125 1069.4 331.42
2 5 6 x 2 5 6 x 3 2 1 78 10.56 65 1048.3 310.3
2 156 5.28 32.5 1093.8 355.8
5 1 2 x 5 1 2 x 3 2 1 78 42.24 260 1243.3 505.3
2 156 21.12 130 1191.3 453.3
(*) A Clock frequency of 100 MHz
Table 7.6: The processing time of the PLUS factorization over the overall processing time
Spectral Bands 8 16 32
NIOS II 15-70% 47-92% 70-88
Cortex M-3 4-38% 9-72% 28-54%
144
7.4.2 Integer KLT SoC Architecture on the SmartFusion Flash FPGA
A KLT SoC architecture for the Smartfusion platform was proposed in Section 5.7. The 
hardware architecture of the Integer KLT is very similar; the differences are:
• The PLUS factorizations (executed on the Cortex processor, so no hardware changes)
• The lifting: which is a 3 matrix multiplications as explained in equations 7.1, 7.2 and 
7.3. Hence, this is somehow similar to the eigen mapping process, but it requires more 
multiply-aecumulate operations in addition to some extra data rescheduling and 
rounding as explained earlier in section the previous section.
Since the on-chip processor of the Smartftision (Cortex M-3) is more computationally 
powerful than NIOS 11, it can assume more computations. The KLT algorithm is mapped on 
to the SmartFusion SoC dividing the constituent computational processes between the 
embedded Cortex M-3 processor and the hardware accelerator on the FPGA fabric as shown 
in Figure 7.4.
The block diagram of the proposed system consists of:
• ARM Cortex-M3 Processor
• AMBA on-chip bus for data communications between the Cortex and the FPGA 
fabric.
• 2 large FIFOs (FIFO 2 and FIFO 4) of size
• 2 Small FIFO (FIFO 1 and FIFO 3) of size N
• An accelerator to perform the vector multiplications (2 MAC units and some 
scheduling logic)
• A Control Unit to manage the data flow between the Cortex M3 processor and the 
FPGA fabric
Two operations are performed in the accelerator:
• The Covariance Matrix: The H x H multiplication of the covariance computation 
as in the KLT architecture of Chapter 5.7
• The lifting: The 2 big FIFOs hold the L and U matrices and the small FIFOs hold the
intermediate processed data (xs and xsu  of equations 7.2 and 7.3); the accelerator
incorporates 2 MAC units, so equations 7.2 and 7.3 are performed in the hardware.
On the other hand, the multiplication by S (equation 7.1) is performed in the Cortex
processor.
145
FIF04FIFOl FIF02 FIFOS
Cortex M-3
Accelerator
Figure 7.4: The Block Diagram of the SmartFusion Architecture 
7.4.2.1 EXPERIM ENTAL RESULTS
In order to assess the performance of the proposed system, different data sets of the 
hyperspectral test image (AVIRIS and Hyperion, the same of Table 4.3) are processed by the 
SoC prototype. Table 7.7 outlines the required hardware resources and Table 7.8 shows the 
acceleration offered by the design. It can be seen that the proposed design offers an overall 
acceleration of 53%, 52% and 48% for 8, 16 and 32 spectral bands, respectively. This is 
almost halving the processing time with an overall power consumption of less than 0.25 Watt, 
estimated using the Smart Power tool of Actel.
Table 7.7: The Required Resources (Hardware and Power) o f the SmartFusion Architecture
Used Total Percentage
Power (m W)
Static Dynamic Total
FPGA Fabric (System  
Gates)
4260 4608 92%
9 219 228Embedded SRAM  
(Blocks)
6 8 75%
146
Table 7.8: The Execution Time {seconds) o f the Integer KLT Architecture (SmartFusion)
Cortex App 2 Improvement %
BandMean 0.0224 0.0224 -
MeanSub 0.0224 0.0224 -
Covariance 0.16 0.086 46.3%
256 X  256 X  8 Eigenvectors 0.0108 0.0108 -
PLUS 0.002 0.002
Lifting 2.18 0.98 55%
O verall 2.397 1.124 53%
BandMean 0.0448 0.0448 -
MeanSub 0.0448 0.0448 -
Covariance 0.32 0.172 46.3%
256 X  256 X  16 Eigenvectors 0.093 0.093 -
PLUS 0.015 0.015 -
Lifting 4.35 1.96 55%
O verall 4.867 2.33 52%
BandMean 0.0896 0.0896 -
MeanSub 0.0896 0.0896 -
Covariance 0.64 0.344 46.3%
256 X  256 X  32 Eigenvectors 0.88 0.88 -
PLUS 0.142 0.142 -
Lifting 8.7 3.9 55%
O verall 10.54 5.45 48.3%
147
7.5 Adaptive KLT / Integer KLT Computation
The KLT and Integer KLT algorithms lend themselves very well to a combined hardware 
implementation, in which the designs share most of the computational modules. Such a 
unified design, performing both functions adaptively, would take less hardware resources 
than the two individual ones. This is possible because the KLT and Integer KLT algorithms 
are never used simultaneously, due to the different types of the inter-band compression that 
they realize.
Since the hardware architectures addressed in this Chapter and Sections 5.6 and 5.7 have 
most components in common, building an adaptive KLT / Integer KLT can be 
straightforward and would not requires much additional hardware. In terms of hardware, the 
only difference is the Eigen mapping process, which is performed through a lifting scheme 
for the Integer KLT and through the matrix multiplication {eigenvectors x  MeanSub). 
Therefore, an adaptive system will only require a demultiplexing between these two 
processes. In point of fact, both these processes require similar mathematic operations (matrix 
multiplications), which mainly employ multiply-aecumulate units. Therefore, the main 
hardware resources can be shared between these process, while each of the processes will 
require its own multiplexing and scheduling logic, which are relatively much smaller than the 
MAC units. On the other hand, in term of processor firmware, the PLUS factorization should 
be performed when the Integer KLT is required.
Figure 7.5 illustrates the computational flowchart of the proposed adaptive system. This 
system can offer the option of dynamically selecting between lossy or lossless compression, 
with minimal additional hardware resources.
Table 7.9 and 7.10 outline the execution time of the adaptive KLT / Integer KLT computing 
system on the Altera DE2-115 and the Actel Smartfusion boards, respectively. The hardware 
and the power resources of this system are outlined in Table 7.11 and Table 7.12.
148
Initialization
V,;
BandM ean
RKLT KLT
SubM ean
Covariance Covariance
E igenvectors
RKLT
♦
KLT
Lifting I Matrix 
Multiplication
SubM ean -----------------------------
C ortex  M -3 FPGA
Figure 7.5: The Computation Flow of the Adaptive KLT/ Integer KLT System
149
mfN
M
Q
CZÎ
Ê
O)
X I
II
H
H
I
a
X
I
QJ
S
H
co
*-C
3
U
0>
X
w
O)
ON
(Us
3
H
u
H
m
&nc/2
I
O
U
%
O
;z ;
(3DI
&3 (S
'W!ZD
o>
(3D
42c/2
II
W23 m
55
QJ
I3D
3  C 4
55
o
(3D
42
C/D
. 5
C/D
O
(3D
3
S
(N
UD
LO LO
fN fN
NO
0 0
LO
rH
M
LD
T—1
U3
U3
m
0 0
i<
m
(N
00
i n
NO CN
m
if )
•=d-
T—I
00
rsi
LD
ON
m
00
m
CN
UO
ro
(N
IDr\i
ON
CM
<N
UÔ
m
0 0
(N
fN
T—I 
T—I
ON
LO
( <
m
in
(N
NO
r—I
00
rsi
LD
ON
no
00
m
fN
r \ l
00
$
T—I
O
m
NO
O
d
LfN
T—I
LfN
NO
00
fN
(N
T—I
L D
un
LO
fN
LfN
UN
ON
ON
ON
LfN
<N
NO
fN
no
no
ON
LO
O
T—I
LfN
fN
no
d
no
o6
§  
T—I
LfN
NO
00
UN
UN
no
00
no
ON
o
UN
o
UN
o
NO
fN
<N
LfN
NO
O
OO
00
n -
d
LfN
r n LfNfN
NO
LfN fNm 3
<N
fN
fN
NO
LfN Tj-NO
<N
(N
m
NO
LfN
fN
UN
LO
no
nô
no
UN
00
fN
ON
M
q
00
LO
LO
LfN
UN
no
LO
00
(N nviUN
fN
fN
fN
C-;
d
LfN
LfN
LfN
r -
fN
fN
fN
LfN
LfN
LfN
r -
CN
fN
fN
OO
OO
ro
m
d
NO
m
(N
d
3
fN
LfN
ro LfNfN
fN
m
NO
m
fN
NO
LfN
NO
m
fN
04
m
00
X
NO
LD
CN
X
LO
LfN
fN
0 3
X
CN
tH
LfN
X
CN
tH
LfN
fN
LO
00
CN
rH
X
GO
CN
m
m
d
fN
LfN OOm
3
CN
fN
LO
X
LO
LO
CN
X
LO
LO
CN
fN
LfN
fN
fN
NO
LfN
NO
r -
fN
LO
tH
X
CN
tH
LO
X
CN
tH
LO
(N
LT)
N-
LO
0 4
fNro
NO
LO
(N
NO
LO
OO
OO
fN
LfN
LO
LO
(N
CN
no
00
CN
co
CN
CN
no
X
LO
LO
CN
X
LO
LO
CN
no
rb
LO
N"
m
t—I
ON
N -
fN
fN
N -
fN
fN
N"
m
d
LO
tH
O
LOt—I
(N
(N
UN
00
N -
fN
fN
N"
OO
n -
fN
fN
NO
LO
fN
CN
no
X
CN
tH
LO
X
CN
Eh
I
I
I
1
mU3
r\i
Od
(N
O
d
U3
00o
CX ro in
r n ON T—1T—t o o
d d d
rs
LD
ro
d LDLD
ro
9
ro
CO
<N
LD
go
d
LD
go
d
5ro
d
( N
N " ON
ro
mN-
3O
*53I
aC/D
II0>X
I
I
I
V
H
fo
I CMO
«d-
CM
(N
O
DT—I
d
exo 00 ONo ro
o cx cx
00 00
Tl" cxX- X- COo o dd d
r o in
ON T—!
O O
O O
fH
LD
00
LD
so
d
LD
so
d
s 0000 CN■Lt
00
N "
ID
dt—I
I
ÜI
CO
LD
■=^
X" tH
ON o6in in mLO
00
LD
LO
ro
LD d
ID
rHif)
( Xo
d
CM
O
d
LD
§
d
sin
00 00
f H X "
LD X "
d o od d
%
rH
rH
I D
So
LD
s
O
:ro
LD
in
CM
rx
inrx
in
I rxO (Xo ONro
00 00X" rxX" Tl" roo o oo o
00rx
D
so
d
LD§ s g§d
I bo u >c
LD
in
in
I
LD
rxrn
g ,
II
I
I1 saI
U I
00
X
LDin
cx
VO
incx
b £
I
I
k ,
b £
.5I
i
I
II
I
scnI IIU
X
VO
in
cx
X
VO
in
cx
u
i
cx
CO
VOin
cx
VOincx
b £I
s
I
UN
I
I
Î
A'W
%
I
t
0)X
I
I
I34
ns
i
II
H
(2
. 2
Q .
0 )J3
E
3
O )
CNJ
00
CD''T
OOO
8
CD
CD
ID
CO
O
s
N
''T
O
O
R
ID
CO
h -
h -
00
CO
O
CD
00
CO
00
O
0000
00
ooo
ID
ID
h -
CD
O
00
O
O
0
01
O
O
CM
CO
ID
CDON
(N
8
O
Oo
CO
ON
O
O
CM8
00 CD
CO
00
CO 8 CDID
I
I
I
sCD
( 2
I
I
00
(NfX
CD
rHrx
I
&
I
%Î
I
1
I
I
1
I
X
H
< s
!
f2
I
CD
fX
CD
00
s
o
CD
fX'Lj-
UN
00
CD
fX
UN
0
1
U
33hé
•g
O
5
CD
T3ns
ON
s
a;
7.6 Conclusion
In this chapter, the acceleration of the Integer KLT on SoC FPGA platforms was addressed. 
Therefore, a novel hardware architecture for accelerating the Integer KLT was proposed. This 
architecture was presented on both Flash and SRAM FPGA platforms for low power and 
high-performance applications, respectively. The performance of the proposed architecture 
was assessed thoroughly in term of processing time, required hardware resources and power 
consumption. For the low power platform, the processing time of a hyperspectral data set of 
256 X 256 X 32 was less than 5.5 seconds with acceleration of more than 48% and a power 
consumption of less than 0.25 Watt. For the high-performance platform, the processing time 
of the same data set was less than 1.1 seconds with a power consumption of less than 1.3 
Watt. On the same platform, if the Cortex M-3 were hardwired, a conceptual approach would 
of a processing time of less than 0.36 seconds and far less the 1.3 Watt power consumption. 
The proposed novel hardware architecture targets hyperspectral image compression with 
large number of spectral bands, which has not been addressed before in the context of 
embedded hardware.
153
Chapter 8 
Conclusions and Future Work
In this chapter, the conclusions of the research work will be outlined along with the novelty; 
and the future continueation of the research will be sugessted.
8.1 Conclusions
In chapter 2, the literature review was presented, which covered relevant theory background, 
challenges and trends for this research. This included the space radiations effects and their 
mitigation techniques, the FPGA, the reconfigurable computing and the System-on-a-Chip 
technolgies and their suitabilty for space applicattions. This highlighted the need for a 
powerful acceleration of the KLT transform for hyperspectral image compression on 
embedded hardware, which has not been addressed before.
Chapter 3 presented an overview of hyperspectral satellite imaging, their significance for 
different applications and the mechanism of their operation. The compression process of the 
hyperspectral data was also discussed and the Consultative Committee for Space Data 
Systems (CCSDS) Standards was addressed. A discussion of the spectral decorrelation 
techniques was presented; this included the compression performances of these techniques; 
the complexities and the approaches to reduce these complexities. This discussion highlighted 
the significance of the KLT process for hyperspectral data compression and the complexity of 
its computational process.
In chapter 4, the computation of the KLT process was investigated thoroughly; and the 
computations of the eigenvectors and the eigenvalues were analysed and different techniques 
were compared in term of output accuracy and computational requirements. A novel Matrix 
Reduction Technique based on the Jacobi algorithm was proposed, this technique reduces the
154
number of required iterations; the simulation of different data sets of the test data (AVIRIS 
and Hyperion) showed reductions of 20% to 30%. Moreover, the computational requirements 
of each individual process were outlined and the dependencies of these requirements with the 
hyperspectral image dimensions were defined. Moreover, a comprehensive error analysis of 
the fixed- and floating-point implementation of the KLT algorithm was presented. Therefore, 
the required data formats were determined, where data distortion can be eluded for a fixed 
point implementation of all the process but the eigenvectors computation. A simulation of the 
fixed- and floating-point output error of the eigenvectors computations was also presented; 
this simulation used the hyperspectral data from the AVIRIS and the Hyperion imagers.
In chapter 5, the acceleration of the KLT computations on SoC FPGA platforms was 
addressed. Therefore, novel hardware architecture for accelerating the KLT was proposed. 
Comparing to previously proposed hardware architecture, the proposed architecture offers 
further level of parallel computing, which is more significant for hyperspectral data with 
large spectral band. Therefore, when testing hyperspectral data on the proposed system, it 
shows an overall improvement of up to 4.9%, 11.8% and 18.4 % for 8, 16 and 32 spectral 
bands, respectively. This architecture was presented on both Flash and SRAM FPGA 
platforms for low power and high-performance applications, respectively. The performance 
of the proposed architecture was assessed thoroughly in term of processing time, required 
hardware resources and power consumption. For the low power platform, the processing time 
of a hyperspectral data set of 256 x 256 x 32 was approximately 3.5 seconds with 
acceleration of more than 50% and a power consumption of less than 0.25 Watt. For the high- 
performance platform, the processing time of the same data set was less than 37 milliseconds 
with a power consumption of approximately 1.05 Watt.
In chapter 6, the computational requirements of the Integer KLT process were investigated. 
The number of the required operations for each process has been determined for different 
spectral and spatial dimensions. The required data format (floating and fixed point) for the 
computations was also been discussed, and it was concluded that the PLUS factorization 
requires floating point computation while the lifting scheme can be performed on a fixed 
point format, where the required data width was determined. This Chapter also presented the 
differences in the computational requirements between the KLT and the Integer KLT 
algorithms; this highlighted the potential constraints for the hardware implementations of the 
Integer KLT
155
In chapter 7, the acceleration of the Integer KLT on SoC FPGA platforms was addressed. 
Therefore, a novel hardware architecture for accelerating the Integer KLT was proposed. This 
architecture was presented on both Flash and SRAM FPGA platforms for low power and 
high-performance applications, respectively. The performance of the proposed architecture 
was assessed thoroughly in term of processing time, required hardware resources and power 
consumption. For the low power platform, the processing time of a hyperspectral data set of 
2 5 6 x 2 5 6 x 3 2  was less than 5.5 seconds with acceleration of more than 48% and a power 
consumption of less than 0.25 Watt. For the high-performance platform, the processing time 
of the same data set was less than 1.1 seconds with a power consumption of less than 1.3 
Watt. On the same platform, if the Cortex M-3 were hardwired, a conceptual approach would 
of a processing time of less than 0.36 seconds and far less the 1.3 Watt power consumption. 
The proposed novel hardware architecture targets hyperspectral image compression with 
large number of spectral bands, which has not been addressed before in the context of 
embedded hardware.
8.2 Novelty Claims
The research outcomes of this thesis provide novel contributions to the state of the art 
technologies as follows:
• A new architecture for the acceleration of the integer Karhunen-Loéve Transform 
computation on an FPGA based System-on- Chip platform for lossless hyperspectral 
image compression is proposed. This includes comprehensive investigations of the 
power consumptions, hardware resources and performance constraints. Moreover, 
since no hardware architecture has been addressed before, this work has highlighted 
the algorithm computations constraints from a hardware perspective.
• A novel architecture for the acceleration of the Karhunen-Loéve Transform 
computation on an FPGA based System-on-a-Chip platform for lossy hyperspectral 
image compression is proposed. Comparing to previously proposed hardware 
architecture, the proposed architecture offers further level of parallelism, which is 
more significant for hyperspectral data with large number spectral bands. The 
experiments of the proposed system on the AVIRIS and the Hyperion data showed an 
overall improvement to the level of parallelism of up to 4.9%, 11.8% and 18.4 % for 
8, 16 and 32 spectral bands, respectively.
156
A novel technique for the eigenvalues/ eigenvectors computations based on the Jacobi 
algorithm is proposed; this technique reduces the number of required iterations for 
large symmetric matrices; the simulation of different data sets of the test data 
(AVIRIS and Hyperion) showed reductions of 20% to 30%. Moreover, this technique 
offers partial computations of the eigenvectors and eigenvalues. Therefore, this can 
improve the parallelism level not only for the KLT computations but also for other 
applications where the some of the eigenvectors/eigenvalues can be utilised in the 
next computation stage.
A novel eigenvalues/eigenvectors computing hardware algorithm for large symmetric 
matrices is proposed, which employs the proposed matrix reduction technique with a 
selectable level of output accuracy. In addition to KLT applications, this eigenvectors 
computer can be available for different applications that employ eigenvectors 
computations.
A comprehensive analysis of both fixed- and floating-point implementations of the 
proposed system is presented. This analysis included a comprehensive comparison 
between these approaches which considered different design and performance aspects 
such as power consumption, hardware resources, accuracy and processing time.
A novel architecture for adaptive Lossy / Lossless spectral decorrelation for 
hyperspectral image compression utilising both the KLT and the Integer KLT 
computational processes is addressed. This architecture utilises the similarities 
between the KLT and Integer KLT computational process to achieve minimal 
hardware overhead.
157
8.3 Publications
The results of this thesis are reported in six conference publications. A list of the published
papers related to this thesis is given below.
[1] C. Egho and T. Vladimirova. “Adaptive Hyperspectral Image Compression using the 
KLT and Integer KLT algorithm”, NASA/ESA Conference on Adaptive Hardware and 
Systems (AHS-2014). July 2014, Leicester, UK
[2] C. Egho and T. Vladimirova. “Hardware Acceleration of the Integer Karhunen-Loéve 
Transform Algorithm for Satellite Image Compression”, IEEE International 
Geoscience and Remote Sensing Symposium (IGARSS 2012), July 2012, Munich, 
Germany.
[3] C. Egho and T. Vladimirova, M. Sweeting. “Acceleration of the Karhunen-Loéve 
Transform for System-on-a-Chip Platform”, NASA/ESA Conference on Adaptive 
Hardware and Systems (AHS-2012). June 2012, Nuremberg, Germany
[4] C. Egho and T. Vladimirova. “Hardware Acceleration of the Karhunen-Loéve 
Transform for Compression of Hyperspectral Satellite Imagery”, The 11th Australian 
Space Science Conference (ASSC2011). September 2011 Canberra, Australia
[5] C. Egho and T. Vladimirova. “Eigenvectors Computation on a System-on-Chip 
Platform for Satellite On-Board Use”, 7^  ^ Jordanian International Electrical and 
Electronics Engineering Conference, (JfEEEC 2011). April 2011, Amman, Jordan.
[6] C. Egho, Tanya Vladimirova. “Design of Low-Power Multifunctional System-on-a- 
Chip Based On-Board Controllers”, Surrey Postgraduate Research Conference, 
September 2010, Guildford, UK
158
8.4 Suggestions for Future Work
Based on the efforts initiated in this research work, three main areas are proposed as logical 
continuation for future research.
First, the ultimate objective of this work is a hyperspectral image compression system, which 
also includes the spatial decorrelation process. Therefore, investigating an architecture that 
incorporates both the spatial and the spectral decorrelation processes is a logical approach. 
Therefore, different approaches can be explored for accelerating the overall performance of 
the system. Moreover, this can define the constraints for such system and it can also highlight 
new issues that affect the overall performance of the system.
Second, the computation PLUS matrix factorisation is not as intensive as the other processes 
(as was shown in Chapter 6). However, since it is a highly sequential process, it was executed 
on the on-chip processor rather than the hardware. Consequently, no acceleration was applied 
to this process and its execution time could be an overhead, especially if implemented on less 
powerful processor such as the soft NIOS II (as was shown in Chapter 7). Therefore, it would 
be significantly valuable to investigate similar factorization process to the PLUS but with 
less sequential computation process, so it can be more suitable for hardware implementation.
Third, different approaches can be investigated to reduce the complexity of the KLT 
computational process from mathematical perspective. These includes the estimation of the 
covariance matrix and the eigenvectors. The resulted trade-off of the computations can be 
investigated along with the trade-off in the output compression rate. In addition, the 
estimation of the trigonometric function using look-up tables can be investigated. The 
hardware utilisation, the output accuracy and the execution time can be scrutinized in this 
analysis.
159
References
[1] R. Richter, “Hyperspectral Sensors for Military Applications”, DLR, German Aerospace 
Centre Report, Remote Sensing Data Centre, 2005
[2] Mather, P.M. (1999) Computer Processing of Remotely Sensed Images. An Introduction. 
John Wiley and Sons, Chichester, UK.
[3] T. M. Lillesand, R. W. Kiefer, and J. W. Chipman, Remote Sensing and Image 
Interpretation. New Jersey, USA: John Wiley & Sons, Inc., 2008.
[4] C. Blodgett,”What is Hyperspectral Imagery (HSI)?, Missouri Resource Assessment 
Partnership (MoRAP), Missouri department of natural resources
[5] G. Yu, “An On-Board Real-Time Image Compression System for Earth Observation 
Satellites”, University of Surrey PhD thesis 2009
[6] N. R. Mat Noor, T. Vladimirova, and M. Sweeting, "High Performance Lossless 
Compression for Hyperspectral Satellite Imagery," in UK Electronics Forum, Newcastle 
University, UK, 2010, pp. 78-83.
[7] Bormin Huang, Satellite Data Compression. Springer Science & Business Media, 2011, 
ISBN: 1461411831,9781461411833.
[8] Kamisetty Ramam Rao, Patrick C. Yip, K, The Transform and Data Compression 
Handbook, CRC Press, 2000, ISBN 1420037382, 9781420037388
[9] Yun Q. Shi, Huifang Sun , Image and Video Compression for Multimedia Engineering: 
Fundamentals, Algorithms, and Standards, CRC Press, 2000, ISBN 1420049798, 
9781420049794
[10] Allen Kent, James G. Williams, Encyclopaedia of Microcomputers: Volume 24 - 
Supplement 3: Characterization Hierarchy Containing Augmented Characterizations to Video 
Compression, CRC Press, 2000, ISBN 0824727223, 9780824727222
160
[11] M. Fleury, R. P. Self and A. C. Downton, Multi-Spectral Satellite Image Processing on 
a Platform FPGA Engine, Int. Conf. on Military and Aerospace Programmable Logic Device 
(MAPL D 2005), 2005
[12] Fleury, M.; Self, B.; Downton, A., "A fine-grained parallel pipelined Karhunen-Loeve 
transform," and Distributed Processing Symposium, 2003. Proceedings.
International, vol., no., pp. 11 pp.,, 22-26 April 2003
[13] Wertz, James R., Wiley J. Larson (1999). “Space Mission Analysis and Design”, 3rd Ed.
[14] C. Tantash "Mitigation of Cosmic Radiation Effects in High-Density SRAM-Based 
FPGAs" University of Surrey MSc Dissertation 2003.
[15] A. Smale, NASA Imagine the Universe “Cosmic Rays” 
http://imagine.gsfc.nasa.gov/docs/science/know_ll/cosmic_rays.html
[16] Dyer C. S. (1998) “Space Radiation Effects for Future Technologies and Missions”, 
QINETIQ KI SPACE TR0106901/1.1
[17] V. Bothmer, “Solar corona, solar wind structure and solar particle events,” in Proc. ESA 
Workshop Space Weather Nov. 1998, 1999, ESA WPP-155, pp. 117-126
[18] Altera Corporation www.altera.com
[19] Xilinx Inc www.xilinx.com
[20] E. Grayver, Implementing Software Defined Radio, Springer 2012
[21] Total System Power UNDERSTANDING THE POWER PROFILE OF FPGAs, 2006 
Actel Corporation.
[22] Rezgui, S.; WANG, J.J.; Sun, Y.; Cronquist, B.; McCollum, J., "New Reprogrammable 
and Non-Volatile Radiation Tolerant FPGA: RTA3P," Aerospace Conference, 2008 IE E E , 
vol., no., pp. 1,11, 1-8 March 2008
[23] I. Kuon, R. Tessier, and J. Rose. FPGA architecture: survey and challenges. Foundations 
and Trends in Electronic Design Automation, 2008.
161
[24] Kevin Morris, “Xilinx vs. Altera Calling the Action in the Greatest Semiconductor 
Rivalry”, EE Journal February 2014 http://www.eejoumal.eom/archives/articles/20140225- 
rivalry/
[25] Kenneth A. Label, “What’s All this Field Programmable Gate Array (FPGA) Stuff Have 
to Do With Space”, Single-Event Effects Symposium and Military and Aerospace 
Programmable Logic Devices (SEE-MAPLD), La Jolla, CA, April 9-12, 2013, and published 
on http://nepp.nasa.gov/.
[26] “Virtex-5 FPGA Configuration User Guide” UG191 (v3.11) October 19,2012
[27] “Logic Elements and Logic Array Blocks in Cyclone IV Devices”, Cyclone TV Device 
Handbook Chapter 2, CYIV-51002-1.0, Volume 1, November 2009
[28] “ProASIC3 Flash Family FPGAs”, Revision 13, January 2013 © 2013 Microsemi 
Corporation
[29] “FPGA Logic Cells Comparison”, 1-Core Technologies online library, http://l - 
core.com/library/digital/fpga-logic-cells/Q)ga-logic-cells.pdf
[30] “Implementing Multipliers in FPGA Devices”, Application Note 306, Altera 
Corporation, July 2004, ver. 3.0
[31] S. Vassiliadis, D. Soudris, “Fine- and Coarse-Grain Reconfigurable Computing”, 
Springer 2007.
[32] Kevin Morris , “Altera Partners with Intel for 14nm Tri-Gate FPGAs”, EE Journal, 
Febmary 2013
[33] “The Breakthrough Advantage for FPGAs with Tri-Gate Technology”, White Paper, 
WP-01201-1.0, June 2013 Altera Corporation
[34] K. DeHaven ’’Extensible Processing Platform Ideal Solution for a Wide Range of 
Embedded Systems”, Xilinx Inc White paper, WP369 (vl.O) April 27, 2010
[35] S. Habinc. “Suitability of reprogrammable FPGAs in space applications”. FPGA-002-01, 
Report ESA contract No. 15102/01/NL/FM(SC) CCN-3, September 2002
162
[36] Recent Progress in Field Programmable Logic, P. Alike, 6th Workshop on Electronics 
for LHC Experiments, Krakow, Poland, September 2000
[37] SEU Mitigation Techniques for Virtex FPGAs in Space Applications, C. Carmichael et 
ah, 1999 MAPLD, Johns Hopkins University, Laurel, Maryland, USA, September 1999
[38] Current Radiation Issues for Programmable Elements and Devices, R. Katz et ah, IEEE 
Transactions on Nuclear Science, Vol. 45, December 1998
[39] Radiation Test Results of the Virtex FPGA and ZBT SRAM for Space Based 
Reconfigurable Computing, E. Fuller et al., 1999 MAPLD, Johns Hopkins University, Laurel, 
Maryland, USA, September 1999
[40] Radiation Effects on Current Field Programmable Technologies, R. Katz et al., IEEE 
Transactions on Nuclear Science, Vol. 44, No. 6, December 1997
[41] The Impact of Software and CAE Tools on SEU in Field Programmable Gate Arrays, R. 
Katz et al., 1999 IEEE Nuclear Space Radiation Effects Conference, Norfolk, Virginia, USA, 
July 1999
[42] SEU Mitigation Techniques for Virtex FPGAs in Space Applications, C. Carmichael et 
al., 1999 MAPLD, Johns Hopkins University, Laurel, Maryland, USA, September 1999
[43] Xilinx. Inc. “Xilinx TMRTOOL Product Brief’, 2009
[44] ] C. Carmichael and C. W. Tseng, “Correcting Single-Event Upsets in Virtex-4 Platform 
FPGA Configuration Memory,” Xilinx Application Note, XAPP988 (vl.O), March 13 2008.
[45] T. Schultz, “Reconfigurable Application-Specific Computing User’s Guide”, SILICON 
GRAPHICS INC 2007 (007-4718-005).
[46] G. Estrin, C. R. Viswanathan: Organization of a "Fixed-Plus-Variable" Structure 
Computer for Computation of Eigenvalues and Eigenvectors of Real Symmetric Matrices
[47] C. Bobda, “Introduction to Reconfigurable Computing: Architectures”, Springer 2007
[48] http://www.algotronix.com
163
[49] P. Garcia, K. Compton,M. Schulte, E.Blem, and W.Fu, “An Overview of Reconfigurable 
Hardware in Embedded Systems”, EURASIP Journal on Embedded Systems, Hindawi 
Publishing Corporation, vol. 2006, pp. 1-19,2006.
[50] R. O. Reynolds, P. H. Smith, L. S. Bell, and H. U. Keller, “Design of Mars lander 
cameras for Mars Pathfinder, Mars Surveyor ’98 and Mars Surveyor ’01,” IEEE Transactions 
on Instrumentation and Measurement, vol. 50, no. 1, pp. 63-71,2001.
[51] M. Kifle, M. Andro, Q. K. Tran, G. Fujikawa, and P. P. Chu, “Toward a dynamically 
reconfigurable computing and communication system for small spacecraft,” in Proceedings 
o f the 21 St International Communication Satellite System Conference & Exhibit (ICSSC ’03), 
Yokohama, Japan, April 2003.
[52] T. Todman, G. Constantinides, S. Wilton, O. Mencer, W. Luk, and P. Cheung, 
“Reconfigurable computing: architectures and design methods,” in lEE Proceedings: 
Computer & Digital Techniques, vol. 152, no. 2, March 2005, pp. 193-208.
[53] Cadence Design Systems Inc, Palladium Datasheet, 2004.
[54] Mentor Graphics, Vstation Pro: High Performance System Verification, 2003
[55] K. Compton and S. Hauck, “Reconfigurable computing: a survey of systems and 
software”, ACM Computing Surveys, Vol. 34, No. 2, pp. 171-210, June 2002.
[56] Azarian, A.; Ahmadi, M: Reconfigurable Computing Architecture Survey and 
introduction. 2nd IEEE International Conference on Computer Science and Information 
Technology 2009
[57] N. S. VOROS and K. MASSELOS “System Level Design of Reconfigurable System-on- 
Chip,” ISBN-10 0-387-26103-6, Springer, 2005
[58] Daixun Zheng, Dr. Tanya Vladimirova, Hans Tiggeler, Prof. Martin Sweeting. 
Reconfigurable single-chip on-board computer for small satellite, European Space Agency. 
July, 2006, Proceedings of DASIA
[59] Bubenhagen, B. Fiethe, J. Ilstad, H. Michalik ,P. Norridge, B. Osterloh, W. Sullivan, C, 
"Enhanced Dynamic Reconfigurable Processing Module for Future Space applications,"
164
'mInternational SpaceWire Conference, (Saint Petersburg Russia,), pp. 475-482, 
International Space Wire Conference, June 2010
[60] Xilinx. Inc. “Space-Grade Virtex-4QV Family Overview” DS653(V2.0), April 2010
[61] Y Guillemenet, L Torres, G Sassatelli, I Hassoune, “A non-volatile run-time FPGA 
using thermally assisted switching MRAMs”, FPL-2008
[62] R. Reis, M. Lubaszewski, J. Gess, “ Design of Systems on a Chip: Design and Test: 
Design and Test” Springer 2007
[63] S. Chaudhury, “System on Chip”, online video lecture by The National Programme on 
Technology Enh
[64] Michael Keating and Pierre Bricaud. Reuse Methodology Manual for System-on-a-Chip 
Designs. Kluwer Academic Publishers, Norwell, Massachusetts, June 1999.
[65] R. Rajsuman, System-on-a-Chip Design and Test, Artech House, 2000
[66] B. Al-Hashimi, “ System-on-Chip: Next Generation Electronics”. lEE Press, May 2006,
[67] “Soft CPU Cores for FPGA”, 1-Core Technologies online library, http://l- 
core.com/library/digital/soft-cpu-cores/soft-cpu-cores.pdf
[68] Aeroftex Gaisler website: www.gaisler.com
[69] ATMEL AT7913E SpaceWire Remote Terminal Controller (RTC), DATASHEET
[70] Rad-Hard 32 bit SPARC V8 Processor AT697E, DATASHEET , Rev. 4226H-AERO- 
08/11
[71] LE0N3FT-RTAX Product Sheet: http://www.gaisler.com/doc/leon3ft-
rtaxproduct_sheet.pdf
[72] LE0N4 Product Sheet: http://www.gaisler.com/doc/LEON4_32-bit_processor_core.pdf
[73] “Nios II Processor Reference Handbook”, February 2014 Altera Corporation
[74] Embedded Processor Block in Virtex-5 FPGAs, Xilinx Reference Guide, UG200 (vl.8) 
February 24, 2010
165
[75] PowerPC Processor Reference Guide UGOl 1 (vl.3) January 11,2010
[76] www.actel.com
[77] Loring Wirbel, “Altera integrates ARM Cortex-A53 in Stratix Generation 10” The EDN 
Network, October, 2013
[78] AMBA Open Specefications www.arm.com
[79] CoreConnect Technology, Xilinx.com
[80] Avalon Interface Specifications, Altera Corporation, May 2013
[81] Stephen Rusu, “ISSCC 2013: High-Performance Digital Trends”, Intel, Feb 2013
[82] Yen-Kuang Chen , S. Y. Rung, “Trend and Challenge on System-on-a-Chip Designs”, 
Journal of Signal Processing Systems, v.53 n.1-2, p.217-229, November 2008
[83] Torres L, Benoit P, Sassatelli G, Robert M, Clermidy F, Puschini D (2011) An 
introduction to multi-core system on chip trends and challenges. In: Multiprocessor system- 
on-chip: hardware design and tool integration, pp 1-21
[84] Accelerating High-Performance Computing With FPGAs, Altera White Paper, WP- 
01029-1.1, Oct 2007
[85] Teerakittikul, P., Tempesti, G. and Tyrrell, A.M., The Application of Evolvable 
Hardware to Fault Tolerant Robot Control, IEEE Symposium Series on Computational 
Intelligence, Nashville, USA, March, 2009
[86] M. T. Heideman, D. H. Johnson, and C. S. Burrus, “Gauss and the history of the fast 
Fourier transform,” IEEE ASSP Mag., vol. 1, no. 4, pp. 14-21, Oct. 1984.
[87] On-board Data Handling: OBC 386 data sheet from SSTL, 
http://microsat.sm.bmstu.ru/e-library/SSTL/Subsys_OBC386.pdf
[88] On-board Data Handling: OBC750 data sheet from SSTL,
http://www.sstl.co.uk/assets_sstl/Downloads/OBC750%20Datasheet_0135084_vl03.pdf
166
[89] XSat of NTU (Nanyang Technological University), Singapore:
http://www.sarc.eee.ntu.edu.sg/CREST/AboutXSATProject/Pages/AboutXSATProject.aspx
[90] H. Tiggler, T. Vladimirova, D. Zheng and J. Gaisler, “Experiences Designing a System-
on-a-Chip for Small Satellite Data Processing and Control”, Proceedings of Military and
Aerospace Applications of Programmable Devices and Technologies International
Conference (MAPLD’2000), P20, September 2000, Laurel, Maryland US, NASA.
[91] H. Kramer, "Earth Observation History of Technology Introduction," in Observation
of the Earth and its Environment, 4th ed Berlin: Springer-Verlag, 2002.
[92] S. Aksoy C. H. Chen Signal and Image Processing for Remote Sensing, pp.489
2006, Taylor &amp; Francis
[93] G. Shaw and H. Burke, "Spectral imaging for remote sensing", Lincoln Lab. J., vol. 
14, no. 1, pp.3 -28 2003
[94] M. Borengasser, W. S. Hungate, and R. Watkins, "Imaging Spectrometers: 
Operational Considerations," in Hyperspectral Remote Sensing: Principles and 
Applications, Q. Weng, Ed. Florida, USA: Taylor & Francis Group, 2008,
[95] R. C. Olsen, Remote sensing from air and space. Bellingham, Washington USA: SPIE 
Press 2007.
[96] W. G. Rees, William Gareth Rees, “Physical Principles of Remote Sensing”, 
Cambridge University Press 2001
[97] P. Shippert. “Introduction to hyperspectral image analysis”. Online Journal of Space
Communication, 2003. http://spacejoumal.ohio.edu/pdf/shippert.pdf
[98] K. Navulur, Multispectral Image Analysis Using the Object-Oriented Paradigm. 
Florida, USA: Taylor & Francis Group, 2007.
[99] M. E. Schaepman, S. L. Ustin, A. J. Plaza, T. H. Painter, J. Verrelst, and S. Liang, 
"Earth System Science Related Imaging Spectroscopy - An Assessment," Remote Sensing 
of Environment, vol. In Press, Corrected Proof, 2009
167
[100] M. Borengasser, W. S. Hungate, and R. Watkins, "Imaging Spectrometers: 
Operational Considerations," in Hyperspectral Remote Sensing: Principles and 
Applications, Q. Weng, Ed. Florida, USA: Taylor & Francis Group, 2008, p. 17.
[101] N. R. Mat Noor, T. Vladimirova, and M. Sweeting, "High Performance Lossless 
Compression for Hyperspectral Satellite Imagery," in UK Electronics Forum, Newcastle 
University, UK, 2010, pp. 78-83.
[102] R. C. Olsen, Remote sensing from air and space. Bellingham, Washington USA: SPIE 
Press 2007.
[103] Clark, R. N., and Swayze, G. A., 1995, Mapping minerals, amorphous materials, 
environmental materials, vegetation, water, ice, and snow, and other materials: The USGS 
Tricorder Algorithm. In Summaries of the Fifth Annual JPL Airborne Earth Science 
Workshop, JPL Publication 95-1, v. 1, pp. 39 - 40
[104] Ben-Dor, E., Patin, K , Banin, A. and Kamieli, A., 2001, Mapping of several soil 
properties using DAIS-7915 hyperspectral scanner data. A case study over clayey soils in 
Israel. International Journal of Remote Sensing (in press).
[105] Aber, J. D., and Martin, M. E., 1995, High spectral resolution remote sensing of 
canopy chemistry. In Summaries of the Fifth JPL Airborne Earth Science Workshop, JPL 
Publication 95-1, v. 1, pp. 1-4.
[106] Merton, R. N., 1999, Multi-temporal analysis of community scale vegetation stress 
with imaging spectroscopy. Ph.D. Thesis, Geography Department, University of 
Auckland, New Zealand, 492p.
[107] Schultz R A, Nielsen T, Zavaleta J R, Ruch R, Wyatt R and Garber H R 2001 
Hyperspectral imaging: a novel approach for microscopic analysis Cytometry 43 239-4
[108] M.E. Martin, M.B. Wabuyele, K. Chen, P. Kasili, M. Panjehpour, M. Phan et al. 
Development of an advanced hyperspectral imaging (HSI) system with applications for 
cancer detection Ann Biomed Eng, 34 (2006), pp. 1061-1068
[109] PCI Geomatic, Focus On hyperspectral imagery, http://www.pcigeomatics.com
168
[110] N. R. Mat Noor and T. Vladimirova, “Investigation into Lossless Hyperspectral Image 
Compression for Satellite Remote Sensing”, International Journal o f Remote Sensing, 
2012
[111] Jet Propulsion Laboratory: AVfRIS free standard data product.
http://aviris.jpl.nasa.gov/html/ aviris.freedata.html
[112] https://eol.usgs.gov/sensors/hyperion
[113] Mielikainen, J., Toivanen, P.: Lossless compression of hyperspectral images using a 
quantized index to lookup tables. Geosci. Remote Sens. Lett. 5(3), 474-478 (2008)
[114] Huo, C., Zhang, R., Peng, T.: Lossless compression of hyperspectral images based on 
searching optimal multibands for prediction. Geosci. Remote Sens. Lett. 6(2), 339-343 
(2009)
[115] Magli, E.: Multiband lossless compression of hyperspectral images. IEEE Trans. 
Geosci. Remote Sens. 47(4), 1168-1178 (2009)
[116] Kiely, A.B., Klimesh, M.A.: Exploiting calibration-induced artifacts in lossless 
compression of hyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 47(8), 2672- 
2678(2009)
[117] Zhang, J., Liu, G.: An efficient reordering prediction-based lossless compression 
algorithm for hyperspectral images. Geosci. Remote Sens. Lett. 4(2), 283-287 (2007)
[118] The Consultative Committee for Space Data Systems Recommendation for Space 
Data System Standards, “Lossless Multispectral and Hyperspectral Image Compression” 
Recommended Standard, CCSCS 123.0-B-l, Blue Book, May 2012
[119] S-E Qian, "Hyperspectral data compression using a fast vector quantization 
algorithm," IEEE Transactions on Geoscience and Remote Sensing, vol. 42, pp. 1791- 
1798, 2004.
[120] Cheng-Chen L, Yin-Tsung H. Lossless Compression of Hyperspectral Images Using 
Adaptive Prediction and Backward Search Schemes [J]. Journal of Information Science 
and Engineering, 2011 27(2): 419-435
169
[122] Qian, S.-E., Bergeron, M., Cunningham, I., Gagnon, L., Hollinger, A.: Near lossless 
data compression onboard a hyperspectral satellite. IEEE Trans. Aerospace Electron. Syst. 
42(3), 851-866 (2006)
[123] S. Gupta and A. Gersho, "Feature predictive vector quantization of multispectral 
images," IEEE Transactions on Geoscience and Remote Sensing, vol. 30, pp. 491-501, 1992.
[124] G. R. Canta and G. Poggi, "Compression of multispectral images by address-predictive 
vector quantization," Signal Processing: Image Communication, vol. 11, pp. 147-159, 1997.
[125] Information technology—JPEG 2000 image coding system: Core coding system, 
ISO/me Std. 15 444-1 (2002)
[126] Information technology -  Lossless and near-lossless compression of continuous-tone 
still images -  Baseline
[127] Aranki, N.; Keymeulen, D.; Bakhshi, A.; Klimesh, M., "Hardware implementation of 
Lossless Adaptive and Scalable Hyperspectral Data compression for Space," Adaptive 
Hardware and Systems, 2009. AHS 2009. NASA/ESA Conference o n , vol., no., pp.315,322, 
July 29 2009-Aug. 1 2009
[128] Keymeulen, D.; Aranki, N.; Hopson, B.; Kiely, A.; Klimesh, M.; Benkrid, K , "GPU 
lossless hyperspectral data compression system for space applications," Aerospace 
Conference, 2012 IEEE , vol., no., pp. 1,9, 3-10 March 2012
[129] J. Venbrux, J. Gambles, D. Wiseman, G. Zweigle, W. H. Miller, and P.-S. Yeh, 
“AVLSI Chip Set Development for Lossless Data Compression,” Ninth AIAA Computing in 
Aerospace Conference, San Diego, California, October 19-21, 1993.
[130] Lossless Data Compression, Recommendation for Space Data System Standards, 
CCSDS 121.0-B-l. Blue Book. Issue 1. 7 Washington, D.C., CCSDS, May 1997. 
(http://public.ccsds.org)
[131] G. Yu, “An On-Board Real-Time Image Compression System for Earth Observation 
Satellites”, University of Surrey PhD thesis 2009
170
[132] Christophe, E., Mailhes, C., Duhamel, P.: Hyperspectral image compression: adapting 
SPIHT and EZW to anisotropic 3D wavelet coding. IEEE Trans. Image Process. 17(12), 
2334-2346 (2008)
[133] Qian, S.-E., Bergeron, M., Cunningham, I., Gagnon, L., Hollinger, A.: Near lossless 
data compression onboard a hyperspectral satellite. IEEE Trans. Aerospace Electron. Syst. 
42(3), 851-866 (2006)
[134] E. Christophe, “Hyperspectral Data Compression Tradeoff’, Centre for Remote 
Imaging, Sensing and Processing, National University of Singapore, Singapore. Springer 
2011
[135] Qian Du; Fowler, J.E., "Hyperspectral Image Compression Using JPEG2000 and 
Principal Component Analysis," Geoscience and Remote Sensing Letters, IE E E , vol.4, no.2, 
pp.201,205, April 2007
[136] Y. Man, M. He, J. Wan, “Low-Complexity Compression Algorithm for Hyperspectral 
Images Based on Distributed Source Coding”, Mathematical Problems in Engineering 
Volume 2013 (2013), Article ID 825673, 7 pages
[137] Blanes, I.; Serra, J.; Marcellin, M.; Bartrina, J.. Divide-and-conquer strategies for 
hyperspectral image processing. IEEE Signal Processing Magazine ( 2012 ) DOI:
10.1109/MSP.2011.2179416.
[138] J. A. Saghri, A. G. Tescher, and J. T. Reagan, "Practical Transform Coding of 
Multispectral Imagery," IEEE Signal Processing Magazine, vol. 12, pp. 32-43, 1995.
[139] I. Blanes and J. Serra-Sagrist&agrave; "Quality evaluation of progressive lossy-to- 
lossless remote-sensing image coding", Proc. IEEE ICIP, pp.3665 -3668 2009
[140] N. R. Mat Noor and T. Vladimirova, “Investigation into Lossless Hyperspectral Image 
Compression for Satellite Remote Sensing”, International Journal o f Remote Sensing, 2012.
[141] Blanes, L; Serra-Sagrista, J., "Clustered Reversible-KLT for Progressive Lossy-to- 
Lossless 3d Image Coding," Data Compression Conference, 2009. DCC '09. , vol., no., 
pp.233,242, 16-18 March 2009
171
[142] N. R. Mat Noor and T. Vladimirova, “Investigation into Lossless Hyperspectral Image 
Compression for Satellite Remote Sensing”, International Journal o f Remote Sensing, 2012
[143] Implementation du décorellateur multispectral—R&T Compression. Alcatel Alenia 
Space, Tech. Rep. 100137101A, Nov (2006)
[144] Thiebaut, C., Christophe, E., Lebedeff, D., Latry, C.: CNES studies of on-board 
compression for multispectral and hyperspectral images. In: SPIE, Satellite Data 
Compression, Communications, and Archiving III, vol. 6683. SPIE, August (2007)
[145] Penna, B., Tillo, T., Magli, E., Olmo, G.: A new low complexity KLT for lossy 
hyperspectral data compression. In IEEE International Geoscience and Remote Sensing 
Symposium, IGARSS’06, August (2006), pp. 3525-3528
[146] Penna, B., Tillo, T., Magli, E., Olmo, G.: Transform coding techniques for lossy 
hyperspectral data compression. IEEE Trans. Geosci. Remote Sens. 45(5), 1408-1421 (2007)
[147] Q. Du , W. Zhu , H. Yang and J. E. Fowler "Segmented principal component analysis 
for parallel compression of hyperspectral imagery", IEEE Geosci. Remote Sens. Lett., vol. 
6, no. 4, pp.713 -717 2009
[148] T. Vladimirova and A. Steffens, "Compression of Multispectral Images On-Board 
Observation Satellites," in Proceedings o f the International Conference "Space, Ecology, 
Safety" (SES '05), Varna, Bulgaria, 2005, pp. 105-110.
[149] T. Vladimirova, M. Meerman, and A. Curiel, "On-Board Compression of Multispectral 
Images for Small Satellites," in IEEE International Conference on Geoscience and Remote 
Sensing Symposium, 2006 (IGARSS 2006), Denver, Colorado, USA, 2006, pp. 3533-3536.
[150] N. R. Mat Noor and T. Vladimirova, "Integer KLT Design Space Exploration for 
Hyperspectral Satellite Image Compression", Notes in Computer Science, 2011, 
Volume 6935, pp. 661-668, G. Lee, D. Howard, and D. Slçzak (Eds.), Springer-Verlag Berlin 
Heidelberg
[151] Karhunen, K , Über lineare methoden in der wahrscheinlich-keitsrechnung. Ann. Acad. 
Sci. Fennicea, Ser. A137, 1947. (Translated by Selin, I. in “On Linear Methods in Probability 
Theory,” Doc. T-131, The RAND Corp., Santa Monica, CA, 1960.)
172
[152] Loève, M., Fonctions Aléatoires de second order, In Lévy, P., Ed., Processus 
Stochastiques et Movement Brownien, Hermann, Paris, 1948.
[153] A. Singh, and L. Eklundh, “Comparative analysis of standardised and unstandardised 
principal components analysis in remote sensing”. Int. J. on Remote Sensing, V14, pp. 1359- 
1370, 1993.
[154] J. Devaux, P. Gouton, F. Truchetet, "Application of the Karhunen-Loeve transform to 
aerial color image segmentation", Proc. of the 4th Intern. Conf. on Knowledge-based 
Intelligent Engineering Systems and Allied Technologies (KES’OO), Vol.l, Btighton, UK, 
August, 2000, pp. 373-376.
[155] Greitans, M., Aristov, V., and Laimina, T., Application of the Karhunen-Loeve 
Transformation in Bio-Radiolocation: Breath S i m u l a t i o n , Cont. Comp. Sci., 2012, vol. 
46 ,pp. 18-24
[156] A. Das, N. David, Z. Jospeh, M. Gokhan, and C. Alok. An FPGA-based network 
intrusion detection architecture. Information Forensics and Security, 3(1): 118-132, Mar 2008
[157] K. R. Rao and P. C. Yip, The Transform and Data Compression Handbook, 2001
[158] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J.Dongarra, J.D. Croz, A. 
Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, “LAPACK User’s Guide Third 
Edition,” http://www.netlib.org/lapack/lug/lapack_lug.html, Aug. 1999.
[159] L.S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, 
S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley, ScaLAPACK 
Users’ Guide, SIAM, 1997.
[160] L. N. Trefethen, D. Bau III: Numerical Linear Algebra, 1997
[161] Arbenz, P., “Lecture Notes on Solving Large Scale Eigenvalue Problems”, 
http://people.inf.ethz.ch/arbenz/ewp/Lnotes/, 2012
[162 ] B. N. Parlett, The QR algorithm. Computing Sci. Eng., 2 (2000), pp. 38-42.
[163] Rutishauser, H.: The Jacobi method for real symmetric matrices. Numerical Math 
(1966)
173
[164] David S. Watkins, Understanding the QR algorithm, SIAM Review, 24 (1982), pp. 
427-440
[165] G. H. Golub and C. F. V. Loan, Matrix Computations, 3rd ed. Baltimore, MD: John 
Hopkins University Press, 1996.
[166] J. A. George and J. W. Liu: "Householder reflections versus Givens rotations in sparse 
orthogonal decomposition". Lin. Alg. A ppl. ,  88(1987), 223-238
[167] J. Demmel and K. Veseli6. Jacobi's method is more accurate than QR. Computer 
Science Dept. Technical Report 468, Courant Institute, New York, NY, September 1989. 
(LAPACK Working Note #15).
[168 ] Fog, A.: Instruction tables: Lists of instruction latencies, throughputs and micro­
operation breakdowns for Intel and AMD CPU’s 
(2012), http://www.agner.org/optimize/instruction_tables.pdf
[169] Arbenz, P., “Lecture Notes on Solving Large Scale Eigenvalue Problems”, 
http://people.inf.ethz.ch/arbenz/ewp/Lnotes/, 2012
[170] Steven W. Smith, The scientist and engineer's guide to digital signal processing, 
California Technical Publishing, San Diego, CA, 1997
[171] A. K. Jain, Fundamentals of Digital Image Processing. Englewood Cliffs, NJ: Prentice- 
Hall, 1989.
[172] Q. Du and J. E. Fowler, “Hyperspectral image compression using JPEG2000 and 
principal component analysis,” IEEE Geoscience and Remote Sensing Letters, vol. 4, no. 2, 
pp. 201-205, April 2007.
[173] Tamhankar, H.; Fowler, J.E., "Spectral-decorrelation strategies for the compression of 
hyperspectral imagery," Geoscience and Remote Sensing Symposium, 2007. IGARSS 2007. 
IEEE International, vol., no., pp. 1041,1044, 23-28 July 2007
[174] M. Fleury, R. P. Self and A. C. Downton, Multi-Spectral Satellite Image Processing on 
a Platform FPGA Engine, Int. Conf. on Military and Aerospace Programmable Logic Device 
(MAPL D 2005), 2005
174
[175] Fleury, M.; Self, B.; Downton, A., "A fîne-grained parallel pipelined Karhunen-Loeve 
transform," Parallel and Distributed Processing Symposium, 2003. Proceedings.
International, vol., no., pp. 11 pp.,, 22-26 April 2003
[177] Penna, B.; Tillo, T.; Magli, E.; Olmo, G., "A New Low Complexity KLT for Lossy 
Hyperspectral Data Compression," Geoscience and Remote Sensing Symposium, 2006. 
IGARSS 2006. IEEE International Conference on , vol., no., pp.3525,3528, July 31 2006- 
Aug. 4 2006
[178] H. Yang, D. Qian, W. Zhu, J. E. Fowler, and I. Banicescu, "Parallel Data Compression 
for Hyperspectral Imagery," in IEEE International Conference on Geoscience and Remote 
Sensing Symposium, 2008 (IGARSS 2008), Boston, USA, 2008, pp. 986-989.
[179] Blanes, L; Serra-Sagrista, J., "Cost and Scalability Improvements to the Karhunen- 
Loêve Transform for Remote-Sensing Image Coding," Geoscience and Remote Sensing,
IEEE Transactions on , vol.48, no.7, pp.2854,2863, July 2010.
[180] Actel SmartFusion Evaluation Kit, http://www.microsemi.com/products/fbga- 
soc/design-resources/dev-kits/smartflision/smartfusion-evaluation-kit
[181] Altera DE2-115 Development and Education Board,
http://www.altera.com/education/univ/materials/boards/de2-115/unv-de2-115-board.html
[182] Embedded Multipliers in Cyclone IV Devices, Cyclone IV Device Handbook,
Volume 1, February 2010
[183] Cyclone TV Device Handbook, Volume 1, April 2014 Altera Corporation
[184] SmartFusion Customizable System-on-Chip (cSoC) Data Sheet, Microsemi 
Corporation, October 2013
[185] T. J. Herron, K. M. Reddy, R. Garg, and K. Devanahalli, “Eigen decompositions of 
covariance matrices on a fixed point DSP,” presented at the Nat. Conf. Commun., Indian Inst. 
TechnoL, Madras, India, 2003.
175
[186] R. P. Brent, F. T. Luk, and C. Van Loan, “Computation of the singular value 
decomposition using mesh-connected processors,” J. VLSI Comput. Syst., vol. 1, no. 3, pp. 
242-270, 1983.
[187] Brent RP, Luk FT. The solution of singular-value and symmetric eigenvalue problems 
on multiprocessor arrays. SIAM J. Sci. Stat. Comput. 1985;6:69-84.
[188] I. Bravo et. ai., "Novel HW Architecture Based on FPGAs Oriented to Solve the Eigen 
Problem", IEEE Trans. VLSI Systems, vol. 16, no. 12, Dec, 2008
[189] T. Wang, P. Wei, “Hardware Efficient Architectures of Improved Jacobi Method to 
Solve the Eigen Problem”, Computer Engineering and Technology (ICCET), Apr, 2010
[190] Ma, Weiwei., Kaye, M., Luke, D. and Doraiswami, R. 2006. An FPGA-Based Singular 
Value Decomposition Processor. In Proc. of Canadian Conference on Electrical and 
Computer Engineering.
[191] Bravo, L; Mazo, M.; Lazaro, J.L.; Gardel, A.; Jimenez, P.; Pizarro, D. Intelligent 
Architecture Based on FPGAs Designed to Detect Moving Objects by Using PCA. Sensors 
2010, 10, 9232-9251.
[192] H. Dawid and H. Meyr, “CORDIC algorithms and architectures,” chapter 24, 1999.
[193] Using Nios II Floating-Point Custom Instructions, February 2010 Altera Corporation
[194] Nios II Processor Reference Handbook, Chapter 2 “Processor Architecture”, May 2011 
Altera Corporation
[195] Introduction to Megafunctions IP Cores, User Guide, May 2013 Altera Corporation
[196] Floating-Point Megafunctions, User Guide, November 2013 Altera Corporation
[197] P. Hao and Q. Shi, "Reversible Integer KLT for Progressive-to-Lossless Compression 
of Multiple Component Images," in Proceedings o f the International Conference on Image 
Processing, 2003 (ICIP 2003), Barcelona, Spain, 2003, pp. 1-633-6.
[198] P. Hao and Q. Shi, "Matrix Factorizations for Reversible Integer Mapping," IEEE 
Transactions on Signal Processing, vol. 49, pp. 2314-2324, 2001.
176
[199] L. Galli and S. Salzo, "Lossless Hyperspectral Compression Using KLT," in IEEE 
International Conference on Geoscience and Remote Sensing Symposium, 2004 (IGARSS 
2004), Anchorage, Alaska, USA, 2004, pp. 313-316.
[200] Pengwei Hao, "Customizable Triangular Factorizations of Matrices", Linear Algebra 
and Its Applications, Vol. 382, pp. 135-154, May 2004
[201] L. Wang, J. Wu, L. Jiao, G. Shi, Lossy-to-lossless hyperspectral image compression 
based on multiplierless reversible integer TDLT/KLT, IEEE Geoscience and Remote Sensing 
Letters 6 (3) (2009) 587-591
[202] N. R. Mat Noor and T. Vladimirova, “Parallel Implementation of Lossless Clustered 
Integer KLT Using OpenMP”, Proc. of 7th NASA/ESA Conference on the Adaptive 
Hardware and Systems (AHS-2012), 25-28 June 2012, Nuremberg, Germany.
[203] L. Xin, G. Lei, and L. Zhen, "Reversible Integer Principal Component Transform for 
Hyperspectral Imagery Lossless Compression," in IEEE International Conference on Control 
and Automation, 2007 (ICCA 2007), Guangzhou, China, 2007, pp. 2968-2972.
[204] I. Blanes and J. Serra-Sagrista, "Clustered Reversible-KLT for Progressive Lossy-to- 
Lossless 3D Image Coding," in Data Compression Conference, 2009 (DCC '09), Snowbird, 
Utah, 2009, pp. 233-242.
[205] B. Penna, T. Tillo, E. Magli, and G. Olmo, “Transform Coding Techniques for Lossy 
Hyperspectral Data Compression,” IEEE Geosci. Remote Sens., vol. 45, no. 5, pp. 1408-1421, 
May. 2007.
[206] L. W ang, J. W u , L. Jiao and G. Shi "Lossy-to-lossless hyperspectral image 
compression based on multiplierless reversible integer TDLT/KLT", IEEE Geosci. Remote 
Sens. Lett., vol. 6, no. 3, pp.587 -591 2009
[207] L. Wang , J. Wu , L. Jiao and G. Shi "3D medical image compression based on 
multiplierless low-complexity RKLT and shape-adaptive wavelet transform", Proc. 
ICIP, pp.2521 -2524 2009
177
[208] L. Wang , J. Wu , L. Jiao and Zhang, S. “Three-Dimensional Medical Image 
Compression Based on Low-Complexity RKLT”, Electronics Letters (Volume:46, Issue: 6 ) 
March 2010
[209] W. Yodchanan, S. Oraintara, T. Tanaka, and K. R. Rao, "Lossless Multi-Channel EEC 
Compression," in IEEE International Symposium on Circuits and Systems, 2006 (ISCAS 
2006), Island of Kos, Greece, 2006, pp. 4 pp.-1614.
[210] Peter Yiannacouras, Jonathan Rose, and J. Gregory Steffan. The microarchitecture of 
FPGA-based soft processors. In CASES '05: Proceedings of the 2005 International 
Conference on Compilers, Architectures and Synthesis for Embedded Systems, pages 202- 
212, New York, NY, USA, 2005. ACM Press
[211] Dyer C. S. (1998) “Space Radiation Effects for Future Technologies and Missions”, 
QINETIQ KI SPACE TR0106901/1.1
[212] S. Maqbool, “A system-level supervisory approach to mitigate single event functional 
interrupts in data handling architectures.” University of Surrey PhD thesis 2006
[213] R H. Maurer, M E. Fraeman, M N. Martin, and D R. Roth "Harsh Environments: Space 
Radiation Environment, Effects, and Mitigation" Johns Hopkins APL Technical Digest, 
Volume 28, Number 1 (2008)
[214] Allenspach M, Mouret I, Titus JL, Wheatley Jr CF, Pease RL, Brews JR, Schrimpf RD, 
Galloway KF. Single-event gate-rupture in power MOSFETs: prediction of breakdown biases 
and evaluation of oxide thickness dependence. IEEE Trans Nucl Sci 1995;42:1922-7
[215] Liu, S.; Boden, M.; Girdhar, D.A.; Titus, J.L., "Single-Event Burnout and Avalanche 
Characteristics of Power DMOSFETs," Nuclear Science, IEEE Transactions, vol.53, no.6, 
pp.3379,3385, Dec. 2006
[216] D. Barnhart, “Very small satellite design for space sensor networks”. University of 
Surrey PhD thesis 2008
[217] A. Pavlov and M. Sachdev. “CMOS SRAM Circuit Design and Parametric Test in 
Nano-Scaled Technologies”. Springer US, 233 Spring Street New York, NY 10013, 2008
178
[218] P. Geremia “Cyclic Redundancy Check Computation: An Implementation Using the 
TMS320C54x”, Texas Instruments Application Report SPRA530, April 1999.
[219] Charles B. Cameron, “Hamming Codes”, United States Naval Academy, 
Microprocessor-Based Digital Design Module Materials, Spring 2008
[220] M.S. Hodgart and H. Tiggeler, "A (16,8) Error Correcting Code (t=2) for Critical 
Memory Applications", Proceedings of Data Systems Aerospace 'DASIA2000', Montreal, 
Canada, 22-26 May 2000.
[221] B. Sklar, “Reed-Solomon Codes”, Prentice Hall, 2001
[222] S. Habinc, '‘‘Functional Triple Modular Redundancy (FTMR) VHDL Design 
Methodology for Redundancy in Combinatorial and Sequential Logic”, Gaisler Research, 
FPGA-003-01, December 2002
[223] C. Egho and T. Vladimirova. “Eigenvectors Computation on a System-on-Chip 
Platform for Satellite On-Board Use”, 7th Jordanian International Electrical and Electronics 
Engineering Conference, (JfEEEC 2011). April 2011, Amman, Jordan.
179
Appendix A
Radiation Effects on Microelectronics
The effects of the space radiation can be categorized into three types: Single Event 
Effects (SEE), Multiple Bit Upset (MBU) and Total Ionising Dose (TID) Effects [211].
A.1 Total Ionising Dose Effects (TID)
Total Ionised Dose is the cumulative amount of radiation, which the device receives. 
Therefore, in any space mission TID can be estimated depending on the mission’s lifetime 
and whereabouts [212].
A.2 Single Event Effect
The Single Event Effects SEE is the interaction between a single particle (proton, neutron 
or heavy ion) and semiconductor causes transient or permanent effects. The SEE can be 
classified into two categories: non-destructive SEE and destructive SEE
A.2.1 Non destructive Single Event
There are three main types of the Non-destructive Single Event
1. Single event upset SEU
The impact of a charged particle (heavy ion, proton or electron) on any sensitive 
bi-stable element, such as a flip-flop, can change logical state of the element; this 
change is defined as a single event upset (SEU). In the space environment, many 
electronic components are vulnerable to SEU, such as microprocessors, power 
transistors and memory cells. Two main factors can determine the vulnerability to 
SEU, first is the minimum amount of energy to produce the upset and the second 
factor is the function of the surface area of the SEU- sensitive nodes. SEU is 
considered as a fi*equency independent non-destructive soft error, and when a 
memory element is stuck by an SEU, rewriting the data can solve the problem.
2. Single Event Transient SET
Single Event Transient (SET) is a frequency dependent non-destructive soft error 
occurs when charged particle strikes a sensitive node within the combinational 
logic leading to a voltage disturbance, which might propagate to other parts [212]. 
If voltage disturbance affected the clock or an asynchronous reset, incorrect 
data will be loaded causing serious consequences. Moreover, when this occurs
180
on the data line at the clock edge, incorrect data will be stored in the memory, as in 
the case of SEU.
3. Single Event Functional Interrupt SEFI
When SEE occurs in some complex devices such as microprocessors or flash 
memories, these devices might show some unpredictable behaviour, which is referred 
to as Single Event Functional Interrupt SEFI [212].
A.2.2 Destructive Single Event
1. Single Event latch-up SEE
CMOS structure can inherit parasitic bipolar junction transistors (BJTs) formed by the 
closely located CMOS structures. When a charged particle strikes the device, a 
small current might be introduced to the base region leading the positive feedback 
loop (the base and the collector) to severely increase the current; this high-current 
state is known as Single Event Latch-up (LET). SEL may or may not permanently 
damage the device, so it is considered as a destructive error. SEL permanent damage 
can be prevaricated by limiting the current or by switching the power of as soon as 
the high current is detected [213].
2. Single Event Hard Error
Memory devices can experience Single Event Hard Errors, which can be defined as a 
permanent mechanism failure caused by a heavy ion, neutron or proton [212].
3. Single Event Gate Rupture
In a space environment, when a heavy ion strikes the neck region of a MOSFET cell, a 
Single Event Gate Rupture will occur leading to the failure of the MOSFET [214].
4. Single Event Burnout
When a heavy ion passes through a MOSFET, it can generate a transient current, 
which can turn on the parasitic bipolar transistor inherited to the MOSFET structure. 
This can eventually lead to a complete device failure, this is characterised as a single 
event burnout [215].
A 3 Multiple Bit Upset (MBU)
Multiple Bit Upsets occur when a single energetic particle strikes the device, it can energize 
two or more adjacent cells; this will cause multiple SEUs or SET as it passes through the 
device [212].
181
A.4 The M itigation Techniques
In the past decades, various mitigation techniques have been developed to handle or rectify 
the problems incurred by the space radiation. Many of these techniques were developed 
from ones already been used in communication systems; others were proposed, developed 
and implemented specifically for space applications. On the other hand, some techniques 
are only efficient for certain radiation effects, so usually many techniques are used for a 
certain satellite system.
A.4.1 Mitigation Techniques for TID
Total Ionizing Dose in the space causes gradual system degradation, which in turn will 
increase the power consumption [8]. These techniques have been developed to reduce the 
amount of total ionized dosed absorbed by the system; however, some of them can also 
play an important role in reducing other radiation effects. Three m ain  techniques are 
commonly used to reduce the effects of the TID:
• Shielding: Tantalum and Tungsten are commonly used as shielding materials; they 
are high-Z materials and have a greatly higher density than aluminium, so thinner 
shielding can be used to reduce the weight penalty of this technique. Shielding 
efficiently lessen the impact of electron and low energy proton dose; however, the 
single event effects of high energy cosmic rays cannot be prevented by using this 
technique, moreover, it can even increase these effects when a thick shielding is 
used [8].
• De-rating and conservative circuit design: The idea of this technique is based on the 
on the concept of power and functional de-rating of the circuit as this has been proven 
to reduce the effects of total ionizing dose. However, this technique is limited to the 
mission’s requirements and specifications [8].
• Radiation Hardening by Design: This technique is implemented by using 
unconventional layout techniques at the transistor and the circuit level. It has been 
used for applications in high radiation environment such as nuclear power plants 
and space applications and has been proven effective in term of radiation protection. 
In addition to reducing the TID effects, this technique has been proven to increase 
the immunity of the system against SEE effects. However, implementing RHBD 
will require more area and more importantly will increase the power 
consumption. However, as asynchronous logic can decrease the power
182
consumption by replacing the high-speed eloek with a hand-shaking mechanism, 
combining this mechanism with RHBD can compromise the inereased power 
eonsumption. A reeent study showed that by eombining these two techniques, an 
effective radiation hardening eould be achieved with a lesser power eonsumption 
penalty [216].
A.4.2 Mitigation Techniques for SEE and MBU
This eategory can be elassified into three types depending on the targeted hardware: 
memories, microprocessors and FPGA. As the latest one will be investigated later in this 
ehapter, only the first two will be diseussed in this report.
A.4.2.1 Memory applications
There are many teehniques for memory applications, some ean only deteet errors, while 
others ean deteet and correet them. All these teehniques ean be implemented in hardware or 
in software. When implemented in the hardware, faster speed is acquired but this will 
in c rease  the power, the weight and the area penalties. On the other hand, when 
implemented in the software, the speed will be slower and the computer availability for 
other eomputational tasks will be redueed.
Error detection Codes:
• Parity Check: This is one of the simplest techniques and it’s eommonly used in 
many communieation protocols. It is implemented by appending an extra bit 
(Parity Bit) to eaeh word; this bit represents the parity of the word (odd or even 
number of I ’s). Therefore, only words with even number of errors can be deteeted
[217].
• Cyclic Redundancy Check (CRC): This teehnique is based on polynomial arithmetie 
and implemented by appending the division reminder of the eode-word over a 
predetermined polynomial (Key Polynomial). Designing this technique is more 
complieated than the Parity Check, but it is more robust; in order to maintain high 
effieiency in error deteetion, higher degree for the key polynomial is required, whieh 
means more eomplexity in the design [218].
Error Detection and Correction Code:
• EDACHamming Code (12, 8): A linear error eorreetion technique, which appends 4 
parity bits to each 8-bit byte, and when this byte is called by the CPU, the EDAC 
bloek will process the 12-bit code word, eorreet the 8-bit byte and send it to the
183
CPU. The EDAC bloek does not write baek the eorrected byte into the memory, 
so in order to eorreet the memory data, all data should be read and written by the 
EDAC block periodically, this process is ealled memory washing or serubbing. As it 
ean detect and correct any single bit error in eaeh word, this teehnique is highly 
effieient for SEU, but MBU and SHE can defeat the code [219].
• Modified Hamming like (16, 8): In order to inerease the efficieney of the 
Hamming code (12, 8), SSC has developed a Modified Hamming Like (12, 16), 
which is capable of detecting and correcting 2 errors in a word, and has been 
adequately robust for Surrey eurrent satellite program memory systems [220].
•  Block Error codes such as Reed-Solomon RS codex This teehnique is more powerful 
than the Hamming Code; it generates L + K words for each K-word code, where L is 
the number of parity words. Using this teehnique, L/2 errors for each K-word code 
can be corrected. NASA has developed RS (255,223) code on a single IC, whieh ean 
correct 16 consecutive bytes in error in any 223-byte bloek [221].
A.4.2.2 Microprocessors:
Fault Detection:
Fault deteetion teehniques can be performed online or offline. When performed while the 
system is runnmg, they are online detection techniques; on the other hand, the offline 
teehniques will require halting the functional operations of the system to perform the 
detection procedure.
• Watchdog timer: This technique is used in most microprocessor; it is utilised to 
prevent the processor from infinite loops, whieh eaused by software or a hardware 
bug. A processor generates signal periodically to acknowledge its normal operation, 
if this signal is not generated for longer than the predefined period, then a recovery 
aetion must be taken by resetting the program counter. When an external deviee 
supervises this procedure, it is considered an aetive watehdog; otherwise, it is 
eonsidered a passive watehdog [212].
• Lockstep: Lockstep is implemented by duplieating the logic; so two identical 
processors are used, where they are both initialized at the start-up of the system and 
have exactly the same inputs. Thus, both proeessors work in a precise concurrency 
during normal operation; and by XORing their outputs, an error flag can be asserted 
in ease of any difference in the outputs. The main drawback o f this technique 
is the eloek skew resulted from the TID, which can lead to a relative laek of
184
concurrency in the responses [212].
• Built in Testing: When active BIT is performed, the system halts its normal 
operation and a test pattern is exeeuted, and then compared with the expected one. 
On the other hand, passive BIT monitors the system performanee while it is in 
operation [212].
Fault Handling:
Fault handling can be implemented in hardware or in software.
Hardware:
• N  Modular Redundancy: This technique is also used in some eommunieations 
protocols, and it is based on a voting scheme. Three or more units exeeute the same 
operation and the deeision is made based on the majority of the outputs [222]. NMR 
ean handle transient and permanent faults; however, the major drawbaek of this 
teehnique is the hardware and power consumption overhead.
• Dynamic Redundancy: while two processors are used in this teehnique, only one 
is operating at a given time. When a fault is detected, the other proeessor will 
take over. Therefore, this teehnique requires a fault detection mechanism to decide 
which processor is operating [212].
Software:
• Multi-version Software: This technique is based on voting seheme of N version of 
the software, each of which based on a completely different methodology and has 
been through different development cycle. When the Multi-version Software is 
used with the N-Modular hardware teehnique, a highly robust arehiteeture ean be 
obtained [212].
• Recovery Blocks: This teehnique is implemented by breaking down the software 
into many bloeks. Each of these blocks has a test routine in addition to two 
ftmetional routine performing the same task using different methodologies [212].
185
Appendix B
MATLAB Simulations
B.l Jacobi Algorithm Convergance
In order to assess the convergence of the Jacobi algorithm, different matrices sizes of 
different input data bit-length in MATLAB simulations as shown in figures B.l, B.2, B.3 and
B.4.
8-bit
24-bit
16-bit
100
Iterations
120 140 20020 160 180
Figure B.l: The Jacobi Convergence of different data-width of 8x8 matrix
8-bit
24-bit
16-bit
LU 10
m 10'
600
iterations
200 800 1200400 1000
Figure B.2: The Jacobi Convergence of different data-width of 16x16 matrix
186
8-bit
24-bit
16-bit
lit 10
m 10'
2500 4000500 1000 1500 2000
Iterations
3000 3500
Figure B.3: The Jacobi Convergence of different data-width of 32x32 matrix
8-bit
24-bit
16-bit
8000 10000
iterations
12000 14000 16000 180002000 4000 6000
Figure B.4: The Jacobi Convergence of different data-width of 64x64 matrix
As shown in the above figures, the convergence of the Jacobi algorithm is faster for smaller 
matrices; in addition, this convergance is slower as the input data bit-length increases.
187
B.2 Fixed Point Error Simulation
In order to evaluate the output error of the fixed-point implementation of the Jaeobi 
algorithm, different data sets of 8 and 16 spectral bands from the AVIRIS Cuprite and the 
Hyperion Boston Hyperspectral images were considered in MATLAB simulations as shown 
in figures B.5, B.6, B.7 and B.8
10° r
12-bit
16-bit
20-bit
24-bit
Sw eep
Figure B.5: The Maximum Output Errors of the Eigenvectors for Fixed-point Computation
(AVIRIS Cuprite 8 bands)
-  12-bit
-  16-bit
-  20-bit 
•• 24-bit
S w eep s
Figure B.6: The Maximum Output Errors of the Eigenvectors for Fixed-point Computation
(AVIRIS Cuprite 16 bands)
188
 12-bit
 16-bit
 20-bit
 24-bit
CO 10'
S w eep
Figure B.7: The Maximum Output Errors of the Eigenvectors for Fixed-point Computation
(Hyperion Boston 8 bands)
 12-bit
 16-bit
—  20-bit 
 24-bit
10" -
Sw eep
Figure B.8: The Maximum Output Errors of the Eigenvectors for Fixed-point Computation
(Hyperion Boston 16 bands)
Different fractional bit-length were considered in the simulations. As shown in Figure B.5,
B.6, B.7 and B.8, the output error is larger for lager number of spectral bands. This can be 
justified by the iterative nature of the Jacobi algorithm, where larger matrices require more 
iterations; hence, larger number of operations.
189
Appendix C
Hardware Prototyping Platforms
C.l Altera CYCLONE IV DE2-115 Board
The Altera DE2-115 Development and Education board was designed by professors, for 
professors. It is an ideal vehicle for learning about digital logic, computer organization, and 
FPGAs. Featuring an Altera Cyclone® IV 4CE115 FPGA, the DE2-115 board is designed for 
university and college laboratory use. It is suitable for a wide range of exercises in courses on 
digital logic and computer organization, from simple tasks that illustrate fundamental 
concepts to advanced designs.
m i
i  R a s
Figure C.l: The Altera DE2-115 Board
190
Board Spesifications
FPGA Cyclone IV EP4CE115F29C7 with EPCS64 64-Mbit serial configuration 
device
I/O
Interfaces
Built-in USB-Blaster for FPGA configuration 
Line In/Out, Microphone In (24-bit Audio CODEC) 
Video Out (VGA 8-bit DAC)
Video In (NTSC/PAL/MuIti-format)
RS232
Infrared input port 
PS/2 mouse or keyboard port 
Two 10/100/1000 Ethernet 
USB 2.0 (type A and type B)
Expansion headers (one 40-pin header)
HSMC high-speed header
Memory 128 MB SDRAM, 2 MB SRAM, 8 MB Flash 
SD memory card slot
Clocks 50 MHz clock 
External SMA clock input 
External SMA clock output
FPGA Spesifîcations
Device Logic
Elements
M9K
memory
blocks
Embedded
memory
(Kbits)
9-bit X 9-bit 
multipliers
PLLs
EP4CE115 114480 432 3888 532 4
191
C.2 Actel Smartfusion Kit
SmartFusion System on Chip (SoC) FPGAs are the only devices that integrate an FPGA 
fabric, ARM Cortex-M3 Processor, and programmable analog circuitry, offering the benefits 
of full customization and IP protection, while still being easy to use. Based on a proprietary 
flash process, SmartFusion SoC FPGAs are ideal for hardware and embedded designers who 
need a true system-on-chip (SoC) solution that gives more flexibility than traditional fixed- 
function microcontrollers without the excessive cost of soft processor cores on traditional 
FPGAs.
—..... '==3 1*,::ill Q ,i'
S mARTFoMOII^ &
RESET ;
V^mVmVi S
....... .................
Figure C.2: The Smartfusion Kit
192
S p e c i f i c a t i o n s
FPGA Fabric
Based on Microsemi’s proven ProASIC3 architecture
60,000 to 500,000 system gates with 350 MHz system performance
Embedded SRAMs and FIFOs
Variable aspect ratio 4,608-bit SRAM blocks 
x l, x2, x4, x9 and xl8 organizations 
True dual-port SRAM (including xl8)
Up to 128 FPGA I/Os supporting LVDS, PCI, PCI-X and LVTTL/LVCMOS 
standards
Microcontroller Subsystem (MSS)
Hardware industry-standard 100 MHz, 32-bit ARM Cortex-M3 CPU
Multi-layer AHB communication matrix with up to 16 Gbps throughput
10/100 Ethernet MAC with RMII interface
Two of each: SPI, I2C, UART, 32-bit timers
Up to 512 KB flash and 64 KB of SRAM
External memory controller (EMC)
8-channel DMA controller 
Up to 41 MSS EOs with Schmitt trigger inputs 
25 EOs can be used as FPGA EOs 
Programmable Analog
High-performance analog signal conditioning blocks (SCB) with voltage, current and 
temperature monitors
Analog compute engine (ACE) offloads CPU from analog initialization and 
processing of analog-to-digital conversion (ADC), digital-to-analog conversion 
(DAC) and SCBs
Integrated ADCs and DACs with 1 percent accuracy 
12-/10-/8-bit mode ADCs with 500/550/600 Ksps sampling rate 
Up to ten 15 ns high-speed comparators 
Up to 32 analog inputs and 3 outputs
193
Appendix D
ModelSim Simulations
D.l Simulation of the 6 Computation
In order to verify the operation and the execution time of the 0 Computation, a post synthesis 
ModelSim simulation was undertaken as shown in the next page. The input data used in this 
simulation are taken from processed AVIRIS Cuprite hyperspectral data.
194
0.1
Bo
U
a>Qi
5
%
o
*-C
3
B
%
3
OOn
(U
El
S -a
@r
(jf
sf
ti
-m
(ïïj
a
4»
4 :
m
s
i
4
- f
éiii
■
I
I
■
q? 9b
DQ
CQ 1
CM
qT
c
to
dq
co
u DQ
to 1
_Q3
LO
QQ
ÇÿÿvWSSoSo
:: III:#
LD
O
II
<3i
2i
(U
■ a
QJ
3
Q .
E
0  
u
O)
m
1
T3
C
m
g
00
(D
O
S>ro
LO<
r -
m
.—I
so
00
r »
CO
rsi
•O
cTO
PO
Q
O
ID
cno
no
So
II
CT>rsl
l<
no
■ a
c
m
00
T—1
ro
3
O
no
D
CTÎ
LO
00
QJ*
QJ
_C
<LD<
LO
o
U_
ro
Xo
LO
d
II
o
LO
LU
<
U _
CO
0 0
u_
ro
Xo
II
0 0
'q-
IX
IX
CT»
d
0 0 IIo
LU
3 ; ou
LU
ro
< ■Qs
O c :
II D
(Tt
0 0
LO D
0 0 LU
ro u_
rsl T—1
C 3 LU
II
o
0? 001 II
rsl
DQ rsl
1
ro
LD
C ro
to LD
-M ro
* I X
LO II
d Db
II c
Qi
_QJ
U
u_o
u
LO
o
LO
T—I 
T—I
>■
QJ
fO
£
X
2
Q.
Q.
to
_X
Oo
co
3
Q.
Eou
QJ
>o
_a
to
QJ
Oo
CUD
c
*c
c
3Cd
lO
(T)
D.2 Simulation of the Lifting Process
A data set from the AVIRIS Cuprite of 8 spectral bands was considered in this ModelSim 
simulation. The Eigenvecors of this data set is decomposed into the L, U and S matrices as 
shwon below:
L =
U=
S =
■ 1.0000 00000 00000 00000 00000 00000 00000 00000
0.2100 1.0000 00000 00000 00000 00000 00000 00000
0.3691 0.3606 1.0000 00000 00000 00000 00000 00000
0.1825 -0.1896 -0.1123 1.0000 00000 00000 00000 00000
0.0358 0.0152 0.1493 -0.0148 1.0000 00000 00000 00000
-0.0931 -0.2877 -0.0859 0.2729 0.3437 1.0000 00000 00000
0.184 0.2233 0.3588 0.2267 0.2044 0.4281 1.0000 00000
- 0.0293 0.0103 0.5040 -0.2264 -0.3031 0.2879 -0.1906 1.0000
1.0000 --0.1045 -0.5672 -0.2570 -0.0135 0.1418 0.1189 -0.0272-
00000 1.0000 -0.6289 0.1936 0.0431 0.1169 -0.0607 -0.1257
00000 00000 1.0000 0.1324 -0.0462 -0.2666 -0.6324 -0.2984
00000 00000 00000 1.0000 -0.0592 -0.0597 -0.1362 0.1408
00000 00000 00000 00000 1.0000 -0.4902 -0.0597 0.3211
00000 00000 00000 00000 00000 1.0000 -0.4902 -0.9485
00000 00000 00000 00000 00000 00000 1.0000 0.7929
- 00000 00000 00000 00000 00000 00000 00000 1.0000 -
1.0000
00000
0.5083
00000
00000
00000
0.3554
-0.1901
00000
1.0000
0.4836
00000
00000
00000
0.3381
-0.1809
00000
00000
1.0000
00000
00000
00000
0.6992
-0.3740
00000
00000
00000
1.0000
00000
0.2508
00000
0.1262
00000
00000
00000
00000
1.0000
0.4786
00000
0.2409
00000
00000
00000
00000
00000
1.0000
00000
0.5033
00000
00000
00000
00000
00000
00000
1.0000
-0.5349
00000
00000
00000
00000
00000
00000
00000
1.0000
The fixed point representation of the S matrix is
S =
40000 00000 00000 00000 00000 00000 00000 00000
00000 40000 00000 00000 00000 00000 00000 00000
20874 lef34 40000 00000 00000 00000 00000 00000
00000 00000 00000 40000 00000 00000 00000 00000
00000 00000 00000 00000 40000 00000 00000 00000
00000 00000 00000 lOOcb lea lf 40000 00000 00000
16beb 15a43 2cc03 00000 00000 00000 40000 00000
Lf3d56 f46c8 e80ff 08140 0f6b0 20367 ddc40 40000
Taking a the Spectal pixel T from the same data set
T =  [593 624 563 590 603 608 612 671]
Where the Hexidicmal representation of T is
T =  [251 270 233 24E 25B 260 264 29F]
196
I
I
■
I
II
I
01 CJ
2 On
isx
s
!
J S
<U
I
oOhQJ
J S
H
Q
2
s
b£
(N
m
oo
m
o
m
oLO
oOLin
LO
LO
g
m
O nin
s
GO
I
H
(D
I
<D
H
o
CO
cn
LD
LD
T—I Tj-
00
LDrsl
rsl
00
o
(X
LOrsl
(D
<+H
O
E3
0
1
s&
I
ffi
o
o
I
o
(X
•g
10)
CA
01
o
2
t/}
SoK
>
o
Ic/3
<
cn
Appendix E
Summary of the Hardware Utilisation
E.l. 0 Computer
Flow Status Successful - Sun Mar 16 02:50:35 2014
Quartus II 32-bit Version 12.0 Build 263 08/02/2012 SP 2 SJ Web Edition
Revision Name ATANTST
Top-level Entity Name Blockl
Family Cyclone IV E
Device EP4CE115F29C7
Timing Models Final
Total logic elements 19,539 / 114,480 ( 17 % )
Total combinational functions 18,224 / 114,480 ( 16 % )
Dedicated logic registers 6,841 / 114,480 ( 6 % )
Total registers 6841
Total pins 171 / 529 ( 32 % )
Total virtual pins 0
Total memory bits 8,193 / 3,981,312 ( < 1 % )
Embedded Multiplier 9-bit elements 114 / 532 (21 % )
Total PLLs 1 / 4 ( 2 5 % )
198
E.2. Eq(3,4,5) Computer
Flow Status Successful - Sun Mar 16 02:50:35 2014
Quartus II 32-bit Version 12.0 Build 263 08/02/2012 SP 2 SJ Web Edition
Revision Name ATANTST
Top-level Entity Name Blockl
Family Cyclone IV E
Device EP4CE115F29C7
Timing Models Final
Total logic elements 19,539 / 114,480 ( 17 % )
Total combinational functions 18,224 / 114,480 ( 16 % )
Dedicated logic registers 6,841 / 114,480 ( 6 % )
Total registers 6841
Total pins 171 / 5 2 9 ( 3 2 % )
Total virtual pins 0
Total memory bits 8,193 / 3,981,312 ( < 1 % )
Embedded Multiplier 9-bit elements 114/ 532 (21 % )
Total PLLs 1 / 4 ( 2 5 % )
199
E.3. The Lifting Unit Computer
Flow Status Successful - Sun Mar 16 02:50:35 2014
Quartus II 32-bit Version 12.0 Build 263 08/02/2012 SP 2 SJ Web Edition
Revision Name ATANTST
Top-level Entity Name Blockl
Family Cyclone IV E
Device EP4CE115F29C7
Timing Models Final
Total logic elements 19,539 / 114,480 ( 17 % )
Total combinational functions 18,224 / 114,480 ( 16 % )
Dedicated logic registers 6,841 / 114,480 ( 6 % )
Total registers 6841
Total pins 171 /529(32%)
Total virtual pins 0
Total memory bits 8,193 / 3,981,312 ( < 1 % )
Embedded Multiplier 9-bit elements 114/ 532 (21 % )
Total PLLs 1 / 4 ( 2 5 % )
200
Reproduced with permission of copyright owner. Further reproduction prohibited without permission.
