GPU accelerated onboard data processing for downlink optimisation. by Davidson, Rebecca
 
 
GPU Accelerated Onboard Data 
Processing For Downlink Optimisation 
Rebecca L. Davidson 
Submitted for the Degree of 
Doctor of Philosophy 
from the 
University of Surrey 
 
 
Surrey Space Centre 
Faculty of Engineering and Physical Sciences 
University of Surrey 
Guildford, Surrey, GU2 7XH, UK 
 
September 2018 
Supervised by: Dr C. P. Bridges 
 
ã R. L. Davidson  2018  
Rebecca L. Davidson                   GPU Accelerted Onboard Data Processing For Downlink Optimisation 
- ii - 
STATEMENT OF ORIGINALITY 
 
This thesis and the work to which it refers are the results of my own efforts. Any ideas, 
data, images, or text resulting from the work of others (whether published or unpublished) 
are fully identified as such within the work and attributed to their originator in the text, 
bibliography, or in footnotes. This thesis has not been submitted in whole or in part for any 
other academic degree or professional qualification. I agree that the University has the right 
to submit my work to the plagiarism detection service TurnitinUK for originality checks. 
Whether or not drafts have been so assessed, the University reserves the right to require an 
electronic version of the final document (as submitted) for assessment as above.  
 
Rebecca L. Davidson 
September 2018 
Rebecca L. Davidson                   GPU Accelerted Onboard Data Processing For Downlink Optimisation 
- iii - 
ABSTRACT 
 
The dimensionality and volume of raw payload data generated onboard Earth Observa-
tion (EO) satellites has increased beyond the capabilities of satellite downlink technologies, 
as a result a bottleneck in the data delivery chain has developed. This data bottleneck must 
be alleviated in order for EO satellites to efficiently deliver the quality and quantity of   
payload data now expected by its reliant applications. In this thesis, hardware architectures, 
processing algorithms and software design are aspects explored towards a solution.  
As a result, a new onboard satellite data processing architecture is proposed. The key 
novelties of the proposed system are the use of a Graphical Processing Unit (GPU), to         
facilitate state-of-the-art image processing, and the highly flexible nature of the architec-
ture, enabling an adaptive processing chain that can be deployed across numerous 
platforms and missions. In addition, the research documented in this thesis aims to    
demonstrate the viability and evaluate the advantages of using low-power GPUs in an 
onboard data processing system.  
Onboard suitable GPU optimised software development approaches are proposed and 
practically assessed by leveraging the state-of-the-art image compression algorithm, 
CCSDS-123, as a case study. Firstly, application development for maximised processing 
throughput is investigated using hyperspectral and multispectral EO data sets. The         
processing throughput, compression ratio and power consumption of the new CCSDS-123 
image compression GPU application are assessed and characterised for a desktop GPU and 
the onboard representative low power NVIDIA Jetson TX1 GPU platform. Secondly,  
software based error injection experiments are leveraged to investigate the error resilience 
of the CCSDS-123 GPU application. This is a vital area of research which is required to 
facilitate the wider acceptance and use of GPU devices in space and safety critical          
applications, where errors are possible and cannot be tolerated. Using these results new  
error mitigation techniques are also proposed and evaluated. 
 
 
 
Keywords: Earth observation, onboard, image compression, image processing, FPGA, GPU 
Rebecca L. Davidson                   GPU Accelerted Onboard Data Processing For Downlink Optimisation 
- iv - 
ACKNOWLEDGMENTS 
I would like to thank all the staff and fellow students at the Surrey Space Centre, my 
colleagues at SSTL and the NVIDIA Corporation for producing the Jetson Platform and the 
SASSIFI error injection framework with immaculate timing and for awarding this research 
with a Jetson TX1 developer kit grant.  
I am amazingly grateful to my incredible family, Mel, Colin and Liv, for their unre-
lenting support, mostly in the form of mockery, their encouragement, despite not really 
having a clue what my PhD is about, and for each individually inspiring me every single 
day to be and do the very best I can. 
I would also like to thank my supervisor, Dr Christopher Bridges. Firstly, for introduc-
ing me to HBO’s Silicon Valley. Secondly, for supervising me on my undergraduate level 
dissertation project and convincing me, mostly with the lure of being paid to remain a stu-
dent, to leave my intended Master’s pathway, graduate a year early and embark on the 
journey that has been this PhD. A PhD is something I would never have thought myself 
capable of achieving without his encouragement and belief. I hope that the research we 
have conducted, including our publications and this thesis, stand as a suitable testament to 
Chris’s mentorship. Despite going over to the dark side, I look forward to continuing to 
work together on new and exciting projects and one day maybe we’ll solve all those big 
onboard data problems, if not I’ll settle for another trip to IEEE Aerospace, MT.   
Lastly, I would like to express my appreciation, gratitude, and every other acronym for 
thanks, to James, who has been my absolute pebble through the entire process. Day one he 
helped me to decide if I should do a PhD and he has been there for me every single day 
since, imparting on me his relentless belief, encouragement and stubbornness. I want to 
especially thank him for helping me to get home when I got stranded in Crete, on my first 
conference trip, despite being a few pints down at the time of my SOS. For, hand deliver-
ing a takeaway pizza to my office when I was working up to the wire for my IEEE 
Aerospace paper. For giving me, and taking away, my PlayStation when I needed it the 
most. And lastly but definitely not least, for reading 63,261 out of 63,6843 words in this 
thesis and every single word of every previous version, including all the first cuts, typos 
and inhumanly long sentences.  
 
So long and thanks for all the fish.   
Rebecca L. Davidson                   GPU Accelerted Onboard Data Processing For Downlink Optimisation 
- v - 
TABLE OF CONTENTS 
  
Statement of Originality ................................................................................................................... ii 
Abstract ..................... ....................................................................................................................... iii 
Acknowledgments ... ........................................................................................................................ iv 
Table of Contents ...... ....................................................................................................................... v 
List of Figures ........... ..................................................................................................................... viii 
List of Tables ............. ...................................................................................................................... xi 
List of Abbreviations ..................................................................................................................... xiii 
Chapter 1 , Introduction ................................................................................................................. 16 
1.1 Research Motivation ......................................................................................................... 18 
1.2 Research Scope .................................................................................................................. 18 
1.3 PhD Aim and Objectives ................................................................................................... 18 
1.4 Research Contributions .................................................................................................... 19 
1.5 Publications ........................................................................................................................ 19 
1.6 Overview of Thesis ............................................................................................................ 20 
Chapter 2 , Literature Review ....................................................................................................... 21 
2.1 Space borne Earth Observation Imaging ........................................................................ 21 
2.2 Onboard Data Processing ................................................................................................. 25 
2.2.1 The Space Environment ................................................................................................................ 25 
2.2.2 Error Resilient Space System Design ........................................................................................... 27 
2.2.3 Traditional Onboard Data Processing Hardware .......................................................................... 29 
2.2.4 Error Resilient Software & Firmware Design ............................................................................... 31 
2.2.5 Payload Data Compression ........................................................................................................... 34 
2.3 Terrestrial Computing System Design ............................................................................ 35 
2.3.1 Heterogeneous Computing ............................................................................................................ 35 
2.3.2 Cluster Computing ........................................................................................................................ 36 
2.3.3 Hardware-Software Co-design ...................................................................................................... 37 
2.4 Terrestrial Processors ....................................................................................................... 38 
2.4.1 GPU Hardware Architecture ......................................................................................................... 40 
2.4.2 GPU Software Programming Model ............................................................................................. 43 
2.4.3 GPU Beam Testing Experiments .................................................................................................. 54 
2.4.4 Software Based GPU Error Injection Testing ............................................................................... 56 
2.4.5 Error Resilient GPU Application Development ............................................................................ 58 
2.5 Data Processing and Compression Algorithms .............................................................. 60 
2.5.1 Lossless Image Compression ........................................................................................................ 60 
2.5.2 Additional Data Processing ........................................................................................................... 75 
2.6 CCSDS-123 Implementation Research ........................................................................... 81 
2.7 Chapter 2 Summary .......................................................................................................... 85 
Rebecca L. Davidson                   GPU Accelerted Onboard Data Processing For Downlink Optimisation 
- vi - 
Chapter 3 , New Onboard Data Processing Architecture ........................................................... 87 
3.1 New Behavioural System Design ........................................................................................ 88 
3.2 New Structural System Design ........................................................................................... 89 
3.3 New System Architecture .................................................................................................... 91 
3.3.1 Backplane ......................................................................................................................................... 91 
3.3.2 FPGA ................................................................................................................................................ 92 
3.3.3 CPU .................................................................................................................................................. 92 
3.3.4 GPU .................................................................................................................................................. 93 
3.4 Additional Research Areas ................................................................................................. 94 
3.5 Chapter 3 Summary ............................................................................................................ 96 
Chapter 4 , GPU Accelerated CCSDS-123 Compression ............................................................ 97 
4.1 CCSDS-123 Algorithm Overview ....................................................................................... 97 
4.2 CCSDS-123 Parallelisation Approaches .......................................................................... 101 
4.2.1 Full DLP Approach ..................................................................................................................... 102 
4.2.2 Limited DLP Approach ............................................................................................................... 103 
4.2.3 Hybrid Approach ......................................................................................................................... 104 
4.3 New Parallel GPU CCSDS-123 Application .................................................................... 105 
4.3.1 Input Data Organisation ................................................................................................................. 107 
4.3.2 CCSDS-123 Kernel Organisation – Memory Hierarchy ................................................................ 108 
4.3.3 CCSDS-123 Kernel Organisation – Kernel Occupancy ................................................................ 109 
4.3.4 CCSDS-123 Kernel Organisation – Concurrent Tasks .................................................................. 113 
4.3.5 Bit Packer Kernel Organisation – Configuration Optimisation ..................................................... 114 
4.4 Initial Application Evaluation .......................................................................................... 115 
4.4.1 Literature Performance Comparison ........................................................................................... 116 
4.4.2 Image Tiling, Processing Throughput and Compression Ratio .................................................. 118 
4.4.3 Performance Trade-off ................................................................................................................ 123 
4.5 Low Power GPU Application Performance .................................................................... 125 
4.6 Chapter 4 Summary .......................................................................................................... 131 
Chapter 5 , Multispectral Optimised GPU Accelerated CCSDS-123 Compression ............... 133 
5.1 Initial Multispectral Imagery Performance Evaluation ................................................. 133 
5.2 Low Power GPU Performance Evaluation ...................................................................... 139 
5.3 New Multispectral Imagery Optimised GPU Application ............................................. 140 
5.3.1 CCSDS-123 Kernel Organisation – Nested Parallelism ............................................................. 141 
5.3.2 CCSDS-123 Kernel Configuration –Warp Efficiency ................................................................ 141 
5.4 Multispectral Optimised Application Performance Evaluation .................................... 145 
5.5 Low Power GPU Application Performance .................................................................... 150 
5.6 Chapter 5 Summary .......................................................................................................... 152 
Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression ............................... 155 
6.1 CCSDS-123 Algorithm Assessment .................................................................................. 155 
6.2 GPU Application Error Injection Testing ....................................................................... 156 
6.2.1 Register File Error Injection Results ........................................................................................... 158 
Rebecca L. Davidson                   GPU Accelerted Onboard Data Processing For Downlink Optimisation 
- vii - 
6.2.2 Application Level Error Injection Results .................................................................................. 160 
6.2.3 Kernel Level Error Injection Results .......................................................................................... 161 
6.2.4 Instruction Level Error Injection Results .................................................................................... 163 
6.3 New GPU Error Mitigation Approaches ......................................................................... 166 
6.3.1 Error Resilient GPU Application Development .......................................................................... 167 
6.3.2 Software Based Error Injection Testing ...................................................................................... 170 
6.3.3 Throughput Performance Evaluation .......................................................................................... 172 
6.4 Chapter 6 Summary .......................................................................................................... 175 
Chapter 7 , Conclusion & Future Work ..................................................................................... 178 
7.1 Conclusion .......................................................................................................................... 178 
7.2 Future Work ....................................................................................................................... 182 
7.2.1 GPU Beam Testing ..................................................................................................................... 182 
7.2.2 GPU Application FI Protection ................................................................................................... 182 
7.2.3 State-of-the-art Low Power GPU Testing ................................................................................... 182 
7.2.4 Commercial Exploitation ............................................................................................................ 183 
References .................. ................................................................................................................... 184 
Appendices ................. ................................................................................................................... 200 
 
 
  
Rebecca L. Davidson                   GPU Accelerted Onboard Data Processing For Downlink Optimisation 
- viii - 
LIST OF FIGURES 
Figure 2-1 EO imaging techniques ...................................................................................... 21 
Figure 2-2 Data volume evolution over time [11] ............................................................... 23 
Figure 2-3 Van Allen belts and Earth orbits, simplified from [15] ..................................... 25 
Figure 2-4 Architectural error resilience design .................................................................. 27 
Figure 2-5 Example space data processing architectures .................................................... 28 
Figure 2-6 Cluster computer architecture ............................................................................ 36 
Figure 2-7 Hardware-software co-design double roof model [47] ...................................... 37 
Figure 2-8 Forty five years of processors trend data [48] .................................................... 38 
Figure 2-9 CPU-GPU heterogeneous execution model ....................................................... 39 
Figure 2-10 CPU versus GPU hardware architectures ........................................................ 40 
Figure 2-11 Block diagram of the hardware architecture of an NVIDIA GPU ................... 41 
Figure 2-12 GPU thread divergence .................................................................................... 45 
Figure 2-13 GPU hardware hierarchy showing kernels, blocks, warps and threads ........... 46 
Figure 2-14 GPU hiding latency .......................................................................................... 47 
Figure 2-15 Instruction level parallelism on a GPU ............................................................ 49 
Figure 2-16 NVIDIA GPU memory hierarchy .................................................................... 51 
Figure 2-17 Shared memory bank conflicts ......................................................................... 52 
Figure 2-18 Coalesced and uncoalesced memory transactions ............................................ 53 
Figure 2-19 Lossless image compression algorithm classification methodology ............... 63 
Figure 2-20 Lossless image compression algorithm compression ratio analysis1,2 ............. 65 
Figure 2-21 Example 1-level and 2-level 2D-DWT ............................................................ 66 
Figure 2-22 Adaptive image tiling [115] ............................................................................. 76 
Figure 2-23 CCSDS-123 previously published throughput results ..................................... 83 
Figure 3-1 New behavioural system design ......................................................................... 88 
Figure 3-2 Structural system design .................................................................................... 90 
Figure 3-3 New onboard data processing system architecture ............................................ 91 
Figure 4-1 FL casual template ............................................................................................. 97 
Figure 4-2 CCSDS-123 local sum modes .......................................................................... 100 
Figure 4-3 CCSDS-123 local difference modes ................................................................ 100 
Figure 4-4 CCSDS-123 functional block diagram ............................................................. 101 
Rebecca L. Davidson                   GPU Accelerted Onboard Data Processing For Downlink Optimisation 
- ix - 
Figure 4-5 SSC CCSDS-123 GPU application design summary ....................................... 106 
Figure 4-6 Data ordering for coalesced memory operations .............................................. 108 
Figure 4-7 Instructions occurrence in SSC CCSDS-123 kernel based on operation ......... 110 
Figure 4-8 Impact on varying compiled registers per thread ............................................. 112 
Figure 4-9 NVIDIA CUDA Streams .................................................................................. 114 
Figure 4-10 CCSDS-123 implementation throughput performance comparison ............... 117 
Figure 4-11 AVIRIS Hawaii compression ratio and throughput (GTX750Ti) .................. 120 
Figure 4-12 AVIRIS Maine compression ratio and throughput (GTX750Ti) ................... 120 
Figure 4-13 AVIRIS Yellowstone 00 compression ratio and throughput (GTX750TI) .... 121 
Figure 4-14 AVIRIS Yellowstone 03 compression ratio and throughput (GTX750Ti) .... 121 
Figure 4-15 AVIRIS GTX750TI throughput Weissman scores with tile size (" = 1) ..... 124 
Figure 4-16 AVIRIS imagery GTX750Ti and Jetson TX1 tiled performance ................... 128 
Figure 4-17 AVIRIS Jetson TX1 and GTX750Ti throughput Weissman Scores (" = 1) 129 
Figure 4-18 AVIRIS Hawaii CCSDS-123 compression throughput comparison .............. 130 
Figure 5-1 Landsat Agriculture compression ratio and throughput (GTX750Ti) .............. 135 
Figure 5-2 Landsat Coast compression ratio and throughput (GTX750Ti) ....................... 135 
Figure 5-3 Landsat Mountain compression ratio and throughput (GTX750Ti) ................. 136 
Figure 5-4 GTX750Ti throughput Weissman Score comparison for all images (" = 1) . 138 
Figure 5-5 Tiled Landsat imagery GTX750Ti and Jetson TX1 GPU comparison results . 140 
Figure 5-6 Leveraging image tiling to increase warp execution efficiency ....................... 142 
Figure 5-7 Landsat Agriculture CCSDS-123 kernel warp stall reasons ............................ 144 
Figure 5-8 Landsat Agriculture TPB throughput results (GTX750Ti) .............................. 145 
Figure 5-9 Landsat Coast TPB throughput results (GTX750Ti) ........................................ 146 
Figure 5-10 Landsat Mountain TPB throughput results (GTX750Ti) ............................... 146 
Figure 5-11 Landsat Agriculture TPB throughput with number of blocks (GTX750Ti) ... 148 
Figure 5-12 Landsat images throughput Weissman score for GTX750Ti ......................... 149 
Figure 5-13 Landsat Agriculture TPB testing for GTX750Ti and Jetson TX1 .................. 151 
Figure 5-14 Landsat Coast TPB testing for GTX750Ti and Jetson TX1 ........................... 151 
Figure 5-15 Landsat Mountain TPB testing for GTX750Ti and Jetson TX1 .................... 152 
Figure 6-1 RF error injection results .................................................................................. 159 
Figure 6-2 CCSDS-123 application level error injection results ....................................... 160 
Figure 6-3 Kernel level error injection results ................................................................... 161 
Rebecca L. Davidson                   GPU Accelerted Onboard Data Processing For Downlink Optimisation 
- x - 
Figure 6-4 Application level view of the kernel error injection results ............................. 162 
Figure 6-5 Instruction level weighted error injection results ............................................. 164 
Figure 6-6 Weighted error injection results for GPR instructions in IOV mode ............... 166 
Figure 6-7 TMR implementation comparison ................................................................... 168 
Figure 6-8 Error injection results for original, K-TMR and T-TMR applications ............ 170 
Figure 6-9 Error injection results original, K-TMR and T-TMR applications .................. 171 
Figure 6-10 TMR execution time overhead comparison for Landsat Agriculture ............ 173 
Figure 6-11 GTX750Ti Landsat throughput results with and without TMR protection ... 174 
Figure 6-12 Jetson TX1 Landsat throughput results with and without TMR protection ... 175 
Figure 7-1 New error resilient GPU accelerated application development framework ..... 181 
  
Rebecca L. Davidson                   GPU Accelerted Onboard Data Processing For Downlink Optimisation 
- xi - 
LIST OF TABLES 
Table 2-1 Future onboard processing mission attributes [2] ................................................ 31 
Table 2-2 Recommended onboard payload data and image compression algorithms ......... 35 
Table 2-3 Software and hardware constructs used for exploiting parallelism types ............ 44 
Table 2-4 Causes of low measured occupancy [56] ............................................................. 48 
Table 2-5 Instruction stall reasons [53] ................................................................................ 50 
Table 2-6 Onboard EO image processing algorithm desired characteristics ....................... 61 
Table 2-7 Average compression ratios of surveyed algorithms * ........................................ 63 
Table 2-8: Sunset algorithm key concepts ........................................................................... 67 
Table 2-9 Additional image processing and their advantages .............................................. 75 
Table 2-10 Image analysis examples .................................................................................... 81 
Table 2-11 Previous research limitations ............................................................................. 84 
Table 3-1 Ideal new onboard data processing system attributes .......................................... 87 
Table 3-2 Structural system design summary ...................................................................... 90 
Table 3-3 CCSDS-123 case study research questions .......................................................... 95 
Table 4-1 Broad CCSDS-123 application requirements .................................................... 106 
Table 4-2 AVIRIS Hawaii test image characteristics [143] ............................................... 109 
Table 4-3 SSC CCSDS-123 naturally compiled characteristics ........................................ 110 
Table 4-4 SSC CCSDS-123 optimised compiled characteristics ....................................... 112 
Table 4-5 Bit Packer kernel configuration and performance comparison .......................... 115 
Table 4-6 SSC Bit Packer kernel characteristics and theoretical occupancy ..................... 115 
Table 4-7 Desktop application evaluation platform ........................................................... 116 
Table 4-8 AVIRIS Hawaii test image characteristics ........................................................ 117 
Table 4-9 AVIRIS Hawaii test configuration and new compression results ..................... 117 
Table 4-10 GPU platform comparison ............................................................................... 117 
Table 4-11 AVIRIS test image characteristics ................................................................... 118 
Table 4-12 AVIRIS Hawaii throughput Weissman Score reference data .......................... 124 
Table 4-13 AVIRIS imagery, kernel and GTX750Ti hardware characteristics ................. 125 
Table 4-14 GPU test platform comparison ......................................................................... 126 
Table 4-15 AVIRIS imagery, kernel and Jetson TX1 hardware characteristics ................ 127 
Table 4-16 GPU platform comparison ............................................................................... 130 
Rebecca L. Davidson                   GPU Accelerted Onboard Data Processing For Downlink Optimisation 
- xii - 
Table 4-17 SSC CCSDS-123 case study Chapter 4 findings ............................................. 132 
Table 5-1 Landsat multispectral imagery test data characteristics [143][144] .................. 133 
Table 5-2 Landsat imagery, kernel and GTX750Ti hardware characteristics ................... 134 
Table 5-3 Landsat imagery, kernel and Jetson TX1 hardware characteristics .................. 139 
Table 5-4 Multispectral TPB kernel testing and profiling results ...................................... 144 
Table 5-5 Calculated maximum number of concurrent blocks and tiles ........................... 149 
Table 5-6 Jetson TX1 calculate maximum concurrent tiles for Landsat images ............... 150 
Table 5-7 SSC CCSDS-123 case study Chapter 5 findings summary ............................... 154 
Table 6-1 SASSIFI error injection framework operational modes description ................. 157 
Table 6-2 Kernel register usage and calculated AVF’s ..................................................... 159 
Table 6-3 Occurrence of instructions types in the SSC CCSDS-123 GPU application .... 164 
Table 6-4 Instruction occurrence for the SSC CCSDS-123 GPU application ................... 165 
Table 6-5 SSC CCSDS-123 error resilience case study Chapter 6 findings ...................... 177 
Rebecca L. Davidson                   GPU Accelerted Onboard Data Processing For Downlink Optimisation 
- xiii - 
LIST OF ABBREVIATIONS 
ABFT Algorithm Based Fault Tolerance 
ACAP Adaptive Combination of Adaptive Predictors 
ADAS Advanced Driver Assistance Systems 
APC Adaptive Predictor Combination 
APC-WLS Adaptive Predictor Combination - Weighted Least Squares 
API Application Programming Interface 
APL Adaptive Predictor Length 
ASAP Adaptive Selection of Adaptive Predictor 
ASIC Application Specific Integrated Circuit 
AVF Architectural Vulnerability Factor 
AVIRIS Airborne Visible Infrared Imaging Spectrometer 
BIP Band Interleaved by Pixel 
BLCR Berkeley Lab Check-Point/Restart 
BSQ Band Sequential 
C-DPCM Clustered Differential Pulse Code Modulation 
CALIC Context Adaptive Lossless Image Codec 
CC Conditional Code 
CCD Charge-Coupled-Device 
CCSDS Consultative Committee for Space Data Systems 
CEOI Centre for Earth Observation Instrumentation 
CNES National Centre for Space Studies 
COTS Commercial Off The Shelf 
CPR Check-Pointing and Rollback 
CPU Computer Processing Unit 
CR Compression Ratio 
CRC Cyclic Redundancy Check 
CREW Compression with Reversible Embedded Wavelet 
CWRD Codeword 
D-JPEG-LS Differential-JPEG-LS 
D-JPEG2000 Differential-JPEG2000 
DBF Double Bit Flip 
DCT Discrete Cosine Transform 
DIHU Data Interfacing and Handling Unit 
DLP Data Level Parallelism 
DMA Direct Memory Access 
DMR Dual Modular Redundancy 
DPU Double Precision Units 
DRAM Dynamic Random Access Memory 
DSC Distributed Source Coding 
DSP Digital Signal Processor 
DWC Duplication With Comparison 
DWT Discrete Wavelet Transform 
ECC Error Correcting Codes 
EDAC Error Dectection and Correction Codes 
Rebecca L. Davidson                   GPU Accelerted Onboard Data Processing For Downlink Optimisation 
- xiv - 
EO Earth Observation 
EPHI Edge-based Prediction for Hyperspectral Imagery 
ESA European Space Agency 
FFT Fast Fourier Transform 
FI Functional Interrupt 
FIT Failure-In-Time 
FL Fast Lossless 
FLOPS Floating Point Operations per Second 
FPGA Field Programable Gate Array 
GAP Gradient Adjusted Predictor 
GB GigaBytes 
GPR General Purpose Registers 
GPU Graphics Processing Unit 
GSD Ground Sampling Distance 
HPC High Performance Computing 
HS-CD Hardware-Software Co-Design 
HVF Hardware Vulnerability Factor 
IADD_IMUL Integer Addition and Multiplication 
IB-CALIC Inter-Band CALIC 
IC Integrated Circuit 
ILP Instruction Level Parallelism 
IOA Instruction Output Address 
IOV Instruction Output Value 
IR Intermediate Representation 
ISA Instruction Set Architecture 
JPEG Joint Photographic Experts Group 
JPL Jet Propulsion Laboratory 
K-TMR Kernel TMR 
LAIS-LUT Locally Averaged Inter-band Scaling LUT 
LAIS-QLUT Locally Averaged Inter-band Scaling Quantised LUT 
LD Local Difference 
LD/ST Load and Store Units 
LDPC Low-Density Parity Check 
LDS Load and Store Operations 
LEN Length 
LOCO-A Low Complexity Lossless Compression - Arithmetic coding 
LOCO-I Low Complexity Lossless Compression for Images 
LPVQ Locally optimised Partitioning Vector Quantisation 
LS Local Sum 
LUT Look-Up Table 
MBU Multiple Bit Upset 
MC-CPU Multicore-CPU 
MED Median Edge Detector 
MMSE Minimum Mean Square Error 
MMU Mass Memory Unit 
MPSoC Multiprocessor System-on-Chip 
MSS Multispectral Scanner 
Rebecca L. Davidson                   GPU Accelerted Onboard Data Processing For Downlink Optimisation 
- xv - 
NASA National Aeronautics and Space Administration 
NVCC NVIDIA CUDA Complier 
NVCR NVIDIA CUDA Checkpoint and Restart Library 
OBC Onboard Computer 
PCIe Peripheral Component Interconnect Express 
PDPU Payload Data Processing Unit 
PR Predicate Registers 
RF Register File 
RH Radiation Hardened 
RISC Reduced Instruction Set Computer 
ROI Region Of Interest 
SBF Single Bit Flip 
SBFT Software Based Fault Tolerance 
SBU Single Bit Upset 
SDC Silent Data Corruption 
SEE Single Event Effect 
SEU Single Event Upset 
SFU Special Function Units 
sgn sign 
SHUF_LOP Shuffle and Logic Operations 
SIFT Scale Invariant Feature Transform 
SIMD Single Instruction Multiple Data 
SLSQ Spectrum Orientated Least Squares 
SM Streaming Multiprocessors 
SMA SubMiniature version A connector 
SoC System-on-Chip 
SOI Silicon On Insulator 
SSC Surrey Space Centre 
SSTL Surrey Satellite Technology Limited 
STP Single Thread Performances 
T-TMR Thread TMR 
TB TeraBytes 
TDKZW Tuned Degree-K Zerotree Wavelet 
TDP Thermal Design Power 
TID Total Ionising Dose 
TLP Task Level Parallelism 
TMR Triple Modular Redundancy 
TPB Tiles Per Block 
TRL Technology Readiness Level 
USA United States of America 
USES Universal Source Encoder for Space 
Rebecca L. Davidson                Chapter 1, Introduction 
- 16 - 
CHAPTER 1, INTRODUCTION 
Earth Observation (EO) is the acquisition and interpretation of information about the 
Earth and its atmosphere using instruments positioned on remote sensing platforms.    
Today EO data is utilised by a growing number of diverse applications, including those in 
environmental, humanitarian and industrial sectors. The growth in the number of user ap-
plications and demand for EO data has driven significant technological advances in 
remote sensing instruments. Consequently, the achievable spatial, spectral, temporal and 
radiometric resolutions of remote sensing data have increased significantly in recent 
years. The increasing data dimensionality and subsequent increase in data volume, is cre-
ating new big data challenges in the field of remote sensing. Space borne remote sensing 
platforms are markedly vulnerable to these big data challenges due to constraints imposed 
by the platform’s environment.  
EO satellites are typically operated in a store and forward mode, whereby data is cap-
tured and then stored onboard until a downlink can be established with an appropriate 
ground station. Whilst payload data rates and volumes have continued to grow, the re-
quired level of advancement in downlink technologies to handle such increases has not 
occurred [1][2]. Space based downlink technologies today are typically inhibited by limi-
tations on antenna size, transmission power and pointing abilities [3]. This, in 
combination with the restricted availability and increasing demand for transmission     
frequencies, has led to a saturation in the performance capabilities of satellite downlink 
systems. As such, the growing disparity between payload and downlink capabilities, has 
resulted in the formation of a growing data bottleneck in the data-delivery chain. For EO 
satellite platforms to continue to provide the level of data required by EO science and its 
reliant applications, this data bottleneck must be alleviated.  
This research investigates the exploitation of the latest advancements in data pro-
cessing to help alleviate the growing data bottleneck for remote sensing platforms. The 
field of data processing covers an extensive range of topics from system, hardware and 
software design to data handling, processing and analysis algorithms. Image compression 
is a key example of a data processing algorithm which can be implemented onboard to 
effectively decrease the volume of data to be downlinked, increasing the data-delivery 
throughput of the platform and reducing the data bottleneck. The first known satellite to 
implement onboard image compression was SPOT-1, launched in 1986 and since then, 
many satellites have featured onboard compression capabilities [4][5]. Historically, only 
Rebecca L. Davidson                Chapter 1, Introduction 
- 17 - 
simple software-based compression was performed onboard due to the limited capabilities 
of the main onboard computer. Whilst satellites today feature dedicated payload data  
processing systems, environment-induced constraints including minimised size, mass, 
power consumption and tolerance to radiation effects are often prioritised over computa-
tional capability. However, as the onboard data volumes continue to increase, scaling 
traditional space proven technologies whilst continuing to adhere to strict power, area or 
mass constraints will become a significant challenge. This issue has been specifically 
identified by the European Space Agency (ESA) who have stated that the “challenging 
requirements for future onboard payload data processing systems cannot be met with 
space qualified processors available today” [2].  
In the terrestrial computing industry, there has been a growing demand for high per-
formance and power efficient mobile computing devices in a wide range of terrestrial 
applications. As a result, processor manufactures are producing new smaller, lighter, en-
ergy efficient but computationally powerful mobile processing devices. Therefore, the 
requirements imposed upon space and terrestrial mobile industries are becoming more 
closely aligned. A move to utilise more terrestrial mobile technologies can be seen as the 
next stage of the existing trend in utilising more commercial off the shelf (COTS) com-
ponents in the space community. This research will assess COTS computing systems and 
processing devices towards providing high computational performance within the bounds 
of space system constraints, to facilitate the deployment of state-of-the-art processing al-
gorithms to alleviate the onboard data bottleneck.  
  
Rebecca L. Davidson                Chapter 1, Introduction 
- 18 - 
1.1    Research Motivation  
The dimensionality and volume of raw EO payload data has been steadily increasing 
in recent years and this trend is expected to continue and even accelerate in the coming 
years. Comparable advances in downlink technologies however have not occurred. This 
has resulted in a growing onboard data bottleneck in the delivery chain. This is a clear 
and present problem which has been identified by many space organisations as requiring 
significant research to provide a novel solution [2]. 
1.2    Research Scope 
This research will be focussed upon the proposal of a new state-of-the-art onboard 
system architecture for the advanced processing and compression of EO payload imagery. 
The new architecture should be scalable to meet future payload and mission data re-
quirements, remove extensive hardware or software modification requirements to meet 
the scalability, and provide increased computing performance towards real-time or ad-
vanced onboard data processing. Where low Technology readiness levels (TRL) 
approaches are proposed, further research should be conducted to provide contributions 
towards proving the suitability and feasibility of leveraging such technologies in the space 
environment.  
1.3    PhD Aim and Objectives 
The initial aim of this work is to analyse data processing practices and contribute to-
wards the formulation of an effective strategy to reduce the strain placed on satellite 
downlink systems due to rapidly advancing EO imaging payloads.  
The main objectives of this research are as follows:  
1. Review the state-of-the-art in image compression algorithms and processing archi-
tectures for multispectral and hyperspectral EO data in an onboard environment. 
2. Propose a new future approach to onboard data processing system design with 
priority placed on enabling scalable state-of-the-art high data throughput pro-
cessing capabilities, and ultimately detail a new next generation architecture.  
3. Develop new techniques to meet future high throughput and error resilient state-
of-the-art processing algorithms on the proposed next generation onboard data 
processing architecture.  
 
Rebecca L. Davidson                Chapter 1, Introduction 
- 19 - 
1.4    Research Contributions 
The contributions of this research to the scientific communities in data processing,       
reliability, parallel image processing, and satellite EO algorithms can be summarised as: 
1. The extensive literature collection and analysis of lossless image compression algo-
rithms shows in a snapshot how multidimensional algorithms continue to exploit 
additional image spectral redundancies and should be leveraged to increase the 
achievable onboard compression ratio, 
2. The proposed new heterogeneous onboard data processing architecture, featuring 
Graphics Processing Unit (GPU) hardware, to facilitate new onboard data processing 
functionality that is highly flexible and scalable enabling new and wide ranging EO 
missions, 
3. A new CCSDS-123 GPU accelerated and error resilient image compression applica-
tion which achieves state-of-the-art processing throughput and is published in detail 
for reproducibility, 
4. New insights on the relationships between image tiling, algorithm structure and ap-
plication implementation parameters specific to new GPU architectures, and 
5. Extensive collation of information on GPU application design and optimisation, lead-
ing to new design rules and a development framework to aid future GPU accelerated 
and error resilient image processing applications. 
 
1.5    Publications 
The work described in this thesis has been published in the following peer-reviewed 
journals and conference proceedings:  
- R. L. Davidson, C. P. Bridges, “Error Resilient GPU Accelerated Image Processing for 
Space Applications”, IEEE Transactions on Parallel and Distributed Systems, 1 Sept. 
2018, Volume:29, Issue:9, Pages:1990–2003, DOI: 10.1109/TPDS.2018.2812853 
 
- R. L. Davidson, C. P. Bridges, “GPU Accelerated Multispectral EO Imagery Optimised 
CCSDS-123 Lossless Compression Implementation”, Proc. of 2017 IEEE Aerospace 
Conference, Date: 4-11 March 2017, DOI: 10.1109/AERO.2017.7943817 
 
- R. L. Davidson, C. P. Bridges, “Adaptive Multispectral GPU Accelerated Architecture 
for Earth Observation Satellites”, Proc. of 2016 IEEE Int. Conf. on Imaging Systems 
and Techniques, Date: 4-6 Oct. 2016, DOI: 10.1109/IST.2016.7738208 
Rebecca L. Davidson                Chapter 1, Introduction 
- 20 - 
1.6    Overview of Thesis 
Chapter 2 details the review of previous literature in the fields of satellite onboard 
data processing, terrestrial computing and processor devices and image compression and 
processing algorithms. This has enabled us to assess the current state-of-the-art in each of 
the relevant disciplines and establish the key areas where new or additional research is 
required. 
Chapter 3 discusses the design and proposal of a new GPU accelerated onboard data 
processing architecture. This new architecture leverages aspects from state-of-the-art ter-
restrial computing fields to provide an inherently scalable, flexible and high performance 
platform, to facilitate the deployment of new state-of-the-art image compression and pro-
cessing algorithms onboard.  
Chapter 4 introduces the initial design methodologies for throughput optimised parallel 
GPU application development. These are then demonstrated through the design of a new 
CCSDS-123 image compression GPU application. To evaluate the approach and hard-
ware capabilities, experimental compression testing on a set of hyperspectral images was 
performed. As a result of these experiments, new discoveries with regards to the perfor-
mance metrics, application parameters and the relationships between them and the 
underlying GPU architecture are presented. 
In Chapter 5, the influence of the characteristics of the input data on key performance 
metrics are assessed by testing the new GPU accelerated CCSDS-123 application with 
multispectral imagery. Subsequently a new multispectral specific optimisation approach 
is proposed to further increase the processing throughput performance and new perfor-
mance metric and parameter relationships are presented.  
Chapter 6 details the research towards the assessment and mitigation of radiation in-
duced errors for GPU applications. A state-of-the-art software based error injection 
framework is utilised to perform an error resilience evaluation of the new GPU accelerat-
ed CCSDS-123 application. Subsequently two new generic error mitigation approaches 
are proposed, these leverage aspects of the GPU architecture and software model to detect 
and correct data corruption errors whilst minimising the induced execution overhead.  
Chapter 7 concludes the thesis by summarising the key findings and contributions 
and proposing future areas of research.   
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 21 - 
CHAPTER 2, LITERATURE REVIEW 
2.1    Space borne Earth Observation Imaging 
A wide variety of imaging techniques and sensors are utilised in space borne remote-
sensing. Consequently, there is significant variation in the characteristics of the data pro-
duced by different instruments [6]. Currently, the most widely deployed EO imaging 
technologies are passive optical instruments. Many of the passive optical imagers de-
ployed today are based on solid-state digital imaging and charge-coupled devices (CCDs). 
A CCD is a photosensitive silicon-based detector containing many individual cells that 
produce an electrical charge proportional to the amount of electromagnetic radiation they 
accumulate in a certain period.  
To create a two-dimensional image, cells can be arranged either as a single detector, 
in a single line to form a linear array detector or as a two-dimensional grid called an area 
array detector. Single detector systems use the whisk-broom optomechanical scanning 
technique to create the two-dimensional image, where an oscillating mirror is used to se-
rially image cell by cell in the cross-track direction and the forward motion of the 
spacecraft forms the second image dimension. Linear array detectors image in the cross-
track direction in parallel and then utilise the push-broom scanning technique which lev-
erages the forward motion of the spacecraft to acquire data in the second image 
dimension. Area array imagers capture spatial data in both cross-track and along-track 
directions simultaneously. These imaging techniques are shown in Figure 2-1. 
 
 
A) Whisk-broom [7] 
 
 
B) Push-broom [7] 
 
 
C) Area Array [8] 
 
Figure 2-1 EO imaging techniques 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 22 - 
In addition to the imaging technique, optical imaging instruments are often classified 
based on the spatial and spectral resolutions they are designed to achieve. For CCD based 
imagers the target area which contributes to a single cell’s radiation measurement deter-
mines the spatial resolution of the image. Image spatial resolution is often measured as 
the ground sampling distance (GSD) which is equal to the size represented by a single 
picture element (pixel). The spectral resolution is determined by the filtering method de-
ployed. Using a filter system allows the incoming electromagnetic radiation to be 
separated based on wavelength, providing a means of imaging in different regions of the 
electromagnetic spectrum, which are often referred to as spectral bands. The most com-
mon filtering systems include grating, prism and liquid crystal tuneable filters [7]. Based 
on a combination of these two parameters there are three main classifications of optical 
imagers namely panchromatic, multispectral and hyperspectral.  
Multispectral and hyperspectral imagers provide the capability to capture radiance 
values for multiple spectral bands. Multispectral and hyperspectral imager classes are ul-
timately differentiated by their achievable spectral resolutions. Hyperspectral imagers 
capture data in a large number, from typically 20 to over 200, contiguous narrow bands. 
On the other hand, multispectral imagers are those that capture data in a lower number of, 
up to around 15, discrete wide spectral bands, thus capturing data at a lower spectral reso-
lution. The increased spectral resolution achieved by hyperspectral imagers is often 
achieved at a cost in spatial resolution. Typically, multispectral imagers obtain a much 
smaller GSD, equating to multispectral imagers often achieving much higher spatial reso-
lutions, when compared to hyperspectral imagers. However, the highest spatial 
resolutions are most often captured by panchromatic imagers, which capture data from a 
wide range of wavelengths, mostly capturing the visible portion of the spectrum [8].  
As multispectral imagers provide a well-balanced trade-off between spatial and spec-
tral resolutions, they are the most popular today for commercial satellite EO missions. 
They are often flown in combination with a panchromatic imager, so that the panchro-
matic and multispectral imagery can be combined on the ground using post-processing 
techniques called pan-sharpening to achieve high spatial resolution multi-band             
imagery [9]. 
The USA’s Landsat program was an early pioneer for multispectral satellite imagery. 
The first Landsat satellite, launched in 1972, flew the Multispectral Scanner (MSS) im-
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 23 - 
ager capable of capturing data in up to 4 spectral bands covering 0.5 – 1.1 µm with a 
ground resolution of 80m [10]. The Landsat series of satellites have successfully high-
lighted the potential terrestrial applications for such data. As adoption of satellite EO 
imagery has grown, many user applications have demanded greater spatial resolution 
products. A milestone in spatial resolution capabilities was achieved in 1999 by the Digi-
talGlobe Inc. IKONOS satellite, which was the first commercial satellite to deliver sub-
metre panchromatic imagery. Prior to this Landsat-7 and SPOT-4 satellites were consid-
ered state-of-the-art providing 30m and 20m multispectral and 15m and 10m 
panchromatic spatial resolutions respectively. Today a GSD of one metre or less is the 
standard for high-resolution EO imagery missions. This has resulted in an exponential 
increase in captured data volumes, today many satellites can generate upwards of several 
TeraBytes (TB) of data a day [11]. Figure 2-2 demonstrates this trend by plotting the cu-
mulative data produced by the 10 MODIS and Landsat missions and comparing this to the 
more recent Sentinel 1-2-3 satellites.  
  
 
 
Figure 2-2 Data volume evolution over time [11] 
 
EO satellites are typically operated in a store and forward mode, whereby data is cap-
tured and then stored onboard until a downlink can be established with an appropriate 
ground station. Whilst technological advances have enabled the efficient storage of these 
growing data volumes onboard, the same level of advancement in downlink technologies 
has not occurred [1][2]. As a result, the satellite downlink system is currently the bottle-
neck towards the timely delivery of captured data to ground for exploitation. Space-based 
downlink technologies are primarily limited by bandwidth. Limitations on antenna size, 
pointing abilities, transmission power and the restricted availability and increasing de-
 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 24 - 
mand for transmission frequencies also contribute to the challenges in advancing down-
link systems [3]. Whilst next generation downlink approaches including optical and inter-
satellite links have been proposed in recent years, they are not yet commercially proven 
as a cost-effective method to sufficiently alleviate the data bottleneck [12].  
In recent years, constellations of satellites have also been explored to increase the 
timeliness of space borne EO data delivery. This has been possible due to the sustained 
miniaturisation and improved proficiency of the field of nano (1-10kg), micro (10-100kg) 
and small (100-500kg) EO satellites [13]. According to the 2017 SpaceWorks report, the 
1- 50kg satellite range has experienced the most growth in recent years; in 2009 – 2013, 
less than 200 of these satellites were launched and only 12% were for remote sensing ap-
plications [13]. In 2016, around 400 additional 1-50kg satellites were launched, the 
percentage utilised for remote sensing applications also increased to 43% in this year. 
Almost half of the nano and microsatellites launched in 2016 were for commercial EO 
constellations such as Planet’s Flock satellites [14]. However, these satellites still suffer 
from the downlink induced data bottleneck. In 2017 the Flock satellites achieve on aver-
age a downlink rate of 160 Mbits/sec downloading approximately 15 GigaBytes (GB) per 
7-10 minute ground station pass [14]. In order to downlink the order of several TB of data 
captured each day, Planet have employed a network of 8 ground stations. However, this 
approach can be extremely costly and is often impractical for many satellite operators, 
due to the political difficulties associated with providing the required broad geographical 
coverage.  
The increasing number and percentage of launched EO satellites was predicted to 
continue for the period 2017-2019. SpaceWorks predicted that by the end of 2019, 600 
satellites will be launched with EO remote sensing applications having a 64% market 
share [13]. In particular, low mass EO satellite constellations are becoming increasingly 
popular as they provide a unique solution for increased rapid revisit times. Therefore, 
there is increasing and immediate pressure on satellite designers to alleviate the onboard 
data bottleneck to ensure the increasing volume of data being generated, can be efficiently 
transmitted and distributed to users in a timely and cost-effective manner.  
  
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 25 - 
2.2    Onboard Data Processing 
Onboard data processing is a wide-ranging field which encompasses aspects of space 
environment science, hardware design, software implementation, and processing algo-
rithm selection. Today onboard data processing systems are employed to reduce the data 
volume to be downlinked with varying success due to the complex trade-offs associated 
with these different aspects of the field. 
 
2.2.1  The Space Environment  
Outside of the Earth’s protective atmosphere, the physical characteristics of space are 
distinctly different to the environment on Earth. For modern computing technologies, the 
most relevant and challenging differences between the terrestrial environment and space 
is due to the presence of radiation. This radiation is in the form of high-energy particles 
which are ejected from our sun and supernova explosions outside of our solar system or 
caused by the ability of the Earths magnetospheres (Van Allen Belts) to trap and acceler-
ate these particles in the near-Earth environment, as shown in Figure 2-3.  
 
Figure 2-3 Van Allen belts and Earth orbits, simplified from [15] 
  
Radiation can cause many different types of effects, of varying severity, on a wide 
range of digital electronics and semiconductor devices used in modern computing sys-
tems. There are two major types of radiation effects: short-term single event effects 
(SEE’s) and long-term total ionising dose (TID) [16]. Long-term TID effects are experi-
enced when ionisation is induced in the semiconductor or insulator layers leading to the 
formation of interference states at the semiconductor-insulator boundary or trapped 
charges. This results in undesirable effects in component behaviour and eventually leads 
Outer Belt
19,000 - 40,000 km
Medium Earth Orbit (MEO)
2,000 - 20,200 km
GPS Satellites
20,000 km
Inner Belt
1,600 - 13,000 km
Low Earth orbit (LEO)
160 - 2,000 km
International Space Station
390 km
NASA Van Allen Probe-A
NASA Van-Allen Probe B
Geostationary Orbit (GEO)
35,786 km
Telecommunications Satellites
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 26 - 
to total device failure. Imaging EO satellites are most commonly deployed in a Low Earth 
Orbit (LEO), here TID due to trapped electrons can be effectively mitigated by ensuring 
the device is sufficiently shielded to limit the exposure and accumulation of radiation.  
However, shielding is significantly less effective against high energy protons. Com-
monly these particles induce alternative short term SEEs, as these particles directly or 
indirectly penetrate and ionise a device. SEEs have many error type sub-classifications 
which can be either temporary or permanent. As a result, SEEs are much more complex 
to model and also more difficult to mitigate against compared to TID effects.  
Single event upsets (SEUs) are the most common SEE type and are of specific concern 
in the space industry. SEUs are typically temporary soft errors which often result in the 
instantaneous changing of a device’s underlying transistor state, also known as single bit 
upset (SBU). However, with shrinking transistor dimensions, the probability of multiple 
bit upsets (MBUs) from a single particle strike has increased [17]. At the application level 
SEU effects can be broadly classified as having one of three outcomes: a functional inter-
rupt (FI), a silent data corruption (SDC) or a masked error [18].  
An FI is defined as an error that causes an application to hang or malfunction, so it 
does not successfully complete. FI’s can be identified by either an undesirable application 
exit status or the timeout of a watchdog timer. Due to often strict and deterministic timing 
requirements of an onboard data processing system, the occurrence of a FI can be critical. 
An SDC is defined as occurring when the application successfully completes but the ap-
plication output is incorrect. Whilst an SDC can be functionally tolerated by the data 
processing chain, data errors have the potential to propagate extensively if not detected or 
corrected. A masked error is when an interaction between a device and a radiation particle 
occurs but there is no observable error outcome (FI or SDC). The rate of masked errors 
equates to the systems error resilience which is its ability to withstand errors should they 
occur.  
Error masking can occur at many different system levels, for example an interaction 
between a device and a radiation particle may not result in a transistor state change at the 
hardware level, and a hardware error may not have an impact on the output of the system. 
Error masking can also be inherent or purposefully designed to increase error resilience of 
the hardware, software or algorithms used in a system. To reduce the probability of an 
observable error event occurring, the traditional approach in the space industry is to de-
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 27 - 
ploy a combination of proven space system designs, specially manufactured hardware 
components and error resilient software and firmware design techniques. 
 
2.2.2  Error Resilient Space System Design 
The methodology often taken in the small satellite market is a bottom-up approach, 
whereby the system is initially defined by selecting hardware to maximise performance 
and minimise constrained characteristics. Subsequently the system behaviour is then de-
fined, whereby software is used as a means to implement functionality which cannot be 
realised in hardware. Due to requirements for high reliability the system design often em-
ploys techniques to increase the error resilience. The majority of these techniques involve 
the introduction of redundant functional blocks, to reduce the number of single points of 
failure and mitigate errors and their effects. Several of these concepts are illustrated in 
Figure 2-4, where the system on the left-hand side depicts an unprotected processing sys-
tem with multiple single points of failure, whilst the right-hand side depicts an increased 
error resilient system design.  
 
 
 
Figure 2-4 Architectural error resilience design 
 
The error resilient system design in Figure 2-4 features a redundant communication 
bus and redundant functional blocks. This reduces the number of single points of failure 
in the system, which is important to avoid loss of system functionality or data and enable 
graceful degradation. Error resilience of functional blocks A and B is also increased when 
the redundant blocks are used to implement techniques such as dual modular redundancy 
(DMR) and triple modular redundancy (TMR). In DMR one additional copy of the func-
tional block is deployed, this is shown for functional block A in Figure 2-4. Using DMR 
errors can be detected by comparing the outputs of the two blocks, but it does not provide 
Key:                 Functional Block             DMR Block            TMR Block             Voting Logic
Input OutputA Input Output
B.1 B.3A.1
A.2 B.2 
Single Point of Failure
B 
B.V 
Key:                 Functional Block             DMR Block            TMR Block             Voting Logic
Input OutputA Input Output
B.1 B.3A.1
A.2 B.2 
Single Point of Failure
B 
B.V 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 28 - 
correction capabilities [19]. For data processing applications, additional software or 
firmware can be deployed to allow calculations to be restarted upon an error being detect-
ed. Alternatively, functional block B in Figure 2-4 is depicted as being protected by 
TMR. TMR deploys three copies of the same functional block and the addition of voting 
logic enables errors to be both detected and corrected [20].  
These redundancy-based techniques are very versatile and due to their generic ap-
proach, they can be applied to many different types of functional blocks, such as memory 
or computation blocks, and can be implemented either in hardware or software and at a 
number of architectural levels. The trade-off of these redundancy-based techniques is that 
often the power consumption, mass and volume are increased, or the processing capabili-
ties are reduced. Figure 2-5 provides a simple illustration of how the traditional space 
system design approach is commonly applied to onboard payload data processing for two 
different mission profiles consisting of a push-broom and an area array payload. The ar-
chitecture features cross-strapped hard redundant main processing devices, dual-
redundant memory structure and a simple open-loop payload data processing flow im-
plemented using point-to-point communications. 
 
 
Figure 2-5 Example space data processing architectures 
                    Payload Data Processing System     
FPGA (Primary)
Payload Data Processing System
Radiation Hardened FPGA (Redundant)Radiation Hardened FPGA (Primary)
Data Interface & 
Buffering
FPGA
Downlink
Platform
Area Array PayloadPush-broom Payload
Compression
Line Array 0
Mass Memory (Primary)
Image (t)
Image (t)
Band 1Band 0 …
Band 0   ...  Band N
Band N-1 Band N
Array 0 …
Band 2
Array 1 Array N-1 Array N
Band 1Band 0 …Band N-1 Band N
Array 0 …
Band 2
Array 1 Array N-1 Array N
Line Array 1 Line Array N-1 Line Array N…
Mass Memory (Redundant)Mass Memory (Primary) Mass Memory (Redundant)
Radiation Hardened FPGA
Downlink
Platform
Data Interface & Buffering
Compression
Area Array
Downlink Interface & Data Formatting
Compression
Downlink Interface & Data Formatting
FPGA (Redundant)
Data Interface & 
Buffering
Compression
Image (t)
Image (t)
Example 1 Characteristics: 
Payload: High spatial & spectral resolution 
Platform: 300kg - 500kg
Subsystem: Medium power consumption (< 50W)
Example 2 Characteristics: 
Payload: Medium spatial & low spectral resolution 
Platform: 50kg - 100kg
Subsystem: Low power consumption (< 25W)
Data Interface & Buffering
... ... ... ... ... ... ...
Band 0   ... Band N
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 29 - 
 Variations in platform and mission requirements often influence the physical con-
straints of the system, such as mass, volume and power consumption. Payload 
characteristics can have a wide impact on interfacing, memory and processing require-
ments. Therefore, to meet mission specific requirements, often changes to both hardware 
and software of the system are required. Consequently, whilst both systems, pictured in 
Figure 2-5, are based on the same fundamental architecture, it will often transpire that 
each of the data processing systems will be implemented as a bespoke solution. 
 
2.2.3  Traditional Onboard Data Processing Hardware 
Radiation Hardened (RH) devices, are often manufactured on insulating substrates, 
Silicon On Insulator (SOI) and specifically Silicon On Sapphire (SOS), instead of com-
mon semiconductor wafers [21]. These technologies provide resistance to latch-ups by 
lowering the parasitic capacitance due to the insulation from the bulk silicon and the iso-
lation of the n- and p-well structures. Due to the electrical insulating characteristics of 
sapphire, SOS technologies provide both high TID and SEU immunity. The primary bar-
rier to RH implementation is the drastic increase in monetary cost, due to increases in 
expenses regarding the materials and extensive development and testing required to de-
sign a radiation tolerant device. As a result, RH components tend to lag behind with 
regards to cutting-edge performance and features.  
In early satellite mission’s data compression was performed in software on the main 
onboard computer (OBC) of the satellite [5]. However as computational demands in-
creased, the industry experienced a shift toward providing compression capabilities via 
dedicated hardware solutions. Payload data processing and compression is commonly im-
plemented onboard EO satellites using dedicated hardware in a bespoke system, designed 
specifically to meet the mission requirements and conform to the constraints induced by 
the space environment. The traditional approach to implementing onboard data pro-
cessing, often employed by major space agencies and large-scale satellite companies, is to 
utilise only RH and space qualified processors.  
The USES (Universal Source Encoder for Space) ASIC (Application Specific Inte-
grated Circuit) is the earliest example of a RH hardware specific for data          
compression [22]. USES is a NASA ASIC implementation of a simple Rice based algo-
rithm with a multispectral image compression mode [23]. The USES chip has flown many 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 30 - 
times and therefore provides reliability confidence to users. However, during the mid to 
late 1990’s alternative solutions which leveraged more widely available hardware were 
explored such as digital signal processor (DSP) based implementations. DSP’s are micro-
processors designed and optimised to perform a specific task in a time efficient manner. 
Due to the sequential nature of the underlying architecture, DSP functionality is not easily 
scaled with increasing input data rates. By the early 2000’s, the inherently parallel and 
flexible field programmable gate array (FPGA) device became the established solution 
for onboard data processing [5].  
An FPGA is a semiconductor device that comprises configurable logic blocks con-
nected by programmable interconnects. The FPGA architecture is clock cycle based and 
parallel in nature which allows data and computational pipelines to be fully defined by the 
user. FPGAs are relatively small, low mass and also some allow many time reprogram-
mability. In the space industry it has enabled satellite manufacturers to perform post-
launch reconfiguration to extend the lifetime of missions and increase the possibilities for 
design reuse across multiple missions. FPGA manufacturers also market RH versions of 
their devices to specifically target the aerospace market [24][25].   
Whilst advances in processing devices due to Moore’s law alone have been sufficient 
for previous generations of onboard computing devices to meet requirements, this is now 
no longer the case [26]. Today, space processors require longer and expensive develop-
ments cycles, but this frequently renders them obsolete in terms of processing 
performance by the time of launch and they now lag several generations behind their ter-
restrial counterparts [2]. The impact of this is compounded due to the fact that the 
underlying system architecture itself lacks sufficient flexibility to allow for effective scal-
ing to meet variations in requirements, without resulting in unacceptable increases in 
mass, volume and power consumption. Therefore, the payload data processing system has 
become vulnerable to becoming an additional bottleneck in the data delivery chain.  
In 2007, ESA held a round table event inviting attendants from the European space 
community, inclusive of industry, space agencies and research groups, to discuss the chal-
lenges associated with next generation onboard payload data processing [2]. The overall 
conclusion drawn from the event was that current hardware developments cannot fulfil 
the requirements of either current or future high-performance data processing applica-
tions. The event included discussions to establish broad future mission requirements for 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 31 - 
any new onboard data processing system, the key requirements discussed are summarised 
in Table 2-1. These findings support suggestions that limited capabilities of current space 
qualified hardware are in fact holding back the development of new high-performance 
onboard data processing systems. Onboard processing capabilities will need to continue 
to increase in priority to help alleviate the growing onboard data bottleneck and ESA’s 
findings even suggest that it is becoming more essential than ensuring low electrical pow-
er consumption. 
Table 2-1 Future onboard processing mission attributes [2] 
 
  
Increasingly COTS components are being deployed in avionic systems in the com-
mercial space sector. This fundamental shift, in the approach to hardware selection, has 
occurred due to several reasons. This includes improvements in commercial manufactur-
ing and screening techniques and developments in representative environmental testing 
methodologies to accurately assess components in a representative space                       
environment [27]. Additionally, developments in software and firmware based error miti-
gation techniques and deploying COTS processors, with error resilient architectural and 
software techniques, has been demonstrated as a feasible and cost-effective approach to 
help advance satellite technologies [28].  
 
2.2.4  Error Resilient Software & Firmware Design 
There is very little that can be done by a third-party to increase error resilience of the 
underlying hardware of COTS devices. Therefore, architectural system design and Soft-
ware based fault tolerance (SBFT) techniques have been developed to increase the error 
resilience of COTS based systems in the space environment. In the space industry, soft-
Organisation Future Mission Attributes 
ESA 
- Onboard compression shall be re-programmable and lossless  
- Onboard autonomy, data selection and downlink encryption  
- Efficient data compression is essential 
CNES - Flexibility and re-programmability in-flight are important factors 
Airbus Defence        
& Space 
(Formerly Astrium) 
- High performance is needed for onboard autonomy 
- More memory and processing power is needed 
- New technology should enforce reusable solutions such as IP 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 32 - 
ware based error resilience techniques are often employed in addition to error resilient 
system design. There are two major types of software-based techniques deployed today: 
generic error resilient techniques and algorithm-based fault tolerance (ABFT).  
Generic techniques include DMR, TMR, error detection and correction codes 
(EDAC), memory scrubbing and check-pointing and rollback (CPR). EDAC allows for 
both the detection and correction of errors in a data sequence by the addition of redundant 
data bits and implementation of a detection and correction algorithm [29]. EDAC codes 
can be used to protect both memory elements and computational elements by the encod-
ing of state machine logic. Two of the most popular EDAC algorithms for these purposes 
are Cyclic Redundancy Check (CRC) and Hamming Encoding [30][31]. Memory scrub-
bing is the technique used to prevent the build-up of persistent errors in configuration 
memory by refreshing and restoring the memory to a known error-free state [32]. This is 
often provided by a golden reference stored in protected radiation tolerant memory. CPR 
is a technique often implemented to address control-flow errors, FIs and data               
corruption [33]. CPR allows the periodic storage of a known error free state of operation, 
known as a snapshot. Functionality is provided so that in case an error is detected in an 
application execution, the system can roll-back to the last snapshot and continue execu-
tion, allowing the system to recover from an error. CPR introduces a significant time 
overhead and will make an application non-deterministic in time and therefore may not be 
suitable for all space applications which may have strict timing requirements.  
ABFT techniques are distinguished from generic error resilient processing methods 
by three fundamental principles. The first is that the input data used by the algorithm is 
encoded, secondly the algorithm needs to be redesigned to work directly with the encoded 
data and thirdly that the computational operations of the algorithm are distributed 
amongst available resources [34]. Due to these principles, ABFT cannot be easily applied 
to all data processing algorithms and significant research and development resource is 
required in order to successfully apply ABFT to a specific application. In addition, ABFT 
protected applications often experience a considerable increase in design and implementa-
tion complexity, as a result these techniques are rarely practically deployed in commercial 
space systems. 
In order to perform an informed selection between error protection techniques, to 
meet a certain requirements trade-off, understanding the inherent error probabilities and 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 33 - 
the increased reliability provided by selected techniques is key. One of the most common 
and comprehensive methodologies for characterising the error resilience of a system is 
practical beam testing. Beam testing uses a radiation source, such as protons or heavy 
ions, to radiate a physical device or system under test. This aims to represents a realistic 
demonstration of the error resilience of the system in a certain radiation environment. 
Beam testing is often used to determine the probability that a gate-level error in hardware 
propagates to the software application and its output, which is often referred to as the ar-
chitectural vulnerability factor (AVF) [35]. This is an important metric because not all 
impacts between a radiated particle and device will result in an observed event, SDC or 
FI, at the application output due to various masking effects.  
Hardware level error masking occurs when a fault does not propagate from the phys-
ical hardware level to software application level. The probability that a physical error is 
not masked and propagates to the software layer is termed the hardware vulnerability fac-
tor (HVF) [36]. COTS processor manufacturers and developers very rarely disclose the 
reliability or specifically the HVF of their devices. This often needs to be determined 
through experimental testing at the expense of the user. In addition, modern processors 
often feature dedicated hardware resources for different responsibilities such as control, 
memory and computational logic, making the testing process complex. Therefore, care-
fully designed test applications are required for beam testing to ensure the different 
underlying hardware blocks are adequately exposed and correctly characterised. Testing 
computational logic is often performed by implementing simple and repetitive algorithms 
such as matrix summation and multiplication, which test adder and multiplier circuitry 
respectively. 
Whilst beam testing can be effectively used to represent many different radiation en-
vironments, controlled testing is extremely difficult, costly and time consuming. This 
often makes it unfeasible to test all different possible system configurations and applica-
tions for a given hardware device. For practicality, often the error resilience of the 
hardware and software are evaluated separately. In this respect, software-based error in-
jection provides a cheap and time efficient framework to replicate the effects of physical 
hardware faults, by intentionally altering instructions or data in a controlled manner, to 
assess the error resilience of a software application. Software based error injection is a 
useful technique to initially classify the error resilience of a software application [37]. 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 34 - 
The results of which can be frequently collected in a timely manner to help with the selec-
tion and development of efficient error mitigation techniques.  
 
2.2.5  Payload Data Compression 
Surveys of data compression algorithms used onboard early EO satellites and other 
space systems, such as those presented by G. Yu et al provide an important insight into 
the historical trends and the mindset of the space industry and algorithm selection for 
onboard deployment [5]. Typically, a major factor towards successful and wide spread 
deployment of an algorithm in the space sector has been computational complexity. Spe-
cifically, standardised algorithms which include those proposed by the Joint Photographic 
Experts Group (JPEG) and Consultative Committee for Space Data Systems (CCSDS) 
groups have been widely deployed in the space sector. JPEG was formed in 1986 and is a 
collaborative committee of several major standardisation organisations devoted to the dis-
cussion and development of state-of-the-art still image coding standards [38]. Whilst 
JPEG do not specifically address the major requirements of the space industry, several of 
their algorithms are widely deployed in the industry today.  
Conversely, CCSDS was founded in 1982 by the leading global space agencies of the 
time to provide a platform for the discussion and development of recommendations and 
standards specifically for space data and information systems. Whilst CCSDS also have 
several standardised and recommended algorithms, they are typically less commonly de-
ployed in onboard EO image data processing; this is due to the influence of data 
customers who more widely recognise the coding standards provided by JPEG.  
Due to the increased competition in the EO satellite market it is extremely difficult to 
get up-to-date details on the latest onboard data processing and compression techniques. 
However, some insight can be gained from reviewing the CCSDS green paper report 
which provides recommendations based on the expertise of the CCSDS members, reflect-
ing the current state of the industry [39]. The main algorithms discussed in these reports, 
which are commonly implemented onboard satellites today, are summarised in Table 2-2. 
Research into the current state-of-the-art in the wider image compression field is required 
to assess if there are alternative algorithms which could provide a significant advantage 
for next generation satellites. The research conducted to address this is detailed in        
Section 2.5.   
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 35 - 
Table 2-2 Recommended onboard payload data and image compression algorithms  
Algorithm Theoretical Basis Key Advantage 
JPEG-LS [40] Predictive Very low complexity 
JPEG2000 [41] Discrete Wavelet Transform Lossless and Lossy Compression 
CCSDS-121[42] Entropy Encoding Suits wide range of data types 
 
2.3    Terrestrial Computing System Design 
To satisfy future computational mission requirements, there is a requirement for re-
search into new alternative hardware devices and system architectures. Due to the growth 
and diversity of terrestrial computing and its application areas, new techniques developed 
in these areas could provide inspiration for new approaches for the space community. 
Whilst the traditional design approach for space systems is well proven for providing 
high error resilience, this is often traded-off with power consumption and computational 
capability. As power consumption and computational capabilities are become increasing-
ly critical, there is a need for a new approach to onboard data processing system design 
which prioritises these areas. Therefore, state-of-the-art computing paradigms from ter-
restrial applications have been investigated and those that present the highest potential 
advantage for the space community are discussed in the following subsections.  
 
2.3.1  Heterogeneous Computing 
Heterogeneous computing was originally a high-performance computing architecture 
approach [43]. However, many terrestrial applications today employ the principles of het-
erogeneous computing towards increased system performance and high energy efficiency. 
Heterogeneous systems employ multiple nodes which are fundamentally different. These 
nodes can differ in a number of ways including in relation to the microarchitecture, in-
struction set architecture (ISA), memory hierarchy and performance. Applications which 
are constructed from multiple functions, different memory access patterns or computa-
tional characteristics benefit the most from employing a heterogeneous system. The 
nature of heterogeneity often means that certain system nodes have increased perfor-
mance or efficiency for certain tasks.  
The key to exploiting heterogeneous computing effectively is to comprehensively 
understand the application, selecting the appropriate resources and effectively mapping 
the desired tasks to the most suitable resources. Onboard data processing systems often 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 36 - 
feature a number of different data handling, data processing and control functionality 
which are characteristically different. This makes the field of heterogeneous computing a 
viable approach for next generation onboard data processing.  
 
2.3.2  Cluster Computing 
One of the most prolific state-of-the-art computing approaches today is cluster com-
puting, which has found success in HPC, large-scale and also small-scale commercial 
computing sectors [44]. Cluster computing architectures are characterised by the use of a 
number of individual configurable nodes which, under centralised control, perform tasks 
cooperatively and act as a single system. Many cluster computers utilise multiple low-
cost COTS computers in a networked configuration to achieve increased computing per-
formance and greater reliability, as depicted in Figure 2-6. This mirrors both the current 
trends and requirements of the space community; therefore, there is a significant oppor-
tunity to deploy key principles from this approach in a space application.  
 
Figure 2-6 Cluster computer architecture 
 
A key enabler for the large flexibility and scalability of the cluster computing archi-
tecture, which would be well suited to next generation onboard data processing systems, 
is the backplane. Originally, backplanes were deployed in terrestrial computing to provide 
increased reliability compared to cable-based connection solutions. Today they are also 
used to improve system maintenance, help to reduce the system mass, volume, complexi-
ty and to increase system scalability. Using a backplane, system scalability is often 
simply achieved by physically connecting additional computational or memory nodes to 
the backplane to expand the capability of the overall system. The system controller is 
subsequently responsible for the utilisation and allocation of work to all resources in the 
system. This minimises the need for changes to both hardware and software. 
Input Data
Output Data
Master Node
Slave Nodes 1- n
Backplane
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 37 - 
2.3.3  Hardware-Software Co-design 
Several terrestrial computing sectors have experienced difficulties similar to those 
currently in the space sector regarding efficient system design [45][46]. As hardware and 
software have both become more complex the industry found that the traditional bottom-
up system design approach often led to sub-optimum designs which lacked the capabili-
ties of a unified solution. This is often caused due to overlooked incompatibilities 
between hardware and software, causing difficulties in meeting requirements and extend-
ed development times. Today, a system design approach called hardware-software co-
design (HS-CD) is being increasingly utilised in terrestrial applications [47]. Device se-
lection on quoted performance alone can result in a sub-optimum solution as the 
performance can vary depending upon the intended application. With regards to software 
design, algorithmic performance can be heavily influenced by the software’s ability to 
effectively utilise the hardware resources available, as well as the inherent performance 
characteristics of the chosen algorithm.  
A system can be defined either in the context of its behaviour or of its structure. Tra-
ditionally, behaviour has been defined in software and structure has been defined by 
hardware. However, HS-CD utilises these two types of system definition to help refine 
both hardware and software design concurrently. The principles can be summarised by a 
double roof model which is pictured in Figure 2-7 [47].  
 
Figure 2-7 Hardware-software co-design double roof model [47] 
 
The design approach starts with an initial top-level system definition which does not 
distinguish between hardware and software. Then through successive iterations the be-
havioural and structural design is refined by allocating resources and scheduling tasks to 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 38 - 
these resources. The advantage of the HS-CD approach is that it has been shown to re-
duce the risk associated with several design areas which include extended development 
time and resources, requirement non-conformity and system over-design.  
 
2.4    Terrestrial Processors 
When compared with traditional RH processors, terrestrial COTS processors can 
provide up to several orders of magnitude increase in computational performance, which 
is a key requirement for future onboard data processing. Figure 2-8, sourced from [48], 
highlights key trends in terrestrial microprocessor technology in the last forty five years. 
Between 1970 and 2015, the key trends in processor technology were dominated by the 
increase in transistor density, following Moore’s Law [26], and the accompanying in-
crease in clock frequencies towards achieving increased single thread performances. A 
defining point in terrestrial processor development occurred in the mid 2000’s when, due 
to the increasing growth in power consumption, practical clock frequency limits were 
reached and diminishing returns in single-thread performance were experienced. Up until 
this point processors were based on a serial execution model, where a single stream of 
instructions is executed sequentially. 
 
Figure 2-8 Forty five years of processors trend data [48] 
 
With the advent of multicore-CPUs (MC-CPU) and use of a parallel instruction and 
task execution model, the industry was able to continue to meet the raw performance 
growth expected with Moore’s law scaling without surpassing the feasible power con-
sumption limits. The MC-CPU is a single component which features multiple core 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 39 - 
processing units that can read and execute program instructions independently, meaning 
multiple instructions can be executed in parallel, thus enabling increases in the processing 
throughput per clock cycle and performance per watt of modern processors. The MC-
CPU can be typically classified as a multiple instruction multiple data (MIMD) architec-
ture whereby each core can independently execute different operations on a clock cycle.  
In addition to MC-CPUs, in recent years there has been a significant growth in other 
parallel processor devices, with one of the most successful of these architectures being the 
GPU. GPUs represent an alternative realisation of a multi-core based parallel processing 
architecture and are architecturally very different to CPUs. The CPU architecture is opti-
mised to minimise operational latency, which requires a large amount of resources 
dedicated to the control of the instruction pipelines and data movement. Alternatively, 
GPUs are based on a single instruction multiple data (SIMD) architecture which is pri-
marily optimised for computational throughput as opposed to operation latency. The logic 
pipelines of a GPU are much simpler, all cores execute the same instruction on every cy-
cle, but on different data. This reduces the control logic requirements but places a greater 
responsibility on the software developer to explicitly declare and exploit low-level paral-
lelism.  
The GPU was originally devised as an accelerator for graphical processing and ren-
dering for computer graphics display and as a result it is almost exclusively deployed as 
part of a CPU-GPU heterogeneous computing system. This system architecture allows 
algorithms that exhibit a high degree of parallelism to be offloaded from the CPU for ac-
celeration on the GPU, whilst tasks which do not exhibit the appropriate level of 
parallelism or large complex data dependencies can remain on the CPU where they can be 
executed most effectively. This execution model is demonstrated in Figure 2-9.  
 
Figure 2-9 CPU-GPU heterogeneous execution model 
CPU GPU
Block
Serial Code
Kernel
Kernel
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Serial Code
Serial Code
Block
Block
Block
Block Block
Thread
Thread
Thread
…
… Thread
Thread
Thread
……
…
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 40 - 
The theoretical speedup which can be achieved by the parallel implementation of an 
algorithm is determined by Amdahl’s law, first proposed in 1967 [49]. It states that the 
maximum speedup is determined by both the parallelisable and non-parallelisable part of 
an algorithm. Therefore, parallel application performance is not just a factor of the raw 
performance of the processor, but it is also associated to how well the software design and 
algorithm can exploit the underlying parallel hardware architecture.  
Both MC-CPU and GPUs can be typically described as featuring control hardware, 
cache memory and logical cores, however the organisation and proportions of each of 
these types of hardware vary between the two platforms. The key architectural differences 
are demonstrated in a simple representation depicted graphically in Figure 2-10. A de-
tailed review of the hardware architecture and the software programming models for GPU 
devices has been conducted and the key findings are detailed in the following subsections. 
 
Figure 2-10 CPU versus GPU hardware architectures 
  
2.4.1  GPU Hardware Architecture  
Whilst many principles of GPU computing are non-vendor specific, this research 
specifically explores NVIDIA GPU hardware, due to the wide popularity in general pur-
pose computing applications. NVIDIA is a leading hardware design company that 
develops state-of-the-art GPUs, tailored specifically to several application groups includ-
ing gaming, high-end graphical processing, data processing and crucially mobile 
computing. NVIDIA change aspects of the underlying hardware typology with each new 
GPU architecture generation, usually occurring every few years. However, the major 
hardware blocks and hierarchy of the SIMD structure have remained constant.             
Figure 2-11 provides an in-depth view of the underlying structure of a typical GPU.  
GPUCPU
Control
Cache
Control
CacheCore Core
Core Core
C
a
c
h
e
Control
Cache
Control
Cache
Control
Cache
C
o
n
t
r
o
l
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 41 - 
 
Figure 2-11 Block diagram of the hardware architecture of an NVIDIA GPU  
 
The top-level contains several control blocks, including the host interface, memory 
controllers and GigaThread Engine [50]. These dedicated control blocks provide the par-
allelism management capabilities and determine the runtime behaviour of the other 
functional blocks. Specifically, the host interface provides a direct communication chan-
nel between the GPU and its host. The host is often a CPU and is responsible for invoking 
work on the GPU. The memory controllers allow the GPU to access off-chip global 
DRAM, for large volume data storage and the GigaThread Engine is responsible for 
scheduling and distributing blocks of independent work to Streaming Multiprocessors 
(SMs).  
SMs contain all the hardware blocks necessary for the low-level execution of the in-
voked workloads. NVIDIA GPUs are made up of an array of SMs, an approach which 
provides scalable parallelism at multiple architectural levels. The top-level also features 
L2 hardware-controlled cache memory which is shared by all SMs on the GPU. 
The GPU SM is responsible for controlling and implementing the low-level SIMD 
execution model. This is achieved using several dedicated control, memory and computa-
tional functional blocks, as depicted in Figure 2-11. To achieve SIMD execution, 
instructions are always issued in a group called a warp. In NVIDIA hardware the size of 
the warp, and hence the width of the SIMD pipeline, is equal to 32 instructions.  
The warp scheduler and dispatch units are responsible for managing and distributing 
warps of instructions to the execution units. The important fundamental execution units of 
Host Interface (PCIe)
GigaThread Engine
SM
SM
Memory Controllers
…
L2 
Cache
SM
SM
GPU
…
MemoryControl Execution
Instruction Cache
Warp Schedulers & Dispatch
Register File
LD/
STSFU
L1 Cache
Shared Memory
CUDA 
Core
CUDA 
Core…
CUDA 
Core
CUDA 
Core
SM
Global 
Memory
Host 
(CPU)
DPU
LD/
STSFUDPU
…
…
… … … …
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 42 - 
the GPU are CUDA cores, Double Precision Units (DPU), Special Function Units (SFU) 
and Load and Store units (LD/ST).  
CUDA cores are the basic execution unit responsible for the majority of GPU in-
structions, including integer, single precision arithmetic, logical and branching 
operations. DPUs are optimised for the execution of double precision floating point in-
structions. SFUs perform complex mathematical operations such as square root, 
reciprocal and trigonometric functions, but can also perform floating point instructions. 
LD/ST units are responsible for issuing memory operations to the appropriate memory 
structures.  
Each individual execution unit in the SM represents a single lane in the SIMD pipe-
line. The exact number of each of the execution units in hardware varies between specific 
GPU architectures. It is common for there to be less DPU, SFU and LD/ST units than the 
warp size, where this is the case instructions will be replayed to the execution unit until 
all warp instructions are complete. Thus, some operations can incur higher execution la-
tencies than others.  
In addition to the memories at the GPU top-level, there are also several SM level 
memory structures. The instruction cache is shared by all warps allocated to the SM. In-
structions are cached in a separate structure from data due to differences in access 
patterns. It is significantly more common to read than write instructions, therefore the in-
struction cache is specifically optimised for read operations. Additionally, it is often more 
efficient to cache certain instructions in groups, as instructions often exhibit common re-
occurring sequential access patterns. The lowest level data storage available on the GPU 
is the register file. The register file is used to store intermediate results between individual 
instructions.  
As GPUs issue instructions in warps the register file needs to be as wide as the in-
struction pipeline, for NVIDIA GPUs this equates to a 1024-bit wide register file to 
provide 32-bit registers to all 32 instruction streams in the warp. Additionally, the regis-
ters often need to allow multiple concurrent accesses and thus are implemented as multi-
port registers. Registers are local to each instruction stream within the warp i.e. a register 
belonging to one instruction stream cannot be read or written to by another instruction.  
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 43 - 
To facilitate cooperation and data re-use within a warp, L1 cache and shared memory 
are also present in each SM. L1 cache and shared memories have higher latencies com-
pared to the register file but provide much higher bandwidth and lower latency than L2 
cache and global off-chip DRAM. L1 cache is a hardware-controlled cache whilst shared 
memory is user managed. Shared memory can be explicitly leveraged to allow data to be 
shared between threads in the same block, this facilitates cooperation, memory reuse and 
reduces off-chip memory traffic.  
 
2.4.2  GPU Software Programming Model 
To accompany their hardware NVIDIA have also developed the CUDA program-
ming Application Programming Interface (API) [51]. CUDA is a relatively mature 
parallel programming API for NVIDIA GPUs and is based on the C programming lan-
guage. The CUDA programming model provides a hierarchal abstraction model, which 
reflects the structure of the underlying hardware. In CUDA, at the top level the developer 
organises the work functionally into kernels. A kernel is a description of GPU work as a 
single sequential execution path. Kernels are invoked on the GPU by the host and paral-
lelism is exposed by describing kernels in terms of two parameters: blocks and threads.  
A kernel can be constructed of multiple blocks, which are themselves constructed of 
multiple threads. A thread represents a single instruction stream and is the lowest execu-
tion level exposed in the CUDA API, it is also what the kernel is written in terms of. The 
CUDA API is often described as a Single Instruction Multiple Thread (SIMT) model. 
This is representative of the SIMD nature of the hardware architecture to achieve data-
level execution parallelism but with an additional layer of abstraction for ease of pro-
gramming. Rather than the programmer needing to consider the physical layer, issuing of 
warps of instructions and writing programs in terms of an instruction pipeline with a 
width of 32, the software can be written in terms of a single instruction stream and single 
data elements.  
Parallelisation is achieved in CUDA by simply declaring the number of concurrent 
execution streams required by defining the number of threads. This allows the program-
ming model to be more applicable to general purpose computing applications. A block is 
a group of threads, blocks are used to define both data sharing and execution isolation 
within a kernel. Only threads within a block have access to the same shared memory re-
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 44 - 
sources facilitating collaborative execution and data re-use. Threads in the same block 
can also be explicitly synchronised. This helps to manage any execution or data depend-
encies within an application.  
Using these characteristics of the SIMT programming model and the SIMD architec-
ture of the GPU hardware, there are several types of parallelism which can be exploited 
by NVIDIA GPU application developers towards high data processing throughput. These 
include data level parallelism (DLP), task level parallelism (TLP) and instruction level 
parallelism (ILP) [52]. These are summarised in Table 2-3; this table lists the aspect of 
the software model which can be used to exploit each type of parallelism and the underly-
ing hardware which is leveraged in doing so.  
  
Table 2-3 Software and hardware constructs used for exploiting parallelism types  
 
Type Description Software Hardware 
DLP 
 
- The same instruction is executed          
concurrently on different data 
 
- Created and managed by the               
programmer 
 
- Limited by data volume, non-regular   
data manipulation patterns & memory 
bandwidth  
 
- Multiple 
threads 
- Warp  
schedulers 
 
- Multiple           
execution 
units 
TLP 
 
- Multiple tasks or multiple instruction         
sequences from the same or different           
applications are executed concurrently 
 
- Created by the programmer, managed by      
compiler and hardware 
 
- Limited by communication or synchro-
nisation overheads and by algorithm 
characteristics  
 
- #Threads > 
#execution 
units 
 
- Multiple 
blocks 
 
- Concurrent    
kernels 
- GigaThread      
scheduler 
 
- SMs 
 
- Context    
switching 
 
- Task specific       
execution 
units 
ILP - Multiple instructions from the same           
instruction stream are executed            
concurrently  
 
- Generated and managed by compiler or     
hardware 
 
- Limited by data and control dependences  
 
- N/A - Dual warp       
schedulers 
 
- Context    
switching 
 
- Task specific       
execution 
units 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 45 - 
DLP is very closely coupled to the SIMD hardware architecture of the GPU and is 
often the preliminary method used for parallelisation in GPU computing. In order to effi-
ciently exploit DLP the problem must exhibit low data dependencies so that the 
calculations on one data element are not dependent on another. The amount of DLP ex-
ploited can be specifically controlled by the programmer in the CUDA programming API 
by defining the number of individual threads in a kernel. The efficiency of a GPU appli-
cation to exploit DLP is measured by the warp execution efficiency [53].  
Warp execution efficiency is the average percentage of active threads in each execut-
ed warp. There are several reasons a thread may be marked as inactive in a warp. Firstly, 
if the number of threads declared in each block is not equal to a multiple of the warp size, 
then the last warp of the block will have the additional threads marked as inactive in the 
execution stream. Secondly, threads can be marked as inactive due to intra-warp diver-
gence. 
This occurs when not all threads in a warp take the same execution path due to 
branching instructions, such as an if-else statement. When this occurs, all instructions for 
all satisfied branches are executed serially and only the appropriate threads marked as ac-
tive for each case. This behaviour is pictured in Figure 2-12. When a thread is marked as 
inactive a full warp of instructions is still issued but the results of the inactive threads are 
masked, resulting in wasted instruction execution. Therefore, ensuring an application has 
a high warp execution efficiency is key to achieving high data processing throughput. 
 
Figure 2-12 GPU thread divergence 
 
0
15 …
16
31 …
0
15 …
16
31 …
0
15 …
16
31 …
0
15 …
16
31 …
Active Thread Inactive Thread
Key :
Time
A = A + B;      if (threadID < 16) {        X = A - Y;
                A = A + C;
                } else {
                           A = A * D;
                }S
of
tw
ar
e
H
ar
dw
ar
e
D
I
V
E
R
G
E
C
O
N
V
E
R
G
E
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 46 - 
TLP is most commonly exploited in GPU computing through partitioning and sched-
uling large numbers of threads, multiple blocks of threads and multiple concurrent 
kernels. As the number of threads per block, number of blocks and number and construc-
tion of kernels are defined by the programmer, TLP can be user controlled. In comparison 
to DLP, TLP is exposed in the GPU to hide execution latency in addition to increasing 
execution throughput. TLP can be exploited at multiple hierarchical levels. At the highest 
level TLP can be simply achieved through invoking multiple kernels or multiple blocks of 
threads within the same kernel. These are subsequently implemented in parallel by hard-
ware across multiple SMs, as resources allow. In the latest NVIDIA GPUs multiple 
kernels can even be evoked from within kernels to enable dynamic and nested parallel 
execution [54]. The hierarchy of kernels, blocks and threads and how these can be 
mapped to the GPU hardware is shown in Figure 2-13. 
 
Figure 2-13 GPU hardware hierarchy showing kernels, blocks, warps and threads 
 
In addition to top-level TLP, TLP can also be exposed within each SM. Firstly, 
NVIDIA GPUs feature at least two dual-issue hardware schedulers per SM. This means 
that multiple warps can be scheduled at any one time per SM. Each dual-issue warp 
scheduler can issue instructions to up to two different warps each issue cycle [55]. These 
warps can be from either the same or different blocks or kernels due to hardware imple-
mented context switching. This means that if an active warp takes more than a cycle to 
complete its instruction, it can be labelled as stalled. Then on the next cycle, the GPU can 
exploit TLP and select alternative eligible warps for concurrent instruction execution. As 
context switching is handled by hardware there is no latency penalty. Context switching is 
GPU
SM 0
Kernel A - Block 0
160 Threads
SM 1
SM 2 SM 3
Kernel B - Block 3
67 Threads
Kernel A - Block 1
160 Threads
Kernel B - Block 0
67 Threads
Kernel A - Block 2
160 Threads
Kernel B - Block 1
67 Threads
Kernel B - Block 2
67 Threads
…
Warp 0
…
0 31
…
Warp 0
…
0 31
Warp 4
…
0 31
…
Warp 0
…
0 31
…
Warp 0
…
0 31
Warp 4
…
0 31
…
Warp 0
…
0 31
Warp 2
…
0 31
…
Warp 0
…
0 31
Warp 4
…
0 31
…
Warp 0
…
0 31
3
Warp 2
…
0 31
Warp 2
…
0 31
Warp 2
…
0 31
Active Thread
Inactive Thread
Key
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 47 - 
possible due to the fact each SM features multiple different execution units which can 
concurrently execute instructions from different warps. The hardware characteristics and 
resulting warp scheduling behaviour is shown in Figure 2-14.  
 
Figure 2-14 GPU hiding latency 
 
In Figure 2-14, the stall blocks are used to represent an issue cycle in which it was 
not possible for at least one dispatch unit to issue an instruction to a free execution unit, 
because there were no eligible warps to be scheduled. Maintaining high TLP, helps to en-
sure there are a high number of eligible warps per SM to hide latency and avoid execution 
stalls. In GPU computing TLP can be described and measured using the notion of GPU 
occupancy. GPU occupancy is defined by equation (2-1), and is a measure of the ratio of 
active warps on a SM to the maximum number of active warps supported by the SM [56].  
  
Theoretical Occupancy =  
 
Theoretical number of active warps
Physical maximum number of warps
  
(2-1) 
 
The physical maximum number of active warps is a constant set by the manufacturer. 
Each SM has a physical maximum number of concurrent warps and blocks which can be 
active on each SM. If the number of threads per block and number of blocks are too 
small, this limits the number of active warps per SM and reduces the theoretical occupan-
cy. Additionally, there is a finite number of registers and shared memory available per 
SM. If the number of registers and shared memory per warp is too high, it can also limit 
the number of concurrent active warps on the SM. To tune theoretical occupancy, the 
threads per block, numbers of blocks and shared memory usage can be managed and ad-
justed by the software designer. Whilst the number of registers is predominantly 
controlled by the software compiler, some software design techniques and compiler op-
StallWarp 0
W 2
Warp 1
Warp Scheduler (Warps 0 - 2)
Instruction Dispatch Instruction Dispatch
Warp 0 Warp 1
Warp 0Warp 2
Ti
m
e
T0
T1
T2
T3
T5
T4
Stall
Stall
Warp 1 Stall
Warp 0 Stall
CUDA Cores (x32) SFU (x8) LD/ST (x8)DPU
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 48 - 
tions allow the designer to influence the number of registers used. NVIDIA provide an 
Excel spreadsheet which contains data on all NVIDIA GPUs, to help the programmer cal-
culate the theoretical occupancy for their kernel with different configurations [57]. 
Theoretical occupancy represents the maximum achievable occupancy for a certain kernel 
configuration, however, the actual achieved occupancy for a kernel needs to be measured 
by profiling the application. NVIDIA provide a GPU specific profiler which can deploy 
counters on each warp scheduler, to count the number of active warps every issue cycle 
and determine the actual measured occupancy as per equation (2-2) [56]. Achieved occu-
pancy varies over time as warps begin and end and it can also be different for each SM in 
the GPU, therefore it is often expressed as an average. There are several situations which 
can cause the achieved occupancy to be lower than the theoretical maximum, these are 
summarised in Table 2-4 [56]. 
 
Theoretical Occupancy =  
 
(Number of active warps ÷	Number of active cycles)
Physical maximum number of warps  
(2-2) 
 
GPU Full Wave =  Number of SMs × Maximum active blocks per SM (2-3) 
 
Table 2-4 Causes of low measured occupancy [56] 
Reason Description 
Unbalanced 
block workload 
If the warps within a block do not all execute for the same amount of 
time, there will be fewer active warps at the end of the kernel resulting 
in an unbalanced workload. 
Unbalanced    
kernel workload  
If all blocks in a kernel do not execute for the same amount of time, 
the workload will be unbalanced. 
Blocks not 
launched in full 
wave 
Theoretical occupancy assumes the number of blocks launched is 
equal to a “full wave”, as per equation (2-3). When the actual number 
of blocks launched is less than a full wave, the achieved occupancy 
will be lower than the theoretical occupancy. 
Partial last wave There is a maximum number of active warps per SM, when the total 
number of warps is not a multiple of this parameter, the last wave will 
be less than this maximum and occupancy will be reduced.  
  
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 49 - 
There are several techniques to increase achieved occupancy. Firstly, the source code 
can be rearranged to reduce workload imbalances, creating a greater number of smaller 
kernels, with lower shared memory or local variable usage. Secondly, compiler flags can 
be used to force the compiler to reduce the number of registers used in a kernel. However, 
these approaches can often hurt overall performance due to increased overheads associat-
ed with kernel launch and more global memory transactions. Research has shown that the 
high effort often required to increase the practical achieved occupancy beyond a level 
greater than 66%, could be better utilised in increasing ILP to increase kernel       
throughput [58].  
ILP can also be leveraged to hide latency on the GPU by concurrently executing dif-
ferent instructions from the same warp. Algorithms can be designed specifically for high 
ILP, however, traditionally it has been extremely difficult to achieve high ILP and diffi-
cult to explicitly exploit when implementing an existing algorithm, as ILP is typically 
extracted and exploited by the compiler. The GPU is an in-order processor device mean-
ing ILP is only possible when consecutive instructions are independent of each other and 
have no register dependencies. Another condition is that the appropriate execution units 
and instruction specific hardware are available for scheduling. Figure 2-15 details how 
ILP can be leveraged by the GPU to increase throughput, it shows how previously stalled 
instruction cycles in Figure 2-14 are eliminated using ILP. 
 
Figure 2-15 Instruction level parallelism on a GPU 
 
The restrictions on ILP for an application can be characterised by examining the exe-
cution stall reasons. The NVIDIA profiling tool also provides the capabilities for the 
software designer to investigate execution stall reasons for their application. The most 
common stall reasons are summarised in Table 2-5. 
Warp 0
Warp 1
Warp 2 (ILP)
Warp Scheduler (Warps 0 - 2)
Instruction Dispatch Instruction Dispatch
Warp 0
W 2
Warp 1
Warp 0Warp 2
Ti
m
e
T0
T1
T2
T3
T5
T4
Warp 1
Warp 0
Warp 0 (ILP)
Warp 1 (ILP)
CUDA Cores (x32) SFU (x8) LD/ST (x8)DPU
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 50 - 
Table 2-5 Instruction stall reasons [53] 
Stall Reason Description 
Pipeline Busy  The compute resources required by the instruction are not yet available. 
Instruction Fetch The next assembly instruction has not yet been fetched. 
Execution         
Dependency  
An input required by the instruction is not yet available, these stalls can 
potentially be reduced by increasing ILP. 
Synchronization  The warp is blocked at a synchronisation API call. 
Memory           
Dependency  
The required LD/ST resources are fully utilised, these stalls can poten-
tially be reduced by optimising memory alignment and access patterns. 
Memory     
Throttle  
A large number of pending memory operations prevent progress, these 
stalls can be reduced by ensuring operations are coalesced. 
Constant  A constant load is blocked due to a miss in the constants cache. 
Texture  The texture cache is fully utilised or has too many outstanding requests. 
 
In addition to the NVIDIA CUDA API providing concepts for computational execu-
tion, several different memory structures are also exposed. There are three types of 
memory exposed in CUDA: private, shared and global. Private memory is used to hold 
values that are local to each thread and can only be read or written to by the respective 
owner thread. Private memory generally equates to on-chip registers, however if all regis-
ter resources have been utilised private variables are stored in high latency off-chip 
DRAM, a technique known as register spilling. Shared memory can be accessed from all 
threads that are in the same block. Shared memory is explicitly managed by the pro-
grammer and the usage is limited by the physical memory size of this resource. Shared 
memory can be statically allocated at compile time or dynamically calculated at run-time. 
One of the key advantages of shared memory spaces are their ability to facilitate thread 
cooperation and data sharing.  
However, the CUDA API provides no guarantee of warp execution order, therefore 
thread synchronisation needs to be considered when using this memory space. If thread, 
block and kernel level synchronisation is not considered, undesirable race conditions or 
incorrect results could occur. Specific synchronisation functions and atomic memory op-
erations are included in the programming API for this purpose [59]. Atomic operations 
serialise contentious accesses allowing only a single thread to read or write from a specif-
ic memory location until the operation is complete.  
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 51 - 
Global memory is accessible by all threads, blocks and kernels which are active on 
the GPU. Global memory physically translates to off-chip DRAM and therefore has the 
largest associated latency. GPU global memory is abstracted from the host memory and 
for traditional GPUs it is also physically separated, alternatively for low power and em-
bedded GPUs host and global memory are unified and share the same address space. Due 
to this abstraction, within the current API, the GPU has no means, to declare or transfer 
data to and from GPU memory to host memory; this functionality is handled by the host 
device.  
To improve the high latencies associated with global off-chip memory, several cache 
structures are also present on NVIDIA GPUs. A hardware managed L2 cache resides at 
the top-level of the GPU and services all off-chip memory transactions, regardless of 
type. At the individual SM level there are also several additional cache structures. The 
hardware managed L1 cache services read and write transactions from its associated SM. 
Constant and texture caches are user or compiler managed and are read-only, differentiat-
ed from each other by different caching policies. The constant cache is optimised for 
broadcasting read-only operations, whereby all threads in a warp access the same address 
in memory. The texture cache is optimised for 2D spatial access patterns; these are pat-
terns which are not contiguous physical memory locations but are still regular. The 
described GPU memory hierarchy is pictured in Figure 2-16. 
 
 
Figure 2-16 NVIDIA GPU memory hierarchy 
 
There are several important characteristics of the physical memories and transaction 
types provided in the CUDA API to maximise throughput. Firstly, the most efficient type 
GPU
SM N 
L2 
Cache
Registers Execution Units
…Execution 
UnitsRegisters Registers…
Registers Execution Units
…Execution 
UnitsRegisters Registers…
SM 1 
RegistersExecution Units
…
Registers Registers
Shared Memory
Execution 
Units
Execution 
Units
…
…
Device 
Memory
Constant Cache
Texture Cache
L1 Cache
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 52 - 
of memory to match the configuration of the computational workload need to be selected. 
The lowest latency memory type available on the GPU are registers. However, as dis-
cussed, high register usage can negatively impact the occupancy and instruction 
throughput of the GPU. Therefore, a trade-off between register usage and other memory 
types often needs to be made, particularly when register spilling occurs.  
In this case utilising alternative on-chip memories can help optimise memory transac-
tion throughput. Shared memory has a much larger bandwidth and lower latency than 
global memory structures, therefore where possible utilising shared memory over global 
memory is recommended. To achieve high bandwidth, the shared memory structure is 
partitioned into banks for parallel accesses. However, if two or more threads request to 
access data from different addresses within the same bank, a bank conflict occurs, and the 
transaction is serialised. Therefore, it is important to optimise shared memory accesses to 
minimise bank conflicts. Bank access examples are given in Figure 2-17. 
 
Figure 2-17 Shared memory bank conflicts 
 
Where it is not possible to utilise on-chip shared memory, or access patterns result in 
large numbers of bank conflicts, explicitly cached constant or texture global memories 
should be explored. Where these caches do not pose a benefit for an application and 
hardware cached global memory must be used, optimising these transactions is extremely 
important. GPU DRAM accesses are unusually wide in order to mitigate the high latency 
of these operations. To utilise this, all global memory transactions are automatically coa-
lesced into as few high bandwidth instructions as possible. Device memory is allocated 
and aligned at 256-byte segments, therefore, to ensure global memory transactions can be 
coalesced into as few transactions as possible each thread in a warp should access data 
0
…
31
30
29 …
Bank 31
Bank 30
Bank 29
Bank 0 0
…
31
30
29 …
Bank 31
Bank 30
Bank 29
Bank 0
0
…
31
30
29
Bank 31
Bank 30
Bank 29
Bank 0 0
…
31
30
29 …
Bank 31
Bank 30
Bank 29
Bank 0
No Bank Conflicts No Bank Conflicts
2-way Bank Conflicts 4-way Bank Conflicts
…
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 53 - 
adjacent and contiguously aligned to global memory boundaries, allowing all memory 
required by all threads in a warp to be serviced by a single memory transaction [61].  
Coalesced and uncoalesced memory transactions examples are pictured in             
Figure 2-18. In the worst case of a random memory scatter or gather with addresses 
spread right across device main memory, 32 separate memory transactions will have to 
take place for each warp and performance drops dramatically. When global memory op-
erations do not naturally result in fully coalesced transactions, this can often be 
accomplished by rearranging the thread organisation to modify the access patterns, reor-
ganising the data in memory to match the ordering of the data to thread access patterns or 
by adding padding to ensure the data aligns with memory boundaries.  
 
Figure 2-18 Coalesced and uncoalesced memory transactions 
 
Due to the growing adoption of GPUs for general purpose computing, which have 
been enabled by the maturing programming model and tools, GPU developers have start-
ed to address the requirements of mobile and embedded computing applications. Many of 
the latest smartphones, including phones by Samsung, Apple and Google, feature GPU 
accelerators [60]. These GPUs are typically not employed as a standalone integrated cir-
cuit (IC), they are often implemented as part of a heterogeneous system-on-chip (SoC). A 
SoC is a processing device that implements multiple elements of an electronic system on 
a single IC often featuring different cores or hardware architecture designed and special-
ised for specific tasks. The underlying hardware architecture of a SoC can vary 
significantly to meet specific system requirements but they are often based on micro-
processor, micro-controller, DSP, FPGA or GPU hardware.  
…
0
…
31
Fully Coalesced Transactions
…
Uncoalesced Transactions
…
… …
…
0
…
31
0
…
31
0
…
31
… …
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 54 - 
Demands on mobile computing products have risen significantly in recent years, uti-
lising a traditional homogenous processor would entail large transistor counts for large 
raw computing resources and many complex layers of software to enable full utilisation 
of the processing capabilities. The heterogeneous architecture provided by SoC’s allows 
for the specialised hardware to be leveraged for certain tasks, reducing power consump-
tion and requiring simpler software models for task execution. In terms of the power 
consumption in particular, these embedded GPUs represent suitable platforms for de-
ployment in next generation onboard data processing architectures.  
However, GPU devices are not currently manufactured to provide radiation tolerance 
at a hardware design level. This is because in their traditional computing applications, the 
probability of experiencing a radiation induced error is comparatively low and the conse-
quence of such an error is often even lower. With the ongoing trend toward reduced 
volume, mass, power consumption and increases in performance efficiency, the radiation 
and error tolerance will likely be the last remaining obstacle preventing COTS GPU and 
SoC devices gaining widespread utilisation in safety critical applications, including the 
space sector. Therefore, demonstrating the successful deployment of SBFT techniques for 
parallel processing architectures, will be critical towards the adoption of state-of-the-art 
parallel COTS processors in safety critical and space industries.  
 
2.4.3  GPU Beam Testing Experiments 
Unlike the programmable logic paradigm of FPGA hardware, COTS parallel proces-
sors such as GPUs have more distinctive hardware and software abstraction layers. This 
makes determining the probability that a gate-level error in hardware propagates to the 
output extremely difficult as error masking can occur at each of these abstraction layers. 
Whilst there is a large amount of material available on optimisation and programming 
techniques for high data processing throughput using GPUs, little research currently ex-
ists on the assessment of GPUs in an error prone environment and the proposal and 
testing of SBFT strategies for parallel or GPU applications. Designing new and effective 
SBFT approaches requires an in-depth understanding of the inherent error resilience of 
the hardware and software architectures.  
P. Rech et al have recently published results from a series of experimental beam test-
ing conducted in order to assess the error resilience of NVIDIA GPU architectures [62]. 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 55 - 
Key findings of these experiments showed that the error resilience of logical execution 
blocks is dependent on both the operation type and the type of data being processed, 
whereby the error resilience is closely linked to complexity [63]. GPUs feature hardware-
based task schedulers, which cannot be as easily protected compared to the software-
based task schedulers of most CPUs, therefore the error resilience of these critical control 
structures were also assessed [64][65]. Different GPU applications and different parallel 
configurations were targeted to understand the impact of resource usage and scheduling 
requirements on reliability.  
The experimental results from P. Rech et al demonstrate that it is more reliable to 
schedule a smaller number of blocks containing a large amount of threads than to sched-
ule a large number of blocks containing a small number of threads. This is attributed to 
the inherent differences between block- and warp-scheduler hardware, whereby the warp-
schedulers are more efficient at handling a large number of threads, whilst block sched-
ulers are more efficient at handling a smaller workload of blocks. Demonstrating that the 
error resilience of the NVIDIA GPU scheduler hardware is related to the workload, and 
the degree of application parallelism and how this workload is distributed on the GPU is a 
key finding which needs to be addressed when designing an application for use in an error 
prone environment.  
In addition to testing control and computational logic blocks P. Rech et al have also 
published results on practical environmental testing which targets other areas such as the 
memory structures of GPU devices [66][67]. The key results shown from their investiga-
tion is that the shared memory structure on the GPU is less sensitive to radiation induced 
error effects when compared to the L1 and L2 cache structures. Whilst the difference be-
tween shared and L1 memories are marginal, L2 cache error probabilities were shown to 
be approximately double those of the shared memory and L1 cache. It is hypothesised 
that this is due to the fact that shared memory and L1 cache provide similar roles in the 
GPU architecture and therefore are likely based on similar silicon designs. However, the 
L2 caches typically have a greater area constraint and are therefore based on a more com-
pact silicon architecture which will be inherently more sensitive to radiation upsets. They 
also found that a corruption in cache access tags only occurred for cache hits resulting in 
an erroneous cache miss, but not vice versa. As a result, an error in the cache tags only 
resulted in a degradation in performance and no erroneous memory accesses. 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 56 - 
Due to the growing adoption of these devices in large-scale terrestrial High Perfor-
mance Computing (HPC) applications and the inherent increase in requirements for 
operational correctness in these applications, device manufactures have added Error Cor-
recting Codes (ECC) to GPU memory structures. The performance of these protection 
schemes in a radiation environment have also been assessed experimentally [68]. The in-
vestigation found that whilst ECC protects against the majority (approximately 80%) of 
Silent Data Corruptions (SDCs), there is a significant increase, over 245%, in the occur-
rence of Functional Interrupts (FIs). This is attributed to the fact that logical resources and 
schedulers are left unprotected and errors detected by ECC are often incorrectly handled 
by the GPU leading to a FI. Therefore, in applications where deterministic processing or 
minimal downtime is required this may not be a suitable protection mechanism.  
An additional experimental testing campaign conducted on several benchmark appli-
cations has also shown that there can be significant variation in SDC and FI probabilities 
between applications which cannot be attributed to variations in hardware                   
characteristics [69]. This therefore provides evidence that algorithmic and software-based 
implementation characteristics also influence the error resilience of a GPU application.  
Error masking effects are discussed in this work as they can often occur naturally in 
algorithms or software design approaches. For example, an error in a memory resource of 
a processor might not result in an SDC in the application output if the application does 
not read from the affected memory element or data is rewritten to that location before it is 
read. This suggests that it should be possible to develop specific software development 
and algorithmic design guidelines which aim to increase the error resilience of an applica-
tion irrespective of the exact hardware architecture used for implementation.  
 
2.4.4  Software Based GPU Error Injection Testing 
Software based error injection experiments can also be used to replicate the effects of 
physical hardware faults by altering values and addresses of memory and instructions. 
They provide a cheap and time efficient approach compared to beam testing for the char-
acterisation of different applications or implementations. A challenge of error injection 
testing can be finding an appropriate existing software tool that can achieve representative 
coverage to encompass all possible fault effects. Manufacturers and device developers 
rarely produce or make available error injection software tools for their devices; as a re-
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 57 - 
sult, the development of suitable error injection software tools has become an interesting 
area of research. Whilst there are numerous publications regarding the development of 
software-based error injectors for CPUs there are currently only a few which address er-
ror injections for GPUs [70][71].  
One of the earliest proposed solutions for error injection on GPU devices was GPU-
Qin [70]. In this framework fault injections are performed at the assembly language level 
of a GPU application by leveraging a GPU-based debugger. The advantage of this ap-
proach is that it allows code to be compiled and executed natively on GPU hardware and 
also provides low level injection granularity at the instruction level. The disadvantage is 
that the application needs to be paused in order for errors to be injected and running the 
application in a debug configuration which can lead to extremely long execution times.  
To overcome the testing duration downside of GPU-Qin, a compiler based solution 
based on the CPU error injection tool LLFI was proposed [71]. Like the CPU version it 
leverages the LLVM compiler to instrument the GPU program to inject faults. A dynamic 
library is attached to the NVIDIA CUDA compiler (NVCC) to intercept its call to the 
LLVM compilation module. At this point the instrumentation pass of LLFI-GPU is in-
voked to return augmented IR which is passed onto the NVCC which proceeds with the 
remaining GPU application compilation process. LLFI-GPU instrumentation has been 
designed to dynamically select a random instruction executed by a random thread in a 
random kernel call. Once an instruction is selected a single bit is flipped in the result val-
ue to simulate a SBU. LLFI-GPU provides less granularity when compared the GPU-Qin 
but the execution and testing time was proved to be significantly faster, up to 1000 times.  
Recently the state-of-the-art in this area has advanced significantly due to the publi-
cation of two key research projects, called SASSI and SASSIFI by NVIDIA [72][73]. 
SASSI and SASSIFI are NVIDIA research prototypes. Whilst not an official part of the 
software toolkit for NVIDIA GPUs, they have been made publicly available via       
GitHub [74][75]. SASSI is an instrumentation framework which provides the capabilities 
for instructions to be inserted into the NVIDIA native ISA known as SASS. The frame-
work also allows call-backs to be made to arbitrary user-level functions which can be 
executed before or after instrumented instructions. SASSIFI is an additional project which 
is based on the SASSI framework and provides error injection capabilities to GPU appli-
cation developers. The SASSIFI tool includes functions which inject transient errors into 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 58 - 
architecturally visible states including general purpose registers, store values, predicate 
registers and conditional registers [76]. Since this approach injects errors only into live 
GPU states the results will not model how errors propagate from the physical microarchi-
tecture to an application but will provide a study into fault propagation from the 
application level to the output.  
 
2.4.5  Error Resilient GPU Application Development 
Published literature which discusses error resilient parallel or GPU application design 
is very limited. One of the first publications to discuss this topic specifically with respect 
to GPU devices is research conducted by H. Takizawa et al. which details the first pro-
posed CPR mechanism for CUDA GPU applications [77]. Their methodology called 
cheCUDA is implemented as an add-on package to the existing Berkeley Lab Check-
point/Restart (BLCR) framework [78]. CheCUDA leverages the driver API version of 
CUDA, overriding official CUDA resources with their own C++ class types and replacing 
the CUDA API functions with their own wrapper to intercept data in order to perform the 
CPR. This means that an application will need to be compiled using their specific add-on 
package to enable this functionality and critically, can only provide CPR to applications 
developed using the driver API. CUDA can be utilised via either a driver API or runtime 
API, however the runtime is the most commonly used API for CUDA application devel-
opment. Thus, the cheCUDA package cannot be used by a large number of existing 
applications.  
To combat these disadvantages in 2011 A. Nukada et al proposed an alternative CPR 
mechanism using a transparent library which can be dynamically linked to enable CPR 
capabilities to applications developed using either the driver or runtime API called NVID-
IA CUDA Checkpoint-Restart Library (NVCR) [79]. Another advantage of their 
implementation is that it can be utilised without the need for source code recompilation. 
There is however a key unresolved issue with their implementation, NVCR relies on 
techniques which leverage CUDA characteristics which whilst are observable are not 
guaranteed behaviours by NVIDIA. Thus, they could change or become inconsistent with 
future software and hardware architecture releases. A currently unresolved disadvantage 
of CPR techniques for GPU devices is the overhead introduced in both check-pointing 
and restart. Whilst traditional HPC GPU applications have characteristically long execu-
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 59 - 
tion times and exhibit a large degree of parallelism, future embedded data processing ap-
plications might not exhibit these same characteristics and execution time could be 
dominated by CPR overheads.  
Another area of research first published in 2009 is the investigation of software based 
algorithm for increased GPU computational reliability by M. Dimitrov et al [80]. In this 
publication they propose and investigate three methodologies, named R-Naïve, R-Scatter 
and R-Thread. The R-Naïve approach simply duplicates each kernel and compares the 
outputs for errors, whilst R-Scatter and R-Thread aim to utilise unused ILP or TLP by in-
terleaving redundant and original operations to reduce the overhead of the duplication 
methodology. Their approaches are implemented on several fundamental algorithms in-
cluding matrix multiplication, convolution and 1D Fast Fourier Transform (FFT).  
However, no error resilience testing of the techniques is explored by the authors and 
therefore they are unable to provide evidence that their proposed techniques do protect 
against SDC events or how the protection schemes impact the FI probability. The only 
analysis performed is with regards to execution time and overheads of the methodologies 
that are investigated. With respect to the impact on processing time, it was found that the 
performance overheads for R-Scatter and R-Thread varied with the algorithmic character-
istics of the benchmark applications used. Whilst algorithms which exhibit a low degree 
of ILP and low register usage may benefit from the R-Scatter method, applications which 
have low TLP, high register usage, utilise shared memory or have complex caching re-
quirements may suffer from large performance overheads and are better suited to the R-
Thread method. These findings have been attributed to the intricate trade-off between 
TLP, ILP and data reuse toward an applications throughput performance. Their findings 
highlight the importance of understanding the algorithmic characteristics of the applica-
tion which is to be protected in order to select an overhead efficient SBFT technique. 
P. Rech et al have also explored several SDC protection focused approaches. Their 
first publication investigates and compares TMR and ABFT strategies [81]. The publica-
tion details the results from an experimental beam testing campaign, which are utilised 
towards the design and optimisation of an ABFT strategy for matrix multiplication, pro-
tecting against MBU errors. The experimental results are then also used to determine the 
probabilities and distributions for software-based error injection testing of three SBFT 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 60 - 
techniques, which includes an application protected using TMR, a specific ABFT and 
their newly developed and optimised extABFT algorithm.  
The results from this publication showed that traditional ABFT for matrix multiplica-
tion succeeded in correcting single errors but often required re-computation for multiple 
errors. The proposed extABFT and TMR techniques, however, were able to correct all 
errors including multiple randomly distributed errors without re-computation. In addition, 
the computational costs of these strategies were also assessed; the TMR strategy had the 
largest performance overhead of the three strategies. However, the performance overhead 
was constant irrespective of error distribution. This shows that a TMR based scheme 
more highly suited to applications, such as onboard data processing, for which determin-
istic processing is an important factor.  
In addition to exploring TMR protection, P. Rech et al have also studied several du-
plication with comparison (DWC) approaches for GPU architectures [82]. For all DWC 
approaches, if an error is detected then re-execution occurs until no error is detected or a 
maximum number of re-computations is reached. However, DWC is not greatly suited for 
implementation in an onboard data processing application due to the requirement for re-
calculation to correct errors which could lead to significant reduction in the time deter-
minism of the application execution.  
 
2.5    Data Processing and Compression Algorithms 
The primary aim of this research is to investigate new approaches in onboard data 
processing which can be deployed to help optimise the payload data downlink efficiency. 
One of the key fields of research concerns suitable data processing and compression algo-
rithms which can help to more efficiently represent the data prior to transmission to the 
ground. This subsection details the research into this field specifically.  
 
2.5.1  Lossless Image Compression 
Digital data compression is a processing technique that aims to reduce the total vol-
ume of data by decreasing the number of bits used to represent the information. This is 
achieved by reducing the amount of information redundancy by exploiting data correla-
tions. Imagery data contains several distinct types of redundancy that are often utilised for 
compression, including statistical, spatial and spectral redundancy. Statistical redundancy 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 61 - 
is associated only with the digital representation of the data, not the content of the image 
itself. To exploit this type of redundancy, algorithms often utilise techniques from infor-
mation theory to more efficiently encode the digital representation of the image data. 
Spatial redundancy is due to the correlations between spatially adjacent pixels. An image 
can often be segmented into several discrete regions such as foreground, background or 
image objects. The pixels within these segments will exhibit relationships which can be 
decorrelated to achieve a compressed representation of the image. Spectral redundancy is 
due to similarities or correlations in values between co-located pixels of multiple image 
bands. Many types of imagery, including multispectral and hyperspectral EO imagery, are 
composed of multiple spectral bands. Whilst the pixels in each band represent intensity 
values for a different region of the electromagnetic spectrum, they often represent the 
same spatial information and exhibit relationships which can be exploited for compres-
sion.  
To date a vast number of techniques have been proposed to exploit these different 
types of data redundancy. Different techniques are often better suited to different re-
quirements, it is therefore important that system requirements are established and 
considered when assessing compression algorithms. For onboard EO satellite image pro-
cessing, there are several key requirements which an algorithm will need to meet. The 
requirements considered in this research are given in Table 2-6. They address all aspects 
of the data life cycle encompassing characteristics of the imaging source, processing plat-
form, as well as the requirements of the user applications of the data.  
 
Table 2-6 Onboard EO image processing algorithm desired characteristics 
User induced characteristics - Lossless compression 
Platform induced characteristics 
- High compression ratio 
- Minimised computational complexity 
- Minimised memory usage 
Imager induced characteristics 
- High data processing throughput 
- Handle high data dimensionality 
 
Image compression algorithms can be classified based on the degree of information 
permanently removed to achieve a compressed representation. Lossless algorithms 
achieve compression using a reversible mechanism whereby no information is lost in the 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 62 - 
compression process. In lossy compression some image information is permanently re-
moved often achieving a significantly higher compression ratio, when compared to 
lossless algorithms, but a level of distortion and inaccuracy will be introduced to the data. 
To ensure high-level information extracted from EO imagery is precise, the image needs 
to be an accurate representation of reality, therefore, lossless data processing is often a 
key requirement from EO data customers.  
Another major requirement induced by the satellite platform is the demand for high 
compression ratio performance to help alleviate the onboard data bottleneck. Additional 
physical platform constraints on key design criteria such as available power, mass and 
volume in turn restrict the amount of computational and memory resources available 
onboard. The characteristics of current and future space borne EO imaging systems, such 
as the increasing data dimensionality and increasing imager data rates also need to be 
considered. Algorithms which are well suited to the processing of large volume and di-
mensionality data whilst achieving high, ideally real-time, data processing throughput 
rates will need to be identified. Considering these requirements, an extensive literature 
review in the field of lossless image compression has been conducted.  
A primary aim of this research is to maximise the achievable onboard data compres-
sion, therefore initially surveyed lossless image compression algorithms have been 
compared with regards to compression ratio performance, where compression ratio is de-
fined by (2-4). As it was not possible to gather first hand experimental results for all 
proposed algorithms in the literature, a survey was conducted to gather quoted results for 
the compression performance for individual algorithms. As a wide variety of approaches 
have been employed by the surveyed algorithms, in order to assess trends in compression 
ratio with the underlying algorithm characteristics, each algorithm has been given a two-
tier classification. The first classification tier is based on the types of redundancies the 
algorithm exploits, whereby algorithms are referred to in this work as either traditional or 
multidimensional. As shown in Figure 2-19, traditional algorithms are those that utilise 
spatial and statistical redundancy reduction techniques, and multidimensional algorithms 
are those that additionally exploit spectral redundancies.  
 compression ratio	(CR)	= original image size (bits)compressed image size (bits) (2-4) 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 63 - 
 
 
Figure 2-19 Lossless image compression algorithm classification methodology 
 
An additional classification used in this research organises algorithms based on the 
theoretical techniques used for compression. The algorithms surveyed in this work utilise 
seven different approaches as shown in Table 2-7. This table provides the occurrence per-
centage and combined average compression ratio for each classification of algorithm in 
this research. The results in Table 2-7 highlight the performance variation between differ-
ent compression techniques and between traditional and multidimensional algorithms. 
Predictive based algorithms are the most prominent in both traditional and multidimen-
sional algorithms and also obtain the overall highest average compression ratios. 
 
Table 2-7 Average compression ratios of surveyed algorithms *     
Theoretical Basis 
Traditional Algorithms Multidimensional Algorithms 
Occurrence Average CR Occurrence Average CR 
Entropy Encoding Only 6% 1.88 - - 
Discrete Cosine Transform 2% 1.75 - - 
Discrete Wavelet Transform 25% 1.90 3% 2.84 
Predictive Coding 67% 2.05 66% 3.15 
Look-Up Table - - 9% 3.05 
Distributed Source Coding - - 19% 2.65 
Vector Quantisation  - - 3% 3.0 
* References provided in Appendix B and Appendix C 
 
 
The key results from the conducted survey for each individual algorithm are provided 
in Figure 2-201,2. The following subsections review in detail key traditional predictive and 
discrete wavelet transform algorithms and multidimensional predictive, look-up table and 
Rebecca L. Davidson                   Chapter 2 , Literature Review 
- 64 - 
vector quantisation algorithms. Further details on the other classification algorithms can 
be found in Appendix A. 
1 Whilst care has been taken to seek data from reputable publications and cross-
reference survey results where possible, due to the nature of the publication review results 
the gathered results may not have used identical test images. For instance; typically, pub-
lications which tested multidimensional algorithms used remote sensing imagery, whilst 
photo and natural image types were used for traditional algorithm testing. However, we 
were able to compare compression results or the JPEG-LS algorithm from two independ-
ent tests on both remote sensing and natural imagery data and they produced a negligible 
difference of 2%. 
2 The average compression ratio results for multidimensional algorithms shown in   
Figure 2-20, used only processed band registered data. Band registration corrects for any 
misalignment between spectral bands of an image, this ensures co-located pixels refer to 
the same physical location increasing spectral correlation. As band registration is not cur-
rently performed onboard, these compression ratio results may be higher than what is 
achievable for raw data produced onboard a satellite.  
Rebecca L. Davidson                                    Chapter 2 , Literature Review 
- 65 - 
 
Figure 2-20 Lossless image compression algorithm compression ratio analysis1,2 
 
 
References for each algorithm can be found in Appendix B and Appendix C.  
* Denotes compression performance data was only available for calibrated multidimensional test images 
Sunset
Lossless JPEG
FELICS
CALIC
LOCO-I
LOCO-A
UCM
HBB
TMW
JPEG-LS
APC
EDP 
ALPC
FBS
RALP
VBS
APT
BPNN
TS-FNN
FLIC
APC-MAP
APC-WLS
MRP
MALCM
CREW
SPIHT JPEG2000
ALCA
MINT-UCA
EZBC PDF PPBWC
TDKZW
JPEG-HD
PNG
CCSDS-121
IB-CALIC*
ACAP*
D-JPEG-LS
ASAP* C-DPCM
M-CALIC
SLSQ*
SLSQ-OPT*
SLSQ-HEU*
BH
BG
CCAP*
FL NPHI*
EPHI*
S-FMP
S-RLP
ABPCNEF*
C-DPCM-APL
CCSDS-123
D-JEPG2000
USES*
LUT
LAIS-LUT
LAIS-QLUT*
LPVQ*
s-DSC*
DSC-CALIC*
v-DSC*
A1
A2
A3
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
3.2
3.4
3.6
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
20
01
20
02
20
03
20
04
20
05
20
06
20
07
20
08
20
09
20
10
20
11
20
12
20
13
Av
er
ag
e 
C
om
pr
es
sio
n 
R
at
io
 
Year of Algorithm Publication
Traditional Algorithm-Predictive Traditional Algorithm-Transform Traditional Algorithm-Entropy Encoding Only
Mutldimensional Algorithm-Predictive Multidimensional Algorithm-Transform Multidimensional Algorithm-Entropy Encoding Only
Multidimensional Algorithm-Look Up Table Multidimensional Algorithm-Vector Quantisation Multidimensional Algorithm-Distibuted Source Coding
Traditional Trendline Multidimensional Trendline
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 66 - 
2.5.1.1 Traditional Discrete Wavelet Transform Algorithms  
 To date the most successful transform based approaches in lossless image compres-
sion leverage the Discrete Wavelet Transform (DWT) [83][84]. The DWT behaves 
similarly to a band-pass filter, allowing the image spectrum to be divided into its constit-
uent low-pass and high-pass components [85]. In image compression, a two-dimensional 
DWT is applied to both rows and columns of the image. Once the data is transformed 
each of the low and high–pass filtered data will contain frequency information for half of 
the original spectrum. Therefore, by Shannon’s sampling theorem the data can be down 
sampled by a factor of two, meaning filtered representations will be a factor of two small-
er in each dimension [86].  
Figure 2-21 was created using the built-in MATLAB DWT filter and depicts a visual 
example of the filtering and down sampling process of the DWT applied to an image. 
This example shows how the DWT transform separates high and low frequency image 
components and highlights that the majority of visual data is present in the low frequency 
components. For use of the DWT for lossless image compression an approach called the 
lifting scheme is applied; this allows the signals to be mapped to integer wavelet coeffi-
cients enabling perfect reconstruction to be performed [83][84].  
 
Figure 2-21 Example 1-level and 2-level 2D-DWT  
 
Compression with Reversible Embedded Wavelet (CREW) is one of the early inte-
ger-to-integer DWT based algorithms [87]. A major advantage of the wavelet-based 
image compression algorithm was the ability to provide additional progressive coding 
functionality, whereby the compressed data is encoded and stored in order of visual im-
portance in an embedded bitstream. Therefore, progressive coding allows for both lossy 
and lossless compression within a single algorithm. This algorithm was proposed in direct 
response to a call for contributions towards a new standard lossless compression algo-
rithm by the JPEG in 1994. Due to the relatively reduced compression ratio performance 
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 67 - 
of CREW compared to other state-of-the-art algorithms proposed, it was not selected as 
the basis for the new JPEG lossless algorithm. However, the novelty of the additional 
progressive coding functionality triggered the development and standardisation of another 
new JPEG standard, known today as JPEG2000 [41].  
The final JPEG2000 standard incorporated the state-of-the-art wavelet and entropy 
encoding techniques to provide the joint lossless and lossy image compression capabili-
ties. The computational complexity of JPEG2000 is additionally not excessive, as a result 
it has been known to have been used for onboard EO image compression in several his-
torical missions [88]. The average compression ratio performance of JPEG2000 is 1.88. 
Whilst it provides a 6% improvement on the original CREW algorithm, the performance 
is significantly reduced when compared to alternative prediction-based schemes.  
Looking at current and future EO mission requirements the compression achieved by 
JPEG2000 will be insufficient to minimise the onboard data bottleneck for future high 
data rate EO missions. Additionally, due to the nature of the DWT technique, processing 
is often performed on large blocks of an image. Therefore, the memory requirements are 
often greater than alternative predictive based compression algorithms which can reduce 
the achievable processing throughput, which will also be an important factor for onboard 
implementation. 
 
2.5.1.2 Traditional Predictive Algorithms  
Sunset was the first algorithm to define many principles which are used widely in 
modern predictive based lossless compression algorithms [90]. Three of these key con-
cepts are noted and explained in Table 2-8. 
 
Table 2-8: Sunset algorithm key concepts 
Concept Description 
Casual   
Template 
A casual template is a subset of previously encoded pixels that spatially 
neighbour a selected pixel. 
Pixel      
Context 
A pixels’ context is a measure of the correlation between a pixel and 
those in its casual template which spatially surrounding it.  
Algorithm        
Conditioning 
The pixel context is used to select and modify the specific prediction 
equation and entropy encoding parameters at run time. 
 
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 68 - 
The greatest contribution to the area of traditional compression algorithm develop-
ment occurred between the years 1996 and 2000, Figure 2-20. In this period, over 40% of 
the total algorithms surveyed were proposed. Of these algorithms, over half achieve an 
average compression ratio greater than 2.0. Many of these algorithms were proposed in 
direct response to the 1994 JPEG call for contributions towards a new lossless image 
compression standard [92]. This call for contributions was partially prompted by the rela-
tively low compression ratio performance and limited adoption of the first 1992 lossless 
JPEG standard [93].  
The first lossless JPEG standard specified two algorithm versions which differ by the 
entropy coding stages: one utilising arithmetic coding and the other Huffman based cod-
ing [91]. The arithmetic coding scheme follows the Sunset algorithm closely using the 
context values to condition the arithmetic entropy coding stage. However, the Huffman 
coding version is a simplified version that does not implement context modelling. Due to 
patent issues at the time, the lossless JPEG Huffman gained wider spread adoption despite 
the fact that the achievable compression ratio was significantly less than that provided by 
the arithmetic coding-based algorithm [91]. This shows that despite performance ad-
vantages provided by the arithmetic coding scheme, ensuring free open source standards 
is a significant factor to achieving widespread adoption. 
Two key algorithms proposed in response for a new lossless JPEG standard were 
Context Adaptive Lossless Image Codec (CALIC) [94] and Low Complexity Lossless 
Compression for Images (LOCO-I) [95]. Despite being developed separately, both algo-
rithms are very similar in structure, both adopting the novel implementation of context 
adaptive prediction. In context adaptive prediction, the prediction function is adapted 
based on the individual context of each pixel. This increased level of adaption provided in 
LOCO-I and CALIC is what enabled them to achieve state-of-the-art levels of compres-
sion performance. Despite the similarities in structure the two algorithms were designed 
to achieve different goals; CALIC was designed as a practical, high-performance image 
codec which implemented and tuned principles developed from previous research areas 
on adaptive predictive modelling optimisation [96].  
In contrast LOCO-I, was specifically designed to achieve competitive performance 
but within a low complexity scheme. J. Shukla et al [97] provide detailed descriptions of 
the specific context adaptive predictors, Median Edge Detector (MED) and Gradient Ad-
justed Predictor (GAP) utilised by the LOCO-I and CALIC respectively. At the time of 
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 69 - 
proposal, CALIC became the state-of-the-art lossless compression algorithm in terms of 
achievable compression ratio. Despite this LOCO-I was eventually chosen as the basis for 
the new lossless JPEG standard, JPEG-LS, due to the implementation advantages provid-
ed by the low complexity of the algorithm [40]. To improve the performance of LOCO-I 
up to that of CALIC, an arithmetic coding version of LOCO-I, (LOCO-A) was developed 
and incorporated into the JPEG-LS standard as an extension. 
Since the development of context adaptive compression principles, first employed by 
CALIC and LOCO, very few algorithms have proposed further advancements that signif-
icantly exceed the performance benchmarks of these algorithms, as seen in Figure 2-20. 
Over the period under review, only a single algorithm, Adaptive Predictor Combination – 
Weighted Least Squares (APC-WLS) [99], has exceeded the compression ratio of the 
1996 LOCO-A algorithm. APC-WLS was published in 2003 as an adaptation of an earlier 
Adaptive Predictor Combination (APC) framework first proposed in 1999 [100]. The 
original APC framework uses complex content filtering schemes and multiple predictor 
functions which are adaptively combined towards improved prediction accuracy and pre-
cise error mapping. The APC-WLS algorithm specifically utilises a weighted least 
squares method to adaptively select and combine the prediction equations. The weighted 
least squares approach is well suited to this application because there is often significant 
variability in the prediction error values due to images features such as edges and content 
change over the image spatially. APC-WLS can be considered the current state-of-the-art 
in image compression with the highest average compression ratio of 2.3, 10% greater than 
CALIC. However, the computational complexity of this algorithm is estimated to be in 
the region of an order of magnitude greater than the CALIC algorithm [101]. 
Taking the requirements for onboard data processing into account, JPEG-LS is the 
highest performing and most suitable traditional lossless image compression algorithm, 
providing a suitable trade-off between resource usage, computational complexity and 
compression ratio. Another advantage is that due to its standardisation it provides in-
creased ease of use for both satellite manufacturers and data end users. JPEG-LS is 
however already commonly utilised image compression algorithms in the satellite indus-
try. Therefore, for most in the space community, there is no significant benefit to utilising 
an alternative traditional lossless algorithm from JPEG-LS.  
 
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 70 - 
2.5.1.3 Multidimensional Predictive Algorithms 
An area relatively unexplored in the satellite industry is the use of multidimensional 
image compression algorithms which, as shown in Figure 2-20 and Table 2-7, on average 
can achieve significantly higher compression ratio performances than counterpart tradi-
tional algorithms. In early multidimensional algorithm development, several simple 
adaptions of traditional algorithms were proposed. One of the simplest techniques that 
can be used to adapt traditional algorithms to additionally exploit spectral correlation is 
differential coding.  
Differential coding is implemented by calculating the differences between the pixel 
values in a current image band and a reference band. The remaining residual difference 
values between the bands are then encoded. Two examples of this are the Differential-
JPEG-LS (D-JPEG-LS) and Differential-JPEG2000 (D-JPEG2000) [102]. D-JPEG-LS 
and D-JPEG2000 achieve an average compression ratio of approximately 2.85 and 2.84 
respectively. When comparing their performance with traditional versions of the same 
algorithm, there are considerable increases in performance of approximately 51% and 
37% for D-JPEG2000 and D-JPEG-LS respectively. This is particularly significant com-
pared to only a 9.5% performance increase from JPEG-LS to the most advanced 
traditional algorithm APC-WLS. 
Another example of the adaptation of an existing traditional algorithm into the multi-
dimensional domain, is the Inter-Band-CALIC (IB-CALIC) algorithm proposed in     
1998 [103]. IB-CALIC is an extension of the traditional CALIC algorithm to include an 
additional mode to enable inter-band coding. In the inter-band mode, a correlation value 
is first calculated between the casual template pixels in the current pixel and co-located 
pixels in the previously encoded band. If this correlation is considered strong, above a 
pre-determined threshold, then the previous band information is used in a three-
dimensional prediction. In cases of weak correlation, then the mode is switched to the tra-
ditional intra-band coding mode. IB-CALIC achieves an average compression ratio 41% 
greater than the original traditional CALIC algorithm, highlighting the potential increase 
in achievable compression by exploiting spectral redundancy, further demonstrating the 
advantages of multidimensional compression. 
In addition to adaptations of existing traditional algorithms, new predictive based 
multi-dimensional algorithms have been proposed in literature, several of which have 
been designed specifically for low resource and computational complexity. The 2004 
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 71 - 
Spectrum Orientated Least Squares (SLSQ) algorithm was designed to perform within a 
low complexity framework for use specifically in environments with real-time processing 
or low resource requirements [104]. In SLSQ, the predictor is optimised for each pixel 
and each band using an adaptive least squares optimisation technique. The algorithm only 
targets spectral decorrelation and does not exploit any spatial redundancy. SLSQ achieves 
a competitive average compression ratio of 3.11, slightly greater than IB-CALIC, but the 
compression ratio achieved will be highly dependent upon the imagery being compressed 
and relies on strong spectral correlation.  
 An alternative algorithm designed specifically for onboard satellite image compres-
sion is the 2005 Fast Lossless (FL) algorithm [105]. The algorithm uses a three-
dimensional casual template of neighbouring previously encoded pixels and a user-
defined number of preceding spectral bands to perform spatial and spectral decorrelation 
for adaptive prediction. The predictor is based on the low complexity sign (sgn) function, 
which determines the sign component of a real number, this is used to update and opti-
mise the weighting of each spectral band. The original authors of the algorithm found that 
a predictor based solely on the sign algorithm yields poor convergence speeds. Therefore, 
a local mean subtraction method is also employed. For this the mean of the casual tem-
plate for each band is calculated and then subtracted from the current or co-located pixel 
value which is then multiplied by the weighting factor for corresponding band. The ap-
proach used in the FL algorithm was designed specifically to achieve competitive levels 
of compression ratio whilst minimising the computational and memory requirements.  
Due to its success, it has been recently adoption by the CCSDS as the basis of the 
CCSDS-123 standardisation titled “Lossless Multispectral and Hyperspectral Data Com-
pression” [106]. In standardisation to CCSDS-123, several modifications were made to 
the original FL algorithm. This includes the addition of a prediction mode specifically 
designed to suit the data patterns produced by push-broom sensors. Push-broom sensors 
acquire data in spatial-spectral slices, where each detector element corresponds to a spe-
cific spectral and cross-track position. As the characteristics of individual detector 
elements can vary, the correlation between spectral bands also varies with cross-track po-
sition. In the original FL algorithm, a local mean is computed for each sample using a 
three-dimensional casual neighbourhood. However, for push-broom data it was found that 
letting the local mean be equal to the previous sample in the same cross-track position 
results in a significantly better compression ratio. Therefore, additional push-boom orien-
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 72 - 
tated modes for local-mean calculation and prediction were added to the standard. The FL 
and CCSDS-123 algorithms achieve an impressive average compression ratio of 3.24, 
Figure 2-20, 56% better than JPEG-LS, 5.2% better than IB-CALIC and 4.2% better than 
the low complexity algorithm SLSQ. 
In addition to the development of low complexity predictive compression algorithms, 
several advanced pre-prediction functions have been incorporated into predictive com-
pression schemes to increase the achievable compression ratio. One popular pre-
prediction function which has been introduced to several independent algorithms is image 
clustering or segmentation. The aim of this function is to determine groups or areas of 
homogenous pixels with relatively similar properties to increase the accuracy of the pre-
diction stage. The algorithms which perform these additional pre-prediction processing 
typically achieve state-of-the-art levels of compression however, the resulting substantial 
increase in computational complexity will lead to increased computational resource re-
quirements or increased processing time. This makes these algorithms not currently 
suitable for onboard implementation. Further details on these algorithms surveyed can be 
found in Appendix A. 
 
2.5.1.4 Multidimensional Look-Up Table Algorithms 
Several Look-Up Table (LUT) algorithms have been proposed for multidimensional 
image compression. The prediction technique used by these algorithms consist of three 
main steps. First, the value of the pixel in the previously encoded band which is co-
located to the current pixel to be encoded is determined. Second, the location in the pre-
viously encoded band of the spatially closest pixel that is equal to the value determined in 
step one is established. Thirdly, the pixel in the current band which matches the location 
established from step two is used as the prediction value. To minimise the computational 
complexity of this technique the search logic required can be replaced by a simple LUT, 
where the co-located pixel value is used as an index and the LUT returns the nearest 
matching pixel value in the current band [107].  
There are however several shortcomings of this method: the nearest matched pixel 
value could be located far from the current pixel and the nearest match might not provide 
the closest prediction. These issues have been addressed in the extended Locally Aver-
aged Inter-band Scaling (LAIS-LUT) algorithm [108]. First, a multi-band local average 
calculation is performed, which is used to perform outlier rejection. In addition, a second 
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 73 - 
LUT is implemented. This acts as a memory for past pixel values, meaning a selection 
can be made to find the most accurate prediction, not just the closest in time history. If no 
match is located, then the LAIS estimate can be used as the prediction. A quantised ver-
sion of the inter-band predictor, LAIS-QLUT, is a further modification of the algorithm 
that improved the memory requirements of the technique [109]. This was achieved by 
uniformly quantising pixel values prior to indexing, thus reducing the size of the LUT.  
When initially studying the multidimensional algorithm compression ratio results in 
Figure 2-20, LAIS-QLUT could be considered the highest performing algorithm to date. 
However, the compression ratio data for this algorithm was only available for a calibrated 
data set. In 2009 A. Kiely et al [110] published a paper investigating the phenomena that 
LUT based algorithms achieve performance which surprisingly exceeds that of consider-
ably more complex algorithms, such as C-DPCM [111]. Performance testing of 
algorithms at this time was typically only conducted on a single set of radiometric cali-
brated 1997 Airborne Visible Infrared Imaging Spectrometer (AVIRIS) data. It was found 
that the LUT based algorithms were able to exploit artificial regularities introduced to this 
data during the calibration process conducted on the ground, and the LUT algorithms are 
not able to achieve the same high level of compression performance on uncalibrated data 
or even alternative calibrated data sets. The performance reduction of the LUT based al-
gorithms for alternate datasets, can be estimated to be in the region of 15 – 20% from the 
results shown in Figure 2-20. Specifically, an 18% and 16% reduction in performance 
was seen for LUT and LAIS-LUT algorithms respectively for an alternative uncalibrated 
dataset.  
The identification of this phenomena is important especially when performing algo-
rithm selection for onboard implementation, where data is most likely in an uncalibrated 
raw format. Although only LUT based algorithms have been proven to have reduced per-
formance for uncalibrated data, to ensure data clarity, all performance results based solely 
on calibrated data have been marked with an * notation in Figure 2-20. 
 
2.5.1.5 Multidimensional Vector Quantisation Algorithms 
In 2003, the Locally optimised Partitioning Vector Quantisation (LPVQ) algorithm 
was proposed utilising vector quantisation techniques for image compression [112]. The 
first stage in vector quantisation is to decompose the image to a set of vectors, in LPVQ 
the vectors are adaptively partitioned to variable sized sub-vectors. These sub-vectors are 
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 74 - 
then independently vector quantised using an appropriate codebook. The codebook indi-
ces and residual errors are then entropy coded. LPVQ achieves a competitive average 
compression ratio just 3% lower than IB-CALIC, but still 14% below the best performing 
C-DPCM-APL algorithm. Additionally, the proposal research states vector quantisation 
encoders are computationally complex when compared to prediction techniques; LPVQ 
employs a smaller independent codebook to reduce encoding and decoding time. Howev-
er, the proposal paper states the complexity exceeds levels currently suitable for onboard 
implementation. 
 
2.5.1.6 Survey Summary 
There are several conclusions which can be drawn from the conducted review of 
lossless image compression algorithms. Regarding traditional lossless algorithms, it ap-
pears that since the proposal of CALIC and LOCO-I algorithms in 1996, a point of 
diminishing returns has been reached. Very few algorithms have demonstrated new ap-
proaches that have resulted in significant increases in compression performance. 
However, the number and compression performance of multidimensional lossless image 
compression algorithms proposed in recent years has significantly exceeded traditional 
algorithms. From Table 2-7 and Figure 2-20, on average multidimension algorithms 
achieve a 53% higher compression ratio compared to traditional algorithms. This suggests 
that researchers are successfully exploring new techniques and exploiting additional spec-
tral image redundancies to achieve state-of-the-art image compression. Whilst several 
multidimensional algorithms explore newer alternative techniques, such as LUTs and 
Distributed Source Coding (DSC), no new technique to date has been able to provide sig-
nificant advantages, in terms of both minimised complexity and compression ratio 
performance, over the more established predictive schemes.  
For the application of onboard data processing specifically, a selected algorithm 
should in addition to providing competitive compression, be able to handle raw data and 
minimise the required computational resources for low complexity and high throughput 
processing. The algorithms surveyed in this work have also been reviewed against these 
criteria, in doing so several types of multidimensional algorithms that are not currently 
deemed appropriate for onboard implementation such as LUT based, DSC, and complex 
predictive algorithms. Of all the algorithms analysed in this review, the CCSDS-123 algo-
rithm emerges the highest performing algorithm most suitable for onboard application. 
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 75 - 
CCSDS-123 achieves a compression ratio approximately 54% greater than JPEG-LS, 
which has been traditionally utilised for onboard image compression. These high levels of 
compression, which are only 7% less than state-of-the-art C-DPCM-APL, are achieved 
whilst also having relatively low computational complexity and memory requirements. 
 
2.5.2  Additional Data Processing 
Commonly onboard operational satellites today, only image compression, and no ad-
ditional data processing, is performed prior to data downlink. This is due to the limited 
availability of processing resources required to perform more advanced processing algo-
rithms. There are, however, many data processing techniques that could be performed 
onboard which would provide a range of benefits to EO satellite missions. Promising 
onboard payload data processing techniques include algorithms that could help increase 
the error resilience of data transmission, increase processing throughput and could poten-
tially increase the compression ratios achieved onboard. Several algorithms that can help 
achieve these objectives are summarised in Table 2-9 and discussed further in the follow-
ing subsections. 
 
Table 2-9 Additional image processing and their advantages 
Processing Algorithms Key Advantages 
Image Tiling - Reduced error propagation 
- Additional dimension for parallel data processing  
Band Registration - Increase spectral correlation for higher compression 
Radiometric Calibration - Increase spatial correlation for higher compression 
Image Analysis 
- Create new onboard data products 
- Increase image knowledge for higher compression 
- Autonomous intelligent onboard control of processing 
chain 
 
 
2.5.2.1 Image Tiling 
Image tiling is the process of splitting an image into smaller non-overlapping blocks 
or sub-images. Conducting tiling prior to image processing can provide significant ad-
vantages at many different data processing stages. Tiling can increase the amount of 
system level parallelism available for exploitation when each image tile is subsequently 
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 76 - 
processed independently and simultaneously. When performed onboard, image tiling can 
increase the amount of parallelism available, which can be leveraged by a suitable parallel 
processing system for increased data processing throughput.  
Tile size is an important parameter to consider in the implementation as it has been 
shown to have an impact on the data processing throughput, image compression perfor-
mance and memory requirements [113][114]. Tile size can have both a positive or 
negative impact on compression performance, the effect is dependent upon the underlying 
characteristics of the imagery data. The ideal tile size for an image is often adaptive 
across the image. To achieve a high compression ratio, a single tile should represent an 
area with homogeneous characteristics. Ideally adaptive tiling would allow the tile size to 
be adapted at runtime to suit the image being processed but hence requires prior 
knowledge and statistics of image content. The concept of adaptive tiling has been 
demonstrated in coordination with a lossy image compression scheme by Z. Zhihua et al 
and their approach is visually demonstrated in Figure 2-22 [115].  
 
 
 
A) Uniform tiling of McMurdo Sound image 
 
 
B) Adaptive tiling of McMurdo Sound image 
 
 
 
C) Uniform tiling of Antarctic image 
 
 
D) Adaptive tiling of Antarctic image 
 
Figure 2-22 Adaptive image tiling [115] 
 
A fully automated and adaptive image tiling scheme could prove to be computation-
ally complex as it requires image segmentation to be conducted as an additional pre-
processing stage to determined tile sizes adaptively. Image segmentation is a data pro-
cessing technique used to segregate an image into non-overlapping regions of similar 
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 77 - 
statistics. This fundamental technique can also be used to perform object or region detec-
tion. Current segmentation techniques are however highly complex often requiring high 
performance computing architectures and large training data sets for efficient operation. 
Therefore, a simple tiling or partially adaptive tiling scheme is currently better suited for 
onboard use. A partially adaptive scheme could provide enough flexibility to allow tile 
sizes to be adapted from image to image but not within a single image. This could help 
increase compression performance, data error resilience, data processing throughput or 
memory requirements whilst minimising the required additional resource requirements. 
 
2.5.2.2 Band Registration 
Multispectral and hyperspectral imagery are very rarely interpreted band inde-
pendently; in EO science applications, measurements and data interpretation are often 
calculated using the relationships between multiple bands. For accurate image analysis, it 
is therefore imperative that each pixel represents the same spatial location on the ground. 
Raw image data often does not exhibit accurate band-to-band correlation properties due to 
several reasons. Firstly, there are static contributions to band misalignment; these tend to 
have minimal variation over time and are often inherent to the imager design. For exam-
ple, for push-broom imagers, detectors are placed in parallel on the focal plane with offset 
distances between band detectors, thus the same scene is captured by each band at a dif-
ferent time with a certain lag factor. Static misalignment factors can normally be 
characterised and accounted for by conducting pre-launch or inflight acceptance tests.  
Additionally, dynamic disturbances are typically induced by the environment, space-
craft movement or due to vibrational noise sources onboard the spacecraft [116]. These 
disturbances can vary significantly with time and are extremely difficult to measure and 
detect onboard the satellite platform. Dynamic disturbances can cause both severe visible 
image distortion and induce noise on a sub-pixel level, the impacts can also differ de-
pending on the imaging technique used [117]. The problem of dynamic disturbances is 
becoming an increasing concern in the field of satellite EO, as the effects of satellite jitter 
and micro-vibrations are manifesting more strongly in imagery, particularly as the radio-
metric and spatial capabilities of imagers increase [118].  
The literature review of lossless image compression algorithms, discussed in Section 
2.5, has shown that by exploiting the spectral dimension of EO imagery, increased com-
pression performance can be obtained. By performing band registration onboard prior to 
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 78 - 
compression, the correlation between spectral band would be increased and thus likely 
increase the achievable compression ratio for multidimensional compression algorithms. 
Traditionally band registration however is performed solely on the ground as part of a 
wider registration and calibration processing chain. Supervised image registration tech-
niques, which require user level input, have historically been the most common in remote 
sensing imagery processing. However, due to the increases in data volumes sophisticated 
unsupervised algorithms are gaining in popularity or ground-based processing, which is 
also more aligned with the requirements of an onboard band registration system.  
An onboard suitable band registration algorithm should be fully autonomous, regis-
tration accuracy should be characterised or reversible, and adhere to limited 
computational resource and processing time constraints. Image registration is a complex 
and challenging research area which has grown in recent years and now encompasses 
many different approach’s and solutions to the variation in registration problems. Unlike 
image compression no algorithms or specific techniques have been formalised or official-
ly adopted for recommendation by any standardisation organisation, leaving the 
community to select, adapt and implement algorithms from the large pool of published 
research or commercial tools [119].  
B. Zitova et al provides a very thorough and detailed review of a wide range of regis-
tration methods under research in recent years [120]. One of the most widely used 
techniques is the scale invariant feature transform (SIFT) algorithm which defines an ap-
proach for detecting and matching features across a wide image variation space. Several 
modifications of the original algorithms have manifested in research which are designed 
to tackle specific optimisations of the original methodology, several of which are aimed 
specifically at the multispectral remote sensing image registration problem [121]-[123]. 
However, the literature very rarely considers computational complexity which is a key 
requirement for onboard implementation. An experimental study to objectively compare 
techniques with regards to resource usage, achieved accuracy rating and how much they 
increase band correlation for increased compression performance would need to be con-
ducted to make an informed algorithm selection for onboard application. 
 
2.5.2.3 Radiometric Calibration 
Satellite image sensors observe and measure the reflectance and emission of radiation 
by the Earth’s surface or atmosphere. This is achieved through the measurement of sensor 
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 79 - 
voltage as a response to a certain physical radiance value. Radiometric calibration is the 
processing technique used to convert the recorded pixel values back to spectral radiance 
values. In scientific applications the raw pixel values collected by the sensors have little 
physical meaning and calibration is therefore required prior to data utilisation [126].  
Absolute radiometric calibration is commonly performed on the ground using sensor 
calibration models, the process that converts pixel values to radiance, also correcting for 
non-ideal sensor characteristics and attenuation due to absorption and scattering induced 
by the Earth’s atmosphere. The relationship between the digital sensor measurement and 
the physical radiance can be derived through careful calibration and testing prior to 
launch.  
However, once launched there are a number of factors including, the thermal, radia-
tion environment or mechanical or electrical system properties that can change or degrade 
the imagers characteristics [127]. Therefore, the calibration equations also need to be up-
dated over a satellite’s operational lifetime. Coefficients of calibration therefore need to 
be updated regularly, typically every few months or weeks, via targeted calibration imag-
ing campaigns of specific locations on the Earth or of the moon [124].  
There are potential advantages to performing radiometric calibration onboard the 
platform prior to compression and downlink. Upon the input of a uniform radiance scene 
to an ideal imaging system, a uniform response at the output should be witnessed, free 
from any radiance variations or artefacts. However, due to non-ideal sensors and sensor 
degradation this is not always the case. A push-broom imager, for example, is composed 
of an array of individual CCD sensors, positioned along the cross-track direction. Each 
individual sensor will often have different physical properties causing varying radiance 
responses across the sensor array. As a result, striping artefacts are often visible in an 
output image, which can negatively affect subsequent processing functions, including 
compression [128].  
Therefore, conducting onboard radiometric calibration, to remove artefacts and in-
creasing image correlation, prior to compression could enable higher rates of compression 
to be achieved. To achieve this a full radiometric model is not necessarily required 
whereby a simplified sensor and calibration model could be deployed onboard. Although 
this would not provide absolute calibration of the data, it would provide a relative radio-
metric calibration, removing any non-ideal sensor response artefacts.  
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 80 - 
2.5.2.4 Image Analysis 
In addition to the additional processing algorithms previously discussed, which can 
be used to directly help increase achievable onboard compression, there are many alterna-
tive types of image processing algorithms which have alternative advantages. Typically, 
these are algorithms which can be used to obtain a higher level of information abstracted 
from image data. A summary of the most relevant algorithms for the application of satel-
lite remote sensing are given and summarised in Table 2-10. 
There are three main uses and advantages of performing image analysis algorithms, 
such as those discussed in Table 2-10, onboard prior to data downlink. Firstly, the infor-
mation output from the techniques can help to determine if it is worth downlinking 
images or parts of images in the first place. For instance, using cloud detection infor-
mation, only imagery which meets a threshold in terms of cloud coverage could be 
selected for downlink.  
Secondly, the information generated could be used to guide more intelligent autono-
mous onboard data processing. For example, the results from target detection algorithms 
could help identify a specific region-of-interest (ROI) in an image. Subsequently all pix-
els not within the ROI could be masked prior to compression or compressed using an 
alternative high compression lossy algorithm.  
Thirdly, the abstracted information could be downlinked directly as an additional 
onboard generated data product with its own value in addition or alternatively to full im-
agery datasets. For example, change maps or motion vectors created by change detection 
algorithms could be downlinked and sold as a data product or used to help inform future 
mission operations and management. 
  
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 81 - 
Table 2-10 Image analysis examples 
Analysis 
Method 
Description 
Cloud  
Detection 
[129] 
[130] 
[131] 
Clouds typically cover around 60% of the Earth’s surface, thus there can of-
ten be circumstances where large portions of an imaged scene is covered by 
clouds which can render the data useless for a wide range of applications. 
Cloud detection algorithms are commonly based on thresholding of pixels 
values in specific spectral channels to determine if a pixel is cloud or not. 
More complex approaches also utilise segmentation and region growing ap-
proaches to detect all pixels which represent cloud only data.   
Target 
Detection 
[132] 
[133] 
[134] 
Often satellites are operated to gather data and imagery on specific targets 
either based on geographical location, characterisation such as land use or 
specific content such as ships or cars.  
Target detection, or region-of-interest identification, algorithms are often 
based on image segmentation or edge-detection techniques to identify re-
gions in an image. Subsequently, image classification and feature extraction 
methods are employed to determine if a region meets target requirements.  
Change 
Detection 
[135] 
[136] 
[137] 
With the growth of constellations and video payloads, the temporal resolu-
tion of EO imagery is growing. Often this type of data is used specifically 
for monitoring and detection of change over time.  
Simple change detection methods implement basic differential techniques to 
determine when pixel values have changed. More sophisticated algorithms 
utilise machine learning and fuzzy clustering to generate change maps or 
movement vectors which more accurately describe differences between im-
ages.  
 
 
2.6    CCSDS-123 Implementation Research 
Since its initial proposal, the CCSDS-123 algorithm, has been the direct subject of a 
number of additional publications which detail their implementation approaches. Of key 
note are the series of NASA JPL published works which detail several implementation 
investigations of the algorithm on different hardware platforms. The series of work covers 
the progression of implementations on FPGA, GPU and multicore CPU devices.  
The first published paper in 2009 provides details of an implementation of the algo-
rithm on a high-density Xilinx Virtex-4 FPGA [138]. The paper describes in detail, how 
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 82 - 
the algorithm was broken down into several key constituent computational blocks and 
how these were pipelined together to achieve the overall implementation. As a further 
progression of this work, in 2014 the FPGA implementation was adapted and integrated 
into two data processing modules for an airborne hyperspectral instrument. The work 
demonstrated the performance achievable in a practical real-time compression              
system [139].  
In 2012, the first paper detailing research into a GPU implementation of the CCSDS-
123 algorithm was published [140]. The authors discuss the specific elements of the algo-
rithm and alternative forms of parallelism that are suitable for the GPU architecture. The 
proposed GPU implementation utilises buffering to remove serial data dependencies to 
increase algorithmic parallelism. The implementation deploys a full greedy parallelism 
approach, whereby the largest amount of parallelism present at each stage of the algo-
rithm is exploited.  
Following on from this research B. Hopson et al, part of a collaboration between The 
University of Edinburgh and NASA JPL, published a progression of this work specifical-
ly targeting mobile GPU and multicore CPU systems [141]. This implementation took an 
approach to limit the leveraged data level parallelism, and to expose addition task level 
parallelism by using image tiling to compress multiple tiles concurrently.  
The performance results for all previous CCSDS-123 implementations proposed in 
the literature to date, are given in Figure 2-23. The measured throughput performance of 
all these implementations are given for the same AVIRIS Hawaii hyperspectral test im-
age, also marked on the Figure is the raw data rate for the AVIRIS imager which is equal 
to 0.8 Gigabits per Second (Gb/s). 
A key characteristic identified from the results given in Figure 2-23, is the im-
portance of optimising the implementation to best exploit the underlying hardware 
architecture. All the results from publication [140], for desktop and mobile MC-CPUs and 
GPUs, show a processing throughput which is less than that of the previous FPGA im-
plementation [138]. Additionally, the differences in throughput performances from [140] 
between mobile and desktop platforms is relatively minor.  
 
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 83 - 
 
Figure 2-23 CCSDS-123 previously published throughput results 
 
For MC-CPUs this varied between 1.56 and 1.87 times increase and for the GPU 
platforms it varied between a 1.23 times decrease and 1.15 times increase in throughput. 
However, the results from publication [141], for the same mobile MC-CPU and GPU, 
show a significant increase in processing throughput by 11.52 and 8.06 times respectively 
for each platform type, single GPU configuration. This highlights that exploiting maxi-
mum DLP alone is not sufficient to achieve high data processing throughput on either a 
MC-CPU or GPU platform and that alternative sources of TLP and the memory model of 
the architecture need to be considered carefully for a platform optimised implementation.  
Towards the use of a GPU platform for onboard satellite lossless image compression, 
the previously conducted research has several limitations, these are summarised in        
Table 2-11. 
  
4.28
3.86
1.53
0.07
0.12
0.09
0.19
0.20
0.23
0.44
0.36
0.54
0.70
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
2x NVIDIA GTX 560M
NVIDIA GTX 560M
Intel i7 2760QM 4 Cores
Intel i7 2760QM 1 Core
Intel i7 2760QM 4 Cores
Intel Xeon X5690 1 Core
Intel Xeon X5690 4 Cores
Intel Xeon X5690 8 Cores
Intel Xeon X5690 12 Cores
NVIDIA GTX 560M
NVIDIA Tesla C2070
NVIDIA GTX 580
Virtex-4 LX25
[1
41
]
[1
40
]
[1
38
]
Throughput (Gb/s)
FPGA
Desktop GPU
Mobile GPU
Desktop CPU
Mobile CPU
AVIRIS Data Rate
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 84 - 
 
Table 2-11 Previous research limitations 
1
1. 
The results given in [141] provide performance results for the use of an alternative 
limited DLP single kernel approach and with the additional axis of TLP from image 
tiling. However, the individual contributions of DLP and TLP towards the throughput 
has not been assessed and performance results for only a single tile configuration has 
been provided.  
2
2. 
 
The previous publications focussed upon data processing throughput, however 
compression ratio and memory requirements are also key characteristics for an 
onboard suitable implementation. Specifically, image tiling and algorithmic parame-
ters can both impact compression ratio, memory requirements and throughput, 
therefore it will be necessary to perform a trade-off analysis with these parameters. 
3
3. 
The development, testing, and throughput performance assessment has only been 
published for a single hyperspectral image. Therefore, the reader is unable to deter-
mine how the GPU implementation throughput performance scales with different 
image datasets. This is particularly important in the case of multispectral imagery 
which will exhibit less dimensionality in the spectral axis, the main axis for parallel-
isation in previous implementations. 
4
4. 
The throughput performance comparisons are conducted on desktop and mobile 
GPUs, but neither of these GPU platforms are suitable for onboard implementation. 
The desktop GPUs used have a maximum power consumption of between 238W and 
244W, and the mobile GPU has a TDP of 75W. To be feasible for most onboard sat-
ellite data processing system the GPU should have a maximum power consumption 
in the region of 10W. 
5
5. 
All the previous publications do not consider the radiation environment in space 
and the impact this could have on the application and the underlying GPU hardware. 
It will likely be necessary to practically assess the inherent resilience of the applica-
tion and GPU hardware architecture, in addition to the implementation of mitigation 
strategies before a GPU platform will be actively utilised in the space environment. 
 
  
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 85 - 
2.7    Chapter 2 Summary 
The research detailed in this literature review, has been conducted to refine the scope, 
project objectives and identify specific areas which will require further investigation. As 
shown in Section 2.1 the majority of interest in the EO and remote sensing sector is cur-
rently focussed on high resolution optical instruments. Consequently, the scope of this 
research is limited to onboard data processing approaches which will alleviate the data 
bottleneck specifically for these types of payloads. Section 2.2 presents the review of key 
aspects of the onboard data processing field with a focus on assessing the significant as-
pects which differentiate it from the terrestrial data processing field. Assessing the current 
state of onboard data processing has highlighted that current processing system design 
and underlying hardware devices are holding back the onboard data processing capabili-
ties, whereby only relatively low complexity compression algorithms are currently 
deployed.  
Therefore, in Section 2.3 the fields of terrestrial computing system design approaches 
and state-of-the-art processing devices which present significant advantages for the appli-
cation of onboard data processing are explored. In particular elements from cluster 
computing systems and GPU devices present significant advantages for flexible and ac-
celerated advanced onboard data processing. Due to the growth of the terrestrial mobile 
computing market, reduced volume, mass and power consumption GPU devices are be-
coming increasingly accessible. Currently, there is very little research towards the use of 
these terrestrial low power devices in fault intolerant applications. As a result, radiation 
and error tolerance will be the last remaining major obstacle preventing widespread utili-
sation of these devices in safety critical and space applications.  
Towards the main research goal of helping to reduce onboard data volumes, image 
compression is a vital data processing technique. Section 2.5 details an extensive review 
of literature in the area of lossless image compression. The review includes a wide range 
of lossless image compression algorithms which have been assessed directly relating to 
the requirements for the application of onboard image processing. The findings clearly 
demonstrate the advantages, in terms of compression performance, of multidimensional 
algorithms, which on average achieve 53% greater compression ratio compared to tradi-
tional algorithms. Of all the algorithms analysed in this review, the CCSDS-123 
algorithm emerges as the highest performing algorithm most suitable for onboard imple-
mentation. CCSDS-123 achieves a compression ratio approximately 54% greater than 
Rebecca L. Davidson               Chapter 2 , Literature Review 
- 86 - 
JPEG-LS which is often used in space applications. In addition to the compression algo-
rithms, several advanced data processing techniques which could be implemented 
onboard to aid compression performance further and provide other advantages have been 
researched in Section 2.5.2. However, typical onboard data processing systems today are 
not suitable for the deployment of the majority of these advanced data processing algo-
rithms.  
This further highlights the need for a new flexible onboard data processing architec-
ture designed for high data throughput processing, to facilitate further developments in 
onboard compression and advanced processing algorithms. The key challenge is achiev-
ing increased levels of computational resources and throughput whilst, also ensuring key 
onboard requirements such as error resilience and power consumption are still met. 
Rebecca L. Davidson          Chapter 3, New Onboard Data Processing Architecture 
- 87 - 
CHAPTER 3, NEW ONBOARD DATA PROCESSING ARCHITECTURE 
The performance of any data processing system is dependent upon the underlying archi-
tecture, the hardware deployed, processing algorithms used and also the effectiveness of the 
software to harness system capabilities. As space proven systems struggle to keep up with the 
evolving requirements, it is increasingly important to consider software and algorithm im-
plementation strategies in the early stages of system design to ensure all aspects of the system 
are compatible and collectively the requirements are met.  
Traditionally, technologies originally developed in the space industry have subsequently 
found beneficial use in terrestrial applications. However, in recent years the opposite trend, of 
increased utilisation of terrestrial devices within the space industry, has become common. 
Whilst COTS hardware devices are increasingly being utilised in the space sector, they are 
often deployed in a traditional space architecture and fail to exploit the latest terrestrial de-
velopments in system design and computing models.  
Therefore, in this work terrestrial computing architectures have been researched in addi-
tion to appropriate hardware devices, for adaptation in a space application. The terrestrial HS-
CD methodology has influenced the approach taken in this research towards the proposal of a 
new onboard data processing system. Firstly, idealistic top-level attributes have been pro-
posed and are summarised in Table 3-1.  
Table 3-1 Ideal new onboard data processing system attributes 
1. Scalable and flexible system architecture 
 a. In-orbit reconfigurable functionality  
 b. Scalable computational and memory resources 
 c. Multiple payload interfacing solutions 
2. High performance digital data processing  
 a. Execute complex concurrent and sequential digital data processing applications 
 b. Achieve real-time data processing to match high data rate payloads 
 c. New data products and autonomous and intelligent control 
3. Space environment suitable 
 a. Minimised mass, volume and power consumption  
 b.  Minimise data corruption and exposure to error susceptible environment 
 c. Minimal operational downtime 
Rebecca L. Davidson          Chapter 3, New Onboard Data Processing Architecture 
- 88 - 
Secondly, new hardware and software agnostic behavioural and structural system designs 
have been researched, these target key system attributes which are not currently addressed by 
current onboard data processing systems. Finally, a new hardware orientated onboard data 
processing architecture is proposed and current state-of-the-art devices which could be lever-
aged to construct the proposed architecture are discussed. 
 
3.1    New Behavioural System Design  
To meet the new attributes given in Table 3-1, a runtime adaptive system behaviour is 
proposed, this will allow new data processing techniques not currently implemented onboard 
satellite data processing systems to maximise the achievable payload data downlink through-
put. The proposed new behavioural system design is described in Figure 3-1. The behavioural 
design defines the new adaptive processing chain for payload data, how the system will be 
internally controlled and how the system can be monitored externally using telemetry. The 
payload data flow has been designed to be adaptive where two behavioural modes of opera-
tion are defined. 
 
Figure 3-1 New behavioural system design 
  
Payload data processing mode A, depicted as the block black coloured arrows in         
Figure 3-1, represents the idealistic streamed real-time processing behaviour. In this mode, 
the data passes through the payload interface, advanced and compression processing stages 
and into memory in real-time at a throughput equal to that of the payload data rate. By meet-
ing payload induced processing constraints, the latency of data transfer from the payload to 
Payload Data Processing System        
1
2A2B
3
4
5A
5B
6B
6A
Image Capture 
Trigger
Streamed 
Processing Mode A
1
2A
2B
3
4
5A
5B
6B
6A
Offline Processing 
Mode B
Advanced  
Processing Done
Compression Done
Downlink Trigger
Offline Processing 
Trigger
Downlink Trigger
Decompression 
Done
Payload Data Mode A
Control Signals
Payload Data Mode B
Key
Decompression
Downlink 
Interface
Advanced 
Processing
Compression
Memory
Pa
yl
oa
d
Payload 
Interface Platform
D
ow
nlink
Telemetry Data
Rebecca L. Davidson          Chapter 3, New Onboard Data Processing Architecture 
- 89 - 
memory is minimised which reduces the time exposed to potential environmental effects and 
decreases onboard data buffering requirements. Data buffering is dependent on the specific 
algorithms employed and is directly related to the data dependencies and parallelism of each 
processing stage.  
However, many advanced processing algorithms which pose the greatest advantages to-
wards increasing the achievable onboard data compression ratio, are not suitable for real-time 
data processing. Performing these algorithms in the same payload data processing flow would 
lead to the formation of new onboard data bottlenecks within the payload data processing sys-
tem. To mitigate against this, a second offline processing mode of operation for the system 
has been devised. This mode is depicted in Figure 3-1 as mode B using shaded and black out-
line arrows. It is based on the same fundamental behavioural blocks as the streamed real-time 
mode but provides an alternative behavioural model to achieve the same system objectives.  
In the offline processing approach, processing stages which cannot meet real-time pro-
cessing constraints or have large buffering requirements, can be performed after the payload 
data has been compressed and stored in memory, but prior to the data being downlinked. As 
the data will still need to be compressed for onboard storage, additional onboard decompres-
sion capabilities will be required. Offline processing could be scheduled in non-imaging 
operational periods, avoiding inducing an increased strain on the processing system during 
imaging periods. This will result in more complex control and operational profiles; however, 
it will enable high onboard compression performance and advanced data products to be gen-
erated onboard to increase the downlink efficiency without restricting the system to only real-
time processing. 
 
3.2    New Structural System Design 
Based on the new behavioural system definition, a new structural system design has been 
researched. The structural architecture defines the organisation of the system whilst remain-
ing hardware and software agnostic and has been organised based on the requirements of 
each behavioural block and the system requirements proposed in Table 3-1. The resulting 
proposed structural system design is given in Figure 3-2. It is split into four distinct blocks, 
namely system controller, interfaces and communications, payload data handling and pro-
cessing and the mass memory. The key responsibilities of each system block and the 
advantages these bring over traditional onboard payload data processing systems are summa-
rised in Table 3-2. 
Rebecca L. Davidson          Chapter 3, New Onboard Data Processing Architecture 
- 90 - 
 
 Figure 3-2 Structural system design 
 
Table 3-2 Structural system design summary 
Structural     
System Blocks 
Key Responsibilities Main Advantages 
System          
Controller 
Autonomous triggering of 
events, configuring processing 
parameters, monitoring system 
health and managing any error 
events. 
Enables new platform independent 
autonomous and intelligent onboard 
data processing increasing the range 
of compatible missions and plat-
forms.  
Interfaces and 
Communications 
Provide different interfaces and 
communication channels to ex-
ternal systems and for various 
internal data types.  
 
Configurable interface blocks in-
crease the system flexibility and 
enables greater design re-use, in-
creased power efficiency, throughput 
and scalability. 
Payload Data  
Handling and  
Processing 
Provide processing resources for 
both data handling and computa-
tional demanding algorithms.  
Multiple hardware devices and soft-
ware models increase task specific 
throughput and power efficiency. 
 
Mass Memory Store and protect data.  
 
Scalable data storage capacity. 
Payload Data Processing System
Platform
Payload 
Interface
Mass Memory
   .  .  .
Pa
yl
oa
d
Mass Memory
System Controller
Downlink 
Interface
D
ow
nl
in
k
Payload Data Handling & Processing
Processing
Advanced Processing
Compression
Decompression
Handling
Data Formatting
Data Buffering
System MonitoringProcessing Control
Control Signals
Payload Data
Strucutural Block
Telemetry Data
Key
Behaviourial Block
Interfaces & Communications
Platform 
Interface
Rebecca L. Davidson          Chapter 3, New Onboard Data Processing Architecture 
- 91 - 
3.3    New System Architecture 
Taking inspiration from the research conducted into the terrestrial computing principles 
defined in cluster and heterogeneous computing and the new behavioural and structural sys-
tem designs, a new GPU accelerated onboard data processing architecture is proposed, as 
shown in Figure 3-3. Each major system block is discussed further in the following subsec-
tions. Additional details towards the selection of specific suitable devices for the practical 
realisation of this architecture has can also be found in Appendix D. 
 
Figure 3-3 New onboard data processing system architecture 
 
3.3.1    Backplane 
Employing cluster computing principles increases the scalability of the design to help 
ensure that a single system could be deployed across multiple different missions with mini-
mal impact to the hardware, software and behaviour of the system.  
In the application of onboard data processing, utilising a backplane would allow the 
amount of onboard processing resources and memory to be easily adjusted to suit varying 
satellite platforms and mission requirements. This would help provide resilience against fu-
ture changes to payload data types, volumes and processing tasks. Additionally, it also has 
fewer single points of failure and allows for graceful degradation. When memory or pro-
cessing resources fail, other processing and memory nodes can take on additional load. 
Payload Data Processing System
Memory Card
Memory Card
Memory Card
Memory CardMass Memory
Data
Memory
CPU
Processing Control
Platform Comms
System Monitor
GPUGPU
MemoryMemory
GPU
Advanced Processing
Compression
Data Memory
FPGA
Interfacing
Data Formatting
Data Buffering
Platform
Decompression
Config. 
Memory
Data
Memory
Config. 
Memory
Payload
Downlink
B
A
C
K
P
L
A
N
E
Internal Connection
External Connection
Key:
Firmware/Software
Hardware
Rebecca L. Davidson          Chapter 3, New Onboard Data Processing Architecture 
- 92 - 
Therefore, functionality can be gradually reduced as hardware failures occur and the proba-
bility of a complete system failure is lowered. The backplane can also be designed to 
facilitate the connection of different hardware devices, which might not be traditionally found 
in a single system. Backplanes can be implemented in a number of ways; electrically they can 
be passive or active, and in terms of networking they can provide serial or parallel communi-
cation channels. 
 
3.3.2    FPGA 
Whilst traditionally a single FPGA is leveraged to achieve a suitable trade-off in perfor-
mance for all system functionality, advantages can be gained from the use of multiple 
hardware devices and software models to achieve task specific optimised system perfor-
mance, leveraging the principles of heterogeneous computing. As defined in the behavioural 
and structural system models all behavioural blocks that manipulate data can be classified as 
either control, payload data handling or payload data processing tasks. This classification is 
based on the type of data being used and the computational characteristics of the tasks.  
As shown in Figure 3-3, in the new proposed architecture, the FPGA will be responsible 
for the low-level data manipulation and handling tasks which require relatively low computa-
tional resources and need to be performed deterministically. The FPGA and the underlying 
programmable flexible logic cells will be leveraged to provide a fully customisable data path 
for efficient implementation of the payload and downlink streaming data interfaces and low-
level data handling tasks such as image buffering, tiling and pixel reordering.  
These are combinational logical tasks that would be very costly to implement in alterna-
tive hardware. FPGAs also provide flexibility meaning that changes can be made to the 
FPGA design, thus negating the need for other system tasks to be modified to accommodate 
differences between different payloads or missions. This allows for greater design reuse of 
more complex functional blocks. The in-orbit reconfigurability also allows for new modifica-
tions to be uploaded from ground throughout a mission, potentially enabling a satellites 
operational lifetime to be increased and performance maintained with terrestrial develop-
ments.  
 
3.3.3    CPU 
The proposed architecture, shown in Figure 3-3, also features a centralised controller for 
the management of the system behaviour and for performing supplementary processing of 
Rebecca L. Davidson          Chapter 3, New Onboard Data Processing Architecture 
- 93 - 
telemetry and command data. The overall behaviour of the proposed control model will be 
largely sequential in nature, often performing tasks in response to other events occurring in 
the system. This is characteristically well suited to implementation on low latency driven 
CPU devices. Additionally, control behaviour is well suited to be defined using software. The 
advantage of software defined system control is that it can be configurable at either compile-
time or run-time for modified functionality. This could include the dynamic allocation or re-
allocation of tasks to individual computing nodes in the system upon external factors, such as 
node health status, to change the characteristics of the system for prioritised reliability or 
computational performance.  
 
3.3.4    GPU 
A key novelty of the proposed new architecture is the inclusion of a GPU. There are two 
major motivations for utilising GPUs in onboard data processing applications. The increased 
computational resources they provide, firstly allows for advanced state-of-the-art algorithms 
to be implemented towards increased achievable onboard compression ratio. Secondly, they 
can be leveraged to increase the achievable onboard processing throughput to reduce the time 
the data is not stored in protected mass memory, with the ultimate goal to achieve a complete 
real-time data processing system. 
These devices have been selected to accelerate the mathematical and compute intensive 
payload data processing algorithms. These are the behavioural tasks which are the most chal-
lenging to perform in real-time on current space proven hardware. GPUs provide a hardware 
architecture which is ideal to exploit the parallelism induced by increasing data volumes and 
dimensionality which can be leveraged for high processing throughput. Additionally, the in-
creased amount of computational resources provided by the GPU will help facilitate research 
towards new state-of-the-art high throughput advanced processing algorithms not currently 
deployed onboard. This could include advanced processing algorithms, such as calibration 
and registration techniques, or image quality and analysis algorithms to further increase the 
achievable onboard compression performance. 
In combination with the utilisation of a backplane, the new system also allows the de-
ployment of a variable number of GPU devices to provide a high level of scalability to 
computation resources of the system. This is a feature which is already widely supported at a 
hardware, driver and software level by GPU device manufactures. Additionally, the dynamic 
loading and execution of individual programs by the GPU, also enables the use of a high con-
Rebecca L. Davidson          Chapter 3, New Onboard Data Processing Architecture 
- 94 - 
figurable IP core-based software system, in which different applications can be executed dy-
namically by the system controller. The decision to perform different applications can be 
either performed onboard or on the ground. This allows for a high level of design re-use 
across different platforms and missions. 
 
3.4    Additional Research Areas 
A key novelty of the proposed new onboard data processing architecture is its utilisation 
of GPU hardware. However, no space proven GPU devices or applications currently exist, 
and no published literature has yet demonstrated real-time or error resilience processing on a 
low power GPU platform. To address this gap, new research into how to effectively leverage 
the GPU hardware architecture and software model towards the development of a real-time 
throughput and error resilient image processing application suitable for a low power GPU 
platform is required.  
The remainder of this thesis will discuss the research conducted to address these areas, 
leveraging the state-of-the-art CCSDS-123 lossless multidimensional image compression al-
gorithm as a case study. CCSDS-123 has been chosen because it represents the state-of-the-
art in onboard lossless image compression, as discussed in Chapter 2. Additionally, it is char-
acteristically representative of other typical onboard data processing algorithms, featuring 
elements which are highly sequential in nature, which can be challenging to implement effi-
ciently on GPU hardware.  
Whilst, several publications have detailed their research regarding their GPU implemen-
tation strategies for the CCSDS-123 algorithm specifically, [140][141], the implementations 
from this previous research have not been made publicly available and their publications do 
not provide in depth details on the GPU specific design and development process, making it 
difficult for other researchers to directly assess the design approach for alternative image pro-
cessing algorithms. Additionally, the previous research fails to address several key areas 
which are pertinent to onboard use, including the characterisation of configurable algorithm 
parameters on the performance, and the investigation of the algorithm performance for low 
power GPU platforms in an error prone environment. 
Taking the shortcomings of previous literature into account, three key research areas 
have been identified, summarised in Table 3-3, which will provide vital information to the 
Rebecca L. Davidson          Chapter 3, New Onboard Data Processing Architecture 
- 95 - 
community, to help facilitate the practical deployment of GPUs and the CCSDS-123 algo-
rithm towards a new state-of-the-art in onboard data processing. 
 
Table 3-3 CCSDS-123 case study research questions 
1. GPU accelerated real-time application: 
 a. How do application parameters impact the algorithm and application performance? 
 b. Can these relationships be leveraged to predict optimum configurations in advance? 
 c. How does the application performance relate to the GPU architecture?  
 d. How do data characteristics impact the algorithm and application performance?  
2. Error resilient GPU application: 
 a. Which elements of the algorithm are most susceptible to radiation induced errors? 
 b. Can aspects of the GPU architecture or software model be leveraged to mitigate 
against observed error effects whilst minimising the induced overheads? 
 
3. From the findings of this research, can a new combined high throughput and error 
resilient GPU accelerated application development framework be proposed? 
 
 
Previous literature has shown that real-time CCSDS-123 image compression is possible 
on high-power GPUs. However, there are numerous user definable tiling, imagery and algo-
rithmic input parameters and the impacts of changing these are currently not well understood. 
Additionally, it is important to understand if relationships between the parameters exist and if 
these can be related to the underlying GPU architecture. Previously only a single hyperspec-
tral image with fixed tiling and algorithm parameters have been tested on high power 
consumption GPUs. Therefore, to address the first set of research questions a practical study 
into the algorithm and application performance variation with changing imagery, tiling and 
algorithm parameters is required.  
The other major onboard requirement which has not been significantly researched to 
date, is how a GPU accelerated image compression algorithm behaves in an error prone envi-
ronment and if there are aspects of the highly parallel GPU architecture and software model 
which can be leveraged to increase error resilience. Whilst recent research has started to prac-
tically assess the behaviour of GPU hardware devices in radiation prone              
environments, [62]-[67], research into how image processing algorithms behave and effective 
Rebecca L. Davidson          Chapter 3, New Onboard Data Processing Architecture 
- 96 - 
mitigation strategies for these algorithms on a GPU architecture are very immature. This area 
of research requires significant advancement in order to facilitate the wider adoption of both 
the CCSDS-123 algorithm and GPU devices in an onboard environment. 
 
3.5    Chapter 3 Summary 
Currently satellite onboard data processing systems typically employ RH processors. 
However, these devices are not able to provide the level of computational resources required 
by either current or future satellite EO missions. The requirement and the priority for onboard 
computational resources is increasing due to the need to alleviate the growing onboard data 
bottleneck and additionally to facilitate state-of-the-art image processing for a more intelli-
gent and automated data delivery chain. 
To provide a suitable solution, terrestrial computing devices and high-performance com-
puting architectures have been researched. Incorporating and adapting the principles from 
terrestrial solutions has resulted in the proposal of a new scalable heterogeneous onboard data 
processing architecture, Figure 3-3. The newly proposed onboard data processing system is 
based around a scalable and robust backplane architecture to provide flexibility and reliability 
for space missions. The main processor hardware used are FPGA and GPU devices creating a 
heterogeneous computing solution. In the proposed new architecture, the FPGA is responsi-
ble for the payload data interfacing, buffering and formatting functionality. Whilst the 
computationally intense image processing and compression is offloaded onto the GPU for 
high throughput, ideally real-time, state-of-the-art data processing. This proposed architecture 
aims to facilitate the implementation of a processing pipeline that can enable the generation 
of new types of onboard data products and enable onboard data analysis for autonomous in-
telligent satellite operations whilst also providing flexibility and resilience against future 
technology advances. 
However, state-of-the-art low power GPU platforms have yet to be practically demon-
strated as being able to provide real-time error resilient advanced image processing. 
Advancing these fields will be key to facilitating the adoption of a new GPU accelerated 
onboard processing architecture and new GPU accelerated advanced image processing algo-
rithms to further help alleviate the onboard data bottleneck.  
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 97 - 
CHAPTER 4, GPU ACCELERATED CCSDS-123 COMPRESSION 
To date GPU devices have not been deployed in the application of onboard data pro-
cessing in space. Whilst traditional FPGAs provide a flexible architecture suitable for 
concurrent processing, the degree of parallelism required to efficiently exploit the highly 
parallel nature of the underlying GPU architecture is considerably greater. As a result, 
alternative fundamental parallelisation approaches and application design techniques not 
common in onboard data processing will be required. In this Chapter parallelisation and 
application development techniques appropriate for GPUs are explored using the CCSDS-
123 lossless multidimensional image compression algorithm as a case study. The result-
ing new SSC GPU accelerated CCSDS-123 image compression application is then used 
to investigate the group of research questions identified in Table 3-3 on Page 95. 
 
4.1    CCSDS-123 Algorithm Overview 
CCSDS-123 is a lossless predictive image compression algorithm based on the FL 
algorithm which was specifically designed for low complexity processing of multispectral 
and hyperspectral data sets. Due to its ability to achieve competitive compression perfor-
mance whilst minimising computational requirements, it was subsequently adopted and 
standardised by CCSDS [106]. The CCSDS-123 and FL algorithms exploit spectral data 
redundancies in addition to traditional spatial redundancies by utilising information from 
a small 3D neighbourhood of previously encoded pixels, from up to 15 adjacent image 
bands, to influence the prediction; see Figure 4-1, where p is the selected number of 
bands used to influence the prediction.  
 
Figure 4-1 FL casual template  
 
The FL and CCSDS-123 algorithms are composed of a wide range of several compu-
tational functions such as multiplication, addition and vector cross products which are 
common elements of many data and image processing algorithms. The algorithm is com-
posed of five main computational stages. A key feature is the use of the sign function. 
The sign algorithm is a low complexity variation of the least mean square algorithm used 
Band n Band n-1 Band n-p
S8 S9S7
S6 S5 S(5p)S(5p)+1
S(5p)+2S(5p)+3S(5p)+4S3
X
S4S2
S1
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 98 - 
to produce optimised predictor weightings. In the development of the FL algorithm it was 
found that a predictor based solely on the sign algorithm yields poor convergence speeds 
and performance; therefore, in FL and CCSDS-123 it is used in combination with a local 
difference subtraction method.  
The five major stages of the algorithm are described below: each of these steps is 
performed sequentially for each pixel in the image. In these equations, n represents the 
total number of image bands, p is the selected number of prediction bands and the sample 
references (S#) corresponds with Figure 4-1. First, within each spectral band, a local mean 
of neighbouring pixels is calculated as shown in equation (4-1). 
 
1) Local Sum Calculations 
Band n : !0 =(S1 + S2 + S3 + S4) 
Band n-1: !5 =(S6 + S7 + S8 + S9) 
Band n-p: !5P = (S(5p)+1 + S(5p)+2 + S(5p)+3  + S(5p)+4) 
(4-1)  
 
 
 
Then the local difference values are calculated by subtracting the local sum from 4 
times the pixel value. There are two types of local differences used to construct the final 
local difference vector, shown in (4-4): the directional local differences, which use previ-
ously encoded pixels from the current band, calculated using equation (4-2), and the 
central local differences, calculated using equation (4-3), which are based on pixels from 
previously encoded bands.  
 
2) Local Difference Calculation 
 
Directional Local Differences, LDN:    4(S3) -–!0 
                                                  LDW:   4(S1) -–!0 
                                                  LDNW: 4(S2) -–!0 
 
 
(4-2) 
Central Local Difference, LDn-1:  4(S5) -–!5   
                                          LD n-p: 4(S(5*p)) -–!5*P 
 
 
(4-3) 
Local Difference Vector:    "# = ⎣⎢⎢
⎢⎢⎡
LD*LD+LD*+LD,-.…LD,-0⎦⎥⎥
⎥⎥⎤ 
 
(4-4) 
 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 99 - 
The final predictor is then equal to the local sum for the current band plus a weighted 
sum of the local difference vector. The weighted sum of the local difference vector is cal-
culated by first performing a dot product between the local difference vector "# and the 
weightings vector W56, then the constituent values of the remaining vector are summed, 
using the equation (4-5). 
 
 
3) Final Weighted Predictor  7̃# = !9 +W56 ∙ "# (4-5)  
 
Following this the prediction residual value is then calculated as a simple error value, 
equation (4-6). The predicted residual is then mapped to an unsigned integer value which 
is output by the CCSDS-123 and FL algorithms. The mapped residuals are then passed to 
an entropy encoding scheme which is responsible for generating the final compressed bit-
stream. 
 
4) Mapped Prediction Residual Calculation <# = =9 − 7̃#   (4-6) 
 
The final stage of the compression algorithm is to update the predictor weightings 
vector. The updated values are calculated using the sign algorithm, equation (4-7), where ρ is a user defined scaling parameter. 
 
5) Update Prediction Weightings Using Sign Algorithm W@(t + 1) = W@(t) − ρ ∙ UF ∙ sgn(eF)  (4-7) 
 
Algorithm 1 provides a simplified structural overview of CCSDS-123 highlighting 
key functional blocks and showing the overall data flow. It emits control blocks and user 
defined parameter usage for simplicity. Ultimately, CCSDS-123 operates in an iterative 
manner over each pixel sequentially, as represented by the nested loops in lines 2-4 and 
follows the same functional structure as represented by the previous equations.  
 
 
 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 100 - 
Algorithm 1 – Serial Simplified CCSDS-123 
1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 
9. 
10. 
11. 
12. 
13. 
14. 
15. 
16. 
17. 
Procedure CCSDS-123 (*IN, *OUT) 
  for (z = 1; IM_BANDS; z++)   
      for (y = 1; IM_HEIGHT; y++) 
        for (x = 1; IM_WIDTH; x++) 
              if (y==1) && (x==1) then  
                  Initialise_Weights(*W) 
              end                 
              Calc_Local_Sums(*IN, *LS) 
              Calc_Local_Differences(*LS, *IN, *LD) 
              PRED_LD = Dot_Product(*W, *LD) 
              PRED = Calc_Prediction(PRED_LD, LS) 
              Update_Weights(*W, *LD, PRED, *IN) 
              RESIDUAL = Map_Residual(PRED, *IN) 
              *OUT = Entropy_Encode(RESIDUAL) 
          end 
      end 
  end 
 
In standardisation to CCSDS-123, several modifications were made to the original 
algorithm. This includes a prediction mode specifically designed to suit the data patterns 
produced by push-broom sensors. As the characteristics of individual detector elements in 
push-broom sensors can vary, the correlation between spectral bands can also vary with 
cross-track position. Therefore, for push-broom data it was found that letting the local 
mean and local difference be related to only the previous sample in the same cross-track 
position, results in a significantly better compression ratio than the original method. 
These new methods are shown in Figure 4-2 and Figure 4-3 respectively. In addition to 
these operational modes, there are several key user definable parameters for fine-tuning 
the CCSDS-123 algorithm operation, a summary of these is given in Appendix E. 
 
 
 
A) Neighbour-orientated local mean 
 
 
B) Column-orientated local mean 
 
Figure 4-2 CCSDS-123 local sum modes 
 
 
 
A) Full prediction mode 
 
 
B) Reduced prediction mode 
 
Figure 4-3 CCSDS-123 local difference modes 
XS1
S3 S4S2
X
S3
S5 S(5p)
S3
X
S4S2
S1
Band n Band n-1 Band n-p
S(5p)
Band n-p
S5
Band n-1
S3
X
Band n
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 101 - 
4.2    CCSDS-123 Parallelisation Approaches  
The CCSDS-123 algorithm was not specifically designed to provide high inherent 
parallelism, as it was originally conceived with the goal of low resource utilisation within 
an FPGA. To assess potential parallelisation, it is important to understand the functional 
and data dependencies. Figure 4-4 shows a functional block diagram of the CCSDS-123 
algorithm, where the arrows indicate functional dependencies and blocks are highlighted 
to indicate the different data dependencies within each functional block.  
 
Figure 4-4 CCSDS-123 functional block diagram  
 
Several functional blocks in this algorithm contain no internal data dependencies 
(blocks with no background shading in Figure 4-4). These blocks could be fully parallel-
ised exposing DLP equal to the number of pixels in the image. However, several blocks, 
shaded in grey in Figure 4-4, contain spatial dependencies limiting the amount of parallel-
ism that can be exploited equal to the number of spectral bands. This dependency is 
induced by the spatial serial dependency of the weight update feedback loop and subse-
quent dependencies of the weight vector itself. Only two blocks, weight update and 
residual mapping, exhibit TLP which highlights the highly sequential nature of the under-
lying algorithm. In order to increase the amount of TLP pre-processing such as image 
tiling can be leveraged. Performing image tiling increases the amount TLP for the algo-
rithm as each independent image tile can now be compressed in parallel increasing the 
overall data processing throughput.  
 
Local 
Diﬀerence 
Vector
Sample-Adaptive 
Entropy EncoderPredictor
Local 
Sum Prediction
_
Prediction 
Error 
Compressed Bitstream 
Input Sample
Residual 
Mapping
VLC 
Parameter
Counter & 
Accumulator 
Update
Data Dependency Key:
Spatially & Spectrally IndependentSpectrally Independent 
Binary 
Codeword
Weight 
Vector
Weight 
Update
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 102 - 
4.2.1  Full DLP Approach 
An approach often taken in the parallelisation of a problem is to separate the algo-
rithm based the parallelism available at each stage. This strategy allows the maximum 
level of parallelism to be exploited at each stage, in CUDA this can be implemented using 
separate kernels to represent differences in the underlying parallelism. An example of this 
approach applied to the CCSDS-123 algorithm is shown in Algorithm 2.  
 
Algorithm 2 – Full DLP CCSDS-123 
1. 
2. 
3. 
4. 
 
5. 
6. 
7. 
8. 
9. 
10. 
11. 
12. 
13. 
14. 
15. 
16. 
17. 
18. 
 
19. 
20. 
 
21. 
22. 
23. 
Kernel LOCAL_SUM_DIFF (*IN, *LS, *LD) 
  Pix = threadID 
  LS(Pix) = Calc_Local_Sums(*IN) 
  Calc_Local_Differences(*LS, *IN, *LD) 
 
Kernel CCSDS-123(*IN, *LS, *LD, *LEN, *CWRD) 
  z = threadID 
  for (y = 1; y = IM_HEIGHT; y++)  
      for (x = 1; x = IM_WIDTH; x++) 
          if (y==1) && (x==1) then  
              Initialise_Weights(*W) 
          end                
          PRED_LD = Dot_Product( *W, *LD) 
          PRED = Calc_Prediction(PRED_LD, *LS) 
          Update_Weights(*W, *LD, PRED, *IN) 
          RESIDUAL = Map_Residual(PRED, *IN) 
          Calc_CWRD(RESIDUAL, *CWRD, *LEN) 
      end 
  end 
 
Kernel INCLUSIVE_SUM(*LEN) 
  Inclusive_Sum(*LEN) 
 
Kernel BIT_PACKER(*CWRD, *LEN, *OUT) 
  Pix = threadID 
  *OUT = Bit_Packer(CWRD(Pix), LEN(Pix) 
 
24. 
25. 
26. 
27. 
28. 
29. 
30. 
31. 
32. 
33. 
34. 
35. 
Host Code 
Initialise_H_mem 
Initialise_G_mem 
Copy_H2G_mem(*H_IN, *G_IN) 
threads = IM_BANDS x IM_HEIGHT x IM_WIDTH 
LOCAL_SUM_DIFF<<<threads>>>(*G_IN, *G_LS, *G_LD) 
threads = IM_BANDS 
CCSDS-123<<<threads>>>(*G_IN, *G_LS, *G_LD, *G_LEN, *G_CWRD) 
threads = IM_BANDS(log(IM_BANDS)) 
INCLUSIVE_SUM<<<threads>>>(*G_LEN) 
threads = IM_BANDS x IM_HEIGHT x IM_WIDTH 
BIT_PACKER<<<threads>>>(*G_CWRD, *G_LEN, *G_OUT) 
Copy_G2H_mem(*H_OUT, *G_OUT) 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 103 - 
This implementation and parallelisation approach is likely similar to the methodolo-
gy used by D. Keymeulen et al in the first GPU implementation of the CCSDS-123 
algorithm [140]. The first kernel LOCAL_SUM_DIFF, lines 1-4, is responsible for calcu-
lating all local sum and local difference values for the whole image. It has no data 
dependencies therefore a maximum number of parallel threads equal to the number of 
spectral bands in the image times the height and width of the image in pixels can be de-
clared, line 27. The second kernel, CCSDS-123, lines 5-18, is responsible for the majority 
of the computational work of the algorithm, taking in the input image pixel information, 
and previously calculated local sums (LS), local difference (LD) and weights (*W) and 
outputs the calculated variable length binary codewords (CWRDS) and codeword length 
(LEN) information from global memory.  
The constituent functions of this kernel have the greatest data dependencies and can 
be considered the parallel bottleneck of the algorithm. The maximum degree of DLP that 
can be exposed is only equal to the number of bands in the image, line 29. The third ker-
nel, INCLUSIVE_SUM, is responsible for calculating the cumulative sums of the lengths 
of each binary codeword, lines 19, 20. This gives the offset location for each codeword in 
the final compressed bitstream. The final kernel is BIT_PACKER, lines 21-23, which 
takes in the CWRDS and cumulatively summed LEN data to generate a single bitstream 
of consecutive variable length binary codewords (*OUT). By pre-calculating the offset 
locations, INCLUSIVE_SUM, data dependencies for the final compressed bitstream con-
struction functional block can be eliminated, thus exposing the maximum amount of DLP, 
line 33. 
The disadvantage of this approach is that all intermediate results, LS, LD, CWRDS 
and LEN arrays must be stored in global off-chip DRAM as this is the only memory 
which facilitates communication between different kernels. This memory has large asso-
ciated latencies causing the implementation to become memory bandwidth bound. This 
does not ideally match the key characteristic of the GPU which is its increased ability to 
hide computational instruction latencies associated with computationally bound problems.  
4.2.2  Limited DLP Approach 
An alternative parallelisation approach, which is typically better suited for memory 
bandwidth bound problems, is to limits the degree of parallelism exploited and implement 
fewer kernels. By reducing the number of kernels, the kernel and memory initialisation 
overhead is reduced, and the opportunity for the GPU to exploit low-level ILP and utilise 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 104 - 
low latency on-chip memory is increased. This alternative approach applied to the 
CCSDS-123 algorithm is demonstrated in Algorithm 3. This implementation is likely 
similar to the methodology used by B. Hopson et al in their GPU implementation of the 
CCSDS-123 algorithm [140]. 
 
4.2.3  Hybrid Approach 
Considering the previous approaches and conducting new research, a new hybrid 
DLP approach is proposed and subsequently summarised in Algorithm 4. Due to the size 
of the input and output arrays, *LEN *CWRD and *OUT, each of these will always need 
to be stored in global memory. Therefore, by implementing the INCLSUIVE_SUM and 
BIT_PACKER as separate kernels, lines 17-21, additional DLP parallelism can be ex-
ploited without relegating any data to higher latency memory structures. 
Algorithm 3 – Limited DLP CCSDS-123 
1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 
9. 
10. 
11. 
12. 
13. 
14. 
15. 
16. 
17. 
18. 
19. 
20. 
21. 
Kernel CCSDS-123 (*IN, *OUT) 
  z = threadID 
  for (y = 1; y = IM_HEIGHT; y++)  
     for (x = 1; x = IM_WIDTH; x++) 
           if (y==1) && (x==1) then  
               Initialise_Weights(*W) 
           end                
           LS = Calc_Local_Sum(*IN) 
           Calc_Local_Differences(LS, *IN, *LD) 
           PRED_LD = Dot_Product( *W, *LD) 
           PRED = Calc_Prediction(PRED_LD, LS) 
           Update_Weights(*W, *LD, PRED, *IN) 
           RESIDUAL = Map_Residual(PRED, *IN) 
           Calc_CWRD(RESIDUAL, CWRD, LEN)  
      end 
  end  
  OFFSET(z) += LEN  
  *OUT_Z = Bit_Packer(CWRD, OFFSET(z)) 
  if (threadID == 1) then 
      *OUT = Combine_Bands(*OUT_Z, *OFFSET) 
  end 
 
22. 
23. 
24. 
25. 
26. 
27. 
Host Code 
Initialise_H_mem 
Initialise_G_mem 
Copy_H2G_mem(*H_IN, *G_IN) 
threads = IM_BANDS 
CCSDS-123<<<threads>>>(*G_IN, *G_OUT) 
Copy_G2H_mem(*H_OUT, *G_OUT) 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 105 - 
 
4.3    New Parallel GPU CCSDS-123 Application 
Before starting the development of a new application, it is important to understand 
and define any requirements and constraints that will impact application design decisions. 
The broad requirements defined for the new CCSDS-123 application development are 
given in Table 4-1. They have been defined in line with the proposed GPU accelerated 
onboard data processing architecture discussed in Chapter 3. 
Algorithm 4 – Hybrid DLP CCSDS-123 
1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 
9. 
10. 
11. 
12. 
13. 
14. 
15. 
16. 
 
17. 
18. 
 
19. 
20. 
21. 
Kernel CCSDS-123 (*IN, *LEN, *CWRD) 
  z = threadID 
  for (y = 1; y = IM_HEIGHT; y++)  
     for (x = 1; x = IM_WIDTH; x++) 
           if (y==1) && (x==1) then  
               Initialise_Weights(*W) 
           end                
           LS = Calc_Local_Sum(*IN) 
           Calc_Local_Differences(LS, *IN, *LD) 
           PRED_LD = Dot_Product(*W, *LD) 
           PRED = Calc_Prediction(PRED_LD, LS) 
           Update_Weights(*W, *LD, PRED, *IN) 
           RESIDUAL = Map_Residual(PRED, *IN) 
           Calc_CWRD(RESIDUAL, *CWRD, *LEN) 
      end 
  end 
 
Kernel INCLUSIVE_SUM(*LEN) 
  Inclusive_Sum(*LEN) 
 
Kernel BIT_PACKER(*CWRD, *LEN, *OUT) 
  Pix = threadID 
  OUT(Pix) = Bit_Packer(CWRD(Pix), LEN(Pix)) 
 
22. 
23. 
24. 
25. 
26. 
27. 
28. 
29. 
30. 
31. 
32. 
33. 
Host Code 
Initialise_H_mem 
Initialise_G_mem 
Copy_H2G_mem(*H_IN, *G_IN) 
threads = IM_BANDS 
blocks = IM_TILES 
CCSDS-123<<<threads, blocks>>>(*G_IN, *G_LEN, *G_CWRD) 
threads = IM_BANDS(log(IM_BANDS)) 
INCLUSIVE_SUM<<<threads>>>(*G_LEN) 
threads = IM_BANDS x IM_HEIGHT x IM_WIDTH 
BIT_PACKER<<<threads>>>(*G_CWRD, *G_LEN, *G_OUT) 
Copy_G2H_mem(*H_OUT, *G_OUT) 
Free G_mem 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 106 - 
 Table 4-1 Broad CCSDS-123 application requirements 
Input Data Characteristics Process multispectral and hyperspectral EO imagery 
Hardware Characteristics Use low power (<20W TDP) GPU Platform  
Algorithm Performance Maximise compression ratio (>2:1) 
Implementation Performance Maximise throughput (>imager data rate) 
Minimise SDC & FI rates 
 
 
 
Figure 4-5, provides more information on the GPU specific implementation of this 
hybrid parallelisation approach, including the memory allocation requirements and kernel 
configuration parameters. The resulting SSC GPU implementation adheres to the 
CCSDS-123 standard; a summary of the capabilities and compliance it provides can be 
found in Appendix E.  
 
Figure 4-5 SSC CCSDS-123 GPU application design summary  
 
 
1 Weights_Length3 = #Prediction Bands3 + {if Prediction Mode3 = FULL} 3 {ELSE} 0 
2 No new memory is allocated; input sample memory is reused for the compressed stream 
3 User defined algorithm parameters, see Appendix E for details 
 
Hardware Operations Memory Allocation (Bytes) 
Host Load Compression Parameters 21 
Host Load Image Samples 2 * X Size * Y Size * Z Size  
Device  
 
I. CCSDS-123 Kernel 
Registers/Thread = 72 Global Memory 
 = (14 * X Size * Y Size * Z Size) + 
21  
 
Shared Memory  
= 8* Z Size * Weights_Length1  
  
Grid Size # Tiles 
  
Block Size Z Size 
Inputs Parameters, Samples 
Outputs  Codewords, Lengths 
Device 
II. THRUST – Inclusive Sum 
NA   
Input  Lengths 
 
Output  Lengths 
Device 
III. Bit Packer Kernel 
Registers/Thread = 20 
NA2 
  
Grid Size 1024 
 
Block Size Ceil{(X Size * Y Size * Z Size) /1024} 
Inputs  Codewords, Lengths 
Outputs Compressed Stream 
Host Store Compressed Stream Variable, but ≤ 2 * X Size * Y Size * Z Size   
 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 107 - 
As per the approach detailed in Algorithm 4, the application is made of three kernels, 
new CCSDS-123 and Bit Packer kernels and a THRUST inclusive Sum kernel. THRUST 
is an open-source parallel algorithms library and provides a flexible high-level interface 
for GPU optimised routines [142]. It features an abundant collection of data parallel prim-
itives such as scan, sort, and reduce, which are key building blocks of many complex 
algorithms. The library is included with the CUDA toolkit and extensive documentation 
is available online. THRUST algorithms can be called from both host and device code 
and can additionally be executed in either location where different parallelisation policies 
are provided. In this implementation, the THRUST Inclusive Sum algorithm was utilised 
to calculate the offset locations in the bitstream for the variable codewords generated 
from the CCSDS-123 algorithm. By calculating this offset as an inclusive sum of the in-
dividual codeword lengths, the packing of the codewords into the final bitstream can be 
performed with no serial dependencies.  
Towards the design and optimisation of the new CCSDS-123 and Bit Packer kernels 
fundamental knowledge of the GPU architecture and software model, discussed in      
Section 2.4.1 and 2.4.2, was leveraged in the investigate and trade-off between a number 
of key application design and optimisation approaches. To summarise the main features 
of the new CCSDS-123 and Bit Packer kernels, the following subsections discuss the key 
design decisions researched and the resulting design rules which have been established 
and deployed. Whilst the proposed design rules are based on the findings from the 
CCSDS-123 specific application development, they have been expanded to be generic to 
help the design of future new GPU accelerated image processing algorithms.  
4.3.1  Input Data Organisation 
As shown in Figure 4-5, the new CCSDS-123 kernel has been primarily designed to 
exploit DLP by declaring the number of threads per block equal to the number of spectral 
bands in the image. In this case, each GPU thread is responsible for the processing of all 
the pixels in a single z plane of the image, therefore threadID = z. A factor for high GPU 
processing throughput is to ensure global memory operations are fully coalesced. As the 
input pixel values will need to be stored in global memory, it is important to ensure that 
all pixel values are read contiguously in memory.  
Traditionally, image data is stored in a Band Sequential (BSQ) format. In one dimen-
sional memory this would equate to pixels being stored firstly by incrementing the x 
dimension index, then y and finally z. This format creates large strides between pixel val-
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 108 - 
ues which only differ in the z index and because in the new application the threadID 
equates to the z index, this will equate to a large number of uncoalesced global memory 
transactions, as pictured in Figure 4-6. Therefore, a more GPU optimised data ordering 
would be for the pixels to be ordered by incrementing z, then x and then finally y. This is 
called the Band Interleaved by Pixel (BIP) order and if the data is reordered into BIP for-
mat prior to GPU processing this will result in coalesced memory transactions for 
improved processing throughput, as shown in Figure 4-6. 
 
Figure 4-6 Data ordering for coalesced memory operations 
 
 
Design Rule A: Ensure input data is organised to match the underlying DLP and the 
thread organisation which exploits it. 
 
 
In addition to the DLP induced by the image spectral bands the new application also 
leverages image tiling pre-processing to expose additional TLP to the GPU architecture. 
In order to maintain high GPU memory access efficiency, the input data needs to be reor-
dered so that data is firstly in BIP format and secondly that the tiles are stored 
sequentially in memory.  
 
Design Rule B: Additional data induced TLP can increase parallelism where inherent 
DLP is low, ensure data ordering reflects TLP to maintain coalesced memory accesses.  
 
4.3.2  CCSDS-123 Kernel Organisation – Memory Hierarchy 
The new CCSDS-123 kernel is the main computational block of the GPU application, 
in order to ensure high processing data throughput, the computational work has been or-
ganised to take advantage of the lowest latency constructs of the memory hierarchy 
wherever possible. As a result, all individual thread local variables, such as configuration 
BSQ Memory Order BIP Memory Order
*INWarp 0
31
0
…
(0,0,0)
(1,0,0)
(0,0,1)
…(0,0,31)
…(0,0,z)
…
(x,y,z)
…1
Warp N
z
…
*INWarp 0
31
0
…
(0,0,0)
(0,0,1)
(0,0,31)
…
(0,0,z)
…
(x,y,z)
…1
Warp N
z (1,0,0)
Instruction:  R1 = LD[IN(0,0,Z)]
…
…
…
…
…
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 109 - 
parameters and intermediate results, are stored in registers, whole array variables, which 
include the weight and local difference vectors, are stored the user managed shared 
memory. Shared memory facilitates thread cooperation and data re-use in addition to be-
ing lower latency than alternate global memory for the storage of array variables. 
However, the use of shared memory requires careful thread synchronisation to ensure that 
all threads complete appropriate operations correctly where instruction dependencies oc-
cur. In the CCSDS-123 kernel only a few synchronisation barriers are required, thus there 
are no significant increases in operational stalls and overall the processing throughput is 
increased by utilising shared memory.  
 
Design Rule C: Maximise the use of low latency on-chip memory including registers and 
shared memory. 
 
4.3.3  CCSDS-123 Kernel Organisation – Kernel Occupancy 
In GPU computing a common approach to ensure maximum kernel processing 
throughput is to assess and maximise the theoretical kernel occupancy. An initial occu-
pancy and profiling assessment of the kernel was performed using an application 
configuration suitable for the AVIRIS Hawaii hyperspectral test image, details of which 
are given in Table 4-2. The image and tiling configurations used for this test were select-
ed specifically due to its use in previous literature [140][141].  
 
Table 4-2 AVIRIS Hawaii test image characteristics [143] 
Number of 
Bands (Z) 
Dynamic 
Range 
Original 
Width (X) 
Original  
Height (Y) 
Tiled     
Width (X) 
Tiled      
Height (Y) 
224 12 bpp 614 Pixels 512 Pixels 614 Pixels 32 Pixels 
 
The key kernel characteristics and calculated theoretical occupancy of the new SSC 
CCSDS-123 kernel, configured for the AVIRIS Hawaii tiled image and compiled for the 
NVIDIA Maxwell architecture with no optimisation compiler flags, are given in Table 
4-3. Examining the profiling results, the key limiting factor for kernel occupancy was the 
relatively high register usage. The targeted Maxwell architecture provides up to 65536 
registers for each SM, and in this configuration the new CCSDS-123 kernel used 78 reg-
isters per thread and had 224 threads per block. This results in the usage of 17472 
registers per block, which limits each SM to simultaneously executing 3 blocks. When 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 110 - 
compared to the theoretical maximum number of blocks, equal to 9, this equates to a 
maximum theoretical occupancy of 33%. To try and improve the theoretical occupancy 
approaches to reduce the register usage of the application are explored. 
 
Table 4-3 SSC CCSDS-123 naturally compiled characteristics 
Kernel Characteristics Register Usage Limited Occupancy Statistics 
Number of 
blocks 
Number of threads 
per block 
Registers 
per 
Thread 
Active Warps 
per SM 
Active Thread 
Blocks per SM 
Occupancy 
of each SM 
16 224 78 21 3 33% 
 
 
To understand the cause of the high register usage, the NVIDIA Nsight platform and 
NVIDIA nvprof application were used to perform in-depth profiling analysis of the 
CCSDS-123 kernel. Figure 4-7 gives the specific breakdown of the percentage of each 
instruction grouped and instruction group type.  
 
Figure 4-7 Instructions occurrence in SSC CCSDS-123 kernel based on operation  
 
From this figure it is clear that computationally the CCSDS-123 kernel is heavily 
based upon integer addition and multiplication (IADD_IMUL) operations, which together 
equate to almost 50% of all instructions. The kernel has also been designed to leverage 
low latency shuffle and logical (SHUF_LOP) operations where possible, which account 
for over 10% of all instructions. Both of these groups of instructions exclusively output to 
0
5
10
15
20
25
30
35
40
45
50
X
M
A
D
IA
D
D
IM
N
M
X
IA
D
D
3
IS
C
A
D
D
IA
D
D
32
I
SH
L
SH
R
LO
P
SH
F
IS
ET
P
PS
ET
P
IS
ET
B
R
A
LD
S
LD
C
LD
G
LD
L
LD I2
I
I2
F
F2
I
ST
S
ST
G
R
ED ST
L ST
D
FM
A
D
M
U
L
M
O
V
S2
R
SS
Y
SY
N
C
LO
P3
2I
B
A
R
SE
L
M
U
FU
IC
M
P
EX
IT
FL
O
R
ET
JC
A
L
LE
A
IADD_IMUL SHUF_LOP PRED LD CONV ST FL MISC
In
st
ru
ct
io
n 
O
cc
ur
re
nc
e 
(%
)
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 111 - 
general purpose registers. Whilst this minimises instruction latencies and reduces higher 
latency memory structure accesses it increases the register usage of the kernel. Addition-
ally, because the CCSDS-123 kernel contains the bulk of the computational work, it 
exhibits high dynamic execution count. Whilst this allows ILP to minimise GPU stalls, it 
also contributes to the overall high register usage. 
One approach to reduce register usage is to reduce the amount of work conducted in 
the kernel by implementing multiple separate smaller kernels. However, as shown by the 
full-DLP approach discussed in Section 4.2.1, this would not be beneficial for the new 
application as it requires intermediate values to be passed between the kernels using high 
latency off-chip memory.  
Another approach to reduce register usage is to leverage available compiler con-
structs. When the NVIDIA CUDA compiler called nvcc turns the source code into GPU 
machine code it decides upon the number of registers used for each kernel. It makes this 
decision to meet a balanced trade-off between performance for a generality of kernel 
launch parameters. As a result, the choice made by the compiler guarantees effectiveness 
for different numbers of threads per block and blocks per SM, however this does not al-
ways equate to the best performance for certain kernel configurations. The first construct 
which can be used to advise the compiler is 
__launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_SM). This 
construct is placed directly in the kernel definition source code, and the maximum num-
ber of threads per block for the kernel and the minimum number of blocks per SM are 
defined at compile time.  
This additional information, on the bounds of the kernel launch parameters, helps the 
compiler to reach optimised register usage specific to the specified kernel launch parame-
ters. If a kernel is launched with a number of threads greater than 
MAX_THREAD_PER_BLOCK or a number of blocks less than 
MIN_BLOCKS_PER_SM the kernel will not run and return an error. Often off-chip local 
memory usage or the number of instructions is traded-off in order to reach lower register 
usage. The second construct which can be used is the compiler flag maxrregcount which 
is placed in the make file for the project. This compiler flag places a simple hard limit on 
the number of registers that the compiler can use. When the compiler cannot stay below 
the hard limit, register spilling into off-chip local memory occurs. The impact of both of 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 112 - 
these constructs on register usage, occupancy and throughput were practically assessed 
and the results are given in Figure 4-8.  
 
 
Figure 4-8 Impact on varying compiled registers per thread 
 
In Figure 4-8, 78 registers per thread is the native number of registers allocated by 
the compiler with no constructs used, 72 registers per thread is the number of registers 
used when using the __launch_bounds__ construct with a 
MAX_THREADS_PER_BLOCK value of 224 and a MIN_BLOCK_PER_SM of 16 and 
56 and 32 registers per thread results were achieved by manually forcing the compiler us-
ing the maxrregcount construct. Figure 4-8 demonstrates that the __launch_bounds__ 
construct can be used to minimise execution time by aiding the kernel in making the op-
timum trade-off between register usage and occupancy. This configuration is specific to 
the AVIRIS test image characteristics; the occupancy results for this configuration are 
given for reference in Table 4-4.  
 
Table 4-4 SSC CCSDS-123 optimised compiled characteristics 
Image Characteristics – AVIRIS Hawaii 
Kernel Characteristics Register Usage Limited Occupancy Statistics 
Number of 
blocks 
Number of threads 
per block 
Registers/ 
Thread 
Active Warps 
per SM 
Active Thread 
Blocks per SM 
Occupancy 
of each SM 
16 224 72 28 4 44% 
 
95
79
83 82
33
44
55
98
30
35 35 35
0
10
20
30
40
50
60
70
80
90
100
60
65
70
75
80
85
90
95
78 72 56 32
O
cc
up
an
cy
 (%
)
Ti
m
e 
(m
s)
Registers/Thread
Time (ms) Theoretical Occupancy Achieved Occupancy
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 113 - 
The results shown in Figure 4-8 also show that whilst manually restricting the com-
piler register usage increases the theoretical occupancy, this is not translated to an 
increase in achieved occupancy or throughput. This is due to the increased overheads as-
sociated with register spilling into off-chip local memory. NVIDIA regularly updates 
their compilers; each new version often increases the compiler intelligence leading to 
greater optimised implementations. Therefore, with future improvements to the compiler 
technology this discussed approach may not always be necessary to help the compiler 
achieve the best trade-off between occupancy and execution time, however understanding 
kernel limiting factors will be necessary to understand compiler implementation charac-
teristics. 
 
 
 
4.3.4  CCSDS-123 Kernel Organisation – Concurrent Tasks 
In addition to exploiting DLP through independent spectral data processing, the new 
CCSDS-123 kernel leverages image tiling to introduce an additional axis for parallelisa-
tion. The additional parallelism introduced from image tiling could be exploited on the 
GPU in multiple ways. The first is to leverage the concurrent programming functionality 
called CUDA streams, introduced in CUDA 7.0 (March 2015) [9].  
This concept allows the execution of asynchronous GPU commands including host-
GPU memory operations and kernel launches. Previous to CUDA 7.0, all GPU com-
mands were allocated to the default stream and were executed synchronously with the 
host. With the introduction of custom streams, it is now possible to execute multiple 
memory commands and kernel launches concurrently from the host, as GPU resources 
allow, as illustrated in Figure 4-9. This concept could allow us to create an implementa-
tion in which multiple streams can be utilised to launch multiple CCSDS-123 kernels 
concurrently for different image tiles.  
Design Rule D: Determine the kernel occupancy limiting factor. For register limited ker-
nels, compiler constructs can be used to optimise register usage for increased occupancy. 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 114 - 
 
Figure 4-9 NVIDIA CUDA Streams 
 
The alternative approach is to implement the independent compression of multiple 
image tiles within a single kernel, where each block of threads is tasked with the com-
pression of each image tile. This approach will achieve a higher processing throughput for 
the application because it is more effectively able to increase the parallel workload to mit-
igate stall reasons and instruction latencies whilst not inducing any additional kernel start-
up overheads.  
 
Design Rule E: TLP can be exploited using concurrent kernels where an individual kernel 
has sufficient DLP or can increase intra-kernel parallelism where DLP is insufficient. 
 
4.3.5  Bit Packer Kernel Organisation – Configuration Optimisation 
The new Bit Packer kernel has been designed to take in the two output arrays from 
the CCSDS-123 kernel. The first will contain the binary variable length codeword data, 
represented in integer form, to be packed into an output bit stream. The second will con-
tain the cumulatively summed, by the THRUST inclusive sum kernel, lengths of the 
codewords to be packed.  
The bit packer operation can be performed across the full number of pixels in the im-
age as there will be no data dependencies. Therefore, each thread will be responsible for 
the packing of a single pixel’s compressed codeword into the final bitstream. As each 
thread will be working independently it is important to ensure no two threads try and 
write to the same memory location at the same time, thus the atomic OR operation is 
used. The compressed sizes for each individual tile will also need to be stored with the 
bitstream to enable the decoder to identify the start and end of each tile to enable subse-
quent decompression. Additionally, as the compression algorithm stipulates that the 
output bitstream will be guaranteed to be no larger than the original image size, the 
Kernel A
Kernel B
Kernel C
Time
Kernel A
Kernel B
Kernel C
Stream 0
Stream 1
Stream 2
Stream 3
Key :
Sequential Kernels
Independent Concurrent A, B & C Kernels
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 115 - 
memory allocated to store the input pixel values can be re-used for the storage of the out-
put bitstream which eliminates an additional large memory allocation operation. 
Because the kernel functionality has no data dependencies, the kernel configuration 
in terms of number of blocks and number of threads, is best configured to reflect the un-
derlying GPU hardware to maximise GPU utilisation. Performance results demonstrating 
this are given in Table 4-5 and the resulting final theoretical occupancy for the new Bit 
Packer kernel is given in Table 4-6. As the achieved theoretical occupancy is 100% and 
the execution time is marginal compared to the CCSDS-123 kernel no further optimisa-
tions were pursued.  
 
Table 4-5 Bit Packer kernel configuration and performance comparison 
Kernel Configuration Type Threads Blocks Time (ms) 
Image characteristics induced 224 314,368 33.4 
GPU hardware characteristics induced 1024 68,768 32.7 
 
 
Table 4-6 SSC Bit Packer kernel characteristics and theoretical occupancy 
Kernel Characteristics Occupancy Statistics 
Number of 
blocks 
Number of threads 
per block 
Registers/ 
Thread 
Active Warps 
per SM 
Active Thread 
Blocks per SM 
Occupancy 
of each SM 
68,768 1024 18 64 2 100% 
 
 
4.4    Initial Application Evaluation 
The initial application evaluation experiments discussed in this subsection, for the 
new SSC CUDA CCSDS-123 application utilise a desktop development platform, the de-
tails of which are given in Table 4-7. These experiments have been conducted to assess 
the overall performance of the new application and the questions research identified in 
Table 3-3. 
  
Design Rule F: When the operations of a kernel have no data dependencies, for maximum 
data processing throughput the configuration should reflect the underlying hardware ar-
chitecture rather than the input data dimensionality. 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 116 - 
Table 4-7 Desktop application evaluation platform 
Desktop PC Dell OPTIPLEX 7010 (SMF) 
CPU & Clock Intel Core i5-3470 @ 3.20GHz 
CPU RAM 4GB DDR3  
GPU & Clock NVIDIA GeForce GTX750Ti @ 1.02GHz 
 
There are two key metrics which are used in this work to evaluate the algorithm and 
application performance, namely compression ratio and processing throughput. The com-
pression ratio is calculated using equation (2-4) by measuring the original file size and 
compressed file size in bits. Processing throughput is calculated using equation (4-8) by 
measuring the total amount of time taken to execute the full compression application and 
is often expressed as Gigabits per second (Gb/s) in this research. The correctness of the 
algorithm implementation has been verified by using a developed decompression applica-
tion which performs the reverse of the algorithm process, taking in the compressed 
bitstream and outputting the original uncompressed image, where each compressed image 
test case has been verified for correctness.  
 
4.4.1  Literature Performance Comparison 
This subsection details performance testing of the new SSC CCSDS-123 application 
on the hyperspectral AVIRIS Hawaii test image, which was used by previous publications 
in their own documented testing. Testing using this image has been performed specifical-
ly to allow a comparison to be made between the new SSC application and with those 
implementations previously proposed in literature.  
The performance results quoted here cannot be compared directly like for like with 
previous published works [138][140][141] as different GPU devices have been used and 
the necessary details on the compression parameters for the CCSDS-123 algorithm were 
not published. The parameters used by default, unless otherwise stated, for all experi-
ments conducted in this research can be found in Appendix F. The results and 
configurations used for comparison are provided in Table 4-9, this data is then presented 
in comparison with previous results from literature in Figure 4-10.  
  
Processing throughput	= original image size (bits)total execution time (seconds) (4-8) 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 117 - 
 
Table 4-8 AVIRIS Hawaii test image characteristics 
AVIRIS Image Width (X pixels) 
Height 
(Y pixels) 
Number of 
Bands (Z) 
Dynamic Range 
(bits per pixel) 
Hawaii 614 512 224 12 
 
 
Table 4-9 AVIRIS Hawaii test configuration and new compression results 
Number of 
Tiles 
Number   of 
Bands (Z) 
Width 
(X Pixels) 
Height  
(Y Pixels) 
Compression 
Ratio 
Throughput 
(Gb/s) 
16 224 614 32 4.03 5.28 
  
 
 
 
Figure 4-10 CCSDS-123 implementation throughput performance comparison 
 
Table 4-10 GPU platform comparison 
Device Peak GFLOP/s 
Peak Memory 
Bandwidth 
(GB/s) 
CUDA Cores 
Max Power 
Consumption 
(W) 
GTX 580 1581.1 192.4 512 244 
Tesla C2070 1030.0 144.0 448 238 
GTX 560 M 595.2 60.0 192 75 
GTX 750 Ti 1305.6 86.4 640 60 
5.28
4.28
3.86
1.53
0.07
0.12
0.09
0.19
0.20
0.23
0.44
0.36
0.54
0.70
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
NVIDIA GTX 750Ti
2x NVIDIA GTX 560M
NVIDIA GTX 560M
Intel i7 2760QM 4 Cores
Intel i7 2760QM 1 Core
Intel i7 2760QM 4 Cores
Intel Xeon X5690 1 Core
Intel Xeon X5690 4 Cores
Intel Xeon X5690 8 Cores
Intel Xeon X5690 12 Cores
NVIDIA GTX 560M
NVIDIA Tesla C2070
NVIDIA GTX 580
Virtex-4 LX25
SS
C
[1
41
]
[1
40
]
[1
38
]
Throughput (Gb/s)
FPGA
Desktop GPU
Mobile GPU
Desktop CPU
Mobile CPU
AVIRIS Data Rate
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 118 - 
These results show that the new SSC CCSDS-123 application in this thesis is able to 
achieve a throughput performance in excess of the previous state-of-the-art. Whilst these 
results are for the GTX750Ti GPU which does have increased peak GFLOP/s and 
memory bandwidth than the previously used GTX 560M, previous results from literature 
have shown that hardware characteristics have had a relatively low impact on achieve 
throughput whereby the results from [140] between the GTX 560M and the GTX 580, 
which has greater peak GFLOP/s and memory bandwidth than the GTX 750Ti used in 
this research only improves throughput by 0.1 Gb/s.  
4.4.2  Image Tiling, Processing Throughput and Compression Ratio 
The remainder of this Chapter details the further research and practical experiments 
conducted to assess the CCSDS-123 algorithm and the new SSC CCSDS-123 GPU appli-
cation, with regards to throughput and compression ratio performance and the impact of 
image tiling parameters. The data set used for the testing detailed in this Chapter is a sub-
set of images sourced from the official CCSDS satellite data corpus which is compiled 
specifically for the testing of image compression algorithms, available from [143]. The 
subset of images used and discussed in this Chapter is uncalibrated imagery from the 
AVIRIS hyperspectral imager. Uncalibrated imagery is used as this is most representative 
of the raw data which will need to be compressed onboard the satellite. The dimensionali-
ty of each image used in this study is given in Table 4-11 and thumbnails for reference of 
image content can be found in Appendix G [143]. 
 
Table 4-11 AVIRIS test image characteristics 
AVIRIS Image Width (X pixels) 
Height 
(Y pixels) 
Number of 
Bands (Z) 
Dynamic Range 
(bits per pixel) 
Hawaii 614 512 224 12 
Maine 680 512 224 12 
Yellowstone 00 680 512 224 16 
Yellowstone 03 680 512 224 16 
 
The following experiments have been designed to gather new information and enable 
us to propose initial answers to research questions 1.a-c in Table 3-3 on Page 95.        
Figure 4-11 - Figure 4-14 present the compression ratio and processing throughput results 
for the SSC GPU accelerated compression of each tested AVIRIS image over a range of 
different image tiling scenarios. The X axis of each plot shows the variation in the overall 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 119 - 
number of image tiles, and in order to be able to assess the influence of tiling specifically 
in the input image’s X or Y dimension, the results for tiling scenarios for different X Siz-
es are presented as separate series on the plots.  
All the compression ratio results presented in these figures account for the additional 
overhead information induced by performing tiling. This is the overhead necessary to 
store the tile boundary locations in the compressed bitstream for successful decompres-
sion and is equal to 16 bits per tile. In comparison to the overall compressed image file 
sizes, this has a negligible impact on the overall achieved compression ratio and the 
trends seen with varying numbers of image tiles.  
Directly analysing the results for the tested AVIRIS images,                                   
Figure 4-11 – Figure 4-14, there are several clear consistent trends across all images. 
Firstly, with overall increasing numbers of tiles, X axis, the achieved compression ratio 
decreases. In addition, these plots also show that the tile configuration, tile X and Y sizes, 
also has increasing influence on the compression ratio performance with increasing num-
bers of tiles. This is seen in the Figures by the increasing variation in achieved 
compression ratio between the different series for the same numbers of tiles. Whilst for 
the AVIRIS Hawaii image a larger X tile size achieves the higher compression ratio, for 
the other AVIRIS images at higher numbers of tiles the smallest tile X size achieves the 
highest compression ratio. It is postulated that these trends are dependent upon the indi-
vidual image content characteristics, specifically if pixel correlation is greater across the 
X or Y dimension of the image. This makes predicting ideal tile dimensions for maximum 
compression ratio extremely difficult in advance of the data being available. However, the 
variation between the different tile configurations for the same number of overall tiles is 
minimal; for all AVIRIS images the largest variation is approximately 0.02, which can be 
considered negligible. In contrast, selecting the overall number of image tiles can have a 
much larger influence on the compression ratio.  
 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 120 - 
 
Figure 4-11 AVIRIS Hawaii compression ratio and throughput (GTX750Ti) 
 
 
 
Figure 4-12 AVIRIS Maine compression ratio and throughput (GTX750Ti) 
 
3.9
3.92
3.94
3.96
3.98
4
4.02
4.04
4.06
4.08
4.1
0.8
1.3
1.8
2.3
2.8
3.3
3.8
4.3
4.8
5.3
5.8
1 2 4 8 16 32 64 128
C
om
pr
es
si
on
 R
at
io
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
AVIRIS Hawaii
X Size=614, Throughput (Gb/s) X Size=307, Throughput (Gb/s)
X Size=614, Compression Ratio X Size=307, Compression Ratio
3.6
3.65
3.7
3.75
3.8
3.85
3.9
3.95
4
0.8
1.3
1.8
2.3
2.8
3.3
3.8
4.3
4.8
5.3
5.8
1 2 4 8 16 32 64 128 256 512
C
om
pr
es
si
on
 R
at
io
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
AVIRIS Maine
X Size=680, Throughput (Gb/s) X Size=340, Throughput (Gb/s) X Size=170, Throughput (Gb/s)
X Size= 85, Throughput (Gb/s) X Size=680, Compression Ratio X Size=340, Compression Ratio
X Size=170, Compression Ratio X Size= 85, Compression Ratio
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 121 - 
 
Figure 4-13 AVIRIS Yellowstone 00 compression ratio and throughput (GTX750TI) 
 
 
 
Figure 4-14 AVIRIS Yellowstone 03 compression ratio and throughput (GTX750Ti) 
 
Assessing the throughput performance results in Figure 4-11 - Figure 4-14, there are 
several interesting trends. Firstly, converse to the trend in compression ratio, specific tile 
dimensions have very little impact on the achieved processing throughput, whereby the 
overall number of image tiles is the main influencing parameter. Secondly, for all images 
and between 1 and 16 image tiles there is an exponentially increasing trend in processing 
throughput. However, beyond 16 tiles the increase in throughput is significantly reduced 
2.28
2.3
2.32
2.34
2.36
2.38
2.4
2.42
2.44
2.46
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
8
1 2 4 8 16 32 64 128 256 512
C
om
pr
es
si
on
 R
at
io
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
AVIRIS Yellowstone 00
X Size=680, Throughput (Gb/s) X Size=340, Throughput (Gb/s) X Size=170, Throughput (Gb/s)
X Size= 85, Throughput (Gb/s) X Size=680, Compression Ratio X Size=340, Compression Ratio
X Size=170, Compression Ratio X Size= 85, Compression Ratio
2.34
2.36
2.38
2.4
2.42
2.44
2.46
2.48
2.5
2.52
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
8
1 2 4 8 16 32 64 128 256 512
C
om
pr
es
si
on
 R
at
io
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
AVIRIS Yellowstone 03
X Size=680, Throughput (Gb/s) X Size=340, Throughput (Gb/s) X Size=170, Throughput (Gb/s)
X Size= 85, Throughput (Gb/s) X Size=680, Compression Ratio X Size=340, Compression Ratio
X Size=170, Compression Ratio X Size= 85, Compression Ratio
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 122 - 
with further increased numbers of tiles, for all tested images there are minimal increases 
in processing throughput beyond 16 image tiles.  
Using an understanding of the software design and specific underlying GPU architec-
ture it is possible to relate the trend in processing throughput with the variation in 
numbers of tiles to several GPU characteristics. The CCSDS-123 kernel represents the 
majority of computational time and is the bottleneck kernel of the application. From the 
occupancy analysis of this kernel it is shown that, due to the register usage limitation, the 
kernel is restricted to executing up to 4 concurrent thread blocks per SM. As the 
GTX750Ti features 5 SMs, this equates to up to 20 concurrent thread blocks.            
Equations (4-9) and (4-10) are proposed to mathematically define this relationship and 
allow the ideal number of tiles to be calculated using known GPU and application param-
eters.  
 
 
In the GPU application proposed in this thesis, the number of image tiles is equal to 
the number of thread blocks, therefore, it is postulated that for the AVIRIS imagery 20 
image tiles will represent the theoretical peak of the exponential throughput trend. For 
greater than 20 image tiles the additionally required thread blocks will not be executed on 
the GPU concurrently, as a result there may be a small reduction or minimal increase in 
processing throughput.  
In Figure 4-11 – Figure 4-14, only powers of two number of tiles have been tested, 
this is for simplified spatial subdivision of the imagery to suit their specific dimensions. 
The graphs show that 16 tiles result in the peak of the exponential throughput trend for 
the numbers of tiles tested in these experiments. This is the case because it is the closest 
value to the theoretical maximum number of concurrent blocks for the CCSDS-123 kernel 
on the GTX750Ti GPU, i.e. 20. Considering these trends and the parameters which influ-
ence them, GPUs with different numbers of SMs and imagery with different 
dimensionality would be expected to impact the occupancy and thus the maximum num-
Concurrent blocks per SM 
(Register Limited Kernel) 
 
=	 L MNO	P<QR7S<P7	T<P	UVWXYZNPT7	T<P	UVWXY	 × 	\<QR7S<P7	T<P	]NPT^ (4-9) 
SSC CCSDS-123 
concurrent blocks and 
tiles per GPU 
 
 
   
= L MNO	P<QR7S<P7	T<P	UVWXYZNPT7	T<P	UVWXY	 × 	\<QR7S<P7	T<P	]NPT^ × =M7	T<P	_`" (4-10) 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 123 - 
ber of concurrent blocks and optimum number of image tiles for peak processing 
throughput. 
In addition to investigating the influence of image tiling parameters on compression 
ratio and throughput the impact of varying a number of CCSDS-123 algorithm parameters 
have also been researched. The results from these tests can be found in Appendix G 
[143][144].  Overall, it was found these algorithm parameters have a less significant im-
pact on the algorithm and application performance when compared to the discussed tiling 
parameters.  
4.4.3  Performance Trade-off  
Due to the opposing trends in throughput and compression ratio with increased image 
tiles, maximising both concurrently is not possible. As a result, a methodology to deter-
mine the number of tiles to obtain a required trade-off between data processing 
throughput and compression ratio is required. A methodology based on the Weissman 
score comparison metric, given in equation (4-11) is proposed [144]. The Weissman score 
is a comparative measure that factors both compression time and compression ratio for 
the comparison of different compression schemes. Whereby the greater the Weissman 
score the better the combined throughput and compression ratio performance.  
 Z = 	a PP̅ VWQ.9cdVWQ.9 c 
 
Where: P is the compression ratio for comparison 
            c is the time to compress for comparison 
            P̅ is the compression ratio for the reference data 
           cd  is the time to compress for the reference data  
            a is a scaling constant 
 
(4-11) 
 
For EO and remote sensing data, the dynamic range, which is the number of bits used 
to represent each pixel in the original data, can vary between data sets. Therefore, it is 
proposed that a modified Weissman Score, which utilises the compression throughput ra-
ther than compression time, is used. By doing so this considers the performance 
difference between a wider range of imagery data for a fair comparison metric. The modi-
fied Weissman score equation used in this research is given in (4-12), where the 
throughput used is measured in Mb/s.  
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 124 - 
Zefg = 	a PP̅ log.9 c`log.9 c`dddd 
 
Where: P is the compression ratio for comparison 
            cP is the achieved throughput (Mb/s) for comparison 
            P̅ is the compression ratio for the reference data  
            c`dddd is the achieved throughput (Mb/s) for the reference data  
            a is a scaling constant 
 
(4-12) 
 
As the AVIRIS image has been widely used in many publications in the area of re-
mote sensing image compression, it has been the throughput and compression ratio results 
from the non-tiled AVIRIS Hawaii image have been used in this research as the reference 
in the calculation of throughput modified Weissman scores quoted throughout this re-
search. For reference the throughput and compression ratio results used are given in   
Table 4-12. In this research a Weissman Score scaling constant of 1 is used, this gives a 
balanced trade-off between compression ratio and throughput. However, to suit different 
mission requirements this scaling constant can simply be altered to give a differently bal-
anced trade-off between the two metrics as required. The throughput Weissman Score 
results for the tiling size investigation for AVIRIS imagery are given in Figure 4-15.  
 
Table 4-12 AVIRIS Hawaii throughput Weissman Score reference data 
Image Throughput (Mb/s) Compression Ratio 
AVIRIS Hawaii – No tiling  955 4.0794 
 
 
Figure 4-15 AVIRIS GTX750TI throughput Weissman scores with tile size (a = 1) 
Hawaii,16x614x32: 1.23
Maine,16x680x32: 1.19
Yellowstone 00,16x85x256: 0.77
Yellowstone 03,16x85x256: 0.79
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1 2 4 8 16 32 64 128 256 512
W
ei
ss
m
an
 S
co
re
Number of Tiles
AVIRIS Hawaii AVIRIS Maine AVIRIS Yellowstone 00 AVIRIS Yellowstone 03
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 125 - 
Plotting the throughput Weissman scores in Figure 4-15, helps to easily visualise and 
determine the optimum number of tiles which achieve in this case an equally balanced 
trade-off between throughput and compression ratio. A clear commonality shown by 
these results, is that 16 image tiles achieves the highest modified throughput Weissman 
score for all the tested AVIRIS images. For these AVIRIS test images this correlates with 
the peak number of concurrent tiles executed on the GPU. Therefore, it is proposed that 
the peak number of concurrent tiles also represents an appropriate trade-off between max-
imising both processing throughput and compression ratio. Additionally, as all the 
parameters used in these equations can be determined in advance, it is therefore possible 
to theoretically calculate the number of tiles which results in an appropriate trade-off be-
tween throughput and compression ratio in advance of performing any compression, as 
demonstrated in Table 4-13. 
 
Table 4-13 AVIRIS imagery, kernel and GTX750Ti hardware characteristics 
Imagery induced 
characteristics 
CCSDS-123 kernel induced 
characteristics 
Hardware induced characteristics 
(GTX750Ti) 
Warps per block Registers per (thread) & 
warp 
Max registers per block SMs per GPU 
7 (72) 2304 65536 5 
 
This could be helpful in deciding in advance the suitable number of image tiles re-
quired for a certain mission prior to launch. However, it also important to consider the 
raw input data rate: the AVIRIS imager which generated the data tested in this Section 
has a data rate of 800 Mb/s [141]. Therefore, in this mission scenario even when no image 
tiling is performed the application is able to achieve a real-time data processing through-
put. Due to the reduction in compression ratio induced by image tiling, it will be specific 
to the onboard data processing system employed if beyond real-time processing provides 
any significant advantages. 
 
4.5    Low Power GPU Application Performance 
To date the development and testing of all published CCSDS-123 CUDA GPU appli-
cations have occurred on GPU platforms which are not feasible for use in low power 
applications such as the onboard space environment. In terms of power consumption par-
Concurrent blocks per SM (4-9) Concurrent blocks & tiles per GPU (4-10) 
4 20 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 126 - 
ticularly, the NVIDIA Jetson range of GPU platforms are considered suitable for onboard 
implementation with a maximum TDP of 15W [145]. The NVIDIA Jetson TX1 device is 
based on the NVIDIA Tegra X1 SoC device, featuring four ARM Cortex-A57 CPU cores 
and four ARM Cortex-A53 CPU cores in addition to an NVIDIA Maxwell based GPU on 
the single silicon device.  
Whilst a key advantage of the Jetson TX1 is its onboard suitable power consumption, 
the Jetson TX1 module is also mechanically relatively small, at 50 mm x 87 mm. Another 
important factor is that the Jetson TX1 is available as a developer kit. This allows indi-
viduals and industry to purchase the device for a reasonable price in a package enabling 
quick set-up and application development.  
In this Section new results for the performance for the newly developed SSC 
CCSDS-123 application on the low power embedded Jetson TX1 GPU platform are pre-
sented. This was the latest Jetson device available at the time this research was conducted 
and the author would like to gratefully acknowledge the support of NVIDIA Corporation 
with the award of a grant which provided us with the Jetson TX1 developer kit used in 
this research.  
Whilst, the previously used GTX750Ti and the Jetson TX1 GPUs are based on the 
same NVIDIA Maxwell architecture generation, there are a number of characteristics 
which differ between the two GPU devices, summarised in Table 4-14. A key difference 
is the lower number of SMs per GPU and therefore lower number of overall CUDA cores 
for the Jetson TX1 device. As a result, lower data processing throughput results are ex-
pected for the Jetson TX1 platform. 
 
Table 4-14 GPU test platform comparison 
 GTX750Ti Jetson TX1 
GPU    
Config. 
640 Maxwell CUDA Cores (5 SMs)      
@ 1020MHz 
256 Maxwell CUDA Cores (2 SMs)      
@ 1000MHz 
GPU RAM 2GB 128 bit GDDR5 @ 86.4 GB/s 4 GB 64 bit LPDDR4 @ 25.6 GB/s 
GPU TDP  60W 6.5 -–15W 
 
 
In order to experimentally examine the measurable differences in processing 
throughput of the Jetson TX1 GPU platform several of the compression experiments con-
ducted in Section 4.4 have been repeated utilising the Jetson TX1 platform. Additionally, 
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 127 - 
these results can also be used to prove the generality of the previously proposed equations 
for the relationship between number of tiles and the throughput. For the AVIRIS images, 
using equations (4-9) and (4-10), it is predicted that on the Jetson TX1 the peak of the ex-
ponential throughput trend and peak throughput Weissman score will occur for 8 image 
tiles, as shown in Table 4-15.  
 
Table 4-15 AVIRIS imagery, kernel and Jetson TX1 hardware characteristics 
Imagery induced 
characteristics 
CCSDS-123 kernel induced 
characteristics 
Hardware induced characteristics 
(Jetson TX1) 
Warps per block Registers per (thread) & 
warp 
Max registers per block SMs per GPU 
7 (72) 2304 65536 2 
 
Concurrent blocks per SM (4-9) Maximum concurrent tiles per GPU (4-10) 
4 8 
 
 
Figure 4-16 gives the comparative image tiling testing results, of the CCSDS-123 GPU 
application from this thesis, on both the desktop GTX750Ti and embedded Jetson TX1 
platforms for all the AVIRIS images.  
The processing throughput trends are similar across the two GPU platforms, as pre-
dicted the peak of the exponential throughput increase occurs at 8 tiles across all tested 
AVIRIS images, which corresponds to the maximum number of concurrently processed 
image tiles. In opposition to the trends seen on the GTX750Ti however, beyond 8 tiles the 
throughput plateaus with no significant further increases in the processing throughput. 
This could indicate that beyond 8 tiles for these images all GPU resources on the Jetson 
TX1 are being fully utilised, in comparison to the GTX750Ti in which further image til-
ing continues to exploit underutilised GPU resources. Considering the 800 Mb/s data rate 
of the AVIRIS imager, for the Jetson TX1 platform the application developed in this re-
search achieves a real-time data processing throughput for all images by leverage image 
tiling and at least two image tiles. 
  
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 128 - 
 
 
 
 
A) 
 
 
 
 
 
B) 
 
 
 
C) 
 
 
 
D) 
Figure 4-16 AVIRIS imagery GTX750Ti and Jetson TX1 tiled performance 
 
Additionally, for each experiment the power consumption of the Jetson TX1 module 
and GPU were measured. The approach used to measure these characteristics and the re-
sults obtained are given in Appendix I. Overall these findings show that under application 
load the peak power consumption of the Jetson TX1 module is around 13W and the aver-
age power consumption is approximately 6W. 
3.9
3.92
3.94
3.96
3.98
4
4.02
4.04
4.06
4.08
4.1
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
1 2 4 8 16 32 64 128
C
om
pr
es
si
on
 R
at
io
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
AVIRIS Hawaii
GTX750Ti: X Size=614, Throughput (Gb/s)
GTX750Ti: X Size=307, Throughput (Gb/s)
Jetson TX1: X Size=614, Throughput (Gb/s)
Jetson TX1: X Size=307, Throughput (Gb/s)
X Size=614, Compression Ratio
X Size=307, Compression Ratio
3.9
3.92
3.94
3.96
3.98
4
4.02
4.04
4.06
4.08
4.1
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
1 2 4 8 16 32 64 128
C
om
pr
es
si
on
 R
at
io
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
AVIRIS Hawaii
3.6
3.7
3.8
3.9
4
0.5
1
1.52
2.5
33.5
4
4.55
5.5
1 2 4 8 16 32 64 128 256 512
C
om
pr
es
si
on
 R
at
io
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
AVIRIS Maine, Yellowstone 00, Yellowstone 03
GTX750Ti: X Size=680, Throughput (Gb/s)
GTX750Ti: X Size=340, Throughput (Gb/s)
GTX750Ti: X Size=170, Throughput (Gb/s)
GTX750Ti: X Size= 85, Throughput (Gb/s)
Jetson TX1: X Size=680, Throughput (Gb/s)
Jetson TX1: X Size=340, Throughput (Gb/s)
Jetson TX1: X Size=170, Throughput (Gb/s)
Jetson TX1: X Size= 85, Throughput (Gb/s)
X Size=680, Compression Ratio
X Size=340, Compression Ratio
3.63.65
3.73.75
3.83.85
3.93.95
4
0.511.5
22.53
3.544.5
55.56
6.577.5
8
1 2 4 8 16 32 64 128 256 512
Co
m
pr
es
sio
n 
Ra
tio
Th
ro
ug
hp
ut
 (G
b/
s)
Number of Tiles
AVIRIS Yellowstone 03
GTX750Ti: X Size=680, Throughp t (Gb/s)
GTX750Ti: X Size=340, Throughp t (Gb/s)
GTX750Ti: X Size=170, Throughp t (Gb/s)
GTX750Ti: X Size= 85, Throughput (Gb/s)
Jetson TX1: X Size=680, Throughput (Gb/s)
Jetson TX1: X Size=340, Throughput (Gb/s)
Jetson TX1: X Siz =170, Throughput (Gb/s)
Jetson TX1: X Size= 85, Throughput (Gb/s)
X Size=680, Compression Ratio
X Size=340, Compression Ratio
X Size=170, Compression Ratio
X Size= 85, Compression Ratio 3.6
3.65
3.7
3.75
3.8
3.85
3.9
3.95
4
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
1 2 4 8 16 32 64 128 256 512
C
om
pr
es
si
on
 R
at
io
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
AVIRIS Maine
2.28
2.3
2.32
2.34
2.36
2.38
2.4
2.42
2.44
2.46
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
8
1 2 4 8 16 32 64 128 256 512
C
om
pr
es
sio
n 
R
at
io
Th
ro
ug
hp
ut
 (G
b/
s)
Number of Tiles
AVIRIS Yellowstone 00
2.28
2.3
2.32
2.34
2.36
2.38
2.4
2.42
2.44
2.46
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
8
1 2 4 8 16 32 64 128 256 512
C
om
pr
es
sio
n 
R
at
io
Th
ro
ug
hp
ut
 (G
b/
s)
Number of Tiles
AVIRIS Yellowstone 03
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 129 - 
As performed previously, the throughput Weissman scores for the Jetson TX1 results 
have also been calculated. These new Jetson TX1 throughput Weissman scores are given 
in Figure 4-17 along with the previous Weissman scores for the GTX750Ti for direct 
comparison.  
 
Figure 4-17 AVIRIS Jetson TX1 and GTX750Ti throughput Weissman Scores (a = 1) 
 
The GTX750Ti and Jetson TX1 throughput Weisman Score follow similar trends. 
This is because the compression ratio is unaffected by the GPU platform used, thus the 
differences in Weissman scores for the two platforms are based solely on the change in 
processing throughput. This is reflected in the change in the number of tiles which 
equates to the peak throughput Weissman Score, which is 8 for the Jetson TX1 platform, 
in comparison with 16 for the GTX750Ti, as predicted by Equation (4-10). 
For the AVIRIS Hawaii test image the throughput results for the Jetson TX1 and 
GTX750Ti platforms can also be compared with the results quoted in literature for 
CCSDS-123 implementations on alternate hardware. This comparison is given in         
Figure 4-18.  
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1 2 4 8 16 32 64 128 256 512
W
ei
ss
m
an
 S
co
re
Number of Tiles
AVIRIS Hawaii GTX750Ti AVIRIS Hawaii Jetson TX1
AVIRIS Maine GTX750Ti AVIRIS Maine Jetson TX1
AVIRIS Yellowstone 00 GTX750Ti AVIRIS Yellowstone 00 Jetson TX1
AVIRIS Yellowstone 03 GTX750Ti AVIRIS Yellowstone 03 Jetson TX1
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 130 - 
  
Figure 4-18 AVIRIS Hawaii CCSDS-123 compression throughput comparison 
 
 
Table 4-16 GPU platform comparison 
Device Peak GFLOP/s 
Peak Memory 
Bandwidth (GB/s) 
CUDA 
Cores 
Max Power    
Consumption (W) 
GTX 580 1581.1 192.4 512 244 
Tesla C2070 1030.0 144.0 448 238 
GTX 560 M 595.2 60.0 192 75 
GTX 750 Ti 1305.6 86.4 640 60 
Jetson TX1 1024.0 25.6 256 15 
 
The results from this work are given for given for 16 tiles, this is to match the con-
figurations used in the other publications. The results show that whilst the throughput 
performance for the Jetson TX1 is significantly reduced compared to the results achieved 
for the GTX750Ti platform, it is comparable with the throughput achieved by [140] on 
the quad core Intel i7 CPU and approximately double the performance of the only other 
onboard suitable device; the Virtex-4 FPGA [139]. Using image tiling on the Jetson TX1 
1.39
5.28
4.28
3.86
1.53
0.07
0.12
0.09
0.19
0.20
0.23
0.44
0.36
0.54
0.70
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
NVIDIA Jetson TX1
NVIDIA GTX 750Ti
2x NVIDIA GTX 560M
NVIDIA GTX 560M
Intel i7 2760QM 4 Cores
Intel i7 2760QM 1 Core
Intel i7 2760QM 4 Cores
Intel Xeon X5690 1 Core
Intel Xeon X5690 4 Cores
Intel Xeon X5690 8 Cores
Intel Xeon X5690 12 Cores
NVIDIA GTX 560M
NVIDIA Tesla C2070
NVIDIA GTX 580
Virtex-4 LX25
SS
C
[1
41
]
[1
40
]
[1
38
]
Throughput (Gb/s)
FPGA
Desktop GPU
Mobile GPU
Embedded GPU
Desktop CPU
Mobile CPU
AVIRIS Data Rate
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 131 - 
it is possible to exceed the 800 Mb/s real-time data rate of the AVIRIS imager. This 
demonstrates the suitability, in terms of power consumption and processing throughput, 
of the Jetson TX1 GPU platform for onboard data processing of hyperspectral imagery. 
 
4.6    Chapter 4 Summary 
This Chapter details the research conducted into appropriate parallelisation ap-
proaches and application development techniques for GPU architectures, using the 
CCSDS-123 lossless multidimensional image compression algorithm as a case study. De-
tails of the selected parallelisation approach and design techniques employed in the 
development of the new SSC CCSDS-123 GPU application are discussed, this provides 
key insights not explored in previous literature. Each of the techniques employed is based 
on the fundamental principles of the GPU hardware architecture and software model pre-
sented in Section 2.4.1 and 2.4.2 and have been generalised and summarised into a 
number of key GPU design rules (A - F) to aid future GPU accelerated image processing 
application developments.  
Using the new SSC CCSDS-123 GPU application an experimental testing campaign 
has been carried out on a set of AVIRIS hyperspectral EO images. Using these results, it 
is possible to compare the newly proposed application performance with previous results 
in the literature. Figure 4-18 provides the final comparison results for the new SSC 
CCSDS-123 GPU application, showing that for the desktop (GTX750Ti) GPU a new 
state-of-the-art processing throughput is achieved. Additionally, on a low power onboard 
representative GPU platform (Jetson TX1), a processing throughput approximately twice 
that of the other onboard suitable hardware platform, the Virtex-4 FPGA, is achieved. 
This throughput also exceeds the real-time processing requirement of the 800 Mb/s    
AVIRIS image. These experiments were designed to specifically address the research 
questions detailed in Table 3-3, which were posed based on the limitations and gaps of 
previous literature. The key findings from these experiments are summarised in            
Table 4-17. 
  
Rebecca L. Davidson        Chapter 4, GPU Accelerated CCSDS-123 Compression 
- 132 - 
Table 4-17 SSC CCSDS-123 case study Chapter 4 findings  
1.a  How do application parameters impact the algorithm and application performance? 
 - Image tiling parameters have a greater influence than algorithm parameters on both 
compression ratio and processing throughput. 
- The greater the number of image tiles, the lower the compression ratio. The decrease 
in compression ratio between no image tiling and 512 image tiles was on average 0.17. 
- The greater the number of image tiles, the greater the processing throughput. On av-
erage, the throughput increase from no tiling to 512 image tiles, was 8.8 times greater 
for the GTX750Ti and 2.5 times greater for the Jetson TX1. 
1.b  How does the application performance relate to the GPU architecture?  
 - Increasing the number of tiles results in a significant increase in throughput until a 
certain point; beyond which there are diminishing returns in the throughput increase. 
The number of tiles resulting in diminishing returns in processing throughput is directly 
related to maximum number of the concurrently executed image tiles. 
- Proposed a new equation (4-10), based on data, kernel and hardware characteristics, 
to calculate the maximum number of concurrent image tiles per GPU in advance.  
- The warps per block and SMs per GPU are the key variable characteristics which in-
fluence the peak processing throughput and maximum number of concurrent thread 
blocks. 
1.c  Can these relationships be leveraged to predict an optimum configuration in advance? 
 - A trade-off will need to be made between compression ratio and processing through-
put in order to select appropriate image tiling parameters. The throughput Weissman 
Score has been proposed (Equation (4-12)) as an appropriate trade-off metric to com-
pare different configurations and facilitate optimum parameter selection. 
- The resulting maximum number of concurrent tiles value has been shown to correlate 
well with the peak processing throughput and peak throughput Weissman Score config-
uration, therefore Equation (4-10) can be leveraged to help determine in advance the 
optimum tiling configuration for a specific application and GPU hardware. 
1.d  How do data characteristics impact the algorithm and application performance?  
 - Image content characteristics have been shown to have a small impact on the 
achieved compression ratio but no significant impact on the achieved processing 
throughput. 
- Image bit depth does not have a significant impact on the compression ratio but does 
significantly influence the achieved processing throughput.  
 
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 133 - 
CHAPTER 5, MULTISPECTRAL OPTIMISED GPU ACCELERATED 
CCSDS-123 COMPRESSION 
There are two major classes of multiband optical EO imagery data. Hyperspectral da-
ta sets, such as the previously tested AVIRIS imagery, can be considered as an almost 
continuous representation of electromagnetic reflectance, as the data is gathered for sev-
eral hundred narrow spectral bands. Multispectral EO imagery on the other hand, 
provides electromagnetic reflectance data in a small number of discrete spectral bands, in 
the order of tens, but often provide significantly greater spatial resolution than hyperspec-
tral imagers.  
Whilst hyperspectral data sets pose an interesting big data processing problem for the 
community, they are not currently widely used onboard satellites. As such the majority of 
EO data gathered from space borne platforms are currently multispectral images. The in-
trinsic differences between these two types of data sets can have significant impacts on 
both the achievable processing throughput and achievable compression ratio of an 
onboard image compression system. When compared with hyperspectral imagery data 
multispectral data has inherently reduced compressibility due to reduced correlations be-
tween spectral image bands. Additionally, as the major axis for parallelisation for the 
CCSDS-123 algorithm is also the number of bands, it is postulated that the achievable 
compression processing throughput will also be reduced for multispectral imagery.  
 
5.1    Initial Multispectral Imagery Performance Evaluation  
To examine the differences, an experimental study into the throughput and compres-
sion ratio performance of the SSC CCSDS-123 implementation on a multispectral data set 
has been conducted. Three multispectral Landsat images, which are openly available from 
the CCSDS image corpus [143][144], have been used in these tests. All of these images 
have the same key parameters which are detailed in Table 5-1. These images are referred 
to in this thesis as Landsat Agriculture, Landsat Coast and Landsat Mountain, thumbnails 
of these images are included for reference in Appendix G [143][144]. 
 
Table 5-1 Landsat multispectral imagery test data characteristics [143][144] 
Data Set 
Name 
Width 
(X pixels) 
Height 
(Y pixels) 
Bands 
(Z) 
Dynamic 
Range (bits) 
Landsat 1024 1024 6 8 
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 134 - 
 Performing a throughput and compression study, including the influence of the 
number of image tiles, for alternative multispectral data sets allows us to investigate a 
number of areas. Firstly, it is possible to evaluate the impact of the differences in imagery 
characteristics between hyperspectral and multispectral data on both compression ratio 
and processing throughput. Secondly, this research assesses the accuracy of the previous 
proposed theory that the throughput trends should be predictable given a number of im-
agery and hardware induced kernel execution characteristics. With respect to this, details 
on the relevant parameters and the predicted number of tiles for the Landsat imagery spe-
cifically are given in Table 5-2. These values were calculated using equations (4-9) and 
(4-10) on page 122. The predicted number of tiles for the multispectral Landsat imagery 
is considerably higher than for the equivalent hyperspectral AVIRIS imagery. This is due 
to the reduced inherent DLP as seen by the lower number of warps per block.  
 
 
Table 5-2 Landsat imagery, kernel and GTX750Ti hardware characteristics 
Imagery induced 
characteristics 
CCSDS-123 kernel induced 
characteristics 
Hardware induced characteristics 
(GTX750Ti) 
Warps per block Registers per (thread) & 
warp 
Max registers per block SMs per GPU 
1 (72) 2304 65536 5 
 
Concurrent blocks per SM (4-9) Maximum concurrent tiles per GPU(4-10) 
28 140 
 
Figure 5-1 - Figure 5-3 give the practical compression ratio and processing through-
put results for the new SSC CCSDS-123 compression application for three multispectral 
Landsat images, namely Agriculture, Coast and Mountain, on the GTX750Ti desktop 
GPU. 
 
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 135 - 
 
Figure 5-1 Landsat Agriculture compression ratio and throughput (GTX750Ti) 
 
 
Figure 5-2 Landsat Coast compression ratio and throughput (GTX750Ti) 
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
C
om
pr
es
sio
n 
R
at
io
Th
ro
ug
hp
ut
 (G
b/
s)
Number of Tiles
Landsat Agriculture
X Size=1024, Throughput (Gb/s) X Size=512, Throughput (Gb/s) X Size=256, Throughput (Gb/s) X Size=128, Throughput (Gb/s)
X Size=64, Throughput (Gb/s) X Size=32, Throughput (Gb/s) X Size=16, Throughput (Gb/s) X Size=8, Throughput (Gb/s)
X Size=1024, Compression Ratio X Size=512, Compression Ratio X Size=256, Compression Ratio X Size=128, Compression Ratio
X Size=64, Compression Ratio X Size=32, Compression Ratio X Size=16, Compression Ratio X Size=8, Compression Ratio
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
C
om
pr
es
sio
n 
R
at
io
Th
ro
ug
hp
ut
 (G
b/
s)
Number of Tiles
Landsat Coast
X Size=128, Throughput (Gb/s) X Size=1024, Throughput (Gb/s) X Size=512, Throughput (Gb/s) X Size=256, Throughput (Gb/s)
X Size=64, Throughput (Gb/s) X Size=32, Throughput (Gb/s) X Size=16, Throughput (Gb/s) X Size=8, Throughput (Gb/s)
X Size=1024, Compression Ratio X Size=512, Compression Ratio X Size=256, Compression Ratio X Size=128, Compression Ratio
X Size=64, Compression Ratio X Size=32, Compression Ratio X Size=16, Compression Ratio X Size=8, Compression Ratio
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 136 - 
 
Figure 5-3 Landsat Mountain compression ratio and throughput (GTX750Ti) 
 
 
These figures show that all three Landsat images exhibit similar overall trends for 
compression ratio and processing throughput, however several of these observed trends 
differ significantly from the previous results for the hyperspectral AVIRIS imagery. The 
first key difference observed is that for the Landsat images the peak compression ratio is 
consistently achieved when the image is tiled so that the tiles X size is equal to 16 pixels. 
This is in considerable contrast to the hyperspectral imagery whereby the peak compres-
sion ratio was achieved when no tiling was performed, and increased numbers of tiles had 
a negative impact on compression ratio. It is postulated that this is due to the increased 
spatial resolution and spatial correlation of the Landsat images, making them better suited 
to image tiling in terms of compressibility. This is an important trend for onboard pro-
cessing as it shows that for certain imagers, if used optimally, image tiling can be 
leveraged to increase both the achieved compression ratio and processing throughput. 
The throughput trends more closely resemble those seen for the previous AVIRIS 
image results. However, there are again several key differences between the throughput 
achieved for the multispectral Landsat and hyperspectral AVIRIS imagery. Firstly, as 
predicted from equation (4-10) on page 122, the number of tiles at which the throughput 
performance no longer exponentially increases differs for the Landsat imagery, occurring 
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
C
om
pr
es
sio
n 
R
at
io
Th
ro
ug
hp
ut
 (G
b/
s)
Number of Tiles
Landsat Mountain
X Size=128, Throughput (Gb/s) X Size=1024, Throughput (Gb/s) X Size=512, Throughput (Gb/s) X Size=256, Throughput (Gb/s)
X Size=64, Throughput (Gb/s) X Size=32, Throughput (Gb/s) X Size=16, Throughput (Gb/s) X Size=8, Throughput (Gb/s)
X Size=1024, Compression Ratio X Size=512, Compression Ratio X Size=256, Compression Ratio X Size=128, Compression Ratio
X Size=64, Compression Ratio X Size=32, Compression Ratio X Size=16, Compression Ratio X Size=8, Compression Ratio
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 137 - 
at 128 tiles, which is the closest to the 140 maximum concurrent tiles per GPU, Table 5-2. 
This shows that the proposed Equation is scalable to different imagery characteristics and 
that there is a clear relationship between the theoretical occupancy, maximum number of 
concurrent blocks for the CCSDS-123 kernel and the achieved processing throughput. 
Additionally, the multispectral Landsat images exhibit a significant reduction in peak 
throughput which is approximately 1.3 Gb/s for the Landsat images, compared to approx-
imately 8 Gb/s for the AVIRIS images. 
The trade-off between these two performance areas can be directly compared using 
the modified throughput Weissman Score comparison metric. The following metrics have 
been calculated using equation (4-12) and the AVIRIS Hawaii results from Table 4-12 as 
the reference data. The results from this comparison are given in Figure 5-4. The 
throughput Weissman score results, shown in Figure 5-4, highlight the increased benefit 
of leveraging an increased number of image tiles has for the compression of the multi-
spectral Landsat imagery compared to the hyperspectral AVIRIS images. It is postulated 
that the high combined throughput and compression ratio performance for the Landsat 
images at higher tile numbers is due to several factors. Firstly, for each image band the 
two-dimensional multispectral image dimensions are approximately twice those of the 
hyperspectral images tested, resulting in around four times the number of pixels per band.  
As a result, the reduction in compression ratio caused by over segmentation is re-
duced, and the compression ratio performance can be maintained for a larger number of 
tiles. This is then combined with the fact that leveraging a larger number of tiles, for the 
multispectral images, helps to achieve higher processing throughput because it exposes 
greater TLP. This additionally helps to maximise the GPU resource utilisation, which is 
inherently reduced due to the significantly lower number of spectral bands and DLP. 
Overall, this results in a higher relative throughput Weissman score for higher numbers of 
tiles. 
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 138 - 
 
Figure 5-4 GTX750Ti throughput Weissman Score comparison for all images (a = 1) 
 
Another key insight shown in Figure 5-4 is the relationship between the number of 
tiles and the peak throughput Weissman score for each image. For the hyperspectral 
AVIRIS images the number of tiles for the peak throughput Weissman score correspond-
ed to the number of tiles which equalled the peak of the exponential throughput trend. For 
the Landsat images the peak throughput Weissman score would therefore be predicted to 
occur at 128 tiles. However, for Landsat Agriculture and Mountain this relationship is not 
consistent.  
It is postulated that this is due to the compression ratio characteristics for these imag-
es and the influence of compression ratio on the Weissman score. Correlating these 
results with the compression ratio trends shown in Figure 5-1, Figure 5-2 and Figure 5-3, 
Landsat Coast exhibits a much larger drop in compression ratio performance as the tile 
size increases whilst conversely for Landsat Agriculture and Mountain images for larger 
numbers of tiles the reduction in compression ratio is much smaller. As a result, the cal-
culated throughput Weissman scores for these images remain higher for larger numbers of 
tiles. Since image compressibility is unpredictable without prior knowledge of the image 
content and characteristics, it will therefore not be possible to exactly predict the number 
of tiles which results in the peak Weissman score in advance of image capture. 
Landsat Agriculture, 512x16x128: 0.52
Landsat Coast, 128x16x512: 0.69
Landsat Mountain, 512x8x256: 0.49
AVIRIS Hawaii, 16x614x32: 1.23
AVIRIS Maine, 16x680x32: 1.19
AVIRIS Yellowstone 00, 16x85x256: 0.77AVIRIS Yellowstone 03, 16x85x256: 0.79
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
W
ei
ss
m
an
 S
co
re
Number of Tiles
Landsat Agriculture Landsat Coast Landsat Mountain AVIRIS Hawaii
AVIRIS Maine AVIRIS Yellowstone 00 AVIRIS Yellowstone 03
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 139 - 
5.2    Low Power GPU Performance Evaluation 
In addition to evaluating the performance of the new SSC CCSDS-123 application on 
a multispectral imagery using the desktop GTX750Ti GPU, this research also examines 
and compares the throughput achieved on the low power Jetson TX1 platform. Prior to 
examining the experimental results it is possible to leverage the previously proposed 
equations (4-9) and (4-10) on page 122 to examine the maximum number of concurrent 
tiles which can be compressed on the GPU. As shown in Table 5-3, the calculated maxi-
mum of concurrent tiles for the Landsat imagery is considerably less than the 140 
concurrent tiles able to be processed on the GTX750Ti platform, therefore there is also an 
expected associated reduction in peak processing throughput for the Jetson TX1 platform.  
 
Table 5-3 Landsat imagery, kernel and Jetson TX1 hardware characteristics 
Imagery induced 
characteristics 
CCSDS-123 kernel induced 
characteristics 
Hardware induced characteristics 
(Jetson TX1) 
Warps per block Registers per (thread) & 
warp 
Max registers per block SMs per GPU 
1 (72) 2304 65536 2 
 
Concurrent blocks per SM (4-9) Maximum concurrent tiles per GPU (4-10) 
28 56 
 
The experimental results for the compression throughput of the Landsat imagery on 
the Jetson TX1 platform, given in Figure 5-5. These results exhibit the similar trends to 
those seen previously for the AVIRIS hyperspectral imagery, whereby there is a signifi-
cant reduction in peak throughput achieved on the platform, especially for high numbers 
of image tiles.  
Unlike the AVIRIS images, the throughput for the Jetson TX1 remains close to that 
achieved on the GTX750Ti, within 0.2 Gb/s, for less than 32 tiles. However, for greater 
than 32 tiles the throughput does not significantly increase, which concurs with the pre-
diction that the peak in exponential throughput is related to the maximum number of 
concurrent tiles which can be processed on the GPU, where 32 tiles is closest number of 
tiles less than or equal to this value of 56.  
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 140 - 
 
 
 
A) 
 
 
B) 
 
 
C) 
Figure 5-5 Tiled Landsat imagery GTX750Ti and Jetson TX1 GPU comparison results 
 
The data rate of the Landsat imager which produced this imagery is 440 Mb/s [146], 
therefore a processing throughput of at least this would be required on the Jetson TX1 to 
demonstrate the feasibility of real-time onboard multispectral GPU data processing. From 
the results presented in Figure 5-5, a minimum of 256 image tiles would need to be lever-
aged to achieve this level of processing throughput for the Landsat imager. The data rate 
of future state-of-the-multispectral imagers will likely exceed this level, however the re-
sults in Figure 5-5 show that exploiting image tiling alone may not be sufficient to 
achieve real-time data processing throughput for state-of-the-art imagers. 
 
5.3    New Multispectral Imagery Optimised GPU Application 
The throughput results shown in Section 5.1 and 5.2, highlight the significantly re-
duced performance of the multispectral data compared against the previous hyperspectral 
data. This is due to the inherently reduced DLP that the implementation is reliant upon for 
parallelised processing. As the majority of EO data produced today is multispectral, be-
tween 3 to 10 bands typically, this section will investigate processing throughput 
optimisation approaches specifically for multispectral imagery.  
1.8
1.82
1.84
1.86
1.88
1.9
1.92
1.94
1.96
1.98
2
2.02
2.04
2.06
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
C
om
pr
es
si
on
 R
at
io
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Landsat Agriculture
GTX750Ti: X Size=1024, Throughput (Gb/s) GTX750Ti: X Size=512, Throughput (Gb/s)
GTX750Ti: X Size=256, Throughput (Gb/s) GTX750Ti: X Size=128, Throughput (Gb/s)
GTX750Ti: X Size=64, Throughput (Gb/s) GTX750Ti: X Size=32, Throughput (Gb/s)
GTX750Ti: X Size=16, Throughput (Gb/s) GTX750Ti: X Size=8, Throughput (Gb/s)
Jetson TX1: X Size=1024, Throughput (Gb/s) Jetson TX1: X Size=512, Throughput (Gb/s)
Jetson TX1: X Size=256, Throughput (Gb/s) Jetson TX1: X Size=128, Throughput (Gb/s)
Jetson TX1: X Size=64, Throughput (Gb/s) Jetson TX1: X Size=32, Throughput (Gb/s)
Jetson TX1: X Size=16, Throughput (Gb/s) Jetson TX1: X Size=8, Throughput (Gb/s)
X Size=1024, Compression Ratio X Size=512, Compression Ratio
X Size=256, Compression Ratio X Size=128, Compression Ratio
X Size=64, Compression Ratio X Size=32, Compression Ratio
X Size=16, Compression Ratio X Size=8, Compression Ratio
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
C
om
pr
es
sio
n 
R
at
io
Th
ro
ug
hp
ut
 (G
b/
s)
Number of Tiles
Landsat Agriculture
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
C
om
pr
es
sio
n 
R
at
io
Th
ro
ug
hp
ut
 (G
b/
s)
Number of Tiles
Landsat Coast
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
C
om
pr
es
sio
n 
R
at
io
Th
ro
ug
hp
ut
 (G
b/
s)
Number of Tiles
Landsat Mountain
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 141 - 
5.3.1  CCSDS-123 Kernel Organisation – Nested Parallelism 
In CUDA 5.0 (October 2012) the concept of dynamic parallelism was introduced to 
the CUDA programming model. This allows nested parallelism, where present in an algo-
rithm, to be exploited as dynamically launched kernels. Kernels are now able to launch 
new kernels and spawn new threads on the GPU independently from the host. Previously, 
programmers were restricted to a flat programming model, but using dynamic parallelism, 
nested and varying degrees of parallelism can be exploited to allow for a more run-time 
adaptive parallel implementation.  
There is a small degree of nested parallelism available in the implementation of the 
CCSDS-123 algorithm. Therefore, the potential advantages of applying the new dynamic 
parallelism CUDA programming model were investigated. The specific nested parallel-
ism exploited was present in the calculation of the local difference vector, the update of 
the weight vector and the dot product calculation of these two vectors. However, in the 
implementation proposed in this thesis, the weight and local difference vectors are stored 
in the low latency on-chip shared memory. This memory is local to a single kernel block 
of threads and as launching a new kernel creates a new thread block, shared memory can-
not be used to pass data between nested kernels. As a result, the dynamically parallelised 
version of the CCSSD-123 implementation achieved a significantly lower throughput 
than the previous version. Using evidence from the profiling results, it was concluded that 
the amount of nested parallelism is too small to overcome both the additional kernel 
launch overhead and the performance loss from not being able to utilise the low latency 
shared memory. 
 
Design Rule G: Nested parallelism should be exploited using dynamic kernels when the 
amount of nested parallelism is sufficient to overcome the kernel launch overhead.  
 
5.3.2  CCSDS-123 Kernel Configuration –Warp Efficiency 
For multispectral imagery, the number of spectral bands provide insufficient DLP for 
high throughput processing, and although image tiling has also been exploited, the pro-
cessing throughput for tiled multispectral imagery is still significantly reduced when 
compared to hyperspectral results. For multispectral images, the restriction in inherent 
parallelism is related to the number of image bands, which for the GPU hardware imple-
mentation translates to the number of threads per block. As the GPU hardware 
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 142 - 
architecture is based on a SIMT execution model, instructions are always issued with a 
granularity equal to the warp size of 32. For the Landsat agriculture test case, the image 
has 6 bands, which means that the kernel will be launched with 6 threads per block. As a 
result, the warp execution efficiency will be low as only 6 out of a possible 32 concurrent 
threads will execute valid instructions. Therefore, an optimisation would be to increase 
the number of threads issued per block.  
Due to the introduction of image tiling there are now two axes of parallelism, equal 
to the number of spectral bands and the number of image tiles, which are consistent for 
the whole CCSDS-123 algorithm. The most efficient methodology for leveraging paral-
lelism on the GPU is firstly by invoking sufficient parallel execution threads and secondly 
through the declaration of multiple concurrent blocks of threads.  
In the initial SSC CCSDS-123 kernel, image tiling is leveraged at the thread block 
level. However, for multispectral imagery it is proposed that utilising image tiling to in-
crease the number of threads per block rather than the number of blocks per kernel could 
have a greater positive impact on the achieved processing throughput. For multispectral 
imagery, this approach would increase the number of active threads per warp and number 
of active warps for increased GPU utilisation. This approach is demonstrated using three 
configuration cases, A, B and C shown in Figure 5-6. 
 
 
Figure 5-6 Leveraging image tiling to increase warp execution efficiency 
Block 0
8 Tiles per Block x 6 Image Bands = 48 Active Threads
Warp 1
Block 0
2 Tiles per Block x 6 Image Bands = 12 Active Threads
Warp 0
Block 0
1 Tile per Block x 6 Image Bands = 6 Active Threads
Warp 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Warp 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Block (#Tiles/Tiles per Block)-1
8 Tiles per Block x 6 Image Bands = 48 Active Threads
Warp N
Block (#Tiles/Tiles per Block)-1
2 Tiles per Block x 6 Image Bands = 12 Active Threads
Warp N
Block (#Tiles - 1)
1 Tile per Block x 6 Image Bands = 6 Active Threads
Warp N
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Warp N-1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
…
…
…
Active Thread Inactive ThreadKey
Configuration A: 1 Tile per Block, 6 Image Bands - Warp Execution Efficiency = (6/32)x100 = 18.75%
Configuration B: 2 Tiles per Block, 6 Image Bands - Warp Execution Efficiency = (12/32)x100 = 37.5%
Configuration C: 8 Tiles per Block, 6 Image Bands - Warp Execution Efficiency = (48/64)x100 = 75%
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 143 - 
This new approach has been implemented as an additional adaptation of the new SSC 
CCSDS-123 GPU application by modifying the CCSDS-123 kernel to allow the compres-
sion of independent image tiles within a single block of threads. It is described using 
pseudo code in Algorithm 5, where lines 4 – 6 specifically demonstrate how the tiles per 
block (TPB) value impacts the blocks and thread configuration and also the amount of 
required shared memory for the CCSDS-123 kernel. 
 
 
To practically demonstrate the impact of increasing the number of TPB, and hence 
the active threads per warp, an initial experiment using the Landsat Agriculture image 
with a tiled configuration of 128 tiles (16x512) has been conducted. This configuration 
was chosen because it achieved the highest Weissman score in Figure 5-4. The profiling 
and execution time results for several different TPB configurations have been compiled 
for the modified SSC CCSDS-123 kernel and are given in Table 5-4. These results show 
that there is a trade-off to be made between simultaneously maximising the warp execu-
tion efficiency, GPU occupancy and the number of blocks to achieve peak performance.  
  
Algorithm 5 –SSC CCSDS-123 GPU Application for Tiled Imagery and TPB parameter  
 
1. 
2. 
3. 
 
4. 
5. 
6. 
7. 
 
8. 
9. 
 
10. 
11. 
12. 
13. 
Host Code 
Initialise_H_mem 
Initialise_G_mem 
Copy_H2G_mem(*H_IN, *G_IN) 
 
s_mem=(2*sizeof(int)* IM_BANDS * weights_len* TPB) 
blocks= num_tiles/TPB 
threads= IM_BANDS * TPB 
CCSDS-123<<<blocks,threads,s_mem>>>(*G_IN, *G_LEN, *G_CWRD) 
 
threads = IM_BANDS(log(IM_BANDS)) 
INCLUSIVE_SUM<<<threads>>>(*G_LEN) 
 
Threads = IM_BANDS x IM_HEIGHT x IM_WIDTH 
BIT_PACKER<<<threads>>>(*G_CWRD, *G_LEN, *G_OUT) 
Copy_G2H_mem(*H_OUT, *G_OUT) 
Free G_mem 
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 144 - 
Table 5-4 Multispectral TPB kernel testing and profiling results 
Tiles / 
Block 
Threads / 
Block 
Number of 
Blocks 
Warp execution 
efficiency 
Achieved 
Occupancy 
Kernel Execution 
Time (ms) 
1 6 128 18.5% 38.1% 29.051 
2 12 64 37.5% 19.8% 22.246 
4 24 32 75.0% 9.8% 17.755 
8 48 16 75% 9.8% 17.907 
16 96 8 100% 7.6% 20.265 
32 192 4 100% 9.4% 20.856 
64 384 2 100% 18.7% 26.364 
128 768 1 100% 37.5% 41.322 
 
Assessing each of these test cases in more depth using the profiling tools shows that 
in order to maintain an increase in processing throughput, it is key to ensure that warp 
execution efficiency is maximised whilst also maintaining a number of concurrent blocks. 
The stall reasons shown in Figure 5-7, show that around two thirds of the time a warp is 
stalled, it is because of an execution dependency rather than GPU resource utilisation. 
Therefore, maintaining a number of blocks allows the GPU to leverage context switching 
so when one warp is stalled an alternative warp from a separate block can be executed on 
the available resources to hide the execution latencies. 
 
 
 
Figure 5-7 Landsat Agriculture CCSDS-123 kernel warp stall reasons 
 
 
As a result of these kernel characteristics, it is likely that as the value of the TPB pa-
rameter is increased to increase the warp execution efficiency, it will also be essential to 
increase the number of tiles. This is to maintain a large enough pool of warps which the 
GPU can select from to hide stall latencies.  
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 145 - 
Therefore, the number of tiles deemed the appropriate trade-off between throughput 
and compression ratio previously shown in Figure 5-4 will change when increasing the 
new TPB parameter.  
 
Design Rule H: Ensure the block workload and number of threads is defined to maximise 
warp execution efficiency. TLP can be leveraged to increase warp efficiency when DLP is 
inherently limited, by modifying the kernel organisation. 
 
5.4    Multispectral Optimised Application Performance Evaluation 
To evaluate the proposed multispectral specific optimisation of the CCSDS-123 ker-
nel and its impact on the achievable processing throughput for the application as a whole, 
several compression experiments, with varying tiling parameters and tiles per block pa-
rameter values, have been conducted. These results can be used to further assess how the 
new TPB parameter impacts the achieved throughput and assess the relationship between 
the parameter and the underlying GPU architecture. The measured performance results 
for the three Landsat images previously used, from the CCSDS image corpus [143][144] 
(thumbnails of these images can be found in Appendix G [143][144]), are given in Figure 
5-8,     Figure 5-9 and Figure 5-10.   
 
Figure 5-8 Landsat Agriculture TPB throughput results (GTX750Ti) 
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Landsat Agriculture
Tiles per Block (Threads per Block)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96) 32 (192)
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 146 - 
  
Figure 5-9 Landsat Coast TPB throughput results (GTX750Ti) 
 
 
Figure 5-10 Landsat Mountain TPB throughput results (GTX750Ti) 
 
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
Th
ro
ug
hp
ut
 (G
b/
s)
Number of Tiles
Landsat Coast
Tiles per Block (Threads per Block)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
Th
ro
ug
hp
ut
 (G
b/
s)
Number of Tiles
Landsat Mountain
Tiles per Block (Threads per Block)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
Th
ro
ug
hp
ut
 (G
b/
s)
Number of Tiles
Landsat Agriculture
Til s per Block (Threads per Block)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96) 32 (192)
0
0.2
0.4
0
0
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
Th
ro
ug
hp
ut
 (G
b/
s)
Nu ber of Tiles
andsat Agriculture
Tiles per Block (Threads per Block)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96) 32 (192)
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 147 - 
These three figures highlight the influence the new TPB parameter value has with 
varying numbers of image tiles. For all images, up to 64 image tiles varying the TPB val-
ue, represented by the different series on the plots, has a minimal impact on the achieved 
processing throughput of the algorithm. The major improvement in throughput perfor-
mance, from leveraging an increased TPB value, is when 128 or more image tiles are 
leveraged. 
Examining the trends presented in these Figures further, there appears to be an indi-
rect relationship between increased TPB value and processing throughput. Whereby, 
between 64 and 1024 tiles a TPB value of 4 consistently achieve the peak processing 
throughput, and for 2048 tiles upwards a TPB value of 16 achieves the peak throughput. 
Relating these trends to the underlying hardware architecture, it can be concluded that the 
TPB value which achieves the peak throughput performance, represents the optimum 
trade-off between exploiting TLP at either the thread or block level. Additionally, it was 
found that TPB values of 32 or higher do not provide any processing throughput ad-
vantage for the number of tiles exposed for the Landsat test imagery. 
An additional trend which can be observed in Figure 5-8 - Figure 5-10, is the shift in 
the number of tiles which results in the peak of the initial exponential throughput trend 
with increasing TPB value. For example, for a TPB value of 1 this trend point occurs at 
128 tiles, whilst for a TPB value of 2 this occurs at 256 tiles. This shows that increasing 
the TPB value, allows us to efficiently leverage a greater number of tiles before reaching 
the peak of the exponential portion of the throughput trend. This relationship is due to the 
connection between TPB value and number of thread blocks. Namely, for a TPB value of 
1, the number of tiles is equal to the number of GPU thread blocks, however when the 
TPB value is increased, the number of thread blocks is now equal to the number of tiles 
divided by the number of threads per block. To demonstrate this relationship, Figure 5-11 
plots the same throughput data from Figure 5-8 but against the number of blocks on the X 
axis.  
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 148 - 
 
Figure 5-11 Landsat Agriculture TPB throughput with number of blocks (GTX750Ti) 
 
Figure 5-11 shows that for TPB values of 2 to 4, the throughput follows a similar 
trend as the original 1 TPB implementation, with the peak of the exponential increase oc-
curring at 128 blocks. However, for TPB values of 8 and 16 this value occurs at 64 and 32 
blocks respectively. This is because for these TPB values the number of warps is in-
creased, and as shown in equations (4-9) and (4-10) on page 122 this directly impacts the 
number of concurrent blocks per GPU. As the number of warps and TPB value are con-
figurable and therefore known at run-time, they can be taken into account to pre-
determine the number of blocks and number of tiles which will result in the peak of the 
exponential throughput trend, based on previous equations (4-9) and (4-10).  
The new proposed equations to calculate this trend, taking into account the new TPB 
parameter, are given in equation (5-1). Additionally, equation (5-2) can be used to calcu-
late the number of tiles specific for the power of two image tiling configuration used for 
the Landsat images. To demonstrate their use, Table 5-5 gives the theoretically calculated 
numbers of blocks and number of tiles which should correspond to the end of the expo-
nential throughput trend for the Landsat imagery for different TPB values. Comparing 
these calculated values with the results given in Figure 5-8 – Figure 5-10, these equations 
provide a good fit to the actual trends observed experimentally.  
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
Th
ro
ug
hp
ut
 (G
b/
s)
Number of Blocks
Landsat Agriculture
Tiles per Block (Threads per Block)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96) 32 (192)
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 149 - 
 
Maximum number of concurrent 
tiles per GPU 
 
 
 
= c`j × kWlXmPP<lS	UVWXY7	T<P	_`"	 (5-1) 
Maximum number of concurrent 
Landsat tiles 
 
 
= c`j × 2opffq(pfrs(tf,uvqqw,#	xpfuyz	0wq	{|}))		
 
(5-2) 
Table 5-5 Calculated maximum number of concurrent blocks and tiles 
TPB Concurrent blocks equation (4-9) 
Concurrent Tiles 
equation (5-1) 
Concurrent Landsat Tiles 
equation (5-2) 
1 140 140 128 
2 140 280 256 
4 140 560 512 
8 70 560 512 
16 45 560 512 
 
As performed for the previous image test experiments, the Weissman scores for the 
different compression configurations have been calculated for the Landsat images across 
different TPB values and number of tiles and are shown in Figure 5-12. These results 
show the significant advantage of increasing the TPB value from 1 to 2 and also 2 to 4, 
and the subsequent decreasing advantage of increasing TPB value beyond a value of 4.  
 
 
 
A) 
 
 
 
B) 
 
 
 
 
C) 
 
 
Figure 5-12 Landsat images throughput Weissman score for GTX750Ti  
0.45
0.47
0.49
0.51
0.53
0.55
0.57
0.59
64 128 256 512 1024 2048 4096 8192
W
ei
ss
m
an
 S
co
re
Number of Tiles
Landsat Mountain
Tiles per Block (Threads per Block)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
0.48
0.5
0.52
0.54
0.56
0.58
0.6
0.62
64 128 256 512 1024 2048 4096 8192 16384
W
ei
ss
m
an
 S
co
re
Number of Tiles
Landsat Agriculture
0.65
0.67
0.69
0.71
0.73
0.75
0.77
0.79
64 128 256 512 1024 2048 4096 8192
W
ei
ss
m
an
 S
co
re
Number of Tiles
Landsat Coast
0.45
0.47
0.49
0.51
0.53
0.55
0.57
0.59
64 128 256 512 1024 2048 4096 8192
W
ei
ss
m
an
 S
co
re
Number of Tiles
Landsat Mountain
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 150 - 
Overall the Weissman score results for the Landsat images and different tiles per 
block configurations, highlight the increase in overall performance which can be achieved 
by ensuring that the configuration of number of threads and blocks suits the underlying 
GPU hardware typology. The experimental Weissman score results can be compared with 
the calculated values for the number of tiles which results in the end of the exponential 
throughput period in Table 5-5. This comparison shows that as the TPB values increases, 
so does the number of tiles which results in the peak Weissman score. This highlights that 
the increased processing throughput performance obtained by increasing the TPB value is 
greater than the reduction in compression ratio with increased numbers of tiles. 
 
5.5    Low Power GPU Application Performance 
The compression of the multispectral Landsat images on the low power Jetson TX1 
GPU platform has also been performed. This allows us to assess the scalability of the pre-
viously proposed equations which calculate the number of concurrent blocks and tiles 
which can be compressed on the GPU and has been shown to result in the peak of expo-
nential throughput trend. The calculated values for the Landsat imagery and varying TPB 
values on the Jetson TX1 GPU are presented in Table 5-6 and have been calculated using 
equation (5-2). Then in Figure 5-13, Figure 5-14 and Figure 5-15, the experimental results 
for the compression of the three Landsat test images on the Jetson TX1 platform and for 
varying TPB values are presented.  
 
Table 5-6 Jetson TX1 calculate maximum concurrent tiles for Landsat images  
TPB Concurrent blocks 
equation (4-9) 
Concurrent tiles 
equation (5-1) 
Concurrent Landsat tiles 
equation (5-2) 
1 56 56 32 
2 56 112 64 
4 56 224 128 
8 28 224 128 
16 18 224 128 
 
 
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 151 - 
 
Figure 5-13 Landsat Agriculture TPB testing for GTX750Ti and Jetson TX1  
 
 
Figure 5-14 Landsat Coast TPB testing for GTX750Ti and Jetson TX1  
 
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Landsat Agriculture
GTX750Ti: Tiles per Block (Threads per Block)
Jetson TX1: Tiles per Block (Threads per Block)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 2 4 8 16
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Landsat Coast
GTX750Ti: Tiles per Block (Threads per Block)
Jetson TX1: Tiles per Block (Threads per Block)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 152 - 
 
Figure 5-15 Landsat Mountain TPB testing for GTX750Ti and Jetson TX1  
 
Figure 5-13 - Figure 5-15 show the same trends in throughput for the Jetson TX1 
with both varying numbers of tiles and TPB value for all tested images. This highlights 
that the relationship between throughput and these parameters are mostly influenced by 
hardware characteristics, as also observed in testing on the GTX750Ti. The trends in 
throughput for each TPB value follow those predicted and are given in Table 5-6, where-
by there are no significant increases in throughput observed beyond the use of 256 tiles 
for each TPB value. Beyond this number of tiles, increasing the TPB value has a much 
greater impact on throughput typically in the region of a two times speed up when com-
paring a TPB value of 1 and TPB value of 16. 
 
5.6    Chapter 5 Summary 
Firstly, in this Chapter, the compression ratio and processing throughput performance 
of the new SSC CCSDS-123 GPU accelerated image compression application for an addi-
tional multispectral data set has been investigated. This research has shown that due to the 
differences in the underlying characteristics between the multispectral and previously 
used hyperspectral data sets, the trends and relationships between the compression ratio, 
processing throughput and image tiling parameters can differ. For the multispectral im-
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Landsat Mountain
GTX750Ti: Tiles per Block (Threads per Block)
Jetson TX1: Tiles per Block (Threads per Block)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 153 - 
agery tested it was found that in an opposite trend to the tested hyperspectral imagery in-
creased image tiling resulted in an increase in compression ratio, however this was on 
average 0.06 which was has a very low significance. In terms of the achieved processing 
throughput this increased considerably, up to 66 times, with increasing image tiles. How-
ever, despite this relative increase in throughput from image tiling, the absolute 
processing throughput achieved peaked at approximately 1.3 Gb/s , which was considera-
bly lower than for the hyperspectral imagery previously tested which achieved a peak 
throughput of 8 Gb/s. This is due to the inherently reduced DLP of the multispectral im-
agery. On the low power Jetson TX1 platform this performance translated to being only 
40 Mb/s greater than the data rate of the Landsat imager. As a result, achieving real-time 
data rates for state-of-the-art future payloads would be challenging.  
In order to address this, several alternative parallelisation techniques toward a multi-
spectral specific optimised approach for the new SSC CCSDS-123 application was 
researched. Details on the new adapted application, which leverages the TLP induced by 
image tiling to increase the warp efficiency of the main kernel, and two additional generic 
application Design Rules G and H, were presented and discussed. Following on from this, 
further experimental testing was conducted to evaluate the new multispectral optimised 
approach and assessed the relationships between the application performance, new TPB 
parameter and GPU architecture for several image tiling cases. The new multispectral op-
timised approach was demonstrated to be able to increase the achievable processing 
throughput on the Jetson TX1 platform to over 2.5 times the data rate of the Landsat im-
ager. The overall trends and findings from this study and how they relate back to the 
previously defined research questions are summarised in Table 5-7.  
Overall, the results presented in this Chapter demonstrate the advantages of follow-
ing an iterative design and optimisation process whereby the parallelisation approach is 
updated to suit both the underlying hardware architecture and input data. New experi-
ments on the Jetson TX1 platform have also allowed us to demonstrate the capabilities of 
onboard representative GPU hardware, image processing algorithm and input data for re-
al-time state-of-the-art onboard image processing.  
Rebecca L. Davidson        Chapter 5, Multispectral Optimised GPU Accelerated CCSDS-123 Compression 
- 154 - 
Table 5-7 SSC CCSDS-123 case study Chapter 5 findings summary 
1.a  How do application parameters impact the algorithm and application performance? 
 - Image tiling has a greater influence than algorithm parameters on both compression 
ratio and processing throughput. 
- The impact of image tiling on the achieved compression ratio is dependent on the data 
characteristics – see 1.d.  
- The greater the number of image tiles, the greater the processing throughput. On aver-
age, the throughput increase from no tiling to 512 image tiles, was 66 times greater for 
the GTX750Ti and 26 times greater for the Jetson TX1. 
- The new TPB parameter can be leveraged to increase the achievable data processing 
throughput depending on the data characteristics – see 1.d. This parameter has no impact 
on the compression ratio. 
1.b  How does the application performance relate to the GPU architecture?  
 - Increasing the number of tiles results in a significant increase in throughput until a 
certain point; beyond which there are diminishing returns in the throughput increase. The 
number of tiles resulting in diminishing returns in processing throughput is directly relat-
ed to maximum number of the concurrently executed image tiles. 
- New equation (5-1) was proposed, based on data, kernel, hardware and TPB charac-
teristics, to calculate the maximum number of concurrent image tiles per GPU.  
- The warps per block and SMs per GPU are key characteristics which influence the 
peak processing throughput and maximum number of concurrent thread blocks.  
1.c  Can these relationships be leveraged to predict an optimum configuration in advance? 
 - The throughput Weissman Score has been proposed (equation (4-12)) as an appropri-
ate trade-off metric to compare different configurations and facilitate parameter 
selection. 
- The resulting maximum number of concurrent tiles value has been shown to correlate 
well with the peak processing throughput and peak throughput Weissman Score configu-
ration, therefore equation (5-1) can be leveraged to help determine in advance the 
optimum tiling configuration for a specific application and GPU hardware. 
1.d  How do data characteristics impact the algorithm and application performance?  
 - Image content differences, have been shown to have a small impact on the achieved 
compression ratio but no significant impact on the achieved processing throughput. 
- Image bit depth does not have a significant impact on the compression ratio but does 
significantly influence the achieved processing throughput.  
- For low – medium spatial resolution imagery (e.g. AVIRIS) image tiling reduces the 
achieved compression ratio. The decrease in compression ratio between no image tiling 
and 512 image tiles was on average 0.17. 
- For high – very high spatial resolution imagery (e.g. Landsat) image tiling can be lev-
eraged to increase the achieved compression ratio. The increase in compression ratio 
between no image tiling and the peak at 128 image tiles was on average 0.06. 
- Data with a low number of spectral bands (multispectral) exhibit lower DLP and 
therefore lower achieve processing throughput. The new TPB parameter can be lever-
aged with image tiling to further increase the processing throughput for this type of 
imagery. 
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 155 - 
CHAPTER 6, ERROR RESILIENT GPU ACCELERATED CCSDS-123 
COMPRESSION  
Ionising radiation particles, and the effects of the collisions of these particles with 
electronic devices, are one of the most challenging and important concerns for electronics 
design in the space environment. Traditional space systems feature either devices manu-
factured using error resilient processes or devices are deployed in conjunction with 
proven software based error mitigation techniques, such as TMR or EDAC algorithms. 
However, GPUs are not currently manufactured to be radiation hardened at a hardware 
level, and due to the relative immaturity of GPU computing in error intolerant applica-
tions, such as onboard payload data processing, the behaviour of GPUs in a radiation 
prone environment is currently largely undocumented. Additionally, very little research 
exists on the practical demonstration of radiation mitigation techniques for GPU architec-
tures or multithreaded software applications. This presents a critical gap in research 
which needs to be better understood in order for GPUs to achieve wide spread adoption in 
onboard satellite or safety critical data processing. 
 
6.1    CCSDS-123 Algorithm Assessment 
To develop suitable radiation mitigation techniques, the inherent error resilience of a 
system needs to be classified. For GPU applications, error resilience is a function of both 
the hardware, how a radiated particle interacts with the physical device, and software, 
how an error propagates within a parallel application to its output. Software is also a fac-
tor, with error resilience closely associated with the underlying application being 
implemented. To date there have not been any ABFT CCSDS-123 algorithm specific er-
ror protection schemes proposed. Whilst an ABFT approach often provides error 
resilience and minimised memory and processing throughput overheads, the techniques 
often cannot be easily applied to different algorithms. Therefore, this research leverages 
appropriate generic error protection techniques and investigates how specific aspects of 
the highly parallel GPU architecture and software model can be leveraged to minimise the 
induced overheads. To practically assess the inherent error resilience of both the algo-
rithm and the new GPU application software based error injection testing is conducted. 
  
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 156 - 
6.2    GPU Application Error Injection Testing 
As discussed in the literature review, Sections 2.4.3, 2.4.4 and 2.4.5, there are two 
major methods for inducing error events to characterise error resilience: beam testing and 
software based error injection. As software based error injection provides a cheap and 
time efficient framework to replicate the effects of physical hardware faults, this approach 
is leveraged in this research to assess the inherent error resilience of the GPU application 
to help inform the design of appropriate error mitigation techniques. The limitation of 
software-based error injection is that it is only able to manipulate architecturally visible 
states and software accessible blocks. Therefore, it is important to understand that the re-
sults from this testing may not provide a full view of all possible error cases or complete 
error probabilities. 
This section, presents and analyses the results from the software based error injection 
study conducted using the previously described CCSDS-123 GPU application from  
Chapter 4 and Chapter 5. The software based error injection testing has been conducted to 
simulate how the software application behaves when an error occurs, causing instructions 
and memory to be altered. The errors induced are representative of the anticipated behav-
iour of a GPU due to a SEE induced by the space environment. The results from the error 
injection experiments help increase the understanding of error resilience in terms of SDC 
and FI errors.  
In this research, the SASSIFI framework is utilised to perform software based error 
injections. The error injection tests have been performed on the GTX750Ti GPU, but 
since the SASSIFI software injects errors at the architectural level the results are valid for 
the binary file; thus they are not GPU specific but architecture specific instead. Another 
advantage of this framework is it is open source, has proven functionality, detailed docu-
mentation and provides a flexible and wide coverage error model. SASSIFI provides 
three modes of operation, which inject errors into the instruction output value (IOV), in-
struction output address (IOA) and register file (RF). These three injection modes, IOV, 
IOA and RF enable the error resilience, in terms of SDC, FI and masked error rates, to be 
determined for different instruction types and the register file memory structure which 
can be used to identify high-level trends in error resilience between different kernels, run-
time and compile-time configurations.  
The modes can be used to inject errors into many different types of instructions, 
which can be initially broadly classified on where they output data to; namely general 
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 157 - 
purpose registers (GPR), predicate registers (PR), conditional code (CC), or store (ST) 
instruction which outputs data to memory structures such as shared memory and global 
memory. The types of error which can be injected for each mode can also be varied. 
Compatible error models include single bit flip (SBF), double bit flip (DBF), random val-
ue and zero value errors. The modes and their possible error models are summarised in 
Table 6-1. 
 
Table 6-1 SASSIFI error injection framework operational modes description 
Testing Mode Error Injection Location Error Models 
Instruction   
Output Value 
(IOV) 
Output value of a randomly    
selected instruction.  
I.e. incorrect value, correct 
memory location 
All Instruction Groups 
- SBF 
- DBF 
- Random Value 
- Zero Value 
Single thread 
All threads in a 
warp 
Instruction   
Output 
Address 
(IOA) 
Output address of a randomly 
selected instruction. 
I.e. correct value, incorrect 
memory location 
GPR Instructions 
- Random Value Single thread 
Store Instructions 
- SBF 
- DBF Single thread  
Register File 
(RF) 
Randomly selected Register File 
location and random in time. 
Single allocated register 
- SBF 
 - DBF 
 
 
 
The experimental testing in this thesis utilises all three injection modes. For IOV and 
IOA modes, the SBF error model for a single thread per warp is used and for the RF 
mode a SBF error is injected for a single register. Because the error rate is set by the 
SASSIFI framework to be one error injection per run, the total number of experimental 
repetitions needs to be determined in order to achieve a certain confidence level and in-
terval. For each of the following experiments discussed in this thesis at least 370 
injections were performed per test, to give a 95% confidence level and a confidence in-
terval of 5%. The minimum number of error injection to be performed to achieve the 
selected 95% confidence level and 5% confidence interval has been calculated using 
equation (6-1) [147].  
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 158 - 
 
 MRlR~m~	lm~U<P	W	RlÄ<XSRWl7		
 
(for a 95% confidence level  
and 5% confidence interval) 
=	 ÅÇ × 0.25kÇ1 + ÜÅÇ × 0.25kÇ − 1áàl7SPmXSRWl7
 
(6-1) 
[147] 
 
Where:  
Z = Z score, 1.96 used to give a 95% confidence level 
C = Confidence interval expressed as a decimal 
Instructions = The total number of dynamic instructions in the applications 
 
 
The first stage of the SASSIFI error injection framework includes full instruction 
level profiling of the application, this ensures that the tool can stochastically select the 
error injection sites. For more details on how the SASSIFI framework selects error injec-
tion sites to achieve accurate error distributions the reader is referred to the following 
publications from the developers of the SASSIFI framework [73][76].   
The experiments have also been repeated for two test images, one hyperspectral and 
one multispectral. Due to the variation in application throughput, and fixed error injection 
rate per application run, the resulting error rate with absolute time simulated by the con-
ducted error injection experiments are 5.7 and 62.5 errors per second, for Landsat 
Agriculture and AVIRIS Hawaii test images respectively. 
 
6.2.1  Register File Error Injection Results 
Using the RF specific error injection mode, the error resilience of the GPU register 
file memory structure can be assessed for a specific application usage. Figure 6-1 details 
the error injection results for the register file mode for each kernel and the two image test 
cases. The results in Figure 6-1 show there is a relatively small probability that a single 
bit-flip in an allocated register causes an SDC, between 15-20%. Meanwhile, between 40-
50% of error injections result in a FI of the kernel execution. These results demonstrate 
the importance of developing and introducing appropriate mitigating techniques to deal 
with FI error effects.  
  
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 159 - 
 
 
Figure 6-1 RF error injection results  
 
RF error injections are performed randomly with respect to both run-time and the 
physical location of the allocated registers in each kernel. As a result, using knowledge of 
the amount of allocated registers and the size of the register file, the SDC and FI AVFs 
can be calculated. The formulas used are given in equations (6-2) and (6-3). The resulting 
SDC and FI AVF values for both kernels are given in Table 6-2.  
Table 6-2 Kernel register usage and calculated AVF’s 
Kernel Register Usage Per Thread SDC AVF FI AVF 
CCSDS-123 72 0.31 % 6.12 % 
Bit Packer 19 0.06 % 0.19 % 
  
These AVF results highlight the overall very low vulnerability factor of the Bit Pack-
er kernel with regards to both SDC and FI effects, which is due to the proportionally low 
register usage. The FI AVF for the CCSDS-123 kernel poses the greatest risk for the reg-
ister file, however this is still below 10%. The low SDC AVF for both kernels indicate 
that the addition of ECC to the register file will likely not achieve a good trade-off be-
tween error mitigation impact and processing throughput overhead cost. Due to the low 
probability of SDC effects occurring, low overhead detection and re-computation tech-
niques may provide a better trade-off between error resilience and execution time 
overhead for the register file.   
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
SDC FI Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
C SDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
CCSDS Mask Bit Packer Mask Uncharacterised
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
CCSDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
CCSDS Mask Bit Packer Mask Uncharacterised
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
G
PR
ST
O
R
E
PR
C
C
% of pplication err r injections
CCSDS-123 SDC Bit Packer SDC CCSDS-123 FI
Bit Packer FI CCSDS-123 Mask Bit Packer Mask
0 10 20 30 40 50 60 70 80 90 100
CCSDS-123
Bit Packer
CCSDS-123
Bit Packer
L
an
ds
at
A
gr
ic
ul
tu
re
A
V
IR
IS
H
aw
ai
i
% of kernel error injections 
SDC Mask
Average number of allocated 
registers 
 
=Registers Usage Per Thread	 × Threads Per SM (6-2) 
Average fraction of allocated 
registers 
 
=
Average #	of	Allocated Registers
Registers per SM
 (6-3) 
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 160 - 
6.2.2  Application Level Error Injection Results 
In addition to assessing the error probabilities of the GPU register file memory re-
gion, the error resilience of the instructions which make up the software application can 
be assessed using the SASSIFI IOV and IOA error injection modes,  Figure 6-2 gives the 
error injection results for the two instruction injection modes, where each stacked bar rep-
resents the proportion of SDC, FI and masked error effects as a percentage of the total 
number of errors injected into the CCSDS CUDA application as a whole.  
 
Figure 6-2 CCSDS-123 application level error injection results 
 
For IOV mode, there is an around a 40% probability that a single bit flip in the in-
struction output is masked, whilst for IOA mode this probability is lower at around 25%. 
This shows that the address components of GPU instructions are significantly less resili-
ent to errors, compared to the output value component. The SDC probability for both 
modes are fairly consistent, in the region of 45 – 55 %, the main disparity is between the 
FI probabilities for the two modes. For IOA mode the FI probability is approximately 
25% whilst in IOV this is much lower at around 10%. It is postulated that this likely due 
to the inherent characteristics of the two instruction components and their relationship to 
the underlying GPU operation. For example, an error in the output address of an instruc-
tion can cause system or control memory to be corrupted and results in an execution 
failure in the GPU. Overall these results show that there are significant probabilities of 
both SDC and FI events in either instruction values or instruction addresses, therefore 
complete computational instructions will need to be protected towards the reliable use of 
GPUs in error prone environments. 
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
 of applicat on error injections
SDC FI Mask
Landsat Ag iculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
C SDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
CCSDS Mask Bit Packer Mask Uncharacterised
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
CCSDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
CCSDS Mask Bit Packer Mask Uncharacterised
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
G
PR
ST
O
R
E
PR
C
C
% of application error injections
CCSDS-123 SDC Bit Packer SDC CCSDS-123 FI
Bit Packer FI CCSDS-123 Mask Bit Packer Mask
Landsat Agricult r
AVI IS a ii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
SDC FI Mask
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 161 - 
These results however provide little insight into the underlying causes and relation-
ship to the algorithmic properties of the studied application and how best error resilience 
can be increased. Therefore, further testing and analysis is performed to gain a greater 
insight into the inherent error resilience of each kernel. This is to evaluate how the appli-
cation design influences the error resilience and identify aspects of the GPU which can be 
leveraged to efficiently increase the error resilience. 
 
6.2.3  Kernel Level Error Injection Results 
There are two user kernels in the new application described in this thesis, namely 
CCSDS-123 and Bit Packer. The SASSIFI framework allows error injections to be inde-
pendently evaluated at the kernel level as well as from the top-level application 
perspective. The results from kernel level error injection testing for IOV and IOA modes 
are given in Figure 6-3. In Figure 6-3 each stacked bar represents the proportion of SDC, 
FI and masked error effects as a percentage of the errors injected into the respective GPU 
kernel independently.  
 
 
Figure 6-3 Kernel level error injection results 
 
 
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
SDC FI Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
C SDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
CCSDS Mask Bit Packer Mask Uncharacterised
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
CCSDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
CCSDS Mask Bit Packer Mask Uncharacterised
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
G
PR
ST
O
R
E
PR
C
C
% of application error injections
CCSDS-123 SDC Bit Packer SDC CCSDS-123 FI
Bit Packer FI CCSDS-123 Mask Bit Packer Mask
CCSDS-123
Bit Packer
CCSDS-123
Bit Packer
CCSDS-123
Bit Packer
CCSDS-123
Bit Packer
L
an
ds
at
A
gr
ic
ul
tu
re
A
V
IR
IS
H
aw
ai
i
L
an
ds
at
A
gr
ic
ul
tu
re
A
V
IR
IS
H
aw
ai
i
In
st
ru
ct
io
n 
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n 
O
ut
pu
t A
dd
re
ss
0 10 20 30 40 50 60 70 80 90 100
% of kernel error injections
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 162 - 
Looking at the SDC error probabilities across both IOV and IOA tests, the results are 
relatively comparative across both kernels showing that the SDC error probability within 
a kernel appears to remain consistent despite any algorithmic differences in the kernels. In 
terms of FI error probabilities these are also consistent within the two kernels in IOV 
mode, however for IOA error locations, the probability a FI occurs is approximately three 
times greater in the CCSDS-123 kernel compared to the Bit Packer kernel. It is postulated 
that this is likely due a combination in differences between the computational complexity, 
load and algorithmic characteristics. As shown in Table 6-2, the Bit Packer kernel per-
forms several fully parallelised operations in a relatively short computational pipeline and 
also uses significantly fewer registers compared to the CCSDS-123 kernel. As a result the 
active address space that could be corrupted by an erroneous instruction output address to 
cause a FI is likely to be significantly reduced, accounting for the lower FI error probabil-
ity.  
Overall these results show that there is a probability of at least 55% that either a SDC 
or FI error occurs in each kernel. However, it is important to look at thee in respect to the 
overall application, therefore these results have been weighted against the instructions 
occurrence for each kernel and are given in Figure 6-4.  
 
Figure 6-4 Application level view of the kernel error injection results 
 
This application level view of the kernel error injection results show that the CCSDS-
123 kernel has the greatest influence on the overall error resilience of the application, 
whereby there is around a 90% probability that an error will impact this kernel and a 
probability that an error will cause and SDC or FI error in this kernel is around 70%. The 
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
CCSDS-123 SDC Bit Packer SDC CCSDS-123 FI
Bit Packer FI CCSDS-123 Mask Bit Packer Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
C DS SDC Bit Packer SD CCSDS FI Bit Packer FI
CCSDS Mask Bit Packer Mask Uncharacterised
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
G
PR
ST
O
R
E
PR
C
C
% of application error injections
CCSDS-123 SDC Bit Packer SDC C DS-123 FI
Bit Packer FI CCSDS-123 Mask Bit Packer Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agricultu e
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
CCSDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
C DS Mask Bit Packer Mask Uncharacterised
0 5 10 15 20 25 30 3 40 45 50 55 60 65 70 75 80 85 90
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
G
PR
ST
O
R
E
PR
C
C
% of appli ation error injections
CSDS-123 SD it Packer SDC CCSDS-123 FI
Bit Packer FI C S S-123 Mask Bit Packer Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of applic tion err r injections
CCSDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
CCSDS Mask it Packer ask Unchara terised
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agric lture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
G
PR
ST
O
R
E
PR
C
C
% of application error injections
-123 SDC Bit Packer SDC CCSDS-123 FI
it Packer FI CCSDS-123 Mask Bi  Packe  Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
CCSDS SDC Bit Packer SDC CCSDS FI
Bit Packer FI CCSDS Mask Bit Packer Mask
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 163 - 
large influence this kernel has on the overall error resilience of the application as a whole 
is likely due to the fact that the absolute computational workload, in terms of total dynam-
ically executed instructions, is 10 times greater in the CCSDS-123 kernel compared to the 
Bit Packer kernel. Identifying trends in workloads between kernels is important when 
identifying and selecting error mitigations schemes, doing so on a kernel-by-kernel basis 
can help to effectively mitigate errors as well as maximise efficiently in terms of minimis-
ing the incurred overheads.  
Implementing appropriate error mitigation techniques for the Bit Packer kernel can 
only increase the overall application error resilience by less than 10%, whilst implement-
ing error protection techniques to mitigate against both SDC and FI errors in the CCSDS-
123 kernel could improve the application error resilience by up to 60% for IOV errors and 
up to 70% for IOA errors. These results show that prioritising the protection of the 
CCSDS-123 kernel will have the most significant impact on improving the error resili-
ence of the whole application.  
 
6.2.4  Instruction Level Error Injection Results 
In addition to kernel-level error injection analysis, the injection results have also been 
classified based on the individual instruction type. This allows an assessment on how dif-
ferent instructions contribute to the overall error resilience of an application to be made. 
This level of insight could be used in future developments of instruction targeted or com-
piler level error mitigation techniques.  
Firstly, the dynamic run-time instructions in the case study application have been 
classified based on where results are stored: GPR, PR, CC, or ST. The composition of the 
application with respect to these instruction groupings are given in Table 6-3. For IOA 
injection, CC and PR instructions are not injected into as they do not have address com-
ponents. Therefore, the occurrence percentages for result normalisation has been adjusted 
as shown in Table 6-3. The corresponding error injection results for the instruction group-
ings detailed in Table 6-3 are given in Figure 6-5; where each bar in the graph represents 
the percentage of SDC, FI and masked errors weighted by the dynamic instruction group 
occurrence rate detailed in Table 6-3.  
  
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 164 - 
Table 6-3 Occurrence of instructions types in the SSC CCSDS-123 GPU application  
Instruction type  IOV Occurrence (%) IOA Occurrence (%) 
GPR: General Purpose Registers 86.6 97.1 
PR: Predicate Register 6.8 0 
CC: Conditional Code  4.0 0 
ST: Memory Store  2.6 2.9 
 
 
 
 
Figure 6-5 Instruction level weighted error injection results  
 
Table 6-3 shows that for both test cases the majority of instructions output data to 
GPRs, following this the results in Figure 6-5 show that GPR instructions, in both IOV 
and IOA modes, have the greatest influence on the error resilience of the application as a 
whole, accounting for a 85% of combined masked, SDC and FI events. For IOV errors in 
GPR instructions there is a similar probability that either the error is masked (approxi-
mately 35%) or causes a SDC (approximately 40%). However, for IOA error injections in 
GPRs, both the SDC and FI probabilities are increased and the probability of an error be-
ing masked goes down to approximately 25%. This behavioural difference between 
modes mirrors the results seen at the kernel level, as shown in Figure 6-4. 
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
CCSDS-123 SDC Bit Packer SDC CCSDS-123 FI
Bit Packer FI CCSDS-123 Mask Bit Packer Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
C DS SDC Bit Packer SD CCSDS FI Bit Packer FI
CCSDS Mask Bit Packer Mask Uncharacterised
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
G
PR
ST
O
R
E
PR
C
C
% of application error injections
CCSDS-123 SDC Bit Packer SDC C DS-123 FI
Bit Packer FI CCSDS-123 Mask Bit Packer Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agricultu e
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
CCSDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
C DS Mask Bit Packer Mask Uncharacterised
0 5 10 15 20 25 30 3 40 45 50 55 60 65 70 75 80 85 90
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
G
PR
ST
O
R
E
PR
C
C
% of appli ation error injections
CSDS-123 SD it Packer SDC CCSDS-123 FI
Bit Packer FI C S S-123 Mask Bit Packer Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of applic tion err r injections
CCSDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
CCSDS Mask it Packer ask Unchara terised
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Landsa  Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
G
PR
ST
O
R
E
PR
C
C
% of application error injections
-123 SDC Bit Packer SDC CCSDS-123 FI
it Packer FI CCSDS-123 Mask Bi  Packe  Mask
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat griculture
AVIRIS Hawaii
Landsat Agr culture
AVIRIS Hawaii
Landsa  Ag iculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
G
PR
PR
C
C
ST
O
R
E
G
PR
ST
O
R
E
In
st
ru
ct
io
n 
O
up
ut
 V
al
ue
In
st
ru
ct
io
n 
O
ut
pu
t
A
dd
re
ss
% of application error injections
CCSDS-123 SDC % Bit Packer SDC CCSDS-123 FI %
Bit Packer FI CCSDS-123 Mask % Bit Packer Mask
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 165 - 
Another interesting characteristic highlighted in these results is that despite the low 
occurrence of memory store instructions, they have a greater contribution to SDC proba-
bility of the application than the more common CC instructions. It is postulated that this is 
likely due to the impact of algorithmic characteristics. In the implemented application, 
non-register memory is only utilised for the storage of shared intermediate result vectors 
and for the storage of the final results of each kernel. Therefore, there will be very little 
algorithmic masking effects occurring for these operations. These results highlight the 
importance of protecting memory instructions, which whilst having the lowest occurrence 
rate, can have significantly reduced error resilience, when compared to other instruction 
types.  
The open source nature of the SASSIFI framework also allows users to define their 
own custom instruction grouping for low-level instruction error resilience analysis. To 
further examine trends of GPR and STORE instructions in IOV injection mode additional 
results were gathered for the key instructions of these types in the application. The specif-
ic instructions, classification and occurrence rates for this investigation are given in  
Table 6-4. Where IADD_IMUL and MAD represent common integer addition and multi-
plication instructions, SHUFF_LOP includes the logical operations and LD and LDS are 
the global and shared memory store load instructions. The corresponding error injection 
results for these instruction groupings are given in Figure 6-6, where the bars represent 
the percentage of SDC, FI and masked errors weighted against the instruction occurrence 
as per Table 6-4.  
 
Table 6-4 Instruction occurrence for the SSC CCSDS-123 GPU application 
Type Group  Opcodes Occurrence (%) 
GPR 
IADD_IMUL IMNMX, ISCADD, IADD, IADD3, IADD32I 26.4 
MAD XMAD 22.7 
SHUFF_LOP SHF, SHL, SHR, LOP 12.2 
STORE 
LD LD, LDC, LDG, LDL 4.8 
LDS LDS 3.7 
 
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 166 - 
 
Figure 6-6 Weighted error injection results for GPR instructions in IOV mode  
 
The results given in Figure 6-6 highlight that the three computational instruction 
groups (IADD_IMUL, MAD and SHUFF_LOP) have the largest contribution to the over-
all application SDC vulnerability, accounting for 40% of SDCs. The results also show 
that despite the relatively low occurrence rate of the LD and LDS instructions, shown in 
Table 6-4, their error resilience is relatively low. This mirrors the results observed with 
store operations assessed in Figure 6-5. These results also highlight the potential opportu-
nities and advantages of developing a solution for error resilient application development 
which could target specific instructions or instruction types. By targeting at the instruc-
tion level an assessment of error probabilities and occurrence rates could be made 
automatically to achieve a suitable trade-off between error resilience and induced over-
head, in terms of both data processing throughput and memory usage.  
 
6.3    New GPU Error Mitigation Approaches 
As an initial step towards the research and development of more error resilient GPU 
software applications, new versions of the SSC CCSDS-123 GPU application have been 
developed which aim to increase the application’s error resilience. The results from    
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
CCSDS-123 SDC Bit Packer SDC CCSDS-123 FI
Bit Packer FI CCSDS-123 Mask Bit Packer Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
C DS SDC Bit Packer SD CCSDS FI Bit Packer FI
CCSDS Mask Bit Packer Mask Uncharacterised
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
G
PR
ST
O
R
E
PR
C
C
% of application error injections
CCSDS-123 SDC Bit Packer SDC C DS-123 FI
Bit Packer FI CCSDS-123 Mask Bit Packer Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agricultu e
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
CCSDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
C DS Mask Bit Packer Mask Uncharacterised
0 5 10 15 20 25 30 3 40 45 50 55 60 65 70 75 80 85 90
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
G
PR
ST
O
R
E
PR
C
C
% of appli ation error injections
CSDS-123 SD it Packer SDC CCSDS-123 FI
Bit Packer FI C S S-123 Mask Bit Packer Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of applic tion err r injections
CCSDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
CCSDS Mask it Packer ask Unchara terised
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
G
PR
ST
O
R
E
PR
C
C
% of application error injections
-123 SDC Bit Packer SDC CCSDS-123 FI
it Packer FI CCSDS-123 Mask Bi  Packe  Mask
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
IA
D
D
_I
M
U
L
M
A
D
SH
U
FF
_L
O
P
LD
LD
S
% of application error injections
CCSDS-123 SDC Bit Packer SDC CCSDS-123 FI
Bit Packer FI CCSDS-123 Mask Bit Packer Mask
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 167 - 
Section 6.2 have been used to inform the development of these new error resilient ver-
sions of the CCSDS-123 application. Subsequently, software based error injection testing 
has been repeated on these new applications; the results from these experiments are pre-
sented and discussed in the following subsections.  
 
6.3.1  Error Resilient GPU Application Development 
The results in Section 6.2 show that the CCSDS-123 kernel has the greatest influence 
on the overall application resilience, whereby 60% of the total application errors occurred 
in this kernel. Therefore a targeted scheme to specifically protect the CCSDS-123 kernel 
against SDC errors using software based TMR is researched. TMR is a generic approach 
which protects all instruction and memory pipelines. This is important because the previ-
ous results have shown that all instruction types are vulnerable to SDC errors and 
therefore need to be protected, as shown in Figure 6-5. A TMR approach allows for both 
the detection and correction of errors should they occur, without the requirement for re-
computation as processing throughput and deterministic timing can be critical for onboard 
and safety critical applications.  
TMR is a flexible generic error mitigation algorithm which can be implemented in a 
number of ways. On traditional hardware devices TMR can be implemented either spa-
tially using additional device resource for a concurrent approach, or temporally exploiting 
time as an execution axis. The key advantage of a GPU is the relatively large amount of 
concurrent computing resources and the different architectural levels which can be ex-
ploited for different levels of hardware sharing, allowing for greater control of the spatial 
configuration.  
In this research two different TMR kernel approaches are investigated, they imple-
ment TMR protection using either high level kernels or low level threads to duplicate 
workloads. These two approaches will be referred to hence forth as K-TMR (kernel 
TMR) and T-TMR (thread TMR). Both approaches execute the complete instruction pipe-
line of the CCSDS-123 kernel three times, resulting in three copies of output data being 
generated. An additional TMR comparator kernel is then required in the application to 
compare the outputs from the CCSDS-123 kernel(s) to detect and correct in the case of an 
SDC occurring. This approach is summarised in Figure 6-7. 
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 168 - 
 
Figure 6-7 TMR implementation comparison 
 
The key difference between the K-TMR and T-TMR versions of the application is 
how the instruction pipeline is triplicated and at which hierarchical level of the GPU the 
additional resources are managed. In the K-TMR version, the two additional redundant 
executions of the kernel are created using CUDA streams and asynchronous kernel execu-
tion. This allows for the top-level block schedulers to manage the concurrent execution of 
blocks from multiple kernels when the register usage, shared memory and computational 
cores at a SM level are underutilised.  
Alternatively, for the T-TMR version a single kernel executes the original and redun-
dant operations, whereby the redundant operations are executed within the same block as 
the original by declaring three times the number of threads in comparison to the original 
kernel. For this implementation the warp scheduler is responsible for managing the GPU 
workload and thus, when available, underutilisation within a warp or block can be lever-
aged. Both K-TMR and T-TMR approaches aim to exploit idle GPU resources to hide 
instruction latencies and reduce the execution overhead of the TMR protection. 
Both approaches are practically implemented in CUDA using a generic coding meth-
odology, this is so that the techniques could be easily reapplied to alternative algorithms. 
Specifically, redundancy in the K-TMR implementation is achieved using the CUDA 
streams construct introduced in CUDA 7.0. This allows the asynchronous and concurrent 
execution of multiple CCSDS-123 GPU kernels.  
 
  
T-TMRK-TMROriginal
CCSDS-123
Bit_Packer
Thrust: 
Inclusive_sum
CCSDS-123
Bit_Packer
Thrust: 
Inclusive_sum
TMR Comparator
Bit_Packer
CCSDS-123
Thrust: 
Inclusive_sum
TMR Comparator
CCSDS-123
CCSDS-123
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 169 - 
 
The T-TMR implementation is achieved simply by declaring three times the number 
of required threads, whereby the same input data is used for all three duplicate execution 
threads. This is demonstrated using pseudo code in Algorithm 6, where lines 17 to 20 
show how the K-TMR approach is set up in host code and lines 21 to 23 for the T-TMR 
approach. For both K-TMR and T-TMR the output arrays from the CCSDS-123 kernel 
(*G_LEN3 and *G_CWRD3) need to be three times the original size in order to accom-
Algorithm 6 –SSC CCSDS-123 K-TMR and T-TMR GPU kernel and host code 
1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 
9. 
10. 
Kernel TMR_COMPARE(*array_A, *array_B, *array_C) 
  index = threadId  
 if (array_A[index] == array_B[index]) 
  array_C[index] = array_A[index] 
 else if (array_A[index] == array_C[index]) 
  array_B[index] = array_A[index] 
 else 
  array_A[index] = array_B[index] 
  array_C[index] = array_B[index] 
  end 
 
11. 
12. 
13. 
14. 
15. 
16. 
17. 
18. 
19. 
20. 
21. 
22. 
23. 
24. 
25. 
26. 
27. 
28. 
29. 
30. 
31. 
32. 
33. 
34. 
35. 
36. 
Host Code 
Initialise_H_mem 
Initialise_G_mem 
Copy_H2G_mem(*H_IN, *G_IN) 
IM_SIZE = IM_BANDS x IM_HEIGHT x IM_WIDTH 
s_mem=(2*sizeof(int)* IM_BANDS * weights_len* TPB) 
blocks= num_tiles/TPB 
threads= IM_BANDS * TPB 
if (TMR = K_TMR) 
  for (i = 1; i = 3; i++)  
    cudaStreamCreate(&S[i]); 
    CCSDS123<<<blocks,threads,s_mem,S[i]>>>(*G_IN,*G_LEN3,*G_CWRD3) 
else if (TMR = T_TMR) 
  threads= (IM_BANDS * TPB) *3 
  CCSDS-123<<<blocks,threads,s_mem>>>(*G_IN, *G_LEN3, *G_CWRD3) 
else 
  threads= (IM_BANDS * TPB) 
  CCSDS-123<<<blocks,threads,s_mem>>>(*G_IN, *G_LEN, *G_CWRD) 
threads = IM_SIZE 
TMR_COMPARE<<<threads>>>(*G_LEN) 
TMR_COMPARE<<<threads>>>(*G_CWRD) 
threads = IM_BANDS(log(IM_BANDS)) 
INCLUSIVE_SUM<<<threads>>>(*G_LEN) 
threads = IM_SIZE 
BIT_PACKER<<<threads>>>(*G_CWRD, *G_LEN, *G_OUT) 
Copy_G2H_mem(*H_OUT, *G_OUT) 
Free G_mem 
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 170 - 
modate the duplicate output data. These output arrays are then passed into a new TMR 
comparator kernel, as shown in lines 28 and 29. A new simple TMR comparator kernel is 
proposed to compare the three arrays to detect errors and use majority voting logic to also 
correct data discrepancies where found. The approach taken for this is demonstrated in 
lines 1 to 9 in Algorithm 6. 
 
6.3.2  Software Based Error Injection Testing  
To quantify the impact of the newly implemented TMR on the error resilience of the 
application, the SASSIFI based error injection testing from Section 6.2 has been repeated. 
The same testing parameters are utilised so that a direct comparison can be made between 
the two new applications and the original application. A top-level view of the results for 
IOV, IOA and RF injection modes are given in Figure 6-8. The bars in Figure 6-8 repre-
sent the percentage of SDC, SDC corrected, FI and masked errors which occurred for all 
kernels in the original, K-TMR and T-TMR GPU applications. 
 
Figure 6-8 Error injection results for original, K-TMR and T-TMR applications 
 
The results in Figure 6-8 show the significant reduction in SDC error effects 
achieved by both K-TMR and T-TMR versions. For IOV and IOA modes SDC’s were 
reduced by at least 40% achieving SDC probabilities below 2%. For the register file in-
0 10 20 30 40 50 60 70 80 90 100
T-TMR
K-TMR
No TMR
T-TMR
K-TMR
No TMR
T-TMR
K-TMR
No TMR
In
st
ru
ct
io
n 
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n 
O
ut
pu
t
A
dd
re
ss
R
eg
ist
er
 F
ile
% of application error injections
SDC Corrected SDC FI Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
C SDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
CCSDS Mask Bit Packer Mask Uncharacterised
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
G
PR
ST
O
R
E
PR
C
C
% of application error injections
CCSDS-123 SDC Bit Packer SDC CCSDS-123 FI
Bit Packer FI CCSDS-123 Mask Bit Packer Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
CCSDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
CCSDS Mask Bit Packer Mask Unch racterised
0 10 20 30 40 50 60 70 80 90 100
T-TMR
K-TMR
No TMR
T-TMR
K-TMR
No TMR
T-TMR
K-TMR
No TMR
In
st
ru
ct
io
n 
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n 
O
ut
pu
t
A
dd
re
ss
R
eg
is
te
r 
Fi
le
% of application error injections
% SD % Corrected % FI % Mask
0 10 20 30 40 50 60 70 80 90 100
T-TMR
K-TMR
No TMR
T-TMR
K-TMR
No TMR
T-TMR
K-TMR
No TMR
In
st
ru
ct
io
n 
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n 
O
ut
pu
t
A
dd
re
ss
R
eg
is
te
r 
Fi
le
% of application error injections
% SDC % Corrected % FI % Mask
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 171 - 
jection mode, the SDC outcome was reduced by over 15% achieving an SDC probability 
of 0.5% or below. Figure 6-8 also shows how the overall FI rate has been affected by the 
new TMR implementations. For both the IOV and IOA modes the FI rate remains statisti-
cally similar with no significant change in the observed FI rates. This behaviour occurs 
because the TMR technique does not protect against FI error events, it only detects and 
corrects output data corruptions. 
To gain a deeper insight into the impact of SDC errors on the modified application 
version, Figure 6-9 shows the same experimental results as Figure 6-8, but with the error 
probabilities broken down at the kernel level.  
 
Figure 6-9 Error injection results original, K-TMR and T-TMR applications 
0 10 20 30 40 50 60 70 80 90 100
T-TMR
K-TMR
No TMR
T-TMR
K-TMR
No TMR
T-TMR
K-TMR
No TMR
In
st
ru
ct
io
n 
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n 
O
ut
pu
t
A
dd
re
ss
R
eg
ist
er
 F
ile
% of application error injections
SDC Corrected SDC FI Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
C SDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
CCSDS Mask Bit Packer Mask Uncharacterised
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
A
dd
re
ss
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t
V
al
ue
G
PR
ST
O
R
E
PR
C
C
% of application error injections
CCSDS-123 SDC Bit Packer SDC CCSDS-123 FI
Bit Packer FI CCSDS-123 Mask Bit Packer Mask
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
Landsat Agriculture
AVIRIS Hawaii
In
st
ru
ct
io
n
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n
O
ut
pu
t A
dd
re
ss
R
eg
ist
er
 F
ile
0 10 20 30 40 50 60 70 80 90 100
% of application error injections
CCSDS SDC Bit Packer SDC CCSDS FI Bit Packer FI
CCSDS Mask Bit Packer Mask Unch racterised
0 10 20 30 40 50 60 70 80 90 100
T-TMR
K-TMR
No TMR
T-TMR
K-TMR
No TMR
T-TMR
K-TMR
No TMR
In
st
ru
ct
io
n 
O
ut
pu
t
V
al
ue
In
st
ru
ct
io
n 
O
ut
pu
t
A
dd
re
ss
R
eg
is
te
r 
Fi
le
% of application error injections
% SD % Corrected % FI % Mask
0 10 20 30 40 50 60 70 80 90 100
CCSDS-123
TMR Compare
Bit Packer
CCSDS-123
TMR Compare
Bit Packer
CCSDS-123
Bit Packer
CCSDS-123
TMR Compare
Bit Packer
CCSDS-123
TMR Compare
Bit Packer
CCSDS-123
Bit Packer
CCSDS-123
TMR Compare
Bit Packer
CCSDS-123
TMR Compare
Bit Packer
CCSDS-123
Bit Packer
T-
TM
R
K
-T
M
R
N
o
TM
R
T-
TM
R
K
-T
M
R
N
o
TM
R
T-
TM
R
K
-T
M
R
N
o
TM
R
In
st
ru
ct
io
n 
O
ut
pu
t V
al
ue
In
st
ru
ct
io
n 
O
ut
pu
t A
dd
re
ss
R
eg
is
te
r 
Fi
le
% of application error injections
% SDC % SDC Corrected % FI % Mask
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 172 - 
The kernel analysis, shown in Figure 6-9, shows that the K-TMR version eliminated 
all SDC errors in the protected CCSDS-123 kernel and the only remaining SDC’s oc-
curred in the new comparator or unprotected Bit Packer kernels. However, the T-TMR 
version did not eliminate all SDC’s, a very small proportion, 0.5%, still occurred in the 
CCSDS-123 Kernel. It is postulated that this is likely due to the corruption of shared re-
sources such as registers. In this implementation, all three duplications of the instruction 
pipeline are executed within the same thread block. Therefore, errors in shared resources 
may propagate to all three of the TMR instructions and thus will not be detected or cor-
rected. An alternate approach could be to duplicate the redundant instructions at the block 
level to eliminate the use of shared resource by the duplicate instruction pipelines. 
 
6.3.3  Throughput Performance Evaluation 
In addition to assessing the error resilience of the K-TMR and T-TMR applications 
performance testing and analysis of the memory usage of these applications is also con-
ducted to establish the overhead requirements. These results are given in Figure 6-10 
which shows the execution time and execution overhead for the original and modified 
applications under several kernel configurations for the Landsat Agriculture test image. 
The different kernel configurations tested (A-D) represent the same degree of parallelism 
but increasing levels of TLP for the CCSDS-123 kernel; as the number of threads is in-
creased the number of blocks declared decreases. The execution overhead for the TMR 
implementations are calculated in comparison with the equivalent configuration for the 
original (no TMR) version.  
The execution time and overhead results given in Figure 6-10 show that for the origi-
nal application as the TLP increases, execution time decreases with diminishing returns. 
This is also mirrored by the K-TMR version of the application, where the overhead of the 
K-TMR application remains relatively constant across all kernel configurations, with a 
slight decrease in overhead with increasing level of TLP exploited. However, this trend is 
not observed by the T-TMR implementation, which has a significantly increasing over-
head with increasing TLP. This is attributed this to the relationship between TLP and 
warp execution efficiency. For T-TMR configuration A, representing the lowest level of 
TLP, an almost negligible overhead is introduced. For this case, the warp execution effi-
ciency is increased without increasing the number of required warps and the majority of 
the additional GPU workload is hidden.  
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 173 - 
 
Figure 6-10 TMR execution time overhead comparison for Landsat Agriculture 
 
However, as the explicit TLP of the kernel (configurations B to D) increases, addi-
tional warps are required to implement the TMR strategy and a larger overhead is 
induced. When comparing the configuration that achieves the minimised execution time, 
both K-TMR and T-TMR implementations incur a significant but similar overhead of 
130% and 126% to the kernel respectively. This shows that when the explicit application 
configuration is highly optimised for low execution time there is less opportunity to effec-
tively hide the execution overhead of implementing TMR protection.  
Equations (6-4) and (6-5) provide the relationships for shared and global memory. 
Shared memory is closely related to the number of threads initialised per kernel. Due to 
this relationship, the shared memory usage overhead is increased for the T-TMR imple-
mentation when compared to the K-TMR and original applications. The global memory 
overhead introduced for K-TMR and T-TMR applications are the same and approximate-
ly equate to an overhead of 170%. The global memory overhead is minimised by only 
applying the TMR techniques to the CCSDS-123 kernel rather than the whole application 
and utilising a single input data source for all three of the kernel execution pipelines.  
 
Shared Memory (Bytes) 
 
 
 = 8*#Bands*Wlen*(#Threads / #Bands) (6-4) 
   
Global Memory (Bytes)  If TMR	≈ 38*#Pixels 
 Else  					≈ 14*#Pixels (6-5)  
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
120%
130%
140%
150%
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
N
o 
TM
R
K
-T
M
R
T-
TM
R
N
o 
TM
R
K
-T
M
R
T-
TM
R
N
o 
TM
R
K
-T
M
R
T-
TM
R
N
o 
TM
R
K
-T
M
R
T-
TM
R
(A) TPB = 1 (B) TPB = 2 (C) TPB = 4 (D) TPB = 8
O
ve
rh
ea
d 
%
 c
om
pa
re
d 
to
 N
o 
T
M
R
 k
er
ne
l
C
C
SD
S-
12
3 
K
er
ne
l E
xe
cu
io
n 
Ti
m
es
 (m
s)
Execution Time (ms) Overhead (%)
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 174 - 
To conclude the investigation into the impact the TMR protection mechanisms on the 
throughput performance, Figure 6-11 provides the measured throughput performance for 
the three Landsat test images with no TMR, K-TMR and T-TMR enabled across the vary-
ing numbers of tiles and TPB values on the GTX750Ti GPU and Figure 6-12 provides the 
equivalent results on the Jetson TX1 platform.  
 
 
 
 
A) 
 
 
B) 
 
 
 
C) 
 
Figure 6-11 GTX750Ti Landsat throughput results with and without TMR protection 
 
All tested image results for both GPU platforms, in Figure 6-11 and Figure 6-12, ex-
hibit the same overall trends in throughput with both changing tile size and TPB value. 
Comparing the K-TMR and T-TMR results, in Figure 6-11 and Figure 6-12, for lower 
TPB values the T-TMR approach is able to achieve significantly greater processing 
throughputs by leveraging the underutilisation of threads within a warp. Comparing the 
peak achieved throughput for both K-TMR and T-TMR applications, these are approxi-
mately equal to each other and equal to the processing throughput of the application with 
no TMR protection and a TPB value of 1. 
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
GTX750Ti : No TMR - Tiles per Block (Threads per Block)
GTX750Ti :   K-TMR - Tiles per Block (Threads per Block)
GTX750Ti :   T-TMR - Tiles per Block (Threads per Block)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (18) 2 (36) 4 (72) 8 (144) 16 (288)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Landsat Agriculture
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Landsat Coast
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Landsat Mountain
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 175 - 
 
 
 
 
 
 
 
 
A) 
 
 
B) 
 
 
 
C) 
 
Figure 6-12 Jetson TX1 Landsat throughput results with and without TMR protection  
 
Overall these are very promising results, particularly for the Jetson TX1 low power 
GPU platform as they show that both the unprotected and TMR protected versions of the 
CCSDS-123 application are able to achieve a processing throughput which is in the re-
gion of the payload generation data rate for the multispectral Landsat imager of 440 Mb/s. 
This highlights the feasibility of utilising a GPU device for accelerated state-of-the-art 
compression in a real-time onboard data processing system. 
 
6.4    Chapter 6 Summary 
In this Chapter, research into the error resilience of the NVIDIA GPU architecture, 
software model and the new SSC CCSDS-123 application is presented. The state-of-the-
art software based NVIDIA SASSIFI error injection framework is used to conduct this 
research, this research demonstrates how this framework can be used to provide insights 
into error resilience which are not available using the standard CUDA developer tools or 
alternative error injection frameworks. In addition to investigating inherent error resili-
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Jetson TX1: No TMR - Tiles per Block (Threads per Block)
Jetson TX1:   K-TMR - Tiles per Block (Threads per Block)
Jetson TX1:   T-TMR - Tiles per Block (Threads per Block)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (18) 2 (36) 4 (72) 8 (144)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Jetson TX1: No TMR - Tiles per Block (Threads per Block)
Jetson TX1:   K-TMR - Tiles per Block (Threads per Block)
Jetson TX1:   T-TMR - Tiles per Block (Threads per Block)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (18) 2 (36) 4 (72) 8 (144)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
Th
ro
ug
hp
ut
 (G
b/
s)
Number of Tiles
Jetson TX1: No TMR - Tiles per Block (Threads per Block)
Jetson TX1:   K-TMR - Tiles per Block (Threads per Block)
Jetson TX1:   T-TMR - Tiles per Block (Threads per Block)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (18) 2 (36) 4 (72) 8 (144)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Jetson TX1: No TMR - Tiles per Block (Threads per Block)
Jetson TX1:   K-TMR - Tiles per Block (Threads per Block)
Jetson TX1:   T-TMR - Tiles per Block (Threads per Block)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (18) 2 (36) 4 (72) 8 (144)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Jetson TX1: No TMR - Tiles per Block (Threads per Block)
Jetson TX1:   K-TMR - Tiles per Block (Threads per Block)
Jetson TX1:   T-TMR - Tiles per Block (Threads per Block)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (18) 2 (36) 4 (72) 8 (144)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Jetson TX1: No TMR - Tiles per Block (Threads per Block)
Jetson TX1:   K-TMR - Tiles per Block (Threads per Block)
Jetson TX :   T-TMR - Tiles per Block (Threads per Block)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (6) 2 (12) 4 (24) 8 (48) 16 (96)
1 (18) 2 (36) 4 (72) 8 (144)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Landsat Agriculture
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Landsat Coast
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
T
hr
ou
gh
pu
t (
G
b/
s)
Number of Tiles
Landsat Mountain
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 176 - 
ence, two new generic TMR based protection approaches for GPU applications, K-TMR 
and T-TMR, are developed and practically assessed. The key characteristics of the new 
error protection approaches are that they maintain time deterministic processing, mini-
mise induced execution time and memory overheads and are generic; meaning they can 
be applied to multiple kernels, applications, and algorithms. 
The major goal of the research detailed in this Chapter was to directly address the re-
search questions posed in Table 3-3 on page 91. Therefore, a summary of the key findings 
presented in this Chapter in relation to these research questions are given in Table 6-5.  
Overall the research detailed in this Chapter has successfully demonstrated that all 
SDCs errors can be eliminated using software TMR. However, this alone is not sufficient 
to prove that utilising GPUs in error prone environments or safety critical applications is 
truly feasible. In order to achieve this further research with regards to FI errors is re-
quired. Firstly, a greater understanding of the hardware level upset rates and propagation 
probability of a FI in software needs to be established, this will need to be achieved by 
performing practical radiation testing due to the limitations of architectural visibility as-
sociated with software error injection testing. Secondly, research into suitable protection 
and mitigation techniques to reduce the probability or impact of a FI on the application is 
required. Addressing FIs are particularly critical to both space and safety critical terrestri-
al applications as they are often operated in a heavily time deterministic manner, whereby 
multiple application restarts may not be tolerable. Both of these areas are also discussed 
in Chapter 7.2 which details future work. 
 
  
Rebecca L. Davidson                Chapter 6 , Error Resilient GPU Accelerated CCSDS-123 Compression  
- 177 - 
Table 6-5 SSC CCSDS-123 error resilience case study Chapter 6 findings 
2.a  Which elements of the algorithm and GPU are most susceptible to errors? 
 - The CCSDS-123 kernel has the greatest influence on the overall error resilience of 
the CCSDS-123 application, where, with no error protection, there is around a 45% 
probability an SDC and between 10-25% probability an FI occurs in this kernel. The 
comparable probabilities for the Bit Packer kernel are less than 10% for an SDC and less 
than 1% for a FI.  
  
- The Register File has low observed error probability and low overall AVF. FIs were 
the highest probability observed error effect for this structure, therefore ECC presents 
little benefit. 
  
- Whilst memory store instructions had the lowest occurrence rates, they have a signif-
icant influence on the error resilience. Showing that functional and algorithmic 
characteristics are important to error resilience not just occurrence rate. 
 
2.b  Can the GPU architecture and software model be leveraged to mitigate error ef-
fects whilst minimising the induced overheads? 
 - New K-TMR and T-TMR approaches are proposed and then used to protect the most 
vulnerable CCSDS-123 kernel.  
  
- K-TMR uses the CUDA streams software model construct for overlapped redundant 
kernel execution to reduce the induced execution time overhead, less than 150%, and 
minimise the use of shared memory resources.  
  
- K-TMR eliminates all SDC errors in the CCSDS-123 kernel, equating to around a 
40% reduction in SDCs in the overall application.  
  
- T-TMR leverages additional execution threads for concurrent redundant executions 
to significantly minimise the induced execution time overhead, especially where the 
GPU is underutilised, resulting in an overhead of between 15% and 130%. 
  
- T-TMR eliminates the majority of SDC errors in the CCSDS-123 kernel (~0.5% re-
main), equating to around a 40% reduction in SDCs in the overall application. 
 
 
Rebecca L. Davidson          Chapter 7 , Conclusion & Future Work 
- 178 - 
CHAPTER 7, CONCLUSION & FUTURE WORK 
7.1    Conclusion 
EO payload data rates and volumes generated onboard satellite platforms have and 
continue to grow substantially. However, the required level of advancement in downlink 
technologies, to handle such increases, has not occurred. The growing disparity between 
payload and downlink capabilities, has resulted in the formation of an onboard data bot-
tleneck in the data-delivery chain. For EO satellite platforms to continue to provide the 
level of data required by EO science and its reliant applications, this data bottleneck must 
be alleviated. The approach studied in this research is to leverage onboard processing to 
compress the data prior to transmission to ground, effectively optimising the achieved 
data-delivery rate. Currently satellite onboard data processing systems typically employ 
RH processors in a rigid architectural design with simple image compression. These de-
vices and system architectures are not able to provide the flexibility to deal with increased 
diversity in payload types required by current or future satellite EO missions nor the level 
of computational resources and compression required to alleviate the onboard data bottle-
neck. 
In Chapter 2 the current state-of-the-art in terrestrial computing and image compres-
sion and processing is assessed. The findings showed that the requirements of low power 
terrestrial and onboard space processing applications are becoming increasing aligned. 
This presents new opportunities to deploy and adapt terrestrial system design principles, 
from fields such as cluster computing, and devices, including state-of-the-art GPUs, in the 
onboard data processing application. Additionally, by conducting an extensive review of 
lossless image compression algorithms the extensive benefits of leveraging multidimen-
sional algorithms, in combination with additional image processing algorithms such as 
image tiling, band registration and radiometric calibration, are discussed. Specifically, the 
CCSDS-123 algorithm emerges as the most suitable, highest performing algorithm for 
onboard implementation, achieving an average compression ratio 54% greater than the 
JPEG-LS algorithm typically used onboard today.  
Based on these findings, Chapter 3 details the design and proposal of a new scalable 
heterogeneous onboard data processing architecture, shown in Figure 3-3. The proposed 
new system is designed, behaviourally and structurally, to alleviate the existing onboard 
data bottleneck by facilitating the deployment of autonomous advanced processing and 
compression in an inherently flexible system design. This enables high compression ratio 
Rebecca L. Davidson          Chapter 7 , Conclusion & Future Work 
- 179 - 
algorithms to be implemented and new onboard data products to be generated. In this new 
architecture, the computationally intense image processing and compression is offloaded 
onto GPU devices for high throughput processing. Whilst GPUs are architecturally well 
suited to processing the large dimensionality and volumes of data common in EO applica-
tions, there are several research areas which present significant challenges with respect to 
the deployment of these device in an onboard environment, namely low-power, high data 
processing throughput and error resilience.  
Due to the unique nature of the GPU hardware and software model, the use of design 
and optimisation approaches new to the field of onboard data processing are required to 
achieve high data processing throughput. Additionally, new error mitigation approaches, 
not commonly employed in GPU computing, are necessary to ensure the data processing 
is also error resilient. Research into the combination of the design approaches from both 
GPU and onboard computing fields was conducted using the state-of-the-art CCSDS-123 
lossless image compression algorithm as a case study.  
Chapter 4 details the initial investigation of appropriate design and optimisation ap-
proaches for a parallelised high throughput hyperspectral CCSDS-123 image 
compression. In depth details of the design process have been made available and new 
generic design rules have been proposed to aid the design of future GPU accelerated im-
age processing application by the community. The capabilities of the new SSC CCSDS-
123 GPU accelerated application are demonstrated, when directly comparing this research 
against previous implementations from literature new state-of-the-art processing through-
put performance is shown. For the first time the capabilities on a low power onboard 
representative GPU platform are also demonstrated. The results from this investigation 
show that using the new SSC CCSDS-123 application a processing throughput over twice 
that of the AVIRIS imager data rate is possible. Additionally, new relationships between 
the algorithm, application and image tiling parameters are discussed, a new modified 
comparison metric to easily assess these relationships is demonstrated, and new equations 
relating these parameters to the underlying GPU architecture to aid optimum parameter 
and hardware selection are proposed.  
In Chapter 5 the impact of input data characteristics on compression ratio and 
throughput performance are investigated. For data which exhibits lower DLP, such as 
multispectral imagery, it is shown that the processing throughput can be significantly in-
hibited. To address this, appropriate optimisation techniques are researched and as a result 
Rebecca L. Davidson          Chapter 7 , Conclusion & Future Work 
- 180 - 
a new multispectral optimised SSC CCSDS-123 application and accompanying new ap-
plication design rules are presented. The capabilities of the optimisation techniques 
through experimental testing of the application are demonstrated, whereby an increase in 
processing throughput of up to 2.5 times greater than the previous application design is 
observed, while on the low power Jetson TX1 GPU a processing throughput up to twice 
the Landsat imager’s raw data rate is achieved.  
In addition to low power consumption, the other major constraint critical to the 
onboard application is error resilience. In Chapter 6 the error resilience requirements are 
specially addressed by leveraging the new SSC CCSDS-123 GPU accelerated application 
and a state-of-the-art software based error injection framework to evaluate inherent error 
resilience. Using these results and investigating the fusion of traditional onboard error 
mitigation schemes and GPU computing principles, two new TMR based mitigation 
schemes, which leverage the GPU architecture to reduce SDC error events and minimise 
the induced execution overheads, are proposed. Using the SASSIFI software error injec-
tion framework it is shown that, when applied to the new CCSDS-123 kernel, the new K-
TMR approach is able to eliminate all SDC errors and the T-TMR approach reduces SDC 
effects to less than a 0.5% probability. The induced overheads of both techniques are also 
investigated, these equate to a peak 140% increase in execution time. On the Jetson TX1 
platform it is demonstrated that it is still possible to meet the real-time throughput con-
straint for the Landsat imager while enabling either K-TMR or T-TMR protection 
mechanisms.  
Accumulating all the key findings from the research explored in this thesis, a new 
development and testing framework for GPU accelerated error resilient applications is 
proposed and presented in Figure 7-1. This Figure, in conjunction with the findings de-
tailed in this thesis, is designed to help facilitate the development of new image 
processing applications by the community. Figure 7-1 specifically aims to highlight the 
iterative nature of the design and optimisation approach required to maximise processing 
throughput and error resilience performance, the key generic design rules which should 
be considered, and points to specific examples of the detailed approaches in this thesis. 
The Figure also identifies key areas for future research and how these would fit into the 
proposed framework, as shown by the shaded blocks. The future work is also discussed 
further detail in the following subsections. 
Rebecca L. Davidson          Chapter 7 , Conclusion & Future Work 
- 181 - 
 
Figure 7-1 New error resilient GPU accelerated application development framework 
Error Resilience
Assess
Parallelisation
Implement
Profiling Testing
NVIDIA NVprof
Evaluate
Performance Testing
Custom application shell scripts
Evaluate
Error Injection Testing
Software based NVIDIA SASSIFI 
Iterate
Implement
Kernel 
Organisation
- SDC Protection
Algorithm & Application Characteristics
Section 6.1
Define Application Requirements
Section 4.3, Table 4-1
Algorithm PerformanceHardware Characteristics
Iterate
Input Data Characteristics
Input Data 
Organisation
Kernel 
Organisation
Memory Hierarchy
Register & Shared Memory Usage
Section 4.3.2 - Design Rule C
Kernel Occupancy
Limiting Factors & Compiler Flags
Section 4.3.3 - Design Rule D
Data Reordering
Coalesced Memory Operations
Section 4.3.1 - Design Rule A
Pre-processing
Image Tiling for increased TLP
Section 4.3.1 - Design Rule B
Concurrent Tasks
CUDA Streams & Thread 
Organisation
Section 4.3.4 - Design Rule E
Configuration Optimisation 
Architecture Specific Block Sizing
Section 4.3.5 - Design Rule F
Nested Parallelism
Dynamic Kernel Execution
Section 5.2.1 - Design Rule G
Warp Efficiency
Thread Reorganisation
Section 5.2.2 - Design Rule H
Algorithm 
Performance
Application 
Performance
Performance 
Trade-offs
Comparison Metrics
Throughput Weissman Score
Equation (4-12)
Measured & Calculated Metrics
File Size →Compression Ratio
Equation (2-5) 
Measured & Calculated Metrics
Execution Time → Processing 
Throughput 
Equation (4-8)
Variable Parameters
Image Tiling, TPB, TMR
Section 4.4.2, 5.4 & 6.3
Relationships
Equations (5-1) & (5-2)
Variable Parameters
Prediction Bands, Image Tiling
Appendix H, Section 4.4.2 & 5.1
Assess
Algorithm Data & Task Dependencies
Section 4.2, Figure 4-4
Computational 
Levels
Memory 
Structures
Application
IOV & IOA Modes
Section 6.2.1 & Section 6.3.2 
Kernel
IOV, & IOA Modes
Section 6.2.2 & Section 6.3.2
Instruction
IOV, & IOA Modes
Section 6.2.4 & Section 6.3.2
Implementation Performance
Register File
Direct Error Injections
Section 6.2.3 & Section 6.3.2
Caches & Shared Memory
Not assessed in this research
Global Memory
Not assessed in this research
Define Error Models
SBF: Section 6.2, Table 6-2
Control 
Structures
GigaThread Scheduler
Not assessed in this research
Warp Scheduler
Not assessed in this research
Kernel 
Organisation
- FI Protection
Generic
K-TMR & T-TMR
Section 6.3, Figure 6-6
ABFT
Not assessed in this research
CPR
Not assessed in this research
Watchdog
Not assessed in this research
Iterate
Iterate
Rebecca L. Davidson          Chapter 7 , Conclusion & Future Work 
- 182 - 
7.2    Future Work 
7.2.1  GPU Beam Testing  
 In this research the error resilience of GPU hardware and the newly proposed soft-
ware application has been assessed using a state-of-the-art software based error injection 
framework, enabling the discovery of key insights in a cost and time effective manner. 
However, there are several error resilience aspects which cannot be currently assessed 
using this software framework, namely the hardware work schedulers and the cache or 
shared memory structures. This is because many key low level details of the hardware 
architecture and software API are obfuscated by the manufacturers due to the highly 
competitive nature of the market and highly sensitivity nature of the IP deployed. In addi-
tion to assessing these physical structures, the impact of changing memory access patterns 
or the degree of exposed parallelism via the configuration of software based kernel or im-
age tiling parameters will also be valuable contributions to the community. These 
experiments would need to be explored using beam testing experiments which can be 
costly and time consuming but would enable a complete assessment of the GPU architec-
ture in an environment indicative of space. The results from such a research programme 
could also be used to verify and correlate the software based error injection results such 
as those presented in this research.  
7.2.2  GPU Application FI Protection 
This research has concentrated upon the mitigation of SDC error events. However, in 
space and safety critical applications mitigating against FI events will also be a high pri-
ority. As shown in in Figure 7-1, approaches such as CPR or a simple watchdog 
controller could be implemented to achieve this. However, research into new approaches 
specifically designed for the GPU architecture and leveraging the latest software model 
features could have a significant impact and aid the wide spread adoption of GPU devices 
in both space and safety critical fields. 
7.2.3  State-of-the-art Low Power GPU Testing 
In 2017 NVIDIA released a new low power GPU platform, the Jetson TX2, which is 
based on a more advanced GPU architecture, boasting twice the raw computational per-
formance and increased power efficiency [148]. Additionally, NVIDIA have released a 
version of the device, Jetson TX2i, specifically designed for industrial environments. It 
features ECC protected memory, voltage monitoring circuitry, wider operating tempera-
Rebecca L. Davidson          Chapter 7 , Conclusion & Future Work 
- 183 - 
ture range and longer operating lifetime [149]. It would be valuable to reassess both the 
throughput and error resilience performance of the newly developed SSC CCSDS-123 
application on this new platform and directly compare the results against those presented 
for the Jetson TX1 device in this thesis, to evaluate which is the most suitable device for 
onboard use.  
7.2.4  Commercial Exploitation  
To demonstrate the potential impact of the newly proposed GPU accelerated onboard 
data processing architecture, work towards the design and road-mapping of a commercial-
ly viable system would be beneficial. Such work should focus on the outstanding system 
level challenges in areas such data interfacing, operational power characteristics and 
thermal engineering challenges. 
Recently a UK consortium, which includes the pioneering small satellite design and 
manufacture company Surrey Satellite Technology Limited (SSTL), have won a Centre 
for EO Instrumentation (CEOI) funded project to implement, test and demonstrate inno-
vative software techniques for ultra-high-resolution optical image processing on 
dedicated GPU hardware [150]. The knowledge gained from this research programme and 
the work detailed in this thesis will be directly used within SSTL on their role in this pro-
ject and to help shape their future onboard EO platforms.  
  
Rebecca L. Davidson                                    References 
- 184 - 
REFERENCES 
[1]  S. Lopez, T. Vladimirova, C. Gonzalez, J. Resano, D. Mozos and A. Plaza, “The Promise of 
Reconfigurable Computing for Hyperspectral Imaging Onboard Systems: A Review and 
Trends” In Proceedings of the IEEE, vol. 101, no. 3, pp. 698-722, March 2013. DOI: 
10.1109/JPROC.2012.2231391 
[2]  R. Trautner. “ESA’s roadmap for next generation payload data processors.” In Proceedings 
of DASIA Conference, 2011.  ISBN: 978-92-9092-258-2, ESA-SP Vol. 694, 2011, id.79 
[3]  D. Giggenbach, B. Epple, J. Horwath, and F. Moll. “Optical satellite downlinks to optical 
ground stations and high-altitude platforms.” In Advances in Mobile and Wireless Commu-
nications, 2008. DOI: 10.1109/ISTMWC.2007.4299318 
[4]  C. Thiebaut, E. Christophe, D. Lebedeff and C. Latry. “CNES studies of on-board compres-
sion for multispectral and hyperspectral images.” In Proc. SPIE, vol. 6683, 2007. DOI: 
10.1117/12.734186  
[5]  G. Yu, T. Vladimirova and M.N. Sweeting,. “Image compression systems on board satellites. 
Acta Astronautica, 2009 64(9-10), pp.988-1005. DOI: 10.1016/j.actaastro.2008.12.006 
[6]  W. G. Rees. “Physical Principles of Remote Sensing”. Cambridge University Press, 2013.   
ISBN: 9780521181167 
[7]  F. F. Sabins. “Remote sensing: principles and applications.” Waveland Press, 2007.  ISBN-
13: 978-1577665076 
[8]  J. R. Jensen. “Remote sensing of the environment: An earth resource perspective.” Pearson 
Education India, 2009.  ISBN-13: 978-0131889507 
[9]  S. Rahmani, M. Strait, D. Merkurjev, M. Moeller and T. Wittman, "An Adaptive IHS Pan-
Sharpening Method," in IEEE Geoscience and Remote Sensing Letters, vol. 7, no. 4, pp. 
746-750, Oct. 2010. doi: 10.1109/LGRS.2010.2046715 
[10]  B. L. Markham, J. C. Storey, D. L. Williams and J. R. Irons, "Landsat sensor performance: 
history and current status," in IEEE Transactions on Geoscience and Remote Sensing, vol. 
42, no. 12, pp. 2691-2694, Dec. 2004. doi: 10.1109/TGRS.2004.840720 
[11]  P. Soille, A. Burger, D. Rodriguez, V. Syrris, and V. Vasilev. "Towards a JRC earth observa-
tion data and processing platform." In Proceedings of the Conference on Big Data from 
Space (BiDS’16), Santa Cruz de Tenerife, pp. 15-17. 2016.  doi: 10.2788/854791 
[12]  L. Waranon, S. Nakamura, and W. Vongsantivanich. "Inter-satellite Link Assessment For 
Thaichote EOS Satellite Communication Enhancement.” Geo-Informatics and Space Tech-
nology Development Agency (GISTDA), Bangkok, Thailand  doi: 10.1.1.740.6559 
[13]  Spaceworks, “2017 Nano/Microsatellite market forecast.” Spaceworks Enterprises, Atlanta, 
2016. [Online] Accessed: September 2017. Available: 
http://spaceworksforecast.com/docs/SpaceWorks_Nano_Microsatellite_Market_Forecast_20
17.pdf  
Rebecca L. Davidson                                    References 
- 185 - 
[14]  P. d'Angelo, G. Kuschk and P. Reinartz, “Evaluation of skybox video and still image prod-
ucts.” The International Archives of Photogrammetry, Remote Sensing and Spatial 
Information Sciences, 2014, 40(1), p.95. DOI:10.5194/isprsarchives-XL-1-95-2014 
[15]  NASA, “Radiation Belts with Satellites” Feb. 28, 2013 [Online] Accessed September 2018 
Available: https://www.nasa.gov/mission_pages/sunearth/news/gallery/20130228-
radiationbelts.html 
[16]  A. Holmes-Siedle and L. Adams. "Handbook of radiation effects." Second Edition OUP Ox-
ford, 2002,  ISBN-13: 978-0198507338 
[17]  Underwood, C.I. (1996) “ Single Event Effects in Commercial Memory Devices in the Space 
Radiation Environment”, Ph.D. Thesis, Centre for Satellite Engineering Research, University 
of Surrey, August 1996. 
[18]  F. Kastensmidt and P. Rech. “FPGAs and parallel architectures for aerospace applications: 
soft errors and fault-tolerant design”. Springer, 2015.  ISBN 978-3-319-14352-1 
[19]  A.E Cooper, W.T Chow “Development of on-board space computer systems.”  in IBM Jour-
nal of Research and Development, vol. 20, no. 1, pp. 5-19, Jan. 1976. doi: 
10.1147/rd.201.0005 
[20]  P. K. Samudrala, J. Ramos, S. Katkoori, "Selective Triple Modular Redundancy (STMR) 
Based Single-Event Upset (SEU) Tolerant Synthesis for FPGAs",  in IEEE Transactions on 
Nuclear Science, vol. 51, no. 5, pp. 2957-2969, Oct. 2004.doi: 10.1109/TNS.2004.834955 
[21]  J. R. Schwank, V. Ferlet-Cavrois, M. R. Shaneyfelt, P. Paillet and P. E. Dodd, "Radiation 
effects in SOI technologies," in IEEE Transactions on Nuclear Science, vol. 50, no. 3, pp. 
522-538, June 2003. doi: 10.1109/TNS.2003.812930 
[22]  P. Yeh, and H.M Warner "Application guide for universal source encoding for space." 
NASA Technical Paper 3441, 1993,  Scientific and Technical Information Branch NASA-
TP-3441 19940017310 
[23]  R. Rice and J. Plaunt. “Adaptive variable-length coding for efficient compression of space-
craft television data.” In IEEE Transactions On Communication Technology, pp. 889-897, 
1971. DOI: 10.1109/TCOM.1971.1090789. 
[24]  Xilinx “Xilinx Aerospace & Defense Solutions, Space Grade FPGA offerings”, V2.3 
[Online] Accessed May 2019. Available:  
https://www.xilinx.com/publications/prod_mktg/xmp077.pdf 
[25]  Microsemi, “RTG4 Radiation-Tolerant FPGAs: High-Speed RT FPGAs for Signal Pro-
cessing Applications” [Online] Accessed May 2019. Available: 
https://www.microsemi.com/product-directory/rad-tolerant-fpgas/3576-rtg4 
[26]  G.E Moore “Cramming more components onto integrated circuits,” Intel Electronics, 1965, 
pp 38: 8. [Online] Accessed September 2018. Available: ftp://download. intel. 
com/research/silicon/moorespaper.pdf 
 
Rebecca L. Davidson                                    References 
- 186 - 
[27]  R. L. Pease, A. H. Johnston and J. L. Azarewicz, "Radiation testing of semiconductor devic-
es for space electronics," in Proceedings of the IEEE, vol. 76, no. 11, pp. 1510-1526, Nov. 
1988. doi: 10.1109/5.90110 
[28]  C. I. Underwood, A. da Silva Curiel, and M. N. Sweeting, “In-orbit monitoring of ‘space 
weather’ and its effects on commercial-off-the- shelf (COTS) electronics—A decade of re-
search using micro-satellites,” presented at the 53rd Int. Astronautical Congress, Houston, 
TX, Oct. 10–19, 2002, doi. IAC-02-IAA.6.3.04. 
[29]  Y. Bentoutou, "A Real Time EDAC System for Applications Onboard Earth Observation 
Small Satellites," in IEEE Transactions on Aerospace and Electronic Systems, vol. 48, no. 1, 
pp. 648-657, Jan. 2012.doi: 10.1109/TAES.2012.6129661 
[30]  P. P. Shirvani, N. R. Saxena and E. J. McCluskey, "Software-implemented EDAC protection 
against SEUs," in IEEE Transactions on Reliability, vol. 49, no. 3, pp. 273-284, Sept 2000. 
doi: 10.1109/24.914544 
[31]  H. J. Tausch, "Simplified Birthday Statistics and Hamming EDAC," in IEEE Transactions on 
Nuclear Science, vol. 56, no. 2, pp. 474-478, April 2009.doi: 10.1109/TNS.2009.2012710 
[32]  A. M. Saleh, J. J. Serrano and J. H. Patel, "Reliability of scrubbing recovery-techniques for 
memory systems," in IEEE Transactions on Reliability, vol. 39, no. 1, pp. 114-122, April 
1990. doi: 10.1109/24.52622 
[33]  H. Akkary, R. Rajwar and S. T. Srinivasan, "Checkpoint processing and recovery: an effi-
cient, scalable alternative to reorder buffers," in IEEE Micro, vol. 23, no. 6, pp. 11-19, Nov.-
Dec. 2003. doi: 10.1109/MM.2003.1261382 
[34]  K. Huang, "Algorithm-Based Fault Tolerance for Matrix Operations," in IEEE Transactions 
on Computers, vol. C-33, no. 6, pp. 518-528, June 1984. doi: 10.1109/TC.1984.1676475 
[35]  S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt and T. Austin, "A systematic method-
ology to compute the architectural vulnerability factors for a high-performance 
microprocessor," Proceedings. 36th Annual IEEE/ACM International Symposium on Micro-
architecture, 2003. MICRO-36., San Diego, CA, USA, 2003, pp. 29-40. doi: 
10.1109/MICRO.2003.1253181 
[36]  V. Sridharan, D. R. Kaeli, "Using hardware vulnerability factors to enhance AVF analysis",  
Proceedings of the 37th annual international symposium on Computer architecture, pp. 461-
472, 2010.  Doi: 10.1145/1816038.1816023 
[37]  H. Madeira, D. Costa and M. Vieira, "On the emulation of software faults by software fault 
injection," Proceeding International Conference on Dependable Systems and Networks. DSN 
2000, New York, NY, USA, 2000, pp. 417-426. doi: 10.1109/ICDSN.2000.857571 
[38]  Joint Photographic Experts Group, “About  Joint Photographic Experts Group -–JPEG.” 
[Online] Accessed: September 2018, Available: http://www.jpeg.org/about.html.  
[39]  CCSDS, "Consultative Committee for Space Data Systems CCSDS 120.0-G-1 Green Book."  
Lossless Data Compression, 2013. [Online] Accessed September 2018. Available: 
https://public.ccsds.org/Pubs/120x0g3.pdf   
Rebecca L. Davidson                                    References 
- 187 - 
[40]  Joint Photographic Experts Group, “Lossless and near-lossless compression of continuous-
tone still images.” ISO/IEC 14 pp. 495-491. 1999.  ITU-T Recommendation T.87, ISO/IEC 
International Standard 14495-1 [Online] Accessed September 2018. Available: 
https://www.ic.tu-berlin.de/fileadmin/fg121/Source-Coding_WS12/selected-
readings/19_t87.pdf 
[41]  Joint Photographic Experts Group. “JPEG 2000 image coding system.” ISO/IEC FCD pp. 
15444-15441. 2000. ITU-T Recommendation T.805,  ISO/IEC 15444-6 [Online] Accessed 
September 2018. Available:  https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-
T.805-201201-I!!PDF-E&type=items. 
[42]  CCSDS, "Recommendation for space data system standards." Lossless Data Compression, 
Technical report, CCSDS 121.0-B-1. Blue Book, 2005. [Online]  Accessed September 2018. 
Available:  https://public.ccsds.org/Pubs/121x0b2ec1.pdf 
[43]  R. F. Freund, and H. J Siegel. "Guest Editor's Introduction: Heterogeneous Pro-
cessing."  IEEE Computer Society Press, 1993, no. 6 pp. 13-17.  ISSN: 0018-9162 
[44]  R. Buyya. "High performance cluster computing: Architectures and systems (volume 1)." 
Prentice Hall, Upper SaddleRiver, NJ, USA 1, 1999. pp 999.  ISBN-13: 978-0130137845 
[45]  M. J. Mišić, Đ. M. Đurđević and M. V. Tomašević, "Evolution and trends in GPU compu-
ting," 2012 Proceedings of the 35th International Convention MIPRO, Opatija, 2012, pp. 
289-294. ISBN: 978-953-233-068-7 
[46]  M. Halpern, Y. Zhu and V. J. Reddi, "Mobile CPU's rise to power: Quantifying the impact of 
generational mobile CPU design trends on performance, energy, and user satisfaction," 2016 
IEEE International Symposium on High Performance Computer Architecture (HPCA), Bar-
celona, 2016, pp. 64-76.doi: 10.1109/HPCA.2016.7446054 
[47]  J. Teich, "Hardware/Software Codesign: The Past, the Present, and Predicting the Future," in 
Proceedings of the IEEE, vol. 100, no. Special Centennial Issue, pp. 1411-1430, 13 May 
2012. doi: 10.1109/JPROC.2011.2182009 
[48]  K. Rupp, “40 Years of Microprocessor Trend Data” [Online] Accessed: September 2018 
Available: https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend-data/ 
[49]  G.M Amdahl, “Validity of the single processor approach to achieving large scale computing 
capabilities.” In Proceedings of the spring joint computer conference ACM, April 18-20, 
1967, (pp. 483-485).  
[50]  NVIDIA “GEFORCE GTX 980: Featuring Maxwell, the most advanced GPU ever made.” 
White paper, NVIDIA Corporation. 2014. [Online] Accessed September 2018 Available: 
https://international.download.nvidia.com/geforce-
com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF 
[51]  NVIDIA, “NVIDIA CUDA C programming guide. v10.0.130” NVIDIA Corporation,  Sep-
tember 19, 2018 [Online] Accessed September 2018 Available:  
https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf 
 
Rebecca L. Davidson                                    References 
- 188 - 
[52]  M. I. Soliman and F. S. Ahmed, "Exploiting ILP, DLP, TLP, and MPI to accelerate matrix 
multiplication on Xeon processors," 2014 International Conference on Engineering and 
Technology (ICET), Cairo, 2014, pp. 1-6. doi: 10.1109/ICEngTechnol.2014.7016779 
[53]  NVIDIA, “Issue Efficiency” NVIDIA Corporation 2015 [Online] Accessed September 2018, 
Available : 
https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexpe
riments/kernellevel/issueefficiency.htm 
[54]  A. Adinets, NVIDIA “ CUDA Dynamic Parallelism API and Principles” NVIDIA Corpora-
tion, May 20, 2014. [Online] Accessed September 2018, Available : 
https://devblogs.nvidia.com/cuda-dynamic-parallelism-api-principles/ 
[55]  NVIDIA “Whitepaper, NVIDIA GeForce GTX 750 Ti, Featuring First-Generation Maxwell 
GPU Technology, Designed for Extreme Performance per Watt”, NVIDIA Corporation 2014 
[Online] Accessed September 2018, Available : 
http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-
Ti-Whitepaper.pdf 
[56]  NVIDIA, “Achieved Occupancy” NVIDIA Corporation 2015 [Online] Accessed September 
2018, Available : 
https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexpe
riments/kernellevel/achievedoccupancy.htm 
[57]  NVIDIA “CUDA Occupancy Calculator”, NVIDIA Corporation [Online]  Accessed Sep-
tember 2018, Available : 
https://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls 
[58]  J. Luitjens, S. Rennich “CUDA Warps and Occupancy, GPU Computing Webinar” GTC 
GPU Technology Conference 2011, NVIDIA Corporation. [Online] Accessed September 
2018, Available : http://on-demand.gputechconf.com/gtc-
express/2011/presentations/cuda_webinars_WarpsAndOccupancy.pdf 
[59]  J. Balfour “CUDA Threads and Atomics” NVIDIA Corporation, 2011. [Online]  Accessed 
September 2018, Available:  https://mc.stanford.edu/cgi-
bin/images/3/34/Darve_cme343_cuda_3.pdf 
[60]  Jkielty, Device Atlas “The most used smartphone GPU – 2019” [Online] Accessed May 
2019. Available https://deviceatlas.com/blog/most-used-smartphone-gpu 
[61]  M. Harris “How to Access Global Memory Efficiently in CUDA C/C++ Kernels” NVIDIA 
Corporation, 2013. [Online]  Accessed September 2018, Available: 
https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-c-kernels/ 
[62]  P. Rech, C. Aguiar, C. Frost, L. Carro, "Neutron radiation test of graphic processing 
units",  2012 IEEE 18th International On-Line Testing Symposium (IOLTS) DOI: 
10.1109/IOLTS.2012.6313841 
[63]  P. Rech, C. Aguiar, C. Frost and L. Carro, "Neutron sensitivity of integer and floating point 
operations executed in GPUs," 2013 14th Latin American Test Workshop - LATW, Cordo-
ba, 2013, pp. 1-6. doi: 10.1109/LATW.2013.6562683 
Rebecca L. Davidson                                    References 
- 189 - 
 
[64]  P. Rech, C. Aguiar, C. Frost and L. Carro, "Experimental evaluation of thread distribution 
effects on multiple output errors in GPUs," 2013 18th IEEE European Test Symposium 
(ETS), Avignon, 2013, pp. 1-6. doi: 10.1109/ETS.2013.6569352 
[65]  P. Rech, T. D. Fairbanks, H. M. Quinn and L. Carro, "Threads Distribution Effects on 
Graphics Processing Units Neutron Sensitivity," in IEEE Transactions on Nuclear Science, 
vol. 60, no. 6, pp. 4220-4225, Dec. 2013. doi: 10.1109/TNS.2013.2286970 
[66]  P. Rech, L. Carro, N. Wang, T. Tsai, S. Hari, and S. W. Keckler “Measuring the radiation 
reliability of SRAM structures in GPUS designed for HPC.” In IEEE 10th Workshop on Sil-
icon Errors in Logic-System Effects (SELSE), 2014 . doi: 10.1.1.696.8896 
[67]  D.Sabenaa, M.S. Reorda, L.Sterpone, P.Rech, L.Carro “Evaluating the radiation sensitivity 
of GPGPU caches: New algorithms and experimental results”  Microelectronics Reliability, 
Volume 54, Issue 11, November 2014, Pages 2621-2628  
doi.org/10.1016/j.microrel.2014.05.001 
[68]  D. A. G. Oliveira, P. Rech, L. L. Pilla, P. O. A. Navaux and L. Carro, "GPGPUs ECC effi-
ciency and efficacy," 2014 IEEE International Symposium on Defect and Fault Tolerance in 
VLSI and Nanotechnology Systems (DFT), Amsterdam, 2014, pp. 209-215. doi: 
10.1109/DFT.2014.6962085 
[69]  D. Tiwari , S. Gupta, J. Rogers ; D. Maxwell ; P. Rech et al., "Understanding GPU errors on 
large-scale HPC systems and the implications for system design and operation," 2015 IEEE 
21st International Symposium on High Performance Computer Architecture (HPCA), 
Burlingame, CA, 2015, pp. 331-342. doi: 10.1109/HPCA.2015.7056044 
[70]  B. Fang, K. Pattabiraman, M. Ripeanu and S. Gurumurthi, "GPU-Qin: A methodology for 
evaluating the error resilience of GPGPU applications," 2014 IEEE International Symposium 
on Performance Analysis of Systems and Software (ISPASS), Monterey, CA, 2014, pp. 221-
230. doi: 10.1109/ISPASS.2014.6844486 
[71]  Q. Lu, M. Farahani, J. Wei, A. Thomas and K. Pattabiraman, "LLFI: An Intermediate Code-
Level Fault Injection Tool for Hardware Faults," 2015 IEEE International Conference on 
Software Quality, Reliability and Security, Vancouver, BC, 2015, pp. 11-16. doi: 
10.1109/QRS.2015.13 
[72]  M. Stephenson, S. K. S. Hari, Y. Lee, E. Ebrahimi, D. R. Johnson, D. Nellans, M. O'Connor, 
S. W. Keckler,  "Flexible software profiling of GPU architectures," 2015 ACM/IEEE 42nd 
Annual International Symposium on Computer Architecture (ISCA), Portland, OR, 2015, pp. 
185-197. doi: 10.1145/2749469.2750375 
[73]  S. Hari, T. Tsai, M. Stephenson, S. W. Keckler, and J. Emer. "Sassifi: Evaluating resilience 
of GPU applications." In Proceedings of the Workshop on Silicon Errors in Logic-System 
Effects (SELSE). 2015. DOI :10.1145/3075564.3075598 
[74]  NVIDIA, “Flexible GPGPU Instrumentation, NVlabs,” NVIDIA Corporation [Online] Ac-
cessed September 2018, Available: https://github.com/NVlabs/SASSI 
 
Rebecca L. Davidson                                    References 
- 190 - 
[75]  NVIDIA, “An architecture-level fault injection tool for GPU application resilience evalua-
tions, NVlabs,” NVIDIA Corporation [Online] Accessed September 2018, Available: 
https://github.com/NVlabs/sassifi  
[76]  S. K. S. Hari, T. Tsai, M. Stephenson, S. W. Keckler and J. Emer, "SASSIFI: An architec-
ture-level fault injection tool for GPU application resilience evaluation," 2017 IEEE 
International Symposium on Performance Analysis of Systems and Software (ISPASS), San-
ta Rosa, CA, 2017, pp. 249-258. doi: 10.1109/ISPASS.2017.7975296  
[77]  H. Takizawa, K. Sato, K. Komatsu and H. Kobayashi, "CheCUDA: A Checkpoint/Restart 
Tool for CUDA Applications," 2009 International Conference on Parallel and Distributed 
Computing, Applications and Technologies, Higashi Hiroshima, 2009, pp. 408-413. doi: 
10.1109/PDCAT.2009.78 
[78]  P. H. Hargrove and J. C Duell “Berkeley lab checkpoint/restart (BLCR) for Linux clusters”, 
IOP Publishing Ltd, Journal of Physics: Conference Series, Volume 46, Volume 46, 2006 
DOI: 10.1088/1742-6596/46/1/067 
[79]  A. Nukada, H. Takizawa and S. Matsuoka, "NVCR: A Transparent Checkpoint-Restart Li-
brary for NVIDIA CUDA," 2011 IEEE International Symposium on Parallel and Distributed 
Processing Workshops and Phd Forum, Shanghai, 2011, pp. 104-113. doi: 
10.1109/IPDPS.2011.131 
[80]  M. Dimitrov, M. Mantor, H. Zhou “Understanding software approaches for GPGPU reliabil-
ity”  Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing 
Units, pp. 94-104 Washington, D.C., USA,  March 08 - 08, 2009. DOI: 
10.1145/1513895.1513907 
[81]  P. Rech, C. Aguiar, C. Frost and L. Carro, "An Efficient and Experimentally Tuned Soft-
ware-Based Hardening Strategy for Matrix Multiplication on GPUs," in IEEE Transactions 
on Nuclear Science, vol. 60, no. 4, pp. 2797-2804, Aug. 2013. doi: 
10.1109/TNS.2013.2252625 
[82]  D. A. G. Oliveira, P. Rech ; H. M. Quinn ; T. D. Fairbanks ; L. Monroeet al., "Modern GPUs 
Radiation Sensitivity Evaluation and Mitigation Through Duplication With Comparison," in 
IEEE Transactions on Nuclear Science, vol. 61, no. 6, pp. 3115-3122, Dec. 2014. doi: 
10.1109/TNS.2014.2362014 
[83]  I. Daubechies and W. Sweldens. “Factoring wavelet transforms into lifting steps.” In Journal 
of Fourier Analysis and Applications, pp. 247-269. 1998. DOI: 10.1007/BF02476026 
[84]  W. Sweldens. “The lifting scheme: A construction of second generation wavelets.” In  SIAM 
Journal on Mathematical Analysis, pp. 511-546. 1998. DOI: 10.1137/S0036141095289051 
[85]  A. Jensen and A. la Cour-Harbo. “Ripples in Mathematics: The Discrete Wavelet Trans-
form”  Springer, 2001  ISBN-10: 3540416625 
[86]  C. E. Shannon. "A mathematical theory of communication." In The Bell System Technical 
Journal pp. 379-423, 1948. DOI: 10.1002/j.1538-7305.1948.tb01338.x 
 
Rebecca L. Davidson                                    References 
- 191 - 
[87]  A. Zandi, J. D. Allen, E. L. Schwartz and M. Boliek. “CREW: Compression with reversible 
embedded wavelets.” In Data Compression Conference Proceedings, 1995. DOI: 
10.1109/DCC.1995.515511. 
[88]  T. Trenschel, T. Bretschneider and G. Leedham, "Using JPEG2000 on-board mini-satellites 
for image-driven compression," IGARSS 2003. 2003 IEEE International Geoscience and 
Remote Sensing Symposium. Proceedings (IEEE Cat. No.03CH37477), Toulouse, 2003, pp. 
2033-2035. doi: 10.1109/IGARSS.2003.1294330 
[89]  L. W. Chew, L. Ang and K. P. Seng. “Lossless image compression using tuned degree-k ze-
rotree wavelet coding.” In Proceedings of the International MultiConference of Engineers 
and Computer Scientists, 2009. DOI: https://doi.org/10.1007/978-90-481-3517-2_12 
[90]  J. Langdon and G. Glen. “Sunset: A hardware-oriented algorithm for lossless compression of 
gray-scale images.” In Proc. SPIE, Vol. 1444, pp. 272-282, 1991. DOI: - 10.1117/12.45179 
[91]  D. A. Huffman, "A Method for the Construction of Minimum-Redundancy Codes," in Pro-
ceedings of the IRE, vol. 40, no. 9, pp. 1098-1101, Sept. 1952. doi: 
10.1109/JRPROC.1952.273898 
[92]  ISO/IEC JTC 1/SC 29/WG 1 (1994) Call for contributions lossless compression of continu-
ous-tone still pictures. ISO Working Document ISO/IEC JTC1/SC29/WG1 N41. 
[93]  ITU, Rec. T.81 (09/92) Terminal Equipment and Protocols for Telematic Services - Infor-
mation Technology - Digital Compression and Coding of Continuous-Tone Still Images-
Requirements and Guidelines, 1993. 
[94]  X. Wu and N. Memon. “CALIC-a context based adaptive lossless image codec.” In Proceed-
ings Of IEEE International Conference On Acoustics, Speech, and Signal Processing, 1996. 
DOI: 10.1109/ICASSP.1996.544819. 
[95]  M. J. Weinberger, G. Seroussi and G. Sapiro. “LOCO-I: A low complexity, context-based, 
lossless image compression algorithm.” In Proceedings Of Data Compression Conference, 
1996. DOI: 10.1109/DCC.1996.488319. 
[96]  M. J. Weinberger, J. J. Rissanen and R. B. Arps. “Applications of universal context modeling 
to lossless compression of gray-scale images.” In Proceedings Of Asilomar Conference on 
Signals, Systems and Computers, 1995. DOI: 10.1109/ACSSC.1995.540546. 
[97]  J. Shukla, M. Alwani and A. K. Tiwari. “A survey on lossless image compression methods.” 
In Proceedings of the IEEE International Conference on Computer Engineering and Tech-
nology, 2010. DOI: 10.1109/ICCET.2010.5486344 
[98]  M. Weinberger, G. Seroussi and G. Sapiro. “LOCO-A : An arithmetic coding extension of 
LOCO-I.” ISO. Vol. 342. IEC JTC1/SC29/WG1 document, 1996. 
[99]  H. Ye, G. Deng and J. C. Devlin. “A weighted least squares method for adaptive prediction 
in lossless image compression.”  In Proc. Picture Coding Symp, pp. 489-493. 2003. DOI 
10.1.1.137.1927 
 
Rebecca L. Davidson                                    References 
- 192 - 
[100]  G. Deng and H. Ye. “Lossless image compression using adaptive predictor symbol mapping 
and context filtering.” In Proceedings of the International Conference On Image Processing, 
1999. DOI: 10.1109/ICIP.1999.819520. 
[101]  A. Martchenko and Guang Deng. “Bayesian predictor combination for lossless image com-
pression.” In IEEE Transactions On Image Processing, pp. 5263-5270, 2013. DOI: 
10.1109/TIP.2013.2284067. 
[102]  F. Rizzo, G. Motta, B. Carpentieri and J. A. Storer. “Lossless compression of hyperspectral 
imagery: A real-time approach.” In Proc. of SPIE vol. 5573. 2004. DOI: 10.1117/12.565407 
[103]  X. Wu, W. Choi and N. Memon. “Lossless interframe image compression via context model-
ing.” In Proceedings of the Data Compression Conference, 1998. DOI: 
10.1109/DCC.1998.672169 
[104]  F. Rizzo, B. Carpentieri, G. Motta and J. A. Storer. “High performance compression of hy-
perspectral imagery with reduced search complexity in the compressed domain.” In 
Proceedings of the Data Compression Conference, 2004. DOI: 10.1109/DCC.2004.1281493 
[105]  M. Klimesh. “Low-Complexity Lossless Compression of Hyperspectral Imagery Via Adap-
tive Filtering” Jet Propulsion Lab., California Inst. of Tech., United States, 2005. [Online] 
Accessed September 2018, Available:  https://ipnpr.jpl.nasa.gov/progress_report/42-
163/163H.pdf 
[106]  CCSDS, “Lossless Multispectral & Hyperspectral Image Compression”, Recommendation 
for Space Data System Standards, Recommended Standard CCSDS 123.0-B-1, Blue Book 
May 2012 [Online] Accessed September 2018, Available: 
https://public.ccsds.org/Pubs/123x0b1ec1.pdf 
[107]  J. Mielikainen. “Lossless compression of hyperspectral images using lookup tables.” In IEEE 
Signal Processing Letters, pp. 157-160. 2006. DOI: 10.1109/LSP.2005.862604 
[108]  B. Huang and Y. Sriraja. “Lossless compression of hyperspectral imagery via lookup tables 
with predictor selection.” In Proc. SPIE. vol. 6365, 2006. DOI: 10.1117/12.690659 
[109]  J. Mielikainen and P. Toivanen. “Lossless compression of hyperspectral images using a 
quantized index to lookup tables.” In IEEE Geoscience and Remote Sensing Letters, pp. 474-
478. 2008. DOI: 10.1109/LGRS.2008.917598 
[110]  A. B. Kiely and M. A. Klimesh. “Exploiting calibration-induced artifacts in lossless com-
pression of hyperspectral imagery.” In IEEE Transactions On Geoscience and Remote 
Sensing, pp. 2672-2678. 2009. DOI: 10.1109/TGRS.2009.2015291 
[111]  J. Mielikainen and P. Toivanen. “Clustered DPCM for the lossless compression of hyper-
spectral images.” In IEEE Transactions On Geoscience and Remote Sensing, pp. 2943-2946. 
2003. DOI: 10.1109/TGRS.2003.820885 
[112]  G. Motta, F. Rizzo and J. A. Storer. “Compression of hyperspectral imagery.” In Proceedings 
of the Data Compression Conference, 2003. DOI: 10.1109/DCC.2003.1194024 
 
Rebecca L. Davidson                                    References 
- 193 - 
[113]  V. Rehna and M. J. Kumar. “Effect of tiling on the performance of GW algorithm for image 
coding.” In Asian Journal of Scientific Research, pp. 418. 2014. DOI: 
10.3923/ajsr.2014.418.433 
[114]  J. Yu and D. J. DeWitt “Processing satellite images on tertiary storage: A study of the impact 
of tile size on performance.” 1996. [Online] Accessed September 2018 Available: 
https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19960052752.pdf  
[115]  Z. Zhihua, and N. Saito. "PHLST with adaptive tiling and its application to Antarctic remote 
sensing image approximation." In Inverse Problems and Imaging, 2014. DOI: 
https://doi.org/10.3934/ipi.2014.8.321 
[116]  G S Aglietti, Z Zhang, G Richardson, B le Page, A Haslehurstl, "Disturbance sources model-
ing for analysis of structure-borne micro-vibration", Proceedings of III ECCOMAS Thematic 
Conference on computational methods in structural dynamics and earthquake engineering, 
May 2011. 
[117]  Surrey Satellite Technology Limited “SSTL Blog: Reducing Camera Shake by Measuring 
Microvibration.” [Online]  Accessed: September 2017 Available: 
http://www.sstl.co.uk/Blog/March-2016/Reducing-camera-shake-by-measuring-
microvibratio  
[118]  Y. Zhu, M. Wang, Q. Zhu, & J. Pan. “Detection and Compensation of Band-to-band Regis-
tration Error for Multi-spectral Imagery Caused By Satellite Jitter.” In ISPRS Annals of the 
Photogrammetry, Remote Sensing and Spatial Information Sciences, 2014. DOI: 
10.5194/isprsannals-II-1-69-2014 
[119]  P. E. Anuta. "Spatial registration of multispectral and multitemporal digital imagery using 
fast Fourier transform techniques." In IEEE transactions on Geoscience Electronics, pp. 353-
368, 1970. DOI: 10.1109/TGE.1970.271435 
[120]  B. Zitova and J. Flusser, “Image Registration Methods: A Survey”, In Elsevier Image and 
Vision Computing, 2003. DOI: https://doi.org/10.1016/S0262-8856(03)00137-9 
[121]  Z. Yi, C. Zhiguo and X. Yang, “Multi-spectral remote image registration based on SIFT”, In 
IET Electronics Letters, vol. 44, 2008. DOI: 10.1049/el:20082477 
[122]  J. Chen and J. Tian, “Rapid Multi-modality pre- registration based on SIFT descriptor”, In 
Proceedings of IEEE International Conference of Engineering in Medicine and Biology So-
ciety, 2006. DOI: 10.1109/IEMBS.2006.260599 
[123]  M. Vural, Y. Yasemin, and A. Temlzel. "Registration of multispectral satellite images with 
orientation-restricted SIFT." In IEEE International Geoscience and Remote Sensing Sympo-
sium, 2009. DOI: 10.1109/IGARSS.2009.5417801 
[124]  J. Tansock, D. Bancroft, J. Butler et al “Guidelines for Radiometric Calibration of Electro-
Optical Instruments for Remote Sensing” U.S. Department of Commerce, National Institute 
of Standards and Technology, April 2015. DOI: 10.6028.NIST.HB.157 [Online] Accessed 
May 2019. Available http://dx.doi.org/10.6028/NIST.HB.157  
 
Rebecca L. Davidson                                    References 
- 194 - 
[125]  Y. Bentoutou, N. Taleb, K. Kpalma & J. Ronsin (2005). “An automatic image registration 
for applications in remote sensing.” In IEEE Transactions on Geoscience and Remote Sens-
ing, pp. 2127-2137, 2005. DOI: 10.1109/TGRS.2005.853187 
[126]  P. Teillet. “Image correction for radiometric effects in remote sensing.” In International 
Journal of Remote Sensing, pp. 1637-1651. 1986. DOI: 10.1080/01431168608948958 
[127]  A. Meygret, D.Leger, "In-flight refocusing of the SPOT-1 HRV cameras," Proc. SPIE 2758, 
Algorithms for Multispectral and Hyperspectral Imagery II, 17 June 1996 doi: 
10.1117/12.243225 
[128]  Qiang Wu, Yaobin Chi and Zhiyong Wang. “CCD noise effect on data transmission efficien-
cy of onboard lossless-compressed remote sensing images.” In Proceedings of International 
Conference On Information Engineering and Computer Science, 2009. DOI: 
10.1109/ICIECS.2009.5365161. 
[129]  B. Han, L. Kang and H. Song. "A fast cloud detection approach by integration of image 
segmentation and support vector machine," in , J. Wang, Z. Yi, J. Zurada, B. Lu and H. Yin, 
Eds. 2006, Available: http://dx.doi.org/10.1007/11760191_176. DOI: 
10.1007/11760191_176. 
[130]  M. Griggin, H. Burke, D. Mandl and J. Miller, "Cloud cover detection algorithm for EO-1 
Hyperion imagery," IGARSS 2003. 2003 IEEE International Geoscience and Remote Sens-
ing Symposium. Proceedings (IEEE Cat. No.03CH37477), Toulouse, 2003, pp. 86-89 vol.1. 
doi: 10.1109/IGARSS.2003.1293687 
[131]  C. M. Hartzell and S. R. Cheng, "A feasibility study of on-board cloud detection and com-
pression," 2010 IEEE Aerospace Conference, Big Sky, MT, 2010, pp. 1-11. doi: 
10.1109/AERO.2010.5446709 
[132]  Men Long and Heng-Ming Tai. “Region of interest coding for image compression.” Present-
ed at Circuits and Systems, 2002. MWSCAS-2002. the 2002 45th Midwest Symposium On. 
2002, . DOI: 10.1109/MWSCAS.2002.1186825. 
[133]  M. Ergul, A. Aydın Alatan, "An automatic geo-spatial object recognition algorithm for high 
resolution satellite images", Proc. SPIE 8897, Electro-Optical Remote Sensing, Photonic 
Technologies, and Applications VII; and Military Applications in Hyperspectral Imaging and 
High Spatial Resolution Sensing, 88970J (15 October 2013); doi: 10.1117/12.2029136; 
[134]  T. Blaschke, S. Lang, and G. Hay, “Object-based image analysis: spatial concepts for 
knowledge-driven remote sensing applications”. Springer Science & Business Media, 2008.  
ISBN-10: 3540770577 
[135]  A. Grivei, A. Radoi and M. Datcu, "Land cover change detection in Satellite Image Time 
Series using an active learning method," 2017 9th International Workshop on the Analysis of 
Multitemporal Remote Sensing Images (MultiTemp), Brugge, 2017, pp. 1-4. doi: 
10.1109/Multi-Temp.2017.8035213 
[136]  P. Chen, Y. Zhang, Z. Jia, J. Yang, & N. Kasabov “Remote Sensing Image Change Detection 
Based on NSCT-HMT Model and Its Application. Sensors” 2017 (Basel, Switzerland), 
17(6), 1295. http://doi.org/10.3390/s17061295 
Rebecca L. Davidson                                    References 
- 195 - 
[137]  T. Costăchioiu, R. Constantinescu and M. Datcu, "Multitemporal Satellite Image Time Se-
ries analysis of urban development in Bucharest and Ilfov areas," 2014 10th International 
Conference on Communications (COMM), Bucharest, 2014, pp. 1-4.doi: 
10.1109/ICComm.2014.6866702 
 
[138]  N. Aranki, D. Keymeulen, A. Bakhshi and M. Klimesh, "Hardware Implementation of Loss-
less Adaptive and Scalable Hyperspectral Data Compression for Space," 2009 NASA/ESA 
Conference on Adaptive Hardware and Systems, San Francisco, CA, 2009, pp. 315-322.doi: 
10.1109/AHS.2009.66 
[139]  D. Keymeulen, N. Aranki, A. Bakhshi, H. Luong, C. Sarture and D. Dolman. “Airborne 
demonstration of FPGA implementation of fast lossless hyperspectral data compression sys-
tem.” Presented at Adaptive Hardware and Systems (AHS), 2014 NASA/ESA Conference 
On. 2014, . DOI: 10.1109/AHS.2014.6880188. 
[140]  D. Keymeulen, N. Aranki, B. Hopson, A. Kiely, M. Klimesh and K. Benkrid, "GPU lossless 
hyperspectral data compression system for space applications," 2012 IEEE Aerospace Con-
ference, Big Sky, MT, 2012, pp. 1-9.doi: 10.1109/AERO.2012.6187255 
[141]  B. Hopson, K. Benkrid, D. Keymeulen and N. Aranki, "Real-time CCSDS lossless adaptive 
hyperspectral image compression on parallel GPGPU & multicore processor systems," 2012 
NASA/ESA Conference on Adaptive Hardware and Systems (AHS), Erlangen, 2012, pp. 
107-114. doi: 10.1109/AHS.2012.6268637 
[142]  NVIDIA, “Thrust: GPU Accelerated Compute Libraries”, NVIDIA Corporation [Online] 
Accessed September 2018, Available:  https://developer.nvidia.com/thrust 
[143]  I. Blanes, CCSDS Space Link Services (SLS) Data Compression (DC), “123.0-B-Info, Test 
Data Corpus” [Online] Accessed May 2019. Available  
https://cwe.ccsds.org/sls/docs/SLS-DC/123.0-B-Info/TestData/ 
[144]  T. S. Perry “ A Fictional Compression Metric Moves Into the Real World” IEEE Spectrum 
[Online] Accessed September 2018, Available: https://spectrum.ieee.org/view-from-the-
valley/computing/software/a-madefortv-compression-metric-moves-to-the-real-world 
[145]  NVIDIA, “Tegra X1 Whitepaper” NVIDIA Corporation [Online] Accessed September 2018 
Available: https://international.download.nvidia.com/pdf/tegra/Tegra-X1-whitepaper-
v1.0.pdf 
[146]  European Space Agency, “ Landsat-8 / LDCM (Landsat Data Continuity Mission)” ESA EO 
portal Directory [Online] Accessed September 2018 Available: 
https://earth.esa.int/web/eoportal/satellite-missions/content/-/article/landsat-8-ldcm 
[147]  Creative Research Systems, “Sample Size Formula” [Online] Accessed May 2019. Available 
https://www.surveysystem.com/sample-size-formula.htm  
[148]  NVIDIA, “Jetson TX2”, NVIDIA Corporation [Online] Accessed September 2018, Availa-
ble:  https://developer.nvidia.com/embedded/buy/jetson-tx2 
 
Rebecca L. Davidson                                    References 
- 196 - 
[149]  NVIDIA, “Jetson TX2i”, NVIDIA Corporation [Online] Accessed September 2018, Availa-
ble:   https://devtalk.nvidia.com/default/topic/1030837/nvidia-jetson-tx2i-module-for-
industrial-environments/ 
[150]  SSTL, “ CEOI grant to enable on board video processing”, Surrey Satellite Technology Lim-
ited [Online] Accessed September 2018, Available:  https://www.sstl.co.uk/media-hub/latest-
news/2018/ceoi-grant-to-enable-on-board-video-processing 
[151]  S. Golomb, "Run-length encodings” in IEEE Transactions on Information Theory, vol. 12, 
no. 3, pp. 399-401, July 1966. doi: 10.1109/TIT.1966.1053907 
[152]  I. H. Witten, R. M. Neal and J. G. Cleary. “Arithmetic coding for data compression.” In 
Communications of the ACM, pp. 520-540. 1987. DOI: 10.1145/214762.214771 
[153]  K. Rao and P. Yip. “Discrete cosine transform: Algorithms, advantages, applications.” Aca-
demic Press Inc. 1990.  ISBN:0-12-580203-X 
[154]  K. Cabeen and P. Gent. "Image compression and the discrete cosine transform." College of 
the Redwoods, 1998. [Online] Accessed September 2018 Available:  
https://www.math.cuhk.edu.hk/~lmlui/dct.pdf 
[155]  MathWorks, “MathWorks, Inc - 2-D discrete cosine transform.” MathWorks, [Online] Ac-
cessed September 2018 Available: 
http://uk.mathworks.com/help/images/ref/dct2.html?refresh=true. 
[156]  B. Aiazzi, P. Alba, L. Alparone and S. Baronti, "Lossless compression of multi/hyper-
spectral imagery based on a 3-D fuzzy prediction," in IEEE Transactions on Geoscience and 
Remote Sensing, vol. 37, no. 5, pp. 2287-2294, Sept. 1999. doi: 10.1109/36.789625 
[157]  B. Aiazzi, L. Alparone and S. Baronti. “Near-lossless compression of 3-D optical data.” In 
IEEE Transactions On Geoscience and Remote Sensing, pp. 2547-2557. 2001. DOI: 
10.1109/36.964993 
[158]  Website “Wolfram Mathematica Fuzzy Logic Documentation - Fuzzy Clustering.” Availa-
ble: http://reference.wolfram.com/applications/fuzzylogic/Manual/12.html. Accessed: 
September 2017 
[159]  J. Mielikainen and B. Huang. “Lossless compression of hyperspectral images using clustered 
linear prediction with adaptive prediction length.” In IEEE Geoscience and Remote Sensing 
Letters, pp. 1118-1121. 2012. DOI: 10.1109/LGRS.2012.2191531 
[160]  S. K. Jain and D. A. Adjeroh. “Edge-based prediction for lossless compression of hyperspec-
tral images.” In Proceedings of Data Compression Conference, 2007. DOI: 
10.1109/DCC.2007.36 
[161]  H. Guogang, and C. Chen. "Distributed source coding in wireless sensor networks." In Pro-
ceedings of IEEE International Conference on Quality of Service in Heterogeneous 
Wired/Wireless Networks, 2005. DOI: 10.1109/QSHINE.2005.19 
[162]  E. Magli, M. Barni, A. Abrardo and M. Grangetto. “Distributed source coding techniques for 
lossless compression of hyperspectral images.” In EURASIP Journal on Applied Signal Pro-
cessing, pp. 24-24. 2007. DOI: 10.1155/2007/45493 
Rebecca L. Davidson                                    References 
- 197 - 
[163]  A. Abrardo, M. Barni, E. Magli and F. Nencini. “Error-resilient and low-complexity onboard 
lossless compression of hyperspectral images by means of distributed source coding.” In 
IEEE Transactions On Geoscience and Remote Sensing, pp. 1892-1904. 2010. DOI: 
10.1109/TGRS.2009.2033470 
[164]  P. G. Howard and J. S. Vitter, "Fast and efficient lossless image compression," [Proceedings] 
DCC `93: Data Compression Conference, Snowbird, UT, USA, 1993, pp. 351-360. doi: 
10.1109/DCC.1993.253114 
[165]  M. J. Weinberger, J. J. Rissanen and R. B. Arps. Applications of universal context modeling 
to lossless compression of gray-scale images. Presented at Signals, Systems and Computers, 
1995. 1995 Conference Record of the Twenty-Ninth Asilomar Conference On. 1995, . DOI: 
10.1109/ACSSC.1995.540546. 
[166]  A. Said and W. A. Pearlman. “A new, fast, and efficient image codec based on set partition-
ing in hierarchical trees.” In IEEE Transactions On Circuits and Systems for Video 
Technology, pp. 243-250. 1996. DOI: 10.1109/76.499834. 
[167]  ISO/IEC “Specification information technology—Computer graphics and image pro-
cessing—Portable network graphics (PNG): Functional specification.” ISO/IEC 159482003 
[168]  T. Seemann, P. Tischer and B. Meyer. “History-based blending of image sub-predictors.” 
Presented at Proc. Picture Coding Symposium. 1997, doi=10.1.1.34.5981 
[169]  B. Meyer and P. Tischer. Tmw - a new method for lossless image compression. Presented at 
In Proc. of the 1997 International Picture Coding Symposium (PCS97. 1997  
doi=10.1.1.116.3891 
[170]  Guang Deng and Hua Ye. Lossless image compression using adaptive predictor symbol 
mapping and context filtering. Presented at Image Processing, 1999. ICIP 99. Proceedings. 
1999 Interna-tional Conference On. 1999, . DOI: 10.1109/ICIP.1999.819520. 
[171]  Xin Li and M. T. Orchard. Edge directed prediction for lossless compression of natural im-
ages. Presented at Image Processing, 1999. ICIP 99. Proceedings. 1999 International 
Conference On. 1999, . DOI: 10.1109/ICIP.1999.819519. 
[172]  G. Motta, J. A. Storer and B. Carpentieri. Lossless image coding via adaptive linear predic-
tion and classification. Proceedings of the IEEE 88(11), pp. 1790-1796. 2000. . DOI: 
10.1109/5.892714. 
[173]  I. Matsuda, H. Mori and S. Itoh. Lossless coding of still images using minimum-rate predic-
tors. Presented at Image Processing, 2000. Proceedings. 2000 International Conference On. 
2000, . DOI: 10.1109/ICIP.2000.900912. 
[174]  N. V. Boulgouris, D. Tzovaras and M. G. Strintzis. Lossless image compression based on 
opti-mal prediction, adaptive lifting, and conditional arithmetic coding. Image Processing, 
IEEE Transactions On 10(1), pp. 1-14. 2001. . DOI: 10.1109/83.892438. 
[175]  N. V. Boulgouris, D. Tzovaras and M. G. Strintzis. Lossless image compression based on 
opti-mal prediction, adaptive lifting, and conditional arithmetic coding. Image Processing, 
IEEE Transactions On 10(1), pp. 1-14. 2001. . DOI: 10.1109/83.892438. 
Rebecca L. Davidson                                    References 
- 198 - 
[176]  Shih-Ta Hsiang. Embedded image coding using zeroblocks of subband/wavelet coefficients 
and context modeling. Presented at Data Compression Conference, 2001. Proceedings. DCC 
2001. 2001, . DOI: 10.1109/DCC.2001.917139. 
[177]  I. Matsuda, N. Shirai and S. Itoh. "Lossless coding using predictors and arithmetic code opti-
mized for each image," in Visual Content Processing and RepresentationAnonymous 2003, . 
[178]  H. Ye, G. Deng and J. C. Devlin. A weighted least squares method for adaptive prediction in 
lossless image compression.  
[179]  T. Haijiang, K. Sei-ichiro, T. Kazuyuki and K. Masa-aki. Lossless image compression via 
multi-scanning and adaptive linear prediction. Presented at Circuits and Systems, 2004. Pro-
ceedings. the 2004 IEEE Asia-Pacific Conference On. 2004, . DOI: 
10.1109/APCCAS.2004.1412696. 
[180]  I. Avcibas, N. Memon, B. Sankur and K. Sayood. A successively refinable lossless image-
coding algorithm. Communications, IEEE Transactions On 53(3), pp. 445-452. 2005. . DOI: 
10.1109/TCOMM.2005.843421. 
[181]  Lih-Jen Kau and Yuan-Pei Lin. Adaptive lossless image coding using least squares optimiza-
tion with edge-look-ahead. Circuits and Systems II: Express Briefs, IEEE Transactions On 
52(11), pp. 751-755. 2005. . DOI: 10.1109/TCSII.2005.852194. 
[182]  I. Matsuda, N. Ozaki, Y. Umezu and S. Itoh. Lossless coding using variable blocksize adap-
tive prediction optimized for each image. Presented at In Proceedings of the 13th Euro. 2005, 
[183]  J. A. Robinson. Adaptive prediction trees for image compression. Image Processing, IEEE 
Transactions On 15(8), pp. 2131-2145. 2006. . DOI: 10.1109/TIP.2006.875196. 
[184]  H. Pan, W. C. Siu and N. F. Law. Lossless image compression using binary wavelet trans-
form. Image Processing, IET 1(4), pp. 353-362. 2007. . DOI: 10.1049/iet-ipr:20060195. 
[185]  L. W. Chew, L. Ang and K. P. Seng. Lossless image compression using tuned degree-k ze-
rotree wavelet coding. Presented at Proceedings of the International MultiConference of 
Engineers and Computer Scientists. 2009 
[186]  F. Dufaux, G. Sullivan and T. Ebrahimi. The JPEG XR image coding standard. IEEE Signal 
Pro-cess. Mag. 26(MMSPL-ARTICLE-2009-004), pp. 195-199, 204-204. 2009. 
[187]  E. Puthooran, R. S. Anand and S. Mukherjee. Lossless image compression using BPNN pre-
dictor with contextual error feedback. Presented at Multimedia, Signal Processing and 
Communication Technologies (IMPACT), 2011 International Conference On. 2011, . DOI: 
10.1109/MSPCT.2011.6150459. 
[188]  C. Lee and L. Kau. Enhancing the predictive coding efficiency with control technologies for 
lossless compression of images. Image Processing, IET 6(3), pp. 251-263. 2012. 
[189]  Z. Wang, M. Klaiber, Y. Gera, S. Simon and T. Richter. Fast lossless image compression 
with 2D golomb parameter adaptation based on JPEG-LS. Presented at Signal Processing 
Conference (EUSIPCO), 2012 Proceedings of the 20th European. 2012, . 
Rebecca L. Davidson                                    References 
- 199 - 
 
 
[190]  A. Martchenko and Guang Deng. Bayesian predictor combination for lossless image com-
pres-sion. Image Processing, IEEE Transactions On 22(12), pp. 5263-5270. 2013. . DOI: 
10.1109/TIP.2013.2284067. 
[191]  G. Motta, F. Rizzo and J. A. Storer. Compression of hyperspectral imagery. Presented at Da-
ta Compression Conference, 2003. Proceedings. DCC 2003. 2003, . 
[192]  E. Magli, G. Olmo and E. Quacchio. Optimized onboard lossless and near-lossless compres-
sion of hyperspectral data using CALIC. Geoscience and Remote Sensing Letters, IEEE 
1(1), pp. 21-25. 2004. 
[193]  M. Slyz and D. Zhang. A block-based inter-band lossless hyperspectral image compressor. 
Pre-sented at Data Compression Conference, 2005. Proceedings. DCC 2005. 2005, . 
[194]  H. Wang, S. D. Babacan and K. Sayood. Lossless hyperspectral-image compression using 
context-based conditional average. IEEE Trans. Geosci. Remote Sens. 45(12), pp. 4187-
4193. 2007. 
[195]  S. K. Jain and D. A. Adjeroh. Edge-based prediction for lossless compression of hyperspec-
tral images. Presented at Data Compression Conference, 2007. DCC'07. 2007, . 
[196]  B. Aiazzi, L. Alparone, S. Baronti and C. Lastri. Crisp and fuzzy adaptive spectral predic-
tions for lossless and near-lossless compression of hyperspectral imagery. Geoscience and 
Remote Sensing Letters, IEEE 4(4), pp. 532-536. 2007. 
[197]  J. Zhang and G. Liu. An efficient reordering prediction-based lossless compression algorithm 
for hyperspectral images. Geoscience and Remote Sensing Letters, IEEE 4(2), pp. 283-287. 
2007. 
[198]  A. Abrardo, M. Barni, E. Magli and F. Nencini. Error-resilient and low-complexity onboard 
loss-less compression of hyperspectral images by means of distributed source coding. Geo-
science and Remote Sensing, IEEE Transactions On 48(4), pp. 1892-1904. 2010. 
[199]  D. M. Hiemstra and V. Kirischian, "Single Event Upset Characterization of the Zynq-7000 
ARM® Cortex™-A9 Processor Unit Using Proton Irradiation," 2015 IEEE Radiation Effects 
Data Workshop (REDW), Boston, MA, 2015, pp. 1-3. doi: 10.1109/REDW.2015.7336735  
Rebecca L. Davidson                                             Appendices 
- 200 - 
APPENDICES 
 
Appendix A - Additional Image Compression Algorithms .......................................... 201 
Appendix B - Traditional Image Compression Algorithms ........................................ 206 
Appendix C - Multidimensional Image Compression Algorithms .............................. 207 
Appendix D - Example Architecture Part Selection ..................................................... 208 
Appendix E - CCSDS-123 User Definable Parameters ................................................ 216 
Appendix F - Default Compression Parameters ........................................................... 217 
Appendix G – Compression Testing Image Thumbnails ............................................. 218 
Appendix H - CCSDS-123 Algorithm Parameter Testing ........................................... 222 
Appendix I – Low Power GPU Power Draw Analysis ................................................ 226 
 
 
Rebecca L. Davidson                                             Appendices 
- 201 - 
APPENDIX A - ADDITIONAL IMAGE COMPRESSION ALGORITHMS  
A.1 Entropy Encoding Only Algorithms  
Entropy encoding is a statistical redundancy reduction technique in which fixed 
length binary codewords are replaced with variable length codewords. The length of the 
codeword is determined based on the symbol probability, whereby shorter codewords are 
used to represent symbols of higher source probability. Three of the most popular entropy 
encoding schemes used in lossless image compression are Huffman, Golomb-Rice and 
arithmetic coding. Huffman coding, which was first proposed by David A. Huffman and 
published in 1952, is a variable-length prefix entropy encoding scheme [91]. Prefix cod-
ing refers to the fact that the codewords used are designed so that no codeword is the 
start, or prefix, of any other codeword. Huffman’s major contribution was a frequency 
sorted binary tree algorithm used to determine optimum codewords. A Golomb coder en-
codes a non-negative integer n as the concatenation of two calculations, using a user 
defined parameter m [151]. The first part is the unary representation of the quotient (n div 
m) and the binary representation of the remainder (n mod m). Rice discovered that for the 
case that m is a power of 2, m = 2k, the quotient is given by the bit shifting of the binary 
representation of n to the right by k bits and the remainder is equal to the k least signifi-
cant bits of n [23]. Arithmetic coding, unlike most other entropy coding schemes, encodes 
the data as a stream of symbols as opposed to treating symbols individually [152]. In do-
ing so the scheme achieves near optimal encoding output and greater compression 
performance than Huffman and Golomb-Rice entropy coding techniques.  
Several traditional lossless image compression algorithms that utilise entropy encod-
ing alone have been proposed in literature. One such algorithm is CCSDS-121 published 
in 1997 [42]. CCSDS-121 is a standardised recommendation of an adaptive Golomb-Rice 
entropy coding scheme for onboard data coding in the space industry. It has been de-
signed for widespread use with many types of science data produced from both imaging 
and non-imaging instruments and to be suitable for implementation in systems where 
computational resources are limited. From the survey results given in Figure 2-20, 
CCSDS-121 achieves an average compression ratio of approximately 1.9. Whilst the low 
computational complexity and memory requirements of the algorithm have made it a 
popular choice for historical space missions, its relatively low compression performance 
does not make it a suitable candidate to handle the growing data volumes for future mis-
Rebecca L. Davidson                                             Appendices 
- 202 - 
sions. The reduced performance of this algorithm is due to the fact that only statistical 
redundancy is exploited and both spatial and spectral image redundancy are unexploited. 
 
A.2 Transform Based Algorithms 
Mathematical transforms have found a growing application in image compression 
due to their ability to provide a spatially decorrelated representation of 2D image data. An 
image can be interpreted as a spatial intensity map of information in which the intensity 
information, represented by a pixel, has a random distribution. Mathematical transforms 
provide an alternative representation of this information in the frequency domain, in the 
form of a map of transformed coefficient values. One of the earliest transforms to be uti-
lised for compression is the discrete cosine transform (DCT). There is a vast amount of 
published literature on the DCT and its application in image compression, two good ex-
amples are [153] which provides detailed mathematical definitions of the DCT and [154] 
which provides further information related specifically to image compression. Figure A-1 
visually demonstrates a two-dimensional DCT applied to imagery data and was con-
structed using the built-in functions from MATLAB [155]. The Figure highlights the 
energy compaction capabilities of the transform; pre-transform energy is randomly dis-
tributed across the image block whilst post-transform, energy is compacted to a few 
coefficients in the upper left-hand corner. However, the DCT transform is naturally a los-
sy procedure due to the floating-point mathematical calculations required. Therefore, to 
be utilised successfully in a lossless compression scheme the transform must be modified 
to utilise integer only arithmetic. Research and development of a modified integer DCT 
has occurred in recent years. However, integer-DCT based algorithm have not been wide-
ly adopted, this is largely influenced by the fact they do not achieve a compression ratio 
competitive with alternative approaches as shown in Table 2-7, where surveyed tradition-
al DCT based algorithms achieve the lowest average compression of 1.75.  
 
Figure A-1 Example 2D-DCT 
255
0
2D-DCT
Rebecca L. Davidson                                             Appendices 
- 203 - 
A.3 Complex Prediction Algorithms  
Two example algorithms that employ this technique are Adaptive Combination of 
Adaptive Predictors (ACAP) [156] and Adaptive Selection of Adaptive Predictor (ASAP) 
[157]. ACAP utilises a fuzzy-c-means clustering algorithm to classify the casual neigh-
bourhood of each pixel. Fuzzy clustering enables data to be classified into to more than 
one cluster, additional membership functions are associated with the data which repre-
sents the degree of the association of data to a cluster. The c-means algorithm is one of 
the most common clustering algorithms which iteratively aims to minimise an objective 
function until a change sensitivity threshold is reached indicating that the cluster coeffi-
cients have converged [158]. For each classification cluster, an optimal predictor is then 
calculated, the final predictor is then a weighted sum of the optimum predictors of each 
cluster. In the second algorithm, ASAP, the image bands are partitioned into blocks and a 
Minimum Mean Square Error (MMSE) optimum predictor is calculated for each block. 
Each image block is then classified using the fuzzy-c-mean algorithm. As statistically 
similar blocks exhibit similar predictors, the number of overall predictors is reduced via 
iterative predictor assignment and an optimisation procedure to establish a single predic-
tor per class. ACAP and ASAP achieve the same average compression ratio of 3.36, 
which is 9% greater than IB-CALIC.  
Clustered Differential Pulse Code Modulation (C-DPCM) is another algorithm that 
employs a pre-prediction clustering algorithm to determine classes of homogenous pixels 
[111]. C-DPCM performs clustering on the whole image spectra then calculates error op-
timised linear predictors for each cluster utilising collocated pixels from previously 
encoded bands. C-DPCM-APL (Adaptive Predictor Length), is a variation of this algo-
rithm. C-DPCM-APL uses a brute force approach to determine the optimum number of 
previously encoded bands to be utilised in the linear predictor calculation [159]. All pos-
sible combinations are calculated (from 10 to 200 previous bands, in steps of 10) the 
combination which yields the lowest error value is then selected. C-DPCM-APL achieves 
an average compression ratio of 3.47, 2% greater than C-DPCM and the highest compres-
sion ratio of all the algorithms surveyed as shown in Figure 2-20.  
Another pre-prediction block implemented to improve algorithm compression ratio is 
edge detection. The Edge-based Prediction for Hyperspectral Imagery (EPHI) algorithm 
applies edge-based prediction principles within a multidimensional algorithm framework 
to achieve high compression ratios [160]. In EPHI, edge threshold values are assigned for 
Rebecca L. Davidson                                             Appendices 
- 204 - 
each spectral band and to reduce the overhead threshold values are predicted from the 
preceding band. The algorithm features an additional modified prediction mode for edge 
identified pixels, so that an edge pixel can only be predicted from other connected edge 
classified pixels. EPHI achieves an average compression ratio of 3.27, 6% less than C-
DPCM-APL but 6% more than IB-CALIC. 
 
A.4 Distributed Source Coding Based Algorithms 
Another major new technique introduced in multidimensional compression is the 
Distributed Source Coding (DSC) approach. DSC is an information coding concept in 
which independent encoders can be used to encode multiple correlated sources at a re-
duced entropy. By utilising independent encoders, it has been found to be possible to 
encode a first source at its entropy level and perform conditional encoding of a second 
source at a rate lower than its entropy. DSC coders are typically implemented using chan-
nel coding techniques; H. Chang et al provides a comprehensive introduction and 
overview of DSC and the techniques used [161].  
Several DSC based algorithms were proposed in 2007 specifically to evaluate the po-
tential of applying the DSC paradigm to the real-world application of hyperspectral image 
compression. Two different principles of DSC were explored by E. Magli et al to assess 
the potential advantages. DSC techniques are most commonly based on channel codes 
such as Low-Density Parity Check (LDPC) codes and turbo codes, as they achieve per-
formance reasonably close to the conditional entropy. However, they are generally 
employed on large blocks of data, thus making them computationally and memory de-
manding, potentially outweighing any compression performance benefits. In addition to 
channel code-based techniques scalar codes can also be used in DSC; these operate sam-
ple-by-sample making them considerably easier to design. E. Magli et al proposed three 
DSC based compression algorithms to demonstrate the approach; DSC-CALIC an adap-
tion of the original CALIC algorithm, s-DSC a scalar code-based algorithm and v-DSC a 
vector extension of s-DSC [162]. The conclusions drawn from their investigation was that 
a considerable amount of further research is needed before DSC image compression is 
expected to achieve competitive results with the more mature predictive coding tech-
niques. This is backed up by the compression ratio results which show the DSC-CALIC 
algorithms achieve a compression ratio 19% greater than the original CALIC algorithm 
but 22% less than the early multidimensional predictive version IB-CALIC. 
Rebecca L. Davidson                                             Appendices 
- 205 - 
In 2010, A. Abrardo et al proposed a further three DSC based algorithms [163]. This 
work concentrated on improving the performance of DSC compression whilst also im-
proving their error resilience. The first of the three algorithms proposed focuses purely on 
coding efficiency and is named A1. The proposed A3 algorithm focuses on error resili-
ence whilst the A2 algorithm aims to provide a trade-off between these two areas. All the 
algorithms proposed were based on the same CRC (Cyclic Redundancy Check) based 
three-dimensional linear prediction DSC coding scheme. However, the performance of 
the DSC based algorithms remain relatively low when compared to state-of-the-art pre-
dictive algorithms. The best performing DSC algorithm to date, A1, has an average 
compression performance of 2.75, 10% less than the early IB-CALIC algorithm and 20% 
less than the state-of-the-art algorithm C-DPCM-APL. 
 
Rebecca L. Davidson                                             Appendices 
- 206 - 
APPENDIX B - TRADITIONAL IMAGE COMPRESSION ALGORITHMS 
Algorithm name & 
reference Theoretical Basis 
Year Pub-
lished 
Average Compression 
Ratio 
Sunset [90] Predictive 1991 2.00 
Lossless JPEG [93] Predictive 1992 1.87 
FELICS [164] Predictive 1993 1.84 
CREW [87] DWT 1995 1.77 
UCM [165] Predictive 1995 2.07 
SPIHT [166] DWT 1996 1.88 
PNG [167] Entropy Encoding 1996 1.87 
CALIC [94] Predictive 1996 2.18 
LOCO-I [95] Predictive 1996 2.10 
LOCO-A [98] Predictive 1996 2.21 
HBB [168] Predictive 1997 2.14 
CCSDS-121 [42] Entropy Encoding 1997 1.89 
TMW [169] Predictive 1997 2.19 
JPEG-LS [40] Predictive 1999 2.08 
APC [170] Predictive 1999 2.17 
EDP [171] Predictive 1999 1.84 
JPEG2000 [41] DWT 2000 1.88 
ALPC [172] Predictive 2000 2.05 
MRP [173] Predictive 2000 2.10 
ALCA [174] DWT 2001 2.00 
MINT-UCA [175] DWT 2001 1.95 
EZBC [176] DWT 2001 1.85 
FBS [177] Predictive 2003 2.13 
APC-WLS [178] Predictive 2003 2.26 
MALCM [179] Predictive 2004 1.9 
PDF [180] DWT 2005 1.85 
RALP [181] Predictive 2005 2.08 
VBS [182] Predictive 2005 2.15 
APT [183] Predictive 2006 1.75 
PPBWC [184] DWT 2007 1.85 
TDKZW [185] DWT 2009 2.04 
JPEG-HD [186] DCT 2009 1.75 
BPNN [187] Predictive 2011 2.01 
TS-FNN [188] Predictive 2012 2.12 
FLIC [189] Predictive 2012 1.86 
APC-MAP [190] Predictive 2013 2.14 
Rebecca L. Davidson                                             Appendices 
- 207 - 
APPENDIX C - MULTIDIMENSIONAL IMAGE COMPRESSION ALGO-
RITHMS 
Algorithm Name & 
Reference Theoretical Basis Year Published 
Average Compression 
Ratio 
USES* [22] Predictive 1993 2.60 
IB-CALIC* [103] Predictive 1998 3.08 
ACAP*  Predictive 1999 3.36 
D-JPEG-LS [102] Predictive 1999 2.85 
D-JEPG2000 [102] DWT 2000 2.84 
ASAP* [157] Predictive 2001 3.36 
C-DPCM [111] Predictive 2003 3.39 
LPVQ* [191] VQ 2003 3.00 
M-CALIC [192] Predictive 2004 3.28 
SLSQ*[104] Predictive 2004 3.11 
SLSQ-OPT* [104] Predictive 2004 3.19 
SLSQ-HEU* [104] Predictive 2004 3.18 
BH [193] Predictive 2005 3.05 
BG [193] Predictive 2005 2.99 
CCAP* [194] Predictive 2005 2.94 
FL [105] Predictive 2005 3.24 
LUT* [107] LUT 2006 3.31 
LUT [107] LUT 2006 2.70 
LAIS-LUT* [108] LUT 2006 3.42 
LAIS-LUT [108] LUT 2006 2.86 
NPHI* [195] Predictive 2007 3.24 
EPHI* [195] Predictive 2007 3.27 
s-DSC* [161] DSC 2007 2.59 
v-DSC* [161] DSC 2007 2.65 
DSC-CALIC* [161] DSC 2007 2.59 
S-FMP[196] Predictive 2007 3.13 
S-RLP [196] Predictive 2007 3.11 
ABPCNEF* [197] Predictive 2007 3.07 
LAIS-QLUT* [109] LUT 2008 3.59 
A1 [198] DSC 2010 2.75 
A2 [198] DSC 2010 2.70 
A3 [198] DSC 2010 2.61 
C-DPCM-APL [159] Predictive 2012 3.47 
CCSDS-123[106] Predictive 2012 3.24 
 
* Denotes compression performance data was only available for calibrated images 
 
Rebecca L. Davidson                                             Appendices 
- 208 - 
APPENDIX D - EXAMPLE ARCHITECTURE PART SELECTION  
There are many different candidate hardware platforms for the implementation of the 
newly proposed heterogeneous onboard data processing architecture shown in Figure 3-3. 
A preliminary study and example of potential hardware selection decisions has been con-
ducted. The resulting configuration and initial proposed realisation of the new onboard 
payload data processing architecture is shown in Figure D-1. This system is constructed 
of four distinct units, the backplane unit, data interfacing and handling unit (DIHU), mass 
memory units (MMUs) and the payload data processing units (PDPU).  
 
Figure D-1 Example system design  
 
 
D.1 Mass Memory Units (MMU) 
The combination of a number of MMUs will be leveraged in the system to provide 
the level of data storage volume required for a particular mission profile. Each unit will 
provide a medium amount of storage, in the order of several GB, to allow for easy scala-
bility and error mitigation. Specifically, NAND Flash technology will be leveraged for 
implementing the mass payload data storage as it the most suitable non-volatile, high den-
sity and inherently error resilient memory architecture available.  
 
 
Mass Memory Units
B
P C
onnector
NVIDIA Jetson TX2i
Payload Data Processing Units
B
P C
onnector NAND Flash 
B
P C
onnector NAND Flash 
B
P C
onnector
Backplane
O
pen V
PX
 B
ackplane
B
P C
onnector
NVIDIA Jetson TX2i
NAND Flash 
B
P 
C
on
ne
ct
or
Data Interfacing & Handling Unit
XILINX ZYNQ FPGA 
Rebecca L. Davidson                                             Appendices 
- 209 - 
D.2 Open VPX Backplane  
The proposed system design given in Figure D-1, is centred around an Open VPX 
backplane. The Open VPX standard covers both the physical and electrical specifications 
for a high-speed passive serial backplane. It has been widely adopted in embedded com-
puting architecture in aerospace and military applications. Physically there are two sizes 
of connector available 3U and 6U for high system flexibility. In the configuration shown 
in Figure D-1, this has been leveraged to provided different IO configurations for the dif-
ferent units. Specifically, it features three 3U connectors for three MMUs, one 6U for the 
DIHU and two 6U connectors for two PDPUs. Utilising the Open VPX standard allows 
for full customisation of the number and size of physical connectors and also the electri-
cal protocols deployed. Electrically, the Open VPX standard accommodates power 
supply, differential paired high-speed serial buses, 10Gbps and above, and also single 
ended point-to-point connections. Electrically, the high-speed serial buses can be used to 
implement a number of state-of-the-art protocols including PCIe, and Gigabit Ethernet, 
for internal system communications and payload data transfer. 
The backplane unit also features a high-density micro D-type connector and coaxial 
SMA connectors for the purpose of external connections to the satellite platform, down-
link and payload. Specifically, the coaxial interface allows the use of the CoaXpress 
protocol, which is a next generation vision interface designed for high speed and scalabil-
ity. CoaXpress offers 6.25 Gbps per cable interface, the protocol also allows for high 
unlimited scalability to any number of parallel cables. Another key feature of CoaXpress 
is the ability to provide power over the cable to the payload directly from the system over 
the same data transfer coaxial cables. The protocol is also well support with a number of 
IP cores available for various FPGA vendors. High-density micro D-type connectors 
could be used for any additional external interfacing such as to the platform and downlink 
systems. The physical connectors and electrical protocols selected for the implementation 
of the system backplane have been done with flexibility and scalability specifically in 
mind. Whereby, changes to the specific configuration of the system could be simply ad-
dressed through modification to the backplane unit by changing the number of each 
interface whilst not requiring any fundamental changes to any of the other system units.  
 
 
Rebecca L. Davidson                                             Appendices 
- 210 - 
D.3 Data Interfacing & Handling Unit (DIHU) 
In Figure D-1, the DIHU is considered the main functional unit of the system. It pro-
vides key functionalities such as interfacing with the payload, platform and downlink 
systems and overall control of the system, including activity and health monitoring. A 
Xilinx ZYNQ FPGA has been selected to provide the underlying capabilities required to 
implement these functionalities and achieve the system behavioural design laid out in 
Figure 3-1. The Xilinx ZYNQ is a family of FPGA based SoC and multi-processor SoC, 
(MPSoC) devices. The Xilinx ZYNQ product range covers low-cost to high-end markets, 
providing different processor and additional application specific embedded hardware con-
figurations.  
This wide range of capabilities and resources, provided by this single device family, 
is a key advantage providing an additional level of flexibility as it is often relatively easy 
to replace FPGA devices with others from the same manufacturer and product range in 
terms of both hardware design and firmware and software resources. Xilinx also claim the 
ZYNQ devices offer industry-leading resilience to SEUs. Specifically, the UltraScale+ 
branded devices have been shown to exhibit inherent radiation tolerance of up to 3X low-
er SEU failure-in-time (FIT) per Mb and 2X faster detection and correction of soft errors 
than prior-generation Xilinx devices. This is attributed in part due to the 16 nm FinFET 
process used. This process is similar to the process used in commercial parts and there-
fore potentially reduces the cost and associated cost of implementing additional soft-error 
mitigation solutions [199].  
CPUs can be categorised on a number of factors, in this application two different 
types of CPU based on their Instruction Set Architecture (ISA) are compared. Typically, 
Reduced Instruction Set Computer (RISC) CPUs. RISC CPUs are better suited for 
onboard application, this is because RISC device operations are determined by a small, 
highly optimised set of instructions. The key characteristic of a RISC CPU architecture is 
that all instructions are uniformly performed in a single clock cycle, thus enabling a de-
terministic and low latency execution model. In recent years, RISC CPUs have become 
integral to the embedded and mobile computing markets. This is because the hardware 
architectures of these devices are heavily optimised for the RISC specific instructions, 
typically using much fewer transistors and greater power efficiency alternative. Although 
the CPU in the architectural diagram, Figure 3-3, is pictured as separate entity, an embed-
ded CPU presents a more onboard suitable solution. Today, embedded CPUs can be 
Rebecca L. Davidson                                             Appendices 
- 211 - 
commonly found in both FPGA and GPU devices, providing the desired computing archi-
tecture with potentially minimal impact to area, mass or power consumption. 
 
D.4 Payload Data Processing Units (PDPU) 
The PDPU is currently the lowest Technology Readiness Level (TRL) level unit in 
the system. This is due to the fact that no space proven low power GPU devices, that the 
unit is based around, currently exist. Additionally, the heterogeneous architecture pro-
posed is a FPGA–GPU system, which is not a traditional terrestrial device combination. 
GPU and FPGA hardware are often discussed and compared as hardware accelerators in a 
CPU orientated heterogeneous computing system. However, interest in GPU-FPGA het-
erogeneous computing is growing. Y. Thoma et al discuss and propose a novel 
framework to allow the direct communication between GPU and FPGA platforms. The 
proposed framework is shown to provide a communication speedup over traditional 
communication via a CPU. A major bottleneck in all heterogeneous computing systems 
resides with the data throughput of communication lines. This paper proposes a new 
framework for low latency, high data rate (up to 150MB/s and 200MB/s) direct FPGA 
GPU communication, to aid developments in GPU-FPGA heterogeneous computing sys-
tems. The PCIe (Peripheral Component Interconnect Express) interface is identified as a 
common interface between GPU and FPGA devices and is utilised to provide direct 
memory access (DMA) transfers between devices.  
Whilst no space proven low power GPU platform currently exists, there are several 
different terrestrial devices which could be initially selected. Towards a first flight solu-
tion an appropriate mitigation strategy would likely need to be developed to ensure high 
probability of success. This will likely need to include ground-based testing to help de-
termine prior to launch the behavioural characteristics of the design in the specific space 
environment. Towards the initial selection of a suitable device there are many comparison 
criteria to consider.  
In the last couple of years, a wider range of manufacturers have been releasing new 
embedded and low power GPU platforms which could be utilised in space applications. 
There are many inherent differences between each of the GPU platforms, but the one 
characteristic they all have in common is the fact they are all implemented as a SoC de-
Rebecca L. Davidson                                             Appendices 
- 212 - 
vice featuring at least one other CPU processor and, in some cases, other dedicated DPU 
hardware.  
For the PDPU, the Jetson TX2i currently represents the best trade-off between com-
putational performance, power consumption, ease of usability and suitability for the space 
environment. At the start of this research, a similar analysis of current low power and 
embedded GPU devices was made to allow for the purchase of a platform toward practi-
cal experimental work. At the time the Jetson TX2 was not yet available and the most 
suitable platform was the Jetson TX1. Therefore, all practical experiments performed in 
the remainder of this research were performed using the Jetson TX1 platform. Due to the 
scalability and backwards compatibility provided by the NVIDIA hardware and software 
API, all developments in the work towards the Jetson TX1 platform will also be applica-
ble to the Jetson TX2. In addition, NVIDIA has recently released a modified version of 
the Jetson TX2, called the Jetson TX2i, which has been specifically modified on a hard-
ware level for industrial applications. This includes ECC supported memory, longer 
operating life, wider operating temperature range, higher vibrational tolerance, extended 
warranties and greater guaranteed sales lifecycle. The launch of the TX2i within a year of 
the launch of the original TX2 demonstrates the willingness of the manufacturer, NVID-
IA, to both listen to their customer base and meet user demands. 
  
Rebecca L. Davidson                                             Appendices 
- 213 - 
D.5 GPU Part Assessment 
The assessment criteria given in Table D-1, has been used to perform an initial as-
sessment of the latest embedded GPU platforms available today. The results of this 
assessment are then given in Table D-1. 
Table D-1 GPU platform assessment criteria 
# Assessment Criteria Description 
1 Hardware Characteristics 
1 Quoted GPU  
Performance 
Computational performance for the GPU hardware, as 
quoted or provided by the platform manufacturer, often 
measured in FLoating Point Operations per Second 
(FLOPS) 
2 Quoted Power  
Consumption 
Typical power consumption as quoted and provided by the 
platform manufacturer, often provided as Thermal Design 
Power (TDP) 
3 Computational 
Configuration 
Internal device computational architecture; such as number 
of cores, operational frequency and generation 
4 Memory 
Configuration 
Internal device memory architecture; such as number, level 
and types of available caches or amount of on-chip 
memory 
5 Interfaces Incorporated device interfaces. 
2 Environmental Characteristics 
1 Form Factor Physical size, mass, volume and fabrication process char-
acteristics 
2 Operating Conditions Manufacturer quoted device operating conditions; such as 
temperature, vibration and operating life 
3 Product Lifecycle Time product is availability for purchase from manufactur-
er  
5 Hardware Level  
Radiation Mitigation 
Any manufacturer introduced error mitigation techniques  
3 Software Characteristics 
1 Compatible Software 
Models 
Compatible programming models and languages 
2 Software Tools Manufacturer provided or open-source software develop-
ment tools 
3 Software Support Manufacture or community provided support mechanisms 
 
  
Rebecca L. Davidson                                             Appendices 
- 214 - 
Table D-2 Preliminary GPU hardware platform assessment 
# 
Hardware Platforms & Release Year 
NVIDIA Jetson AMD Embedded G-series Qualcomm 
Snapdragon 
845 
(2017) 
Samsung 
Exynos 9 
Octa 8895 
(2017) 
TX1 
(2015) 
 
TX2 (i) 
(2017) (2018) 
 
J 
(2016) 
 
LX 
(2016) 
 
1.1 365 GFLOPS 
1.024 
TFLOPS 
345 
GFLOPS 17 GFLOPS 
~ 500 
GFLOPS 
375 
GFLOPS 
1.2 10W 7.5W    (i):10W 15 W 15 W 5 ~ 10 W 5 ~10 W 
1.3 
CPU: Cor-
tex-A57 
 
GPU: 256 
Core 
Maxwell 
CPU: Cortex-
A57 + NVID-
IA Denver 2 
 
GPU: 256 
Core Pascal 
CPU: 2x   
Puma x86 
 
GPU: 192 
Core Radeon 
R4E 
CPU: 2x Ex-
cavator x86 
GPU: 64 
Core Radeon 
R1E 
CPU: Kryo 
(ARMv8) 
 
GPU: 256 
Core Adreno 
630 
CPU: 
Exynos M2, 
Cortex A53 
GPU: ARM 
Mali-G71 
MP20 
1.4 
Embedded 
4GB 64-bit 
LPDDR4 
Embedded 
(Industrial) 
8GB 128-bit 
LPDDR4 
Supports 64-bit DDR3/4 Supports 64-bit LPDDR4 
Supports 
64-bit 
LPDDR4X 
1.5 
PCIe, USB, 
Ethernet, 
CSI-2, 
UART, I2C, 
GPIO 
PCIe, USB, 
Ethernet, CSI-
2, UART, I2C, 
GPIO, CAN 
PCIe, USB, SATA - USB, Ethernet 
2.1 
SoC & Module; 50mm x 
87mm SoC BGA SoC BGA SoC BGA 
28 nm 16 nm 28 nm 10 nm 10 nm 
2.2 
-25-80°C, 5 years’ life 
0-90°C - - 
 (i): -40-85, 10 years’ life 
2.3 2021 2022 (i): 2028 2028 - - 
2.5 
None except TX2i 
Supports 
DDR with 
ECC 
None None None TX2i: Industrial grade com-
ponents, protected power, 
ECC DRAM 
3.1 
CUDA, OpenGL, OpenGL 
ES, Caffe, TensorRT, 
OpenCV 
OpenGL, OpenCL OpenGL ES, OpenCL 
OpenGL 
ES, 
OpenCL 
3.2 NVIDIA Nsight Tools & open-source tools AMD SDK 
Qualcomm 
Tools 
Open-
source tools 
3.3 NVIDIA support & commu-nity support 
AMD support, 
limited community support 
Minimal 
support 
Community 
support 
 
Rebecca L. Davidson                                             Appendices 
- 215 - 
From the hardware characteristics detailed in Table D-2, the energy efficiency of 
each device in terms of peak GFLOPS/Watt is extracted and the resulting comparison for 
these devices is given in Figure D-2.  
 
 
(1-GFLOPS/Watt based on worst case 10W TDP, 2-GFLOPS/Watt based on best case 5W TDP) 
Figure D-2 GLOPS/Watt TDP computational power efficiency comparison 
 
A key insight this provides is the apparent relatively low energy efficiency for the 
AMD embedded G-series devices. The AMD embedded G-series devices’ feature CISC 
x86 CPUs, whilst CISC devices can be leverage for high absolute computational perfor-
mance they often achieve low computational energy efficiency. In the application of 
space, RISC CPUs are also often preferred, due to the greater level of deterministic pro-
cessing. The Qualcomm Snapdragon and Samsung Exynos platforms are most commonly 
utilised in the smartphone industry, this is characterised by the lack of direct support for 
many external interfaces. This, in combination with the relative lack of open information 
on the devices and little manufacturer or community support, is the reason that they have 
not been widely adopted by general purpose computing. In comparison, the NVIDA Jet-
son devices are well utilised in a wide range of applications from commercial mobile 
devices such as tablets, games consoles and advanced driver assistance systems (ADAS) 
to bespoke robotics, machine learning and educational applications. This is due to the 
combination of the availability of specific platform developer kits, relatively large official 
and community support, mature development tools, wide ranging programming API 
compatibility and informative educational material. These features make the NVIDIA Jet-
son platform a very good candidate for deployment in an initial conception of a GPU 
accelerated onboard data processing architecture. 
0
10
20
30
40
50
60
70
80
90
100
110
TX1 TX2 (i) J LX Qualcomm
Snapdragon
845
Samsung
Exynos 9 Octa
8895
NVIDIA Jetson AMD Embedded G-Series
G
FL
O
PS
/W
at
t
1
1
2
2
Rebecca L. Davidson                                             Appendices 
- 216 - 
APPENDIX E - CCSDS-123 USER DEFINABLE PARAMETERS  
Parameter Description 
Prediction 
Bands 
The number of previously encoded bands which can be used for prediction.  
Possible Values: [0 – 15] 
Local Sum   
Mode 
The pixels used for local sum calculations. 
Possible Values: [Full or Reduced] 
Prediction 
Mode 
The local differences used for prediction. 
Possible Values: [Neighbour or Column] 
Register 
Size 
The size of the register used to store the predicted sample value.  
Increasing the register size R reduces the chance of an overflow occurring in the 
calculation of a scaled predicted sample value, thereby increasing prediction accu-
racy. 
Possible Values: [max{32, (Dynamic Range+	Weight	Resolution+2)}	≤R≤64] 
Weight 
Resolution 
Each weight value is a signed integer quantity that can be represented using the 
weight resolution value +3 bits, thus the weight resolution determines the possible 
minimum and maximum weight values.  
Increased weight resolution, increases the accuracy of the prediction calculation.  
Possible Values: [4 ≤ W	≤ 19] 
Weight  
Interval 
Contributes to the control of the rate at which weights adapt to image data statistics, 
determining the interval at which the weight update variable is incremented. 
Smaller values produce larger weight increments, yielding faster adaptation to 
source statistics but worse steady-state compression performance. 
Possible Values: [24 , 25, …- 211]  
Weight  
Initial 
Contributes to the control of the rate at which weights adapt to image data statistics, 
determining the initial value of the weight update scaling variable. 
Possible Values: [-6 – Weight Final] 
Weight  
Final 
Contributes to the control of the rate at which weights adapt to image data statistics, 
determining the final value of the weight update scaling variable. 
Possible Values: [Weight Initial – 9] 
Entropy 
Encoder 
Type 
Mapped prediction residuals are encoded using either a sample-adaptive or a block-
adaptive entropy coding approach. As the sample-adaptive method has been shown 
to provide much greater compression performance whilst still meeting low compu-
tational resource requirements only this method is discussed in this work.  
U Max The sample-adaptive entropy coding procedure ensures that the variable length 
codeword is no longer than Umax +	the original sample dynamic range bit length.  
Possible Values: [ 8≤Umax ≤32] 
Y Star This parameter determines the interval at which the statistics counter, used to de-
termine the length of the binary codeword, is rescaled. 
Possible Values: [max {4, (γ0 +1)} ≤γ∗ ≤9] 
Y Zero This parameter determines the initial value of the statistics counter. 
Possible Values: [1≤γ0 ≤8] 
K This parameter is also used to determine the initial values of the statistics counter. 
Possible Values: [0 ≤ k	≤ (Dynamic range – 2) ] 
  
Rebecca L. Davidson                                             Appendices 
- 217 - 
APPENDIX F - DEFAULT COMPRESSION PARAMETERS 
 
Parameter Value 
Prediction Bands 3 
Prediction Mode Reduced 
Local Sum Mode Column-Orientated 
Register Size 64 
Weight Resolution 19 
Weight Interval 6 
Weight Initial -1 
Weight Final 3 
U Max 18 
Y Star 6 
Y Zero 2 
K 2 
 
  
Rebecca L. Davidson                                             Appendices 
- 218 - 
APPENDIX G – COMPRESSION TESTING IMAGE THUMBNAILS 
 
aviris_hawaii_f011020t01p03r05_sc01.uncal-u16be-224x512x614-band_128.jpg [144] 
 
 
aviris_maine_f030828t01p00r05_sc10.uncal-u16be-224x512x680-band_128.jpg [144] 
Rebecca L. Davidson                                             Appendices 
- 219 - 
 
aviris_yellowstone_f060925t01p00r12_sc00.cal-s16be-224x512x677-band_128.jpg [144] 
 
 
aviris_yellowstone_f060925t01p00r12_sc03.cal-s16be-224x512x677-band_128.jpg [144] 
 
Rebecca L. Davidson                                             Appendices 
- 220 - 
 
 
Landsat_agriculture-u16be-6x1024x1024-band_1.jpg [144] 
 
 
Landsat_coast-u16be-6x1024x1024-band_1.jpg [144] 
Rebecca L. Davidson                                             Appendices 
- 221 - 
 
Landsat_mountain-u16be-6x1024x1024-band_1.jpg [144] 
 
 
Rebecca L. Davidson                                             Appendices 
- 222 - 
APPENDIX H - CCSDS-123 ALGORITHM PARAMETER TESTING  
In addition to the investigation into the impact of image tiling on the processing 
throughput and compression ratio an experiment to assess the tuning of the prediction 
band parameter has also been performed. This parameter has been shown previously to be 
the user definable parameter which has the largest impact on the algorithm’s compression 
performance. Comparable previously published results have shown that with respect to 
the change in compression rate with the number of predictor bands (P). For between 0 – 2 
there is an initial increase in compression performance, followed by a plateau in compres-
sion ratio for further increases in the value of P, for P=3-6. Then lastly there is a period of 
decreasing compression ratio, for P=7-15. With regards to the execution time ratio, previ-
ous implementation has seen an increasing linear trend with increasing value of P.  
The results from the new prediction band experiments, for all previously tested AVI-
RIS images, are given in Figure H-1. In this experiment the tile size which achieved the 
highest modified throughput Weissman score, given in Figure 4-15, has been used for the 
testing of each image respectively, and the number of prediction bands has been varied 
from 0 to the maximum allowed by the algorithm of 15. These results firstly show that 
each AVIRIS image exhibits the same overall trends for both compression ratio and pro-
cessing time. The individual graphs for the change in compression ratio and processing 
time allow us to compare with those from previously published literature to assess any 
changes in trends for the new GPU accelerated implementation. As the compression ratio, 
the inverse of compressed rate, is wholly influenced by the algorithm itself and the value 
of other parameters, these trends closely follow those seen previously in literature as is 
expected. The trends in processing time for the application however do not closely follow 
the strictly linear trends seen in literature. Whilst the trend in processing time is typically 
linear between 0 and 9 prediction bands there is a sharp increase in processing time be-
tween the use of 9 and 10 prediction bands whereas a P value of between 10 and 15 
results in a significant drop in processing throughput performance. An increase in the val-
ue of P is directly related to both the number of instructions required for the construction 
of the local difference and weights vectors, the dot product of these two vectors and also 
the amount of shared memory required for the storage of these vectors. It is postulated 
that the significant increase in processing time for the AVIRIS data set for a value of P 
greater than 9 is due to the GPU being unable to effectively leverage the available paral-
lelism to mitigate the increase in shared memory accesses and executed instructions. 
Rebecca L. Davidson                                             Appendices 
- 223 - 
These results show that it would not be advantageous toward either compression ratio or 
processing throughput performance to utilise a P value greater than 9.  
 
Figure H-1 AVIRIS images prediction bands, compression ratio & execution time  
 
In order to assess the optimum P value for a balanced trade-off between compression 
ratio and throughput performance the modified throughput Weissman scores have been 
calculated for each test case and are shown in Figure H-1using the AVIRIS Hawaii data 
given in Table 4-12 as the reference. The throughput Weissman score results in Figure H-
2, exhibit the trend of an initial sharp increase in overall performance whereby the peak 
throughput Weissman score varies between a value of 2 and 4 for the tested images. Be-
yond these values for each image there are diminishing returns in performance for 
increased values of P and past a value of 10 the performance decreases. Interestingly, the 
peak throughput Weissman scores for each image neither correlate to the P values for 
peak compression ratio or processing throughput and therefore provide an additional in-
sight towards what mathematically is the optimum balanced trade-off between the two 
application characteristics.  
AVIRIS Hawaii 1.24
AVIRIS Maine 1.19
AVIRIS Yellowstone 00 0.77
AVIRIS Yellowstone 03 0.79
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
W
ei
ss
m
an
 S
co
re
Number of Prediction Bands
Hawaii AVIRIS Maine AVIRIS Yellowstone 00 AVIRIS Yellowstone 03
145
165
185
205
225
245
265
285
305
325
345
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Ti
m
e 
(m
s)
Number of Prediction Bands
Hawaii AVIRIS Maine
AVIRIS Yellowstone 00 AVIRIS Yellowstone 0
1.5
1.7
1.9
2.1
2.3
2.5
2.7
2.9
3.1
3.3
3.5
3.7
3.9
4.1
4.3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
C
om
pr
es
si
on
 R
at
io
Number of Prediction Bands
Hawaii AVIRIS Maine
AVIRIS Yellowstone 00 AVIRIS Yellowstone 03
Rebecca L. Davidson                                             Appendices 
- 224 - 
 
Figure H-2 AVIRIS throughput Weissman scores for prediction band experiment testing 
 
As conducted with the hyperspectral AVIRIS data set, a study to assess the impact of 
prediction bands on compression ratio and throughput for the multispectral Landsat imag-
es is performed, these results are given in Figure H-3. For the 6 band Landsat imagery it 
is possible to have a prediction band value of between 0 and 5. 
 
Figure H-3 Throughput Weissman Score prediction bands testing results 
 
A key result shown in Figure H-3, is the fact that all tested Landsat images achieve 
the highest throughput Weissman score metric for the case of 2 prediction bands. Addi-
tionally, it is also clear that for the multispectral Landsat imagery the number of 
prediction bands has a reduced impact on the overall compression algorithm performance 
when compared to the trends for the hyperspectral images. This effect has been quanti-
tively measured by assessing the difference between the minimum and maximum 
AVIRIS Hawaii 1.24
AVIRIS Maine 1.19
AVIRIS Yellowstone 00 0.77
AVIRIS Yellowstone 03 0.79
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
W
ei
ss
m
an
 S
co
re
Number of Prediction Bands
Hawaii AVIRIS Maine AVIRIS Yellowstone 00 AVIRIS Yellowstone 03
Landsat Agriculture 0.52
Landsat Coast 0.67
Landsat Mountain 0.49
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0 1 2 3 4 5
W
ei
ss
m
an
 S
co
re
Number of Prediction Bands
Landsat Agriculture Landsat Coast Landsat Mountain
Rebecca L. Davidson                                             Appendices 
- 225 - 
throughput Weissman score values for each image test case and is shown in Figure H-4. 
The minimum throughput Weissman score case for all tested imagery is when 0 predic-
tion bands are used, therefore the difference between the minimum and maximum 
throughput Weissman scores represents the positive impact of leveraging spectral redun-
dancy between image bands on the overall performance. As is expected, due to the higher 
correlation for hyperspectral imagery these datasets are able to increase performance by a 
larger degree than multispectral imagery which exhibits lower spectral correlations be-
tween co-located pixels.  
 
 
Figure H-4 Difference in throughput Weissman scores for prediction band tests 
 
The results correlate well with similar results published in literature in terms of com-
pression ratio but as the hardware characteristics differ the results present new trend 
information for a GPU accelerated specific application. The key difference is that the pre-
vious CPU implementation shows an increasing linear trend in execution with increased 
prediction bands value whereas the SSC GPU application experiences minimal increase 
in execution time for a prediction band value of between 0 and 9 then a sharp increase 
from 9 to 10. However, as there is very little compression ratio benefit for a prediction 
band value greater than 5, the GPU application will not experience a significant reduction 
in processing throughput when optimising the prediction band parameter between the 
recommended operating value between 1 and 5. 
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Agriculture Coast Mountain Hawaii Maine Yellowstone 00 Yellowstone 03
Landsat AVIRIS
M
IN
 to
 M
A
X
 d
iff
er
en
ce
 in
 W
ei
ss
m
an
 S
co
re
Rebecca L. Davidson                                             Appendices 
- 226 - 
APPENDIX I – LOW POWER GPU POWER DRAW ANALYSIS 
In addition to investigating the achievable processing throughput for the SSC 
CCSDS-123 compression application on the Jetson TX1 platform the onboard power 
monitors have been utilised to experimentally measure the instantaneous power consump-
tion of the whole Jetson TX1 module and also of the GPU specifically. Rather than 
having to rely upon the device manufacture’s generic and approximate TDP statistics for 
the device, it is possible to assess the actual device power draw under a specific applica-
tion, and also if and how the power draw changes with different application 
configurations or different input data. The following subsections provide the measured 
results for the instantaneous power draw every nanosecond for each experimental test de-
tailed in Chapter 4 and Chapter 5. The nominal condition, where there is no GPU load, is 
also measured and provided as a reference.  
 
I.1 Tiled Hyperspectral and Multispectral Test Cases 
Figure I-1 – Figure I-2 plot the average and maximum power draw values for the Jet-
son TX1 module as a whole and the GPU part of the device for all the hyperspectral 
AVIRIS and multispectral Landsat images.  
 
                      
                                                     
 
a) 
 
b) 
Figure I-1 Jetson TX1 module and GPU power draw for AVIRIS images 
  
5.4
5.45
5.5
5.55
5.6
5.65
5.7
5.75
5.8
5.85
5.9
1 2 4 8 16 32 64 128 256 512
Number of Tiles
Average Power Draw
Hawaii Maine Yellowstone 00 Yellowstone 03
5.4
5.45
5.5
5.55
5.6
5.65
5.7
5.75
5.8
5.85
5.9
1 2 4 8 16 32 64 128 256 512
Number of Tiles
Maximum Power Draw
Hawaii Maine Yellowstone 00 Yellowstone 03 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
5
6
7
8
9
10
11
12
13
1 2 4 8 16 32 64 1 8 256 512
M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
ber of Tiles
AVIRIS Images
1.5
2
2.5
3
3.5
4
1 2 4 8 16 32 64 128 256 512
G
PU
 P
ow
er
 D
ra
w
 (W
)
Number of Tiles
AVIRIS Images
Rebecca L. Davidson                                             Appendices 
- 227 - 
 
 
a) 
 
 
b) 
 
 
 
c) 
 
 
d) 
 
 
 
e) 
 
 
f) 
 
Figure I-2 Jetson TX1 module and GPU power draw for Landsat images 
 
4567
891011
1213
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1638
4
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Maximum Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
5
6
7
8
9
10
11
12
13
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Agriculture
1.5
2
2.5
3
3.5
4
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
G
PU
 P
ow
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Agriculture
4567
891011
1213
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1638
4
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Maximum Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
5
6
7
8
9
10
11
12
13
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Coast
1.5
2
2.5
3
3.5
4
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
G
PU
 P
ow
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Coast
4567
891011
1213
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1638
4
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Maximum Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
5
6
7
8
9
10
11
12
13
1 2 4 8 16 32 64 1 8 256 512 1024 2048 4096 81921638
4
M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Mountain
1.5
2
2.5
3
3.5
4
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
G
PU
 P
ow
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Mountain
Rebecca L. Davidson                                             Appendices 
- 228 - 
A key observation from the power consumption results given in Figure I-1 and Fig-
ure I-2 is that the trends in average power draw for both the TX1 module and GPU are 
consistent across all tested images and image test cases. The Jetson TX1 module level 
power consumption results, given in graphs a), show that the average power draw for the 
SSC CCSDS-123 application is typically below 6 W for all the conducted tests. Im-
portantly this is at the lower end of the range for the devices TDP and in a region which is 
suitable for use in an onboard environment. Looking at the GPU level power consumption 
results given in graphs b) the average power draw is typically in the region of 2 – 2.5 W. 
It can therefore be deduced that on average the GPU contributes to approximately 40% of 
the overall module power draw. This is an interesting finding which shows that non-GPU 
components of the Jetson TX1 contribute to the overall power draw slightly more than the 
GPU whilst compression is taking place.  
Whilst the application has been designed to only utilise GPU resources, the operating 
system and background tasks, whilst not explicitly used, appear to have a significant con-
tribution to the power consumption of the module. Additionally, looking at the maximum 
power draw results at the GPU level, there appears to be no significant relationship be-
tween the number of tiles or TPB value used on the power draw. However, at the module 
level there does appears to be a relationship between the number of image tiles and the 
maximum power draw. This hints at the fact that there could be a greater requirement on 
the CPU based instruction for the initialisation and set-up of input parameters to the GPU 
application. Whilst there does appear to be variation in maximum power draw with the 
changing number of tiles and TPB value used the overall peak power draw of the applica-
tion at the module level is typically below 13 W.  
 
I.2 TMR Protected Application Power Draw Analysis 
The power consumption results for the TMR protected SSC CCSDS-123 GPU 
application have also been gathered and are given in Figure I-3 and Figure I-4. Firstly, 
assessing the results for the K-TMR protected GPU application in Figure I-3 the average 
power draw of the Jetson TX1 module for all images appears to be slightly increased, by 
approximately 0.5 W, compared to the non-TMR results. However the GPU specific 
power draw appears to be unchanged with an overall avaerage power draw of 
approximately 2.25 W. This is likely cuased by the extra CPU load required to set CUDA 
streams.  
Rebecca L. Davidson                                             Appendices 
- 229 - 
 
 
a) 
 
 
b) 
 
 
 
c) 
 
 
d) 
 
 
 
e) 
 
 
f) 
 
Figure I-3 K-TMR SSC CCSDS-123 Jetson TX1 power draw results for Landsat images 
 
4567
891011
1213
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1638
4
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Maximum Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
5
6
7
8
9
10
11
12
13
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Agriculture
1.5
2
2.5
3
3.5
4
4.5
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
G
PU
 P
ow
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Agriculture
4567
891011
1213
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1638
4
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Maximum Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
5
6
7
8
9
10
11
12
13
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Coast
1.5
2
2.5
3
3.5
4
4.5
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
G
PU
 P
ow
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Coast
4567
891011
1213
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1638
4
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Maximum Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
5
6
7
8
9
10
11
12
13
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Mountain
1.5
2
2.5
3
3.5
4
4.5
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
G
PU
 P
ow
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Mountain
Rebecca L. Davidson                                             Appendices 
- 230 - 
 
 
a) 
 
 
b) 
 
 
 
c) 
 
 
d) 
 
 
 
e) 
 
 
f) 
 
Figure I-4 SSC T-TMR CCSDS-123 Jetson TX1 power draw results for Landsat images  
 
 
4567
891011
1213
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1638
4
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Maximum Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
5
6
7
8
9
10
11
12
13
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Agriculture
1.5
2
2.5
3
3.5
4
4.5
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
G
PU
 P
ow
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Agriculture
4567
891011
1213
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1638
4
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Maximum Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
5
6
7
8
9
10
11
12
13
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Coast
1.5
2
2.5
3
3.5
4
4.5
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
G
PU
 P
ow
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Coast
4567
891011
1213
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1638
4
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Average Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
4.5
5
5.5
6
6.5
7
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
A
ve
ra
ge
 M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Maximum Power Draw - Tiles per Block
1 2 4 8 16 Jetson TX1 Nominal
4
5
6
7
8
9
10
11
12
13
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
M
od
ul
e 
Po
w
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Mountain
1.5
2
2.5
3
3.5
4
4.5
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921638
4
G
PU
 P
ow
er
 D
ra
w
 (W
)
Number of Tiles
Landsat Mountain
Rebecca L. Davidson                                             Appendices 
- 231 - 
Figure I-4 provides the power draw results for the compression of tiled Landsat test 
images under T-TMR protected application. Comparing these results to those for the ap-
plication without TMR protection, the average and maximum power draw for both the 
Jetson TX1 module and GPU are extremely similar, with no significant differences in ei-
ther overall or for the trends with tiling parameters. This shows that for the T-TMR 
approach, in terms of instantaneous power draw, there are no significant trade-offs to 
consider.  
Overall all these results show that there are minimal changes to the instantaneous 
power draw of the application when either the K-TMR or T-TMR SEU protection ap-
proaches are enabled. However, when TMR protection is enabled the processing 
throughput achieved can be reduced, depending on the tiling configuration used, and the 
application run-time may increase therefore the overall power consumption will also be 
increased, compared to running the application without TMR enabled. As these character-
istics will be dependent upon the tiling configuration used this highlights the importance 
of selecting appropriate compression parameters in line with the system level require-
ments and constraints.  
 
