GPU Accelerated Multispectral EO Imagery Optimised CCSDS-123 Lossless Compression Implementation by Davidson, Rebecca & Bridges, Christopher
 978-1-5090-1613-6/17//$31.00 ©2017 IEEE 
 1
GPU Accelerated Multispectral EO Imagery Optimised 
CCSDS-123 Lossless Compression Implementation  
R. L. Davidson, C. P. Bridges
Surrey Space Centre 
University of Surrey 
Guildford, Surrey, GU2 7XH, UK  
rebecca.davidson@surrey.ac.uk, c.p.bridges@surrey.ac.uk 
 
 
Abstract— Continual advancements in Earth Observation (EO) 
optical imager payloads has led to a significant increase in the 
volume of multispectral data generated onboard EO satellites. 
As a result, a growing onboard data bottleneck need to be 
alleviated. One technique commonly used is onboard image 
compression. However, the performance of traditional space 
qualified processors, such as radiation hardened FPGAs, are 
not able to meet current nor future onboard data processing 
requirements. Therefore, a new high capability hardware 
architecture is required. In previous work a new GPU 
accelerated scalable heterogeneous hardware architecture for 
onboard data processing was proposed. In this paper, two new 
CUDA GPU implementations of the state-of-the-art lossless 
multidimensional image compression algorithm CCSDS-123, 
are discussed.  
The first implementation is a generic CUDA 
implementation of the CCSDS-123 algorithm whilst the second 
is optimised specifically for multispectral EO imagery. Both 
implementations utilise image tiling to leverage an additional 
axis for algorithm parallelisation to increase processing 
throughput. The CUDA implementation and optimisation 
techniques deployed are discussed in the paper. In addition, 
compression ratio and throughput performance results are 
presented for each implementation. Further experimental 
studies into the relationships between algorithm user definable 
compression parameters, tile sizes, tile dimensions and the 
achieved compression ratio and throughput, were performed.  
 
TABLE OF CONTENTS 
1. INTRODUCTION ....................................................... 1 
2. GPU ACCELERATED ONBOARD DATA 
PROCESSING ARCHITECTURE ................................... 2 
3. PREVIOUS CUDA CCSDS-123 GPU 
IMPLEMENTATIONS .................................................... 3 
4. NEW CUDA CCSDS-123 IMPLEMENTATION ....... 4 
5. NEW MULTISPECTRAL OPTIMISED CUDA 
CCSDS-123 IMPLEMENTATION ................................ 6 
6. EXPERIMENTAL STUDIES ....................................... 7 
7. FUTURE WORK ....................................................... 9 
8. SUMMARY ............................................................. 10 
APPENDICES .............................................................. 10 
A. CCSDS-123 IMPLEMENTATION FEATURES ....... 10 
B. TEST PLATFORM SPECIFICATION ....................... 11 
C. TEST IMAGES & DEFAULT TESTING 
PARAMETERS ............................................................ 11 
REFERENCES............................................................. 11 
BIOGRAPHY .............................................................. 12 
1. INTRODUCTION 
Earth Observation (EO) satellites are experiencing 
continual advancements in optical imager payload 
technologies, leading to significant increases in the 
resolutions of images gathered from spaceborne platforms. 
Due to the absence of equivalent developments in satellite 
downlink technologies, the exponential increases in the 
generated data volume are forming a significant bottleneck 
onboard the platform. Image compression is a data encoding 
technique which can be used to reduce the amount of data 
required to represent the imagery information and is being 
increasingly explored to help effectively increase the data 
delivery throughput of EO satellites.  
Specifically, lossless image compression algorithms are 
predominantly used in the field. This is due to the importance 
of data fidelity to many EO data applications, particularly as 
computer based image processing becomes increasingly 
prominent for information extraction. Currently most 
operational satellites implement standardised algorithms 
such as JPEG-LS or JPEG-2000 where both these algorithms 
typically achieve compression ratios of approximately 2.0 
[1]. However, in order to help alleviate the increasing 
onboard data bottleneck, new state-of-the-art compression 
and processing algorithms will need to be deployed to 
achieve increased compression ratios onboard spaceborne 
platforms. Due to the inherent restrictions induced by the 
space environment; time, memory and computational 
resources are often heavily restricted when compared with 
terrestrial systems. Therefore, for space applications, 
algorithms that minimise computational complexity and 
resource requirements are preferred. 
A literature review of an extensive range of lossless 
image compression algorithms has been conducted in 
previous work [3]. The survey was conducted with the aim of 
identifying suitable algorithms specifically for the 
implementation on spaceborne platforms for multispectral 
EO image compression. In the review, average compression 
ratio values were calculated and used as a quantitative 
comparison metric. The comparison of the lossless image 
compression algorithms surveyed highlighted the significant 
compression advantages that can often be gained from 
utilising multidimensional compression algorithms. 
Multidimensional image compression algorithms are those 
that exploit additional spectral redundancies between image 
bands, as shown in Figure 1.  
  2
 
Figure 1: Lossless image compression classification [1] 
Additionally, in the previous research the complexity of 
a number of algorithms were qualitatively assessed for their 
suitability for implementation in the space environment. The 
CCSDS-123 algorithm was identified as a suitable algorithm 
that provides state-of-the-art compression performance in 
low complexity domain. The CCSDS-123 algorithm was 
standardised by the Consultative Committee for Space Data 
Systems (CCSDS) in 2012 [2] and is based on the Fast 
Lossless (FL) algorithm developed by the NASA Jet 
Propulsion Laboratory (JPL) in 2005 [3]. The algorithm 
employs a predictive based scheme incorporating 
information from a 3D neighbourhood of pixels to exploit 
both spatial and spectral redundancies. It also features a low 
complexity version of the least means squares algorithm and 
local mean subtraction to control the weightings of the 
prediction from each of the preceding spectral bands. As the 
algorithm, has been designed specifically for reduced 
computational complexity it is well suited for onboard 
implementation in a restricted hardware environment. 
Another characteristic of the algorithm that makes it 
attractive for onboard implementation is the possibility for 
some parallelisation for increased processing throughput. 
This is due to the data dependencies of the different 
algorithms stages, as shown in Figure 2. By exploiting the 
inherent parallelism of the algorithm using a suitable parallel 
processing hardware architecture, high data processing 
throughput could be achieved. In an ideal scenario, this would 
be equal to the data rate of the payload thus allowing the 
processing to be conducted in real-time.  
 
Figure 2: CCSDS-123 algorithm & data dependencies  
Traditionally in the space community radiation 
hardened FPGAs are used for onboard data processing. 
However, these processors often lag several generations 
behind their terrestrial counterparts in terms of computational 
performance. With the projected ever-increasing imager 
payload data volumes, computational resources are now a 
growing priority to help alleviate the onboard data bottleneck. 
Traditional space qualified processors are no longer able to 
meet the demands of current nor future onboard data 
processing system requirements [4]. Therefore, as part of 
previous research a new onboard data processing architecture 
has been proposed [1]. The new GPU accelerated scalable 
heterogeneous hardware architecture aims to facilitate the 
implementation of state-of-the-art parallel image processing 
and compression algorithm to help alleviate the growing 
onboard data bottleneck of EO satellite platforms. 
The remainder of this paper is organised as follows. 
Section 2 provides an overview of the previously proposed 
GPU accelerated onboard data processing hardware 
architecture. Section 3 discusses the previous work on GPU 
implementations of the CCSDS-123 lossless image 
compression algorithm. Section 4 provides details on the 
implementation of our new CUDA CCSDS-123 software, 
including performance results. Section 5 discusses the 
optimisation of the new CUDA CCSDS-123 implementation, 
specifically for multispectral EO imagery. This section also 
includes profiling analysis, discussions on optimisation 
techniques and presents the final performance results. Section 
6 provides the results of several new experimental studies 
conducted with our new CUDA implementation. It 
specifically addresses the relationships between input 
parameters such as compression modes, prediction bands and 
tiling dimensions and the achieved compression ratio and 
processing throughput of the software. Sections 7 and 8 
provide an overview of the future work which is under 
development and an overall summary of the findings 
presented in this paper. 
2. GPU ACCELERATED ONBOARD DATA 
PROCESSING ARCHITECTURE 
Traditional space qualified processors are unable to 
meet the growing computational demands for onboard data 
processing. The relative scaling of these traditional system 
architectures, to try to meet the growing requirements will 
likely lead to unacceptable and exponential increases in 
power consumption, area and mass. As a result, a new 
heterogeneous onboard data processing architecture has been 
proposed in [1]. This architecture is also given in Figure 3, it 
has been designed to provide scalable high performance 
onboard data processing to address the growing data 
bottleneck onboard EO satellites. The architecture is 
discussed in detail in the previous publication [1], therefore 
this section will only provide a brief overview of its key 
features for image processing software development.  
The heterogeneous nature of the architecture, which 
features high performance GPU devices, allows for the 
computationally intense image processing and compression 
to be offloaded from the less suitable traditional FPGA 
processing hardware. This facilitates the development of new 
high-throughput parallel processing GPU algorithms, whilst 
also enables the low-level, typically software intense, data 
manipulation and interfacing to remain on FPGA hardware, 
maximising the overall system performance. The backplane 
nature of the architecture permits both scalability of memory 
and processing resource but also graceful degradation with 
minimised single points of failure. It also allows a plug and 
play, IP core, based software model to be deployed, where 
Traditional 
Algorithms
Multidimensional 
Algorithms
Spatial 
Redundancy 
Statistical 
Redundancy 
Spatial 
Redundancy 
Statistical 
Redundancy 
Spectral 
Redundancy 
 
  3
new advances in image compression and processing 
algorithms can easily be incorporated into existing software, 
with minimal impact on hardware design. This also allows 
for greater design re-use across multiple platforms and 
missions. To demonstrate the advantages of the new GPU 
accelerated data processing architecture, a newly parallelised 
implementation of the CCSDS-123 image compression 
algorithm has been developed. 
 
Figure 3: New onboard data processing architecture [1] 
3. PREVIOUS CUDA CCSDS-123 GPU 
IMPLEMENTATIONS 
The technical details of the CCSDS-123 algorithm have 
been covered extensively in several publications in addition 
to the CCSDS standard document and the original FL 
algorithm publication [2,3,5,6,7]. Therefore, the specific 
details of the algorithm will not be covered in this paper, but 
an overview of the algorithm stages and their relative data 
dependences can be found in Figure 2. 
Two recent publications by NASA JPL have discussed 
their work on the development of two GPU implementations 
of CCSDS-123. These works are referred to in this paper as 
fixed platform and mobile platform implementations 
respectively [6,7]. The first proposed, fixed platform, 
implementation deployed a “greedy style” optimisation for 
parallel GPU implementation, whereby the largest amount of 
parallelism present at each stage of the algorithm was 
exploited, and buffers were used between stages to store 
intermediate results. This approach implies that each stage of 
the algorithm would need to be implemented as individual 
kernels. The CUDA GPU memory model means that the 
intermediate results from each stage would need to be stored 
in off-chip, high latency memory. GPUs are typically 
composed of hundreds of computational cores that can 
execute thousands of threads simultaneously for fast parallel 
processing, making it a highly efficient architecture for the 
compute intensive algorithm stages. However, the 
implementations reliance on high latency, off-chip memory 
means that there is little opportunity for data re-use or 
efficient utilisation of low latency, on-chip memory 
structures [6]. 
The proceeding, mobile platform, implementation 
addressed these issues by taking an alternative parallelisation 
approach, whereby the parallelism exploited was restricted to 
that of the least parallel stage of the algorithm. As a result, 
the large off-chip data buffers where no longer required to 
store intermediate results and whilst the data flow was more 
akin to a serial implementation this allowed the low latency 
on-chip memories to be fully exploited [7]. The throughput 
performance achieved by these previous works are given in 
Table 1 for reference.  
Table 1. Previously published implementation 
performance results – GPU used GeForce GTX 560M, 
Test image used Hyperspectral AVIRIS Hawaii  
Implementation Time (ms) Throughput (Mb/s) 
Fixed Platform [6] 2346 372.14 
Mobile Platform 
(1 GPU) [7] 226 3862.97 
Mobile Platform 
(2 GPUs) [7] 204 4279.56 
 
The performance of these implementations were tested 
using a hyperspectral optical test image taken by the AVIRIS 
instrument. Currently there are two classes of multiband 
optical EO imagery data that are gathered from satellites, 
multispectral and hyperspectral. Hyperspectral data sets can 
be considered as an almost continuous representation of 
electromagnetic reflectance, as the data is gathered for a large 
number, in the order of several hundred, narrow spectral 
bands. Multispectral EO imagery on the other hand, provides 
electromagnetic reflectance data in a small number of 
discrete spectral bands, in the order of tens, but often provide 
significantly greater spatial resolution than hyperspectral 
imagers. Whilst hyperspectral data sets pose an interesting 
big data processing problem for the community, they are not 
currently widely used onboard satellites. As such the majority 
of EO data gathered from spaceborne platforms are currently 
multispectral images.  The intrinsic differences between these 
two types of data sets can have significant impacts on both 
the achievable processing throughput and achievable 
compression ratio of an onboard image compression system. 
When compared with hyperspectral imagery data 
multispectral data has inherently reduced compressibility due 
to reduced correlations between spectral image bands. As the 
major axis for parallelisation for the CCSDS-123 algorithm 
is also the number of bands, it is therefore postulated that the 
achievable compression processing throughput will also be 
reduced for multispectral imagery. The focus of this research 
is to assess the performance of a GPU implementation of the 
CCSDS-123 algorithm specifically for multispectral imagery 
data.  
Payload Data Handling Backplane
Imaging 
Payload
Memory CardMemory Card
Memory Card
FPGA
Data Interfacing
B
a
c
k
p
l
a
n
e
Memory
Memory
Data Formatting
Data Buffering
Key:
Functional Block
Hardware Block
Memory CardMemory Card
Mass Memory 
Card
Downlink
Platform 
Interfaces
MemoryMemory
GPU
GPUGPU
Image Compression
Advanced Processing
Memory
  4
4. NEW CUDA CCSDS-123 IMPLEMENTATION  
The operation our new CUDA CCSDS-123 
implementation is shown in Figure 4, it adheres to the 
CCSDS-123 standard, a summary of the capabilities it 
provides can be found in appendix A. Figure 4 also provides 
information on the kernel configuration parameters and 
memory requirements. We have considered the lessons 
learned from the previous implementations [6,7] to develop 
this new implementation. This sections aims to provide 
details of new lessons learned, where the following 
subsections provide further details of each of the CUDA 
kernels developed for our CCSDS-123 implementation.  
  
Hardware Operations 
Memory  
Allocation  
(Bytes) 
Host Load Compression Parameters 21 
Host Load Image Samples 2 * X Size *  Y Size * Z Size  
Device  
 
I. CCSDS-123 Kernel 
Registers/Thread = 72 Global Memory  = (14 * X Size *  
Y Size * Z Size)  
+ 21  
 
Shared Memory  
= 8* Z Size * 
Weights_Length1  
  
Grid Size # Tiles 
  
Block Size Z Size 
Inputs  - Parameters  - Samples 
Outputs  - Codewords  - Lengths 
Device 
II. THRUST – Inclusive Sum  
NA   
Input  - Lengths 
 
Output  - Lengths 
Device 
III. Bit Packer Kernel 
Registers/Thread = 20 
 
NA2 
  
Grid Size 1024 
 
Block Size 
Ceil{(X Size *  
Y Size * Z Size) 
/1024} 
Inputs  - Codewords  - Lengths 
Outputs  - Compressed    Stream 
Host Store Compressed Stream Variable  
 
1 Weights_Length= #Prediction Bands + {if Prediction Mode = 
FULL} 3 {ELSE} 0 
 
2 No new memory needs to be allocated as the memory allocated for 
the original image samples can be reused for the Compressed 
Stream. 
 
 
Figure 4: CCSDS-123 GPU implementation stages 
I. CCSDS-123 Kernel  
Tiling 
As shown in Figure 2 the maximum level of parallelism 
available at the least parallel stages of the algorithm are 
across the spectral dimension of the input image data. For 
multispectral imagery the number of bands are commonly 
restricted to less than 10. Therefore, to help further increase 
the amount of parallelism available, and hence increase 
processing throughput, images can be segmented into 
individual image tiles. These tiles can then be independently 
compressed, providing another additional axis for 
parallelisation. Our CUDA implementation allows for the 
number of tiles and tile dimensions to be adaptive at run time, 
where specific GPU hardware characteristics dictate the 
maximum image size and maximum number of tiles which 
can be compressed by a single kernel invocation. Parallelised 
tiled imagery compression can be implemented in CUDA at 
several different levels of the CUDA programming hierarchy, 
kernel or block level, as shown in Figure 5.  
 
 Figure 5: NVIDIA CUDA programming hierarchy [8] 
CUDA 7.0 (March 2015) introduced a new concurrent 
programming functionality called CUDA streams [9]. This 
concept allows the execution of asynchronous GPU 
commands including host-GPU memory operations and 
kernel launches. Previous to CUDA 7.0 all GPU commands 
were allocated to the default stream and were executed 
synchronously with host. With the introduction of custom 
streams, it is now possible to execute multiple memory 
commands and kernel launches concurrently from the host, 
as GPU resources allow. This concept allows us to create an 
implementation in which multiple streams can be utilised to 
launch multiple CCSDS-123 kernels concurrently for 
different image tiles. The alternative approach is to 
implement the independent compression of multiple image 
tiles within a single kernel where each block of threads is 
tasked with the compression of each image tile. It was found 
  5
that the compression of tiled imagery within a single kernel 
leveraging the highly parallel block of threads was more 
efficient. This is likely due to the reduced overhead of kernel 
launch and multiple host-GPU memory transfers as this is 
only required once in the later approach.  
Memory 
The CUDA programming model allows programmers 
access to several different types of memory, as shown in 
Figure 6. The input image pixel sample values and output 
codeword and length values are stored in the global device 
memory. Global device memory has the largest latency of all 
available memory types therefore; to ensure efficient 
memory transactions it is important to ensure all memory 
operations are coalesced. Shared memory is local to each 
block of threads so facilitates thread cooperation and data 
sharing. In our CCSDS-123 kernel we utilise the on-chip low 
latency shared memory for the storage of the local difference 
and weight vector values. Registers are the fastest memory 
type available on the GPU, they are local to each thread and 
have the same lifetime as the thread. All intermediate 
calculation values are stored in the processor registers. The 
number of registers used per thread used by the kernel is 72 
when compiled for SM 5.0 GPU architecture. When profiling 
analysis is performed on our implementation it was found that 
this relatively high number of registers used is the limiting 
factor for the performance of the kernel. As the number of 
register per GPU are fixed, the number of registers used limit 
the number of threads and the number of blocks that can be 
run in parallel. 
 
Figure 6: NVIDIA GPU memory hierarchy [8] 
 
II. THRUST Inclusive Sum 
THRUST is an open-source parallel algorithms library 
and provides a flexible high level interface for GPU 
optimised routines [10]. It features an abundant collection of 
data parallel primitives such as scan, sort, and reduce, which 
are key building blocks of many complex algorithms. The 
library is included with the CUDA toolkit and extensive 
documentation is available online. The THRUST algorithm 
can be called from both host and device code and can 
additionally be executed in either location where different 
parallelisation policies are provided. In this implementation, 
the THRUST Inclusive Sum algorithm was utilised to 
calculate the offset locations in the bitstream for the variable 
codewords generated from the CCSDS-123 algorithm. By 
calculating this offset as an inclusive sum of the individual 
codeword lengths, we are then able to perform the packing of 
the codewords into the final bitstream with no serial 
dependencies. 
III. Bit Packer Kernel 
The bit packer has been designed to take in the two 
output arrays from the CCSDS-123 kernel. The first will 
contain the binary variable length codeword data, represented 
in integer form, to be packed into an output bit stream. The 
second will contain the cumulatively summed, by the 
THRUST inclusive sum kernel, lengths of the codewords to 
be packed. The bit packer operation can be performed across 
the full number of pixels in the image as there will be no data 
dependencies. Therefore, each thread will be responsible for 
the packing of a single pixel’s compressed codeword into the 
final bitstream. As each thread, will be working 
independently, it is important to ensure no two threads try and 
write to the same memory location at the same time therefore 
the atomic OR operation is used. The compressed sizes for 
each individual tile will also need to be stored with the 
bitstream to enable the decoded to identify the start and end 
of each tile to enable subsequent decompression. 
It was found, through experimentation, that the kernel 
configuration of the grid of block and blocks of threads is best 
configured to be dependent on the GPU hardware to 
maximise GPU resources rather than dependent on the input 
image dimensions. As there are no data dependencies in the 
function, all data can be treated independently for maximum 
throughput. The compression algorithm stipulates that the 
output bitstream will be guaranteed to be no larger than the 
original image size therefore we can re-use the memory 
allocated to store the input pixel values for the storage of the 
output bitstream which eliminates an additional large 
memory allocation operation.  
Performance 
The details of the testing platform used for all original 
tests discussed in the paper can be found in Appendix B. The 
testing performed in this paper uses two images readily 
available online via the CCSDS image corpus [11]. The first 
image is the Hawaii image captured from the AVIRIS 
  6
hyperspectral instrument and the second is the Agriculture 
image captured from the Landsat imager, further details on 
the test image attributes and the default CCSDS-123 
algorithm testing parameters used can be found in Appendix 
C.  Testing is performed for both non-tiled and tiled version 
of each of these images. Unless otherwise stated, the tile 
dimensions used for testing purposes are shown in Table 2. 
The performance results for the previous implementations, 
[6,7], also used the AVIRIS Hawaii hyperspectral image. 
Unfortunately, not all the necessary information to facilitate 
a definitive comparison was provided in previous NASA 
publications. In addition, different GPU hardware 
architectures have been used, therefore a definitive 
comparison cannot be made between this work and 
previously published GPU implementations. 
Table 2. Tiling dimensions 
Image Original Dimensions 
Tile 
Dimensions 
# 
Tiles
AVIRIS Hawaii 224*614*512 224*614*32 16 
Landsat Agriculture 6*1024*1024 6*32*32 1024
Table 3 provides a summary of performance of the 
discussed implementation. Where the time and throughput 
measurements are for the whole CCSDS-123 compression 
chain and include device memory allocation and freeing, host 
to device and device to host memory transfers and the 
CCSDS-123, THRUST inclusive sum and bit packer kernel 
execution.  
Table 3. New CUDA CCSDS-123 implementation 
performance 
Image Compression Ratio 
Time 
(ms) 
Throughput 
(Mb/s) Speedup
AVIRIS 
Hawaii 4.08 972.50 868.92 = 
AVIRIS 
Hawaii 
Tiled 
4.03 175.39 4817.87 5.5 
Landsat 
Agriculture 2.01 2233.93 22.53 = 
Landsat 
Agriculture 
Tiled 
2.02 39.60 1271.03 56.4 
Although testing has not currently been conducted on 
an extensive range of test images the initial results, shown in 
Table 3, confirm the expected inherent reduced 
compressibility and achievable processing throughput for the 
multispectral test image when compared to the hyperspectral 
test image. The results also show the significant processing 
throughput increase that can be achieved from performing 
image tiling. For the hyperspectral test image there is a speed 
up of approximately 5.5 times, from non-tiled to tiled tests 
and for the multispectral image there is an approximate speed 
up of 56.4 times. Despite the speed up in multispectral 
compression throughput achieved from performing tiling it is 
still approximately 3.8 times slower than the throughput 
achieved for the hyperspectral imagery. Therefore, the 
following section explores some of optimisation approaches 
specifically to facilitate the increased compression 
throughput for multispectral images.  
5. NEW MULTISPECTRAL OPTIMISED CUDA 
CCSDS-123 IMPLEMENTATION  
Profiling  
To establish the current bottleneck of our CCSDS-123 
CUDA implementation profiling analysis on the GPU 
implementation was carried out. The main bottleneck of the 
implementation was identified as an excessive use of 
registers. We have been currently unable to reduce register 
usage through manual code manipulation and optimisation. 
The NVCC CUDA compiler includes an option, called 
maxrregcount, to force the compiler to utilise a maximum 
number of registers. Various optimisations using this compile 
time parameter was attempted, but no increase in 
compression throughput was achieved.  
For multispectral images, the restriction in inherent 
parallelism is related to the number of image bands, which 
for our GPU hardware implementation translates to the 
number of threads per block. The GPU hardware architecture 
is based around a SIMD or SIMT execution model whereby 
for current NVIDIA GPUS instructions are issued in groups 
of 32 threads, also known as a warp. For the Landsat 
agriculture test case, the image has 6 bands, this means the 
kernel will be launched with 6 threads per block. As 
instructions on the GPU are always issued in groups of 32, 
whilst 32 threads will be launched only 6 will be active in the 
kernel. Therefore, an optimisation would to try to increase the 
number of threads issued per block. Due to the introduction 
of image tiling there are now two axes of parallelism which 
are consistent along the whole CCSDS-123 algorithm, bands 
and tiles. Therefore, we can increase the number of threads 
issued per block by executed the compression of multiple 
tiles with a single thread block. The necessary modification 
to our CUDA implementation was made to allow multiple 
tiles to be compressed with a single block of threads.  
To demonstrate the impact of increasing the number of 
active threads per block, by compressing multiple image tiles 
within a single block of threads, profiling and throughput 
information for the CCSDS-123 kernel has been gathered and 
used for performance comparison. The warp execution 
efficiency information has been gathered from profiling 
testing and processing throughput is quoted for the 
independent CCSDS-123 kernel. This throughput measures 
differs from those quoted in Table 3 as it is only for the kernel 
execution and does not include global device memory 
allocations, host and GPU memory transfers or that of the 
  7
other THRUST inclusive sum or bit packer kernel. A 
summary of these results can be found in Table 4. 
Table 4. New CCSDS-123 kernel efficiency 
Image Tiles / Block 
Threads / 
Block 
Warp 
execution 
efficiency 
TP 
(Mb/s) 
AVIRIS 
Hawaii Tiled 1 224 99.9% 6010.95
Landsat 
Agriculture 
Tiled 
1 6 18.5% 1378.90
2 12 37.1% 2368.26
4 24 74.1% 3728.27
Nested Parallelism 
In CUDA 5.0 (October 2012) the dynamic parallelism 
was introduced to the CUDA programming model [12]. This 
allows nested parallelism, present in an algorithm, to be 
exploited as kernels are now able to launch new kernels and 
spawn new threads on the GPU, independently from the host. 
Whilst previously, programmers were restricted to a flat 
programming model, now nested parallelism can be exploited 
to allow adaptive parallel implementation at run-time.  
There is a small degree of nested parallelism available 
in our implementation of the CCSDS-123 algorithm. 
Therefore, we investigated the potential advantages of 
applying the new dynamic parallelism CUDA programming 
model. The nested parallelism we tried to exploit was present 
in the calculation of the local difference vector, the update of 
the weight vector and the dot product calculation of these two 
vectors, the impact of these functions on the number of 
instructions is shown in Table 5. However, in our existing 
implementation the weight and local difference vectors are 
stored in the low latency on-chip shared memory. This 
memory is local to a single kernel block of threads and as 
launching a new kernel creates a new thread block, shared 
memory cannot be used to pass data between nested kernels. 
The final dynamically parallelised version of the CUDA 
CCSSD-123 implementation achieved a significantly lower 
throughput than the previous version. We postulate that this 
is because the amount of nested parallelism is too small to 
overcome the both the additional kernel launch overhead and 
the performance loss from not being able to utilise the low 
latency shared memory. 
Performance 
Our final multispectral optimised CUDA CCSDS-123 
implementation only varies from our original CUDA 
implementation by the introduced ability to be able to 
compress multiple tiles within a single thread block. The 
results for this multispectral optimised solution are shown in 
Table 6. These results show that whilst compressing multiple 
tiles with a thread block is slightly detrimental to the 
processing throughput for the hyperspectral image, we can 
achieve up to a 2.41 speed up for the multispectral image. 
Table 5. Instruction count of weight & local difference 
vector operations  
 
Table 6. New multispectral optimised CCSDS-123 CUDA 
implementation performance   
 6. EXPERIMENTAL STUDIES 
 Input Parameter Experiments  
There are 12 tuneable user imputable parameters to the 
standard CCSDS-123 algorithm, three of these have been 
shown to have the greatest impact on execution time and 
compression ability in [7]. These are the number of prediction 
band and the prediction and local sum modes. The number of 
prediction bands used by the prediction algorithm is a user 
definable parameter which can be adjusted at runtime. It 
determines the number of preceding spectral bands which 
used in the local neighbourhood calculations in the prediction 
stage of the CCSDS-123 algorithm.  
Function 
Instruction 
Execution 
Count 
% Max 
Execution 
Count 
Dot Product of Weight 
& Local Difference 
Vectors 
125426448 100 
Weight Vector Update 125426448 100 
Local Difference 
Vector Calculations 52811136 42 
Image Tiles / Block
Threads 
/ Block 
Time 
(ms) 
Throughput 
(Mb/s)
Speed 
up 
AVIRIS 
Hawaii 
Tiled 
1 224 175.39 4817.87 = 
2 448 178.65 4730.15 - 1.02
Landsat 
Agriculture 
Tiled 
1 6 39.60 1271.03 = 
2 12 24.00 2097.00 1.65 
4 24 16.44 3061.18 2.41 
8 48 16.76 3002.26 2.36 
16 96 17.15 2934.88 2.31 
      
  8
The different prediction and local sum modes allow the 
compression algorithm to be easily adapted to suit different 
image type. It is common that many raw uncalibrated push-
broom imager data often exhibit streaking artefacts the 
reduced prediction mode and column-orientated local sum 
mode allow the algorithm to only utilise pixel from the same 
along-track position in the image increasing the compression 
ratio performance. In addition, these mode’s feature less 
computation therefore likely achieved increased processing 
throughput. For calibrated or images that do not exhibit 
streaking artefacts the full prediction mode and neighbour-
orientated local sum mode will often provide an increase in 
achieved compression ratio.  
The results from our study into the impact of changing 
number of prediction bands and the influence of the 
prediction and local sum modes on compression ratio and 
processing throughput are given for both AVIRIS Hawaii and 
Landsat Agriculture test images in Figure 7 and Figure 8 
respectively. In these tests, the term reduced mode is used to 
refer to the use of both the reduced prediction mode and 
column-orientated local sum mode. Whilst full mode refers 
to the use of both full prediction mode and neighbour-
orientated local sum mode. 
It is clear from the results of both test images the impact 
the number of prediction band has on both the compression 
ratio and throughput. As the number of prediction bands has 
a positive influence on the number of executed instructions, 
as the number of prediction bands increases the processing 
throughput decreases. However, increasing the number of 
preceding band also increases the amount of spectral 
redundancy available for exploitation, resulting in an increase 
in compression ratio. Therefore, there is a trade-off between 
throughput and compression ratio which needs to be made in 
advance. For both images selecting between 1 - 3 prediction 
bands provide the best trade-off between maximising both 
compression ratio and throughput.  
For the study of the impact of the prediction and local 
sum modes the results show that for both test image the full 
modes provide the best compression ratio performance, this 
would suggest that both test images do not feature any 
streaking artefacts that require these specialised modes. With 
regards to the mode choices impact on the processing 
throughput for between 1 -3 prediction band we could expect 
the throughput reduction from reduced to full modes to be the 
approximately 500 Mb/s for the AVIRIS image and 
approximately 200 Mb/s for the Landsat image. Testing 
should be completed with a more extensive set of test 
imagery to gain a full picture of the trends and influences of 
these parameters on both compression ratio and throughput 
performance.  
 
 
Figure 7: AVIRIS Hawaii input parameter testing results  
 
Figure 8: Landsat Agriculture input parameter testing 
results  
Tile Dimensionality Testing 
A study into the impact of the choice of size and 
dimensions for image tiling has also been conducted. This 
study will help to establish trends in the relationship between 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Prediction Bands
2
2.5
3
3.5
4
4.5
5
C
om
pr
es
si
on
 R
at
io
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
T
hr
ou
gh
pu
t (
M
b/
s)
No Tiling - Reduced - CR
Tiled - Reduced - CR
No Tiling - Full - CR
Tiled - Full - CR
No Tiling - Reduced - TP
Tiled - Reduced - TP
No Tiling - Full - TP
Tiled - Full - TP
Tiling - Modes - Data
0 1 2 3 4 5
Prediction Bands
1.7
1.8
1.9
2
2.1
2.2
2.3
C
om
pr
es
si
on
 R
at
io
0
200
400
600
800
1000
1200
1400
1600
T
hr
ou
gh
pu
t (
M
b/
s)
Not Tiled - Reduced - CR
Tiled - Reduced - CR
Not Tiled - Full - CR
Tiled - Full - CR
Not Tiled - Reduced - TP
Tiled - Reduced - TP
Not Tiled - Full - TP
Tiled - Full - TP
Tiling - Modes - Data
  9
tile dimensions and compression ratio and throughput. 
Understanding these relationships is key to ensure that the 
selection of tile size for image compression helps to achieve 
a suitable trade-off between compression ratio and 
throughput. Due to the nature of the original image AVIRIS 
Hawaii dimensions the test image has been tiled only in the 
Y dimension to gives tiles as a number of image lines, these 
results are given in Figure 9. Whilst the Landsat test image 
has been tiled into square blocks and these results are given 
in Figure 10.  
 
 
Figure 9: AVIRIS Hawaii tile size testing 
 
 
 
Figure 10: Landsat Agriculture tile size testing  
Both the tiling studies conducted show similar trends 
between the decrease in tile size and subsequent increase in 
compression throughput, where the margin of throughput 
gain reduces as tile size gets smaller. What differs between 
the two test images is the slight difference in impact on 
reducing tile size on the compression ratio, this gives more 
unique selection of tile size to provide the best trade-off 
between compression ratio and throughput. For the tile size 
selection for both maximised compression and throughput for 
the AVIRIS image this would be a size of 32 in the Y 
dimension regardless of compression mode. Whilst for the 
Landsat image this would a size of 128 for the reduced 
compression modes and either 64 or 32 for full compression 
modes. 
The final experimentally results presented in this paper 
show the impact of the tuneable tiles per block parameter of 
the new multispectral imagery optimised CUDA CCSDS-123 
implementation, and are given in Figure 11. This plot shows 
that for the multispectral Landsat Agriculture test image 
implementing the compression of at least 2 tiles per block 
always achieves greater compression than the case of the un-
optimised case where only 1 tile is compressed per block. 
Although the overall best number of tiles per block number 
for greatest throughput is dependent on tile size.  
 
Figure 11: Landsat Agriculture tile size and multispectral 
optimisation parameter testing 
7. FUTURE WORK 
Onboard Hardware Architecture Demonstration 
Work is underway to develop a test setup which is 
representative of the proposed new heterogeneous onboard 
hardware architecture as shown in Figure 3. This will then be 
used to demonstrate and perform performance analysis of the 
CUDA GPU implementation of the CCSDS-123 algorithm 
discussed in this work and the FPGA streaming data 
interfaces, buffering and data formatting tasks. 
Radiation Hardening By Software Design 
Currently, GPU’s are not manufactured to provide 
radiation tolerance at the silicon gate level. This poses an 
obstacle in the practical deployment of a GPU device in an 
onboard system as it conflicts with the traditional ethos 
163264128256512
Tile Size (#Pixels in Y Dimension)
4
4.1
4.2
4.3
4.4
4.5
4.6
C
om
pr
es
si
on
 R
at
io
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
T
hr
ou
gh
pu
t (
M
b/
s)
Reduced Modes - CR
Full Modes - CR
Reduced Modes - TP
Full Modes - TP
Compression Modes - Data
1632641282565121024
Tile Sizes (# Pixels in X & Y Dimensions)
2
2.1
2.2
2.3
C
om
pr
es
si
on
 R
at
io
0
200
400
600
800
1000
1200
1400
Th
ro
ug
hp
ut
 (M
b/
s)
Reduced Modes - CR
Full Modes - CR
Reduced Modes - TP
Full Modes - TP
Compression Modes - Data
1024 512 256 128 64 32 16
Tile Sizes (# Pixel X & Y Dimensions)
0
500
1000
1500
2000
2500
3000
3500
T
hr
ou
gh
pu
t (
M
b/
s)
1
2
4
8
16
32
64
128
Tiles Per Block
  10
within the space industry, to strictly utilise only radiation 
hardened processors. Therefore, Radiation Hardening by 
Software Design (RHBSW) principles will be adapted and 
implemented in the proposed system to ensure radiation 
effects are mitigated [13,14,15]. Some RHBSW techniques 
which have been proposed in literature to date include: 
heartbeat monitoring, watchdog timers, control flow 
assertions, progress monitoring and checkpoint and rollback 
functionality. The highly parallel nature of the GPU could 
also allow for the adaption of traditional space modular 
redundancy techniques, such as triple modular redundancy 
(TMR), for GPU software. 
8. SUMMARY 
In this paper two new CUDA GPU implementations of 
the multidimensional lossless image compression algorithm 
CCSDS-123 has been presented. The first implementation 
leverages the low-latency on-chip shared memory for the 
manual caching of several key intermediate compression 
variables. This and the exploitation of additional parallelism 
introduced by image tiling facilitates the high processing 
throughput achieved for both hyperspectral and multispectral 
test images. The results given in Table 3 gives an overview 
of the performance of our first implementation. The confirm 
our initial thoughts that both compression ratio and 
throughput would be reduced for the multispectral test image 
when compared to the hyperspectral test image due to 
underlying image characteristics. The results also highlight 
the significant throughput increase that can be achieved from 
performing image tiling. For the hyperspectral image a 
speedup of approximately 5.5 times was achieved for the tiled 
image whilst for the multispectral image there was a 
considerable speedup of approximately 56.4 times.  
However, despite the considerable speedup achieved 
from the tiling of the multispectral image, the compression 
throughput achieved is still approximately 3.8 times slower 
than the throughput achieved for the hyperspectral imagery. 
This provided the motivation to explore the optimisation of 
the implementation specifically for multispectral data sets. 
The second implementation increases the utilisation of the 
underlying GPU hardware architecture for multispectral EO 
imagery, as shown in Table 4. As a result, we are able to 
achieve a peak processing throughput of 2934.88 Mb/s a 2.3 
times speedup on the first implementation for the same image 
and tiling parameters. This makes the multispectral optimised 
CUDA implementation now only approximately 1.6 times 
slower than the throughput achieved for the hyperspectral test 
image.  
APPENDICES  
A. CCSDS-123 IMPLEMENTATION FEATURES 
Description  Status Supported?
Signed Samples  O.1 N 
Unsigned Samples  O.1 Y 
X Size, NX  M Y 
Y Size, NY  M Y 
Z Size, NZ  M Y 
Dynamic Range, D  M Y 
Number of Prediction Bands, 
P  M Y 
Full Prediction Mode  O.2 Y 
Reduced Prediction Mode  O.2 Y 
Neighbour-Oriented Local 
Sums  O.3 Y 
Column-Oriented Local Sums  O.3 Y 
Weight Component 
Resolution, Ω  M Y 
Default Weight Initialization  O.4 Y 
Custom Weight Initialization  O.4 N 
Weight Initialization 
Resolution, Q  C.21 N 
Register Size, R  M Y 
Weight Update Scaling 
Exponent Initial Parameter, 
νmin  
M Y 
Weight Update Scaling 
Exponent Final Parameter, 
νmax  
M Y 
Weight Update Scaling 
Exponent Change Interval, 
tinc  
M Y 
Compressed Image  M Y 
Output Word Size, B  M Y 
Header  M Y 
Weight Initialization Table 
Encoding  C.31 N 
Compressed Image Body  M Y 
BI Encoding Order  O.5 Y 
BSQ Encoding Order  O.5 N 
Sub-frame Interleaving 
Depth, M  C.32 Y 
Sample-Adaptive Entropy 
Coder  O.6 Y 
Block-Adaptive Entropy 
Coder  O.6 N 
Accumulator Initialization 
Table Encoding  O N 
Initial Count Exponent, γ0  M Y 
Accumulator Initialization 
Constant, K  O Y 
Rescaling Counter Size,  
γ∗  M Y 
Unary Length Limit, U max  M Y 
 
O.1: It is mandatory to support at least one of these items.  
O.2: When NX =1, support is mandatory for Reduced 
Prediction Mode and not applicable for Full Prediction Mode. 
Otherwise, it is mandatory to support at least one of these 
items.  
O.3: When NX=1, support is mandatory for Column-
Oriented Local Sums and not applicable for Neighbour-
Oriented Local Sums. Otherwise, it is mandatory to support 
  11
at least one of these items.  
O.4: It is mandatory to support at least one of these items.  
C.21: If Custom Weight Initialization supported – then 
mandatory  
C.31: If Custom Weight Initialization supported – then 
optional  
C.32: If BI Encoding Order supported – then mandatory  
O.6: It is mandatory to support at least one of these items. 
 
B. TEST PLATFORM SPECIFICATION 
 
C. TEST IMAGES & DEFAULT TESTING 
PARAMETERS 
Image Name AVIRIS Hawaii Landsat Agriculture 
Height 512 1024 
Width 614 1024 
Bands 224 6 
Bit Depth 12 8 
File Size 
(Bytes) 
105,627,648 6,291,456  
Prediction 
Bands 
3 2 
Prediction 
Mode 
Reduced Reduced 
Local Sum 
Mode 
Column-
Orientated 
Column-Orientated 
Register 
Size 
64 64 
Weight 
Resolution 
19 19 
Weight 
Interval 
6 6 
Weight 
Initial 
-1 -1 
Weight 
Final 
3 3 
U Max 18 18 
Y Star 6 6 
Y Zero 2 2 
K 2 2 
 
REFERENCES  
[1] R. L. Davidson, C. P. Bridges, “Adaptive 
Multispectral GPU Accelerated Architecture for Earth 
Observation Satellites”, Proc. of 2016 IEEE Int. Conf. on 
Imaging Systems and Techniques 
[2] Consultative Committee for Space Data Systems, 
Lossless Multispectral & Hyperspectral Image Compression 
CCSDS 123.0-B-1, Blue Book, 
http://public.ccsds.org/publications/archive/123x0b1ec1.pdf 
, 1  June 2016, (Accessed October 2016) 
[3] M. Klimesh, "Low-complexity lossless compression of 
hyperspectral imagery via adaptive filtering ", IPN Progress 
Report, vol. 42–163, pp. 1-10, 2005 
[4] R. Trautner, "ESA's roadmap for next generation payload 
data processors", DASIA Conf., 2011. 
[5] E. Augé, J. Sánchez, A. Kiely, I. Blanes, J. Serra-Sagristà, 
"Performance impact of parameter tuning on the CCSDS-123 
lossless multi- and hyperspectral image compression 
standard", J. Appl. Remote Sens., vol. 7, no. 1, pp. 074594, 
2013. 
[6] D. Keymeulen , N. Aranki , B. Hopson , A. Kiely , M. 
Klimesh and K. Benkrid, "GPU lossless hyperspectral data 
compression system for space applications", Proc. 2012 IEEE 
Aerospace Conf., pp. 1-9, 2012 
[7] B. Hopson, K. Benkrid, D. Keymeulen, N. Aranki, "Real-
time CCSDS lossless adaptive hyperspectral image 
compression on parallel GPGPU & multicore processor 
systems", Proc. 2012 NASA/ESA Conf. Adaptive Hardware 
and Systems (AHS), pp. 107-114, 2012-Jun. 
[8] C. Zeller, “Tutorial CUDA”, NVIDIA Developer 
Technology, NVIDIA Corporation, April 2008, (Accessed 
October 2016) 
[9] M. Harris, “GPU Pro Tip: CUDA 7 Streams Simplify 
Concurrency”,  
https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-
7-streams-simplify-concurrency/, January 22 2015, 
(Accessed October 2016) 
[10] NVIDIA CUDA Toolkit Documentation V7.0. 
http://docs.nvidia.com/cuda/index.html. (Accessed October 
2016) 
[11]  CCSDS MHDC - Space Link Services Image 
Compression Testing Corpus - 
http://cwe.ccsds.org/sls/docs/Forms/AllItems.aspx?RootFol
der=%2Fsls%2Fdocs%2FSLS%2DDC%2F123%2E0%2DB
%2DInfo%2FTestData&View=%7B16ACDA38%2DFFA3
%2D4657%2D8F27%2DB166C23C24A2%7D& , June 
2016 (Accessed October 2016) 
[12] A. Adinetz, “Adaptive Parallel Computation with 
CUDA Dynamic Paralllelism”, 
Desktop Dell OPTIPLEX 7010 (SMF) 
CPU & Clock Intel Core i5-3470 @ 3.20GHz 
CPU RAM 4GB DDR3  
GPU & Clock NVIDIA GeForce GTX 750Ti 
@33MHz 
GPU RAM 2GB GDDR5  
  12
https://devblogs.nvidia.com/parallelforall/introduction-cuda-
dynamic-parallelism/, May 6 2014, (Accessed October 2016) 
 [13] A. G. Schmidt, J. P. Walters, K. M. Zick, M. French, D. 
Keymeulen, N. Aranki, M. Klimesh, and A. Kiely, "Applying 
radiation hardening by software to fast lossless compression 
prediction on FPGAs," in Proceedings of the 2012 IEEE 
Aerospace Conference, March 2012. 
[14] A. Schmidt and M. French, "Fast lossless image 
compression with radiation hardening by hardware/software 
co-design on platform FPGAs", Application-Specific 
Systems, Architectures and Processors (ASAP), 2013 IEEE 
24th International Conference on, pp. 103-106 
[15] H. Takizawa, K. Sato, K. Komatsu, and H. 
Kobayashi, "CheCUDA: A Checkpoint/Restart Tool for 
CUDA Applications," In Proceedings of International 
Conference on Parallel and Distributed Computing, 
Applications, and Technologies (PDCAT), pp. 408-413, 
2009. 
BIOGRAPHY  
Rebecca L. Davidson received a 
BEng in Electronic Engineering from 
the University of Surrey, in 2014. She 
is currently a PhD student at Surrey 
Space Centre, University of Surrey. 
Her current research is in onboard 
data processing for downlink 
optimisation of multispectral Earth 
observation satellites. Her current 
focus is upon efficient hardware-
software co-design utilising commercial-off-the-shelf 
processors for space. 
 
Dr Christopher P. Bridges (BEng, 
2005; PhD 2009) leads the On-Board 
Data Handling (OBDH) research 
group within Surrey Space Centre 
(SSC). He researches software 
defined radios, real-time embedded 
systems, agent computing, Java 
processing, multi-core processing in 
FPGAs, and astrodynamics 
computing methods in many 
spaceflight payloads. In 2013, he designed, built and still 
operates the UK’s first CubeSat (STRaND-1) with SSTL and 
now contributes towards computing hardware and software 
in missions with SSTL, on ESA’s ESEO mission and also the 
NASA-JPL/CalTech AAReST mission. 
