Abstract -In order to better make investment decisions for future space processing, we are equipping an architectures laboratory to investigate the power and computing performance of candidate computing architectures for future space applications. The picture for future space processing is increasingly complicated by ever increasing data rates/sizes and limited communications bandwidth, both of which will require more data processing, in the form of either data reduction or compression, to be performed on orbit rather than on the ground.
INTRODUCTION
The problem of predicting requirements for future space processors is a challenging one. The long lead times required for hardening microprocessors against the space radiation environment and the power constraints imposed by operating on a satellite mean that currently available space qualified processors lag modern terrestrial processing hardware by several generations in performance. In terrestrial systems, this performance improvement over space systems is due to many factors. Owing to traditional Moore's Law scaling of processor feature size and clock rates, terrestrial processors currently enjoy a roughly 100X clock speed advantage. [1, 2] Further complicating matters is the greatly increased diversity of processor features and architectures. Single processor cores now feature more execution units -e.g. integer units, floating point units, Single Instruction-Multiple Data (SIMD) units. Other advancements such as improved pipelining of instructions wring out greater performance per clock cycle. Multiple cores are now combined in a single processor. These multiple cores can either be homogeneous or heterogeneous as in the case of modern implementations of the ARM architecture. Going beyond "conventional" microprocessors, a number of newer architectures that attempt to exploit parallelism at a variety of levels have found application in various high performance computing applications. These range from very fine-grained parallelism that can be exploited by field programmable gate arrays (FPGAs) to graphical processing units (GPUs). The difficulty in determining which of these solutions will work best over a wide range of space processing applications is that their performance is highly dependent on how well the algorithms employed map to those architectures. This has led us to set up a laboratory to study the performance of applications of importance to the Air Force on a variety of processor architectures.
In this paper, we report on recent efforts at benchmarking the performance of data intensive image processing applications, namely Synthetic Aperture Radar (SAR) and Hyper-Temporal Imaging (HTI), on a range of low power ARM-based computing platforms. These ARM platforms include a number of architectural features that we hope to investigate, including 32-bit and 64-bit implementations, homogeneous and heterogeneous computing features (such as big.LITTLE architectures), and GPU capability. Our reference applications have been implemented using the best available numerical libraries for each platform, and parallelized using threaded and message passing implementations. We report parallel speedups and relative performance for these applications, and compare performance with conventional multi-core architectures.
In section 2, the processors included in this study are described in detail.
In section 3, the applications included in this study are described. The parallelization strategies and the libraries used and their configurations are also described.
In section 4, the parallel performances of these applications are analyzed using generalizations of Amdahl's law.
PROCESSORS STUDIED
In this paper, we report results obtained using the ODROID-XU3 development board, the NVidia Jetson TK1 development board, and for reference, a DELL T7500 workstation. The development boards are examples of heterogeneous and homogeneous computing architectures based on the ARM processor.
The ODROID-XU3 development board is based on the Samsung Exynos5422 System on a Chip. This chip includes a quad core ARM Cortex-A15 cluster operating at 2.1 GHz and a quad core ARM Cortex-A7 cluster operating at 1.5GHz. The A7 and A15 cores are binary compatible. They differ in clock rate and in the structure of the instruction processing pipeline. In normal low power operation, tasks run on the A7 cores, while for computing-intensive tasks the faster, more power-hungry A15 cores are used. The board contains 2GB of 32-bit dual channel LPDDR3 RAM operating at 933 MHz (14.9 GB/s memory bandwidth). Storage includes 64GB of eMMC5.0 HS400 Flash Storage pre-installed with Lubuntu 14.04LTS.
The NVidia Jetson TK1 development board is based on the NVidia Tegra K1 System on a Chip. This chip includes a quad core ARM Cortex-A15 and a Tegra K1 GPU. The GPU contains 192 SM3.2 CUDA cores delivering 326 GFlops. This board includes 2GB DDR3L 933MHz EMC x16 DRAM using 64-bit data width. Storage includes 16 GB of eMMC4.51 flash memory and a 16 GB SD card slot. The flash memory is installed with Ubuntu Linux 14.04LTS.
The performance of these boards is compared against a DELL 7500 Workstation containing dual quad-core Intel Xeon X5677 CPUs operating at 3.47GHz and 48 GB of Samsung 1333 MHz DDR3 ECC Registered DIMM Memory. The workstation runs Ubuntu Linux version 12.04LTS. Prior to the tests in Figure 7 , our test workstation had been upgraded to Ubuntu Linux version 14.04LTS.
The GCC compiler version included with Ubuntu Linux version 14.04LTS is version 4.8.4 which supports the OpenMP 3.1 specification. Ubuntu 12.04LTS includes the 4.6.3 GCC compiler which supports version 3.0 of the OpenMP specification.
Several other development boards were not received in time for inclusion in this paper. These include a 64-bit Juno ARM development board and a Xilinx Zynq-7000 based Mini-ITX development kit. Likewise, the CUDA GPU (graphics processing unit) compute capability of the NVidia Jetson TK1 is not studied in this paper as we did not have time to rewrite and validate either of our benchmark codes. We expect to include these boards and capabilities in future work.
APPLICATIONS STUDIED
A SAR application implementing the basic range compression, azimuth transformation, range migration and azimuth compression steps of the SAR algorithm was written by Chris Conger and Adam Jacobs of the National Science Foundation (NSF) Center for High-Performance Reconfigurable Computing (CHREC). [3] [4] [5] [6] The CHREC consortium includes partners from academia, industry and government. The original codes included a serial version and an MPI-based (MPI=Message Passing Interface) version implementing a crude master-slave domain decomposition algorithm on top of the SAR algorithm used in the serial code. Fast Fourier Transforms (FFTs) from the GNU Scientific Library (GSL) are used. [7] The codes were based on a sequential SAR application developed by David Sandwell and Evelyn Price of the Scripps Institution of Oceanography. [8] In the original CHREC MPI implementation, the master processor just distributed data to the slave processors; we changed this to allow the master processor to share in the computational workload. This was critical because of the small number of processors on our test boards. Figure 1 illustrates, for the case of a quad-core environment, how the data distribution and SAR processing are performed in our parallel benchmark. The distributed memory message passing implementation was converted to a shared memory implementation using OpenMP threading. On our reference platform the MPI and OpenMP implementations were found to perform identically. In this paper, we compare the performance of the OpenMP implementations across our test platforms. This SAR code reads in a 326MB file of raw radar echoes and an associated parameter file and generates a raw SAR image file. No additional input decks were available. Conversion of the raw image to a standard Windows BMP image is performed with a separate application that is not part of the benchmark and not included in the timing results.
An HTI application that uses modern state of the art algorithms for the extraction of temporal signals from synthetic high frame rate (>100 Hz) data was developed within our laboratory using MATLAB. The code was subsequently converted to FORTRAN 90 and C. In this conversion process, heavy use was made of industry standard LAPACK 1 and BLAS linear algebra libraries as heavily optimized versions of these libraries are usually available in either vendor-supplied or open source implementations that are tuned to take advantage of architectural features such as cache size and structure.
These libraries are typically available in both serial and thread parallel implementations, so some level of parallelization is obtainable simply by swapping a serial library for a parallel one. For the platforms studied in this paper, the libraries used consisted of versions 3.4.2 (Xeon) and 3.5.0 (Odroid, Jetson) of the LAPACK library and version 3.10.1 of the Automatically Tuned Linear Algebra Subroutine (ATLAS) BLAS library. The parallel implementation of the ATLAS library makes use of pthreads. By default, the parallel ATLAS library makes use of all of the processors on the platform. For our Xeon reference platform this is 8 cores, for the ODROID XU3 this is a total of 8 cores (4 A15 cores and 4 A7 cores), for the NVidia Jetson TK1 this is 4 A15 cores. The configuration options of the ATLAS library also allowed us to build an ATLAS implementation that only uses the A15 cores on the ODROID board. This capability was unique to ATLAS.
The HTI processing chain can be broken into several standard image processing and data manipulation steps that can be expressed in terms of elementary operations such as convolution, matrix multiplication, eigenvector 1 LAPACK is a reimplementation of the LINPACK and EISPACK numerical libraries. Its routines are written using high level block operations that can be optimized for each architecture.
decomposition, conversion of data to double precision floating point, local averaging, and covariance matrix calculations. Many of these operations can be expressed as BLAS Level 1 (vector), 2 (matrix-vector) and 3 (matrixmatrix) linear algebra operations. Other operations within the code were threaded using OpenMP. The application reads and processes 500 256x256 16-bit pixel images from a synthetic 420 MB data file.
The HTI code makes use of explicit thread parallelization via OpenMP directives and indirect parallelization through the use of parallel linear algebra libraries. This is possible because no linear algebra library call appears in any parallel region used by OpenMP. In this application we make use of a Basic Linear Algebra Subroutine (BLAS) library from the Automatically Tuned Linear Algebra Software (ATLAS) project [9] and LAPACK 3.5.0 [10] . When the ATLAS library is compiled both serial and thread parallel versions of the library are built. The serial version is optimized for a single processor core including the use of SIMD instructions such as the SSE2 instruction set in the Xeon or NEON instructions in an ARM processor. The default thread support used in the ATLAS library is pthreads 2 . Also, the number of threads used by the ATLAS library is fixed at compile time and defaults to the number of available cores on the target machine. The LAPACK library is a serial library, but benefits from any parallelization inherited from the BLAS library calls it makes. Since the OpenMP directives can be enabled or disabled via a compiler flag, four different versions of the benchmark can be compiled: a purely serial implementation, an implementation in which 2 POSIX threads OpenMP's parallel constructs, typically omp parallel, omp for and omp parallel for, were used extensively to speed up loops and iterative operations within the convolution, local averaging and data conversion blocks of code. Blocks of code that could be implemented using standardized linear algebra library calls, including Basic Linear Algebra Subroutines (BLAS) and LAPACK, are made parallel using parallel implementations of numerical libraries such as ATLAS. BLAS and LAPACK calls include: DGEMM 3 , DGER 4 , DSCAL 5 , DSYEV 6 .
RESULTS AND ANALYSIS

SAR Benchmark
In this implementation of the SAR benchmark, there are two parameters under user control: (1) the size of a patch (or number of patches) and (2) the number of OpenMP threads employed. As the data set is of fixed size and the patch dimension is a power of 2, varying the size of a patch changes the size of the image produced. The amount of data processed increases as the number of patches is increased, so comparing run times for different numbers of patches is 3 DGEMM is the double precision BLAS LEVEL 3 Matrix-Matrix
Multiply. 4 DGER is the double precision BLAS LEVEL 2 Rank 1 Matrix Update operation. 5 DSCAL is the double precision LAPACK library routine for scaling a vector (or matrix) by a constant. 6 DSYEV is the double precision LAPACK library routine for computing eigenvalues and eigenvectors of a matrix.
not valid. However, varying the patch size allows us to see the impact of swapping to disk when the memory required exceeds the amount of RAM available. This can happen depending on the combination of patch size and number of processors chosen. Figure 2 illustrates the parallel speedup of the SAR application as a function of the number of OpenMP threads for different numbers of patches on our Xeon reference platform. Each speedup curve is computed using the serial run time appropriate to that number of patches. Available memory on this platform greatly exceeds that needed by the code, so the observed performance is not influenced by memory constraints. Instead, the observed performance is easily understood in terms of the number of iterations of the parallel loop body; that is, one iteration of the block of operations shown in Figure 1 . The number of iterations that are performed for a given combination of number of patches and number of OpenMP threads are shown in Figure 3 . Note that when the number of patches is much greater than the number of available threads (processors), the number of parallel iterations becomes smoothly varying. When the number of patches becomes comparable to the number of available threads, there are ranges where adding a thread/processor does not change the number of parallel iterations. In these situations the speedup is found to worsen slightly as there is little benefit to adding a processor. Clear speedup improvements are seen when the number of parallel iterations decrease. As expected, utilization of processors is optimal when the number of patches is evenly divisible by the number of threads. Figure 4 shows the parallel speedup of the SAR application on the NVidia Jetson TK1 for 9 and 35 patches. For 35 patches, reasonably good scaling is observed. For 9 patches, good scaling is observed up to 3 threads/processors. Addition of a 4 th thread/processor does not reduce the number of iterations, so the 9 patch speedup decreases with the addition of a 4 th thread.
The situation becomes more complex for the ODROID XU3 platform. Figure 5 shows the parallel speedup of the SAR application on the ODROID XU3. Recall that the ODROID has a cluster of 4 fast ARM A15 cores and a cluster of 4 slower ARM A7 cores. Initially, the process scheduler assigns tasks to the fast cores, so the speedup curves for less than or equal to 4 threads look very similar to that of the NVidia Jetson TK1. For more than 4 threads, speedup suffers dramatically as slower A7 cores are incorporated into the processing. For the case of 35 patches, there remains a performance benefit to using more threads because the gain from reducing the number of parallel iterations outweighs the cost of adding slower processor cores to the calculation. Peak performance in this case occurs at about 6 cores. In a power constrained environment, such as aboard a spacecraft the minimal performance gains (20% higher speedup at 6 processors than at 3 processors) may not justify the use of additional processors.
HTI Benchmark
The analysis of the performance of the HTI benchmark is challenging because the number of pthreads used by the parallel ATLAS library is fixed when the library is compiled whereas the number of OpenMP threads can be varied at run time. For our Xeon reference platform, the number of pthreads used by the parallel ATLAS library is fixed at 8. For the NVidia Jetson TK1, the number of pthreads used by the parallel ATLAS library is fixed at 4. Since the ODROID XU3 is a heterogeneous architecture in the sense that its cores may be divided into two classes of cores that are binary compatible but run at different speeds, we studied the performance of the ODROID using two builds of the ATLAS library. In the first build, the number of pthreads is the default (8), i.e., the four fast ARM A15 cores are pooled with the four slower ARM A7 cores. In this case, scheduling of tasks on cores is left to the operating system. In the second case, we take advantage of build options in the ATLAS library to restrict the pthreads to run exclusively on the faster A15 cores. Figure 6 shows the performance of our HTI implementation on our reference Xeon platform. Two curves are shown. One curve shows the impact of OpenMP thread parallelization when the code is linked with serial linear algebra libraries (ATLAS BLAS + LAPACK), the other includes the effect of using parallel linear algebra (parallel ATLAS) libraries. The difference in the two curves reflects the amount of parallelizable code distributed between the main body of the code and the numerical libraries. Since the run time associated with the parallel numerical library, which is using all 8 cores in this case, scales as well as or better than the OpenMP code which is using 1 to 8 cores, the code scales reasonably well with the number of OpenMP threads on this platform. The parallel efficiency of the code when using all 8 cores for both OpenMP (main code) and pthreads (ATLAS) is approximately 70%.
We also studied the performance impact on the code of the various loop schedulers available within OpenMP by adding the schedule(runtime) clause to all of the OpenMP parallelizable loops in the code. This allowed the loop scheduler to be varied at run time by assigning a value to the OMP_SCHEDULE environment variable. The environment variable allows both a schedule type and an optional "chunk size" value, which controls the size of blocks of loop iterations distributed to threads, to be set.
The available schedule types include auto, dynamic, guided and static.
[11] The meaning of the optional "chunk size" value varies with the schedule type selected. When the schedule type is auto, the decision regarding scheduling is delegated to the compiler and/or runtime system. When the schedule type is dynamic, the loop iterations are distributed to threads in blocks of size specified by the "chunk size" parameter as the threads request them. The default value of "chunk size" for the dynamic scheduling option is 1. When the schedule type is static, iterations are divided into blocks of size "chunk size" and the blocks are assigned to the threads in round-robin fashion in order of thread number. If "chunk size" is not specified, its default value for static is 1. When the schedule type is guided, iterations are again distributed to the threads as threads request them. The "chunk size" value, with the possible exception of the final chunk, is a lower bound on the size of a chunk. For the guided schedule type, the default value of the chunk size is 1. Figure 7 shows the parallel speedup of the HTI application on the Xeon platform for various combinations of schedule type and chunk size. Also included is the speedup when the schedule(runtime) clause is not included in the loop parallelization specification, i.e. the default OpenMP scheduling. This is labeled as "default" in the figure. Chunk sizes were varied from 1 to 32 in powers of 2, but for clarity, curves for only a few representative values are shown. We observed that dynamic scheduling yielded very poor performance for small chunk sizes, but became comparable to other methods when the chunk size reached 32. The "default" scheduling performed as well as the best run time scheduling options. Figure 8 shows the performance of the application on the NVidia Jetson TK1. Qualitatively, the behavior is similar to that of the Xeon. The parallel efficiency achieved when using parallel libraries with all 4 cores was about 78%. Figure 9 shows the performance of the code on the ODROID XU3 using a version of the ATLAS BLAS library compiled to use all 8 cores (4 A15 cores and 4 A7 cores). The parallel performance is quite poor. We suspect that because of the mix of slow and fast cores in the pthreads portion of the code, the pthreads execution is a larger portion of the run time than on the other boards tested. This acts to limit the scalability of the code with respect to OpenMP threads. We conducted experiments varying the OpenMP scheduler type and chunk size as well as the number and types of cores used by the ATLAS library to determine the likely culprit. Figure 10 shows the performance of the code on the ODROID XU3 when the parallel linear algebra library is restricted to run only on the four faster ARM A15 cores. The parallel performance is improved by about 45% by restricting the BLAS operations to the ARM A15 cores.
The effect of the OpenMP scheduler options selected on the performance of the code is shown the next few figures. Figure 11 shows the effect of OpenMP scheduler options on the HTI application when the parallel ATLAS library is compiled to use both ARM A15 and A7 cores. Figure 12 shows the effect of scheduler options on the HTI application when the ATLAS library is compiled to only use only the A15 cores. In both cases, the best performing run time scheduling option offers the same performance as the default OpenMP scheduler.
We have found that the performance of the HTI application can be well characterized by a slight generalization of Amdahl's law. [12] We have found that the parallel speed up can be written as
where t serial is the execution time for the serial portion of the code, t OpenMP is the serial execution time for the portion of the code using OpenMP threads, and t pthreads is the serial execution time for the portion of the code parallelized using pthreads. The dashed lines in figures 6, 8, 9 and 10 are least squares fits of the data to the expression above using the ratios t OpenMP /t serial and t pthreads /t serial as the fitting parameters. For homogeneous architectures, the fit is excellent while for the heterogeneous architecture the fit is still reasonable. For this problem size, extrapolation of the curves suggests that at most 16 processors can be used for this application. The fitted ratios are platform dependent reflecting compiler differences and differences in SIMD units.
Finally, in Figure 13 we compare the relative performance of the codes on the various platforms. function of the number of OpenMP threads used. The curve for the NVidia Jetson is flatter, reflecting differences that are solely due to clock speed and differences in performance of SIMD units on the Xeon and Jetson. The slight downward trend in this curve reflects the fact that the ATLAS library used with the Jetson was built to use 4 cores whereas the ATLAS library used with the Xeon platform uses 8 cores. The curves for the ODROID reflect differences in clock speed and the impact of using slower ARM A7 cores in the computation.
SUMMARY
In this paper, we have examined the performance to two space applications on low power COTS processors based on the ARM architecture. The processors included examples of both homogeneous and heterogeneous (big.LITTLE) architectures. The applications examples included two image processing applications. The heterogeneous architecture fared poorly as the slower LITTLE cores were a performance bottleneck.
The new laboratory provides us the capability to study how next generation mission applications map to current and future space processing platforms. We hope to work with other groups to develop other benchmarks in the future. 
