This paper evaluates the potential of embedded Graphic Processing Units in the Nvidia's Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated CloudCover Assessment (ACCA) Algorithm. Tegra K1 achieved 51% for ACCA algorithm and 20% for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.
INTRODUCTION
This paper evaluates the suitability of new embedded Graphic Processing Units (GPU) in the Nvidia's Tegra K1 (K1) System-on-Chip (SoC) with typical Typical Design Power (TDP) under 7W [1] for onboard processing. The performance of this SoC is compared to two modern High Performance Computing (HPC) architectures: (1) a general purpose multi-core CPU (8-core Sandy Bridge E5-2470, 2.3GHz, TDP 95W [2] ) and (2) GPU accelerator (Nvidia Tesla K20 (K20), TDP 225W [3] ). For this study, we selected two algorithms:
1.
Wavelet Spectral Dimension Reduction of Hyperspectral Imagery: The principle of this method is to apply a discrete wavelet transform to hyperspectral data in the spectral domain and at each pixel location. The optimal level of wavelet decomposition is computed adaptively for each pixel. See [4] for more details.
2. Automated Cloud-Cover Assessment (ACCA) Algorithm: The ACCA algorithm determines and rates the overall cloud cover of an image through 2 steps: Pass-One isolates clouds from non clouds by utilizing eight thresholdbased filters, then Pass-Two resolves the detection ambiguities from Pass-One by computing global statistics over the image. See [5] for more details.
This paper shows that the performance achieved using this new SoC designed for battery powered devices is comparable to HPC hardware with significantly higher power consumption.
HARDWARE ARCHITECTURES
The Intel Xeon Sandy Bridge CPU is a general purpose processor designed to handle a wide variety of workloads. It has a small number, up to 8, of high performance cores with 64 bit wide SIMD (Single Instruction Multiple Data) units and large on-chip caches (~22 MB) designed to minimize the effect of limited memory bandwidth (38.4 GB/s). , that can be seen as 6 SIMD units. K20 has additional 64 DP cores, but not K1. The K20 contains 13 SMXs while K1 only 1 SMX. In terms of peak performance in SP the K1 is 10.7 times slower than K20, but 2.2 times faster than the CPU. In terms of memory bandwidth K1 is 13.9 times slower than K20 and 2.5 times slower than the CPU. But K1 is a SoC designed for mobile and embedded systems with low power consumption. The entire Jetson TK1 development board [6] consumes ~12.5W (SoC + DRAM ~7W) under full load.
IMPLEMENTATION AND OPTIMIZATION
The main focus of the optimization part is to explore techniques that allow efficient utilization of the parallel hardware. Even though the architectures are different, they all use SIMD units. This means that data is processed as short vectors where identical operations are executed on all elements. The number of elements per vector, or SIMD width, is 4 DP or 8 SP values for CPU and 32 SP/DP for GPUs, see Table 1 . The efficiency of vector processing is significantly reduced by the branching in the code caused by conditional statements.
Both algorithms have a very high degree of parallelism defined by the number of pixels and can be efficiently vectorized. In case of dimension reduction algorithm, the vectorization is used across spectral bands within a pixel, when computing wavelet coefficients, while pixels are processed independently. There are no conditional statements, no branching, within the vectorized section of the code that can reduce vectorization efficiency. For each pixel, the algorithm requires fast access to original data, reduced data and reconstructed data as these are accessed multiple times. Therefore it is very efficient to keep it inside the onchip caches. This can be achieved in case of the CPU, but K20 and K1 do not have enough on-chip storage which results in performance penalty. This algorithm was chosen to evaluate the effect of K1's small caches on the performance.
On the other hand, the ACCA algorithm contains a large number of conditional statements (one for each thresholdbased filter), controlled by an input data. This results in a significant processing time variation depending on cloud coverage, see Figure 3 left. This is a problem for on-board real-time processing systems. To minimize this effect and also to maximize the performance of the SIMD units, two new versions of the ACCA algorithm were developed: (1) Vectorized without Branching (VNB) and (2) Vectorized with Branching (VWB). The VWB algorithm can utilize SIMD units but still exhibits processing time variation. In the VNB algorithm all threshold-based filters are redesigned to avoid branching by setting specific bits of a register. At the end, the register contains the value describing whether the pixel is a cloud or not. This means that all filters are executed for every pixel (this is not the case of the original ACCA algorithm), which creates more work. But since this workload can be processed much more efficient by the SIMD units, the VNB algorithm delivers faster processing. In the case of Tegra K1, the processing time variation as a function of cloud coverage is reduced from ~9% to 0.2%, and the processing time itself is reduced by 8.3% when compared to the VWB version. The performance of the proposed algorithms also depends on the length of the vector processed per SIMD unit. In case of the CPU this parameter is called vector length. In CUDA model for GPUs the same parameter is described by number of threads per block. See Figure 1 and 2 for optimal configuration for all three architectures. These figures also show the variation for different vector lengths. 
RESULTS
The proposed VNB version of the ACCA algorithm brings three major improvements: (1) enables the execution of the algorithm on the K20 and K1 GPUs; (2) significantly improves the performance, see speedup up to 5.7 for CPU in Figure 3 , and (3) reduces the processing time variation for different scenes, from 15.2% of the original algorithm to 1.0% for CPU (see Figure 4) , 0.1% for K20 and 0.2% for K1 (see Figure 5) . In all figures, the 6 different colors represent datasets with cloud coverage varying between 1% and 66%.
The performance of the VNB version of the ACCA algorithm for all three architectures and all datasets is shown in Figure 6 . The CPU is used as a baseline, with speedup equal to 1, and K1 and K20 are compared to it. As expected, the high performance K20 is on average 5.8 times faster, but the K1, with 13 times lower power consumption achieved 51% performance of the high-end 8-core server CPU.
The wavelet spectral dimension reduction algorithm performance on all three architectures is shown in Table 2 . It was tested in a data independent fashion, so that all pixels are reduced by 4 wavelet decomposition levels. For each level the reconstruction to the original size and evaluation using a cross-correlation function is performed. Unlike ACCA this algorithm efficiently utilizes large CPU caches while the throughput GPU architecture is less efficient. This translates into the performance of K1 being about 20% of the CPU. Taking into account the 13 times lower power budget, the new Tegra K1 SoC has a great potential for onboard processing of complex algorithms.
CONCLUSIONS
This paper evaluates the feasibility of a new mobile manycore architecture, the 192-core GPU of the Tegra K1 SoC, for onboard processing, using two remote sensing algorithms. In order to gain optimal performance we had to redesign the original algorithms to support SIMD processing. Tegra K1 achieved (1) 51% for ACCA algorithm and (2) 20% for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU. Both algorithms use only a GPU part of the SoC, leaving the 4+1 ARM Cortex A15 general-purpose cores available for other tasks. Table 2 . Chip-to-chip performance comparison of the Wavelet Spectral Dimension Reduction algorithm for image size 100,000 pixels.
