Abstract-This paper evaluates the potential of embedded graphic processing units (GPU) in the Nvidia's Tegra K1 for onboard processing. The performance is compared to a general purpose multicore central processing unit (CPU), a full-fledge GPU accelerator, and an Intel Xeon Phi coprocessor, for two representative potential applications, wavelet spectral dimension reduction of hyperspectral imagery and automated cloud-cover assessment (ACCA). For these applications, Tegra K1 achieved 51% performance for the ACCA algorithm and 20% performance for the dimension reduction algorithm, as compared to the performance of the high-end eight-core server Intel Xeon CPU which has a 13.5 times higher power consumption. This paper also shows the potential of modern high-performance computing accelerators for algorithms such as the ones for which the paper presents an optimized parallel implementation. The two algorithms that were tested mostly contain spatially localized computations, and one can assume that all image processing algorithms containing localized computations would exhibit similar speed-ups when implemented on these parallel architectures.
Optimization of Selected Remote Sensing Algorithms for Many-Core Architectures
Lubomir Riha, Member, IEEE, Jacqueline Le Moigne, Member, IEEE, and Tarek El-Ghazawi, Fellow, IEEE Abstract-This paper evaluates the potential of embedded graphic processing units (GPU) in the Nvidia's Tegra K1 for onboard processing. The performance is compared to a general purpose multicore central processing unit (CPU), a full-fledge GPU accelerator, and an Intel Xeon Phi coprocessor, for two representative potential applications, wavelet spectral dimension reduction of hyperspectral imagery and automated cloud-cover assessment (ACCA). For these applications, Tegra K1 achieved 51% performance for the ACCA algorithm and 20% performance for the dimension reduction algorithm, as compared to the performance of the high-end eight-core server Intel Xeon CPU which has a 13.5 times higher power consumption. This paper also shows the potential of modern high-performance computing accelerators for algorithms such as the ones for which the paper presents an optimized parallel implementation. The two algorithms that were tested mostly contain spatially localized computations, and one can assume that all image processing algorithms containing localized computations would exhibit similar speed-ups when implemented on these parallel architectures.
Index Terms-Cloud detection, dimension reduction, Intel Xeon
Phi, Kepler GPU, onboard processing, remote sensing, Tegra K1.
I. INTRODUCTION
R EMOTE sensing satellite missions are evolving toward smaller spacecraft (e.g., CubeSats or MiniSats), with lower cost and lower power, as well as toward distributed spacecraft missions in which units are working in a collaborative fashion to reach one or more common goals. In all of these cases, being able to perform basic processing onboard, such as dimension reduction or cloud detection, represents the first step to perform content-based data compression (reducing communication bandwidth) as well as to optimize the acquisition of content-rich datasets. Additionally, onboard processing allows us to take autonomous decisions and, therefore, to react quickly to unexpected Science events or phenomena.
For this "intelligent" processing to be performed onboard, enhanced computing capabilities need to be considered. L. Riha is with IT4Innovations National Supercomputing Center, VSBTechnical University of Ostrava, Ostrava 70833, Czech Republic (e-mail: lubomir.riha@vsb.cz).
J. Le Moigne is with NASA Goddard Space Flight Center, Software Engineering Division, Greenbelt, MD 20771 USA (e-mail: jacqueline.j.lemoignestewart@nasa.gov).
T. El-Ghazawi is with The High-Performance Computing Laboratory, The George Washington University, Ashburn, VA 20052 USA (e-mail: tarek@gwu.edu).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/JSTARS.2016.2558492
Previously, field programmable gate arrays (FPGA) and reconfigurable computers combining the flexibility of traditional microprocessors with the power of FPGA's have been investigated [1] - [4] . With the latest developments in smaller spacecraft such as CubeSats or U-Class spacecraft, lower power solutions need to be investigated. This paper evaluates the suitability of new embedded graphic processing units (GPU) in the Nvidia's Tegra K1 (K1) systemon-chip (SoC) with thermal design power (TDP) under 7 W [5] for onboard processing. The performance of this SoC is compared with two modern high-performance computing (HPC) architectures: 1) a general purpose multi-core central processing unit (CPU) (8-core Sandy Bridge E5-2470, 2.3GHz, TDP 95W [6] ); 2) many-core accelerator (Intel Xeon Phi 5100p, TDP 225W [7] ); and 3) GPU accelerator (Nvidia Tesla K20 (K20), TDP 225W [8] ).
The GPU acceleration of hyperspectral imaging algorithms has been presented in the literature multiple times [10] - [12] . However, the proposed algorithms are mainly developed and evaluated on HPC accelerators which are not suitable for onboard processing due to their high power consumption. This paper shows that the performance achieved using this new SoC designed for battery powered devices is comparable to HPC hardware with significantly higher power consumption. Additionally, we can envision such low power processors to be integrated in space flight hybrid data processing systems such as the SpaceCube processor family being developed at NASA Goddard Space Flight Center [9] . The main contributions of this paper are to 1) Provide an in-depth assessment of two different types of GPU processors as compared with more traditional HPC processors, thus for two representative image processing algorithms 2) Investigate and demonstrate the feasibility of using a low power consumption GPU for onboard processing. The two algorithms that were tested mostly contain spatially localized computations, and one can assume that all image processing algorithms containing localized computations, e.g., local filtering such as median and enhancement filtering, edge detection, and wavelet decomposition and compression, would exhibit similar speed-ups when implemented on these parallel architectures.
A. Algorithms
For this study, we selected two algorithms: 1) Wavelet Spectral Dimension Reduction of Hyperspectral Imagery: More effective data processing techniques are needed to deal with the very rich information offered by hyperspectral imagery, which represents a challenge in processing and analyzing very large amount of data. In particular, dimension reduction-i.e., the transformation that brings data from a high-order dimension to a low-order dimension-is needed for several hyperspectral applications. For example, when performing image classification, dimension reduction can be used to conquer the curse of dimensionality by having a minimum ratio of training pixels to the number of spectral bands [13] , [14] (therefore, ensuring a reliable estimate of class statistics). Another application of hyperspectral dimension reduction is data compression, when fast data downlink is needed while not requiring lossless compression; this is the case of disaster management applications, for which very recent content-rich datasets are quickly needed to assess damages and prepare rescue operations [15] .
The algorithm that applies a discrete wavelet transform to hyperspectral data in the spectral domain and at each pixel location [16] is described as follows: a) For each pixel in the hyperspectral datacube, the spectral signal (or signature) is decomposed using a 1-D Daubechies orthonormal wavelet (with a filter of size 4, called DAU4).
b) The optimal level of wavelet decomposition for each pixel is computed adaptively by performing a 1-D wavelet reconstruction (using the inverse DAU4 filter), and by computing the correlation between the reconstructed signal and the original signal. The optimal level of decomposition is chosen as the lowest level producing a correlation above a given threshold. c) Combining the results from each pixel, the optimal level L of wavelet decomposition for the image is chosen as the lowest level after discarding outlier pixels. d) Using the level L computed in III, the output "dimension reduced" image is made of all pixels decomposed to level L. This method does not only reduce the data volume, but it also can preserve the characteristics of the spectral signatures, i.e., the image content of each dataset. This is due to the intrinsic property of wavelet transforms that preserve the general structure of each pixel spectrum (peaks and valleys), by preserving high-and low-frequency features during the signal decomposition. The algorithm is represented graphically in Fig. 1 .
2) Automated Cloud-Cover Assessment (ACCA) Algorithm
The ACCA algorithm was designed to assess the overall cloud cover of Landsat-7 image data [17] , [18] . The algorithm utilizes two steps: Pass-1 utilizes eight threshold-based filters to isolate clouds from nonclouds, then Pass-2 computes global image statistics to resolve detection ambiguities (e.g., snow versus clouds). In the following, we will only be focusing on Pass-1.
If a Landsat image is defined by seven normalized multispectral bands {B 1 , B 2 , . . . , B 7 }, the eight filters used in Pass-1 are defined as follows: b) Dark pixels elimination through a brightness threshold:
b) Snow elimination through a normalized difference snow index:
c) Warm image features elimination through a temperature threshold:
d) Ice and others elimination through a Band 5/6 composite:
e) Bright vegetation and soil elimination through a Band 4/3 ratio:
f) Ambiguous features elimination through a Band 4/2 ratio:
g) Rocks and desert elimination through a Band 4/5 ratio:
h) Warm clouds from cold clouds distinction through a Band 5/6 Composite:
From Pass-1, pixels are categorized among three categories: Clouds, Non-Clouds, and an Ambiguous Group that is analyzed further in Pass-2. The original algorithm is represented graphically in Fig. 2 .
B. Hardware Architectures
We have selected three different types of hardware architectures to compare the parallel performance of the selected algorithms. Type 1 is a general purpose processor for professional data processing environment (for instance workstations or HPC systems). Type 2 is a heterogeneous many-core HPC accelerator. We have selected two architectures which have the highest share in the supercomputing market [19] : 1) Nvidia Kepler GPU, and 2) Intel Xeon Phi. And Type 3 is a low power SoC with an embedded GPU accelerator.
This selection allows us as to evaluate the potential of modern parallel hardware for mobile systems, such as the Nvidia Tegra K1, with the state-of-the-art HPC processing hardware.
Tegra K1 is a mobile SoC which consists of a CPU component with 4+1 ARM Cortex A15 CPU cores (the "+1" stands for a low-power core used for running the operating system under very mild load) and a GPU component. The entire chip + RAM memory requires only 7 W under heavy load. The GPU component is based on the NVidia Kepler architecture. It contains only one Streaming Multiprocessor (SMX) with six groups of 32 stream processor (SP) cores (192 cores total). The main difference is that K1 does not contain any double precision (DP) cores. Which means that the chip has a high potential for image processing, requiring mostly SP, but has no usage for applications that relies on DP arithmetic.
The main hardware parameters of all platforms are shown in Table II .
1) Comparison of the Architectures: Peak Performance and Memory Bandwidth.
In terms of peak performance in SP, the K1 is 10.7 times slower than the K20m and 6.2 times slower than the Xeon Phi 5110p, but 2.2 times faster than the Sandy Bridge E5-2470. In terms of memory bandwidth, the K1 is 13.9 times slower than the K20m, 21 times slower than the Xeon Phi 5110p, and 2.5 times slower than the Sandy Bridge CPU.
Parallelism Expected from the Algorithms. Intel Xeon Phi and Nvidia Kepler architectures use a slightly different "language" to describe the number of cores. Nvidia calls one SP core a "CUDA core" and K20m has 2496 of them. It in fact is a scalar core, which can perform only one single precision (SP) operation per clock cycle. On the other hand, the Intel Xeon Phi 5110p has only 60 cores, but these cores contain single instruction multiple data (SIMD) units of width equal to 16 for SP operations. This is equal to 960 CUDA cores. This means that both architectures require similar amount of parallelism.
In addition, the CUDA cores in the Kepler architecture are also organized in entities similar to SIMD lanes. These are called warps and their SIMD width is 32. Warps are groups of threads that have to execute an identical instruction in order to maintain optimal performance. Therefore, vectorization could be used in both architectures.
Power Consumption and Performance per Watt. The K1 is a SoC designed for mobile and embedded systems designed for low power consumption. The entire Jetson TK1 development board [20] consumes ∼ 12.5 W (SoC + DRAM ∼ 7 W) under full load. Its comparison with the HPC hardware is shown in Table I , where the K1 has the highest performance per watt. In SP, it can perform 46.7 operations per watt, while the Tesla K20m can execute only 15.6 and the Intel Xeon Phi 5110p 8.9. The less efficient is the Intel Sandy Bridge general purpose processor. For dual precision, the K20m is the most power efficient, since K1's performance in DP is very low. More details on the power consumption of the Tegra K1 platform is shown in Table III . 
C. Test Data
For the wavelet spectral dimension reduction algorithm, we are not using any particular dataset for testing. Our implementation is tested in a data independent fashion, which means that it performs the wavelet decomposition and reconstruction by a certain number of levels for all pixels. This corresponds to the evaluation of the critical path of the algorithm.
For the ACCA algorithm, the tests were performed using the Landsat 7 (Landsat7-ETM instrument), Chesapeake Bay, Apr 2001 scene, see Fig. 3 , retrieved from NASA Imageseer [21] .
II. IMPLEMENTATION AND OPTIMIZATION
The main focus of the optimization part is to explore techniques that allow efficient utilization of the parallel hardware. Even though the architectures are different, they all use SIMD units. This means that the data are processed as short vectors where identical operations are executed on all elements. The number of elements per vector, or SIMD width, is 4 DP or 8 SP values for CPU, 8 DP or 16 SP for Xeon Phi, and 32 SP/DP for GPUs, see Table II . Both algorithms selected for this study have a very high degree of parallelism defined by the number of pixels and can be inherently vectorized. 
A. Conditional Statements and Vectorization
The efficiency of SIMD units is significantly reduced by conditional statements. Depending on the input data, conditional statements cause different code execution paths. In a parallel environment, where each thread processes different data points, this leads to situations in which different threads execute different code paths. In the case of nested conditional statements, each conditional statement can increase the number of different code execution paths by one. We will call the total number of different execution paths of a program the code divergence degree (CDD). In the text, we will also use term "branching" which describes the phenomenon that code executes different code paths in parallel.
Different execution paths cause a significant performance penalty for vectorized code executed by SIMD units. These units execute the same instruction for all data elements of an SIMD vector. This is a vector of length equal to the width of an SIMD unit (for instance the Sandy Bridge processor has SIMD width equal to 4 for DP 8 for SP). If a code is executed by an SIMD unit and its CDD is 1 then part of a vector is processed by one instruction and the other part of the vector is executed by a different instruction. These two instructions are executed sequentially. This means that the execution time is doubled and vectorization efficiency is reduced to 50%. If CDD is 3, the vectorization efficiency is 25%. For the Sandy Bridge processor if the divergence degree is equal to 7 then there is no performance gained by the vectorization.
B. Wavelet Spectral Dimension Reduction of Hyperspectral Imagery
There are three compute intensive kernels that are used by the algorithm: 1) Decimation kernel-reduce the pixel size by one half; 2) reconstruction kernel-increase the pixel size by 2; and 3) correlation kernel-which compares reconstructed pixel with original one. The way these kernels are used to build the entire algorithm is shown in Fig. 1 . The remaining part of the code just controls how many levels of dimension reduction are executed based on the output of the cross-correlation function. This part has a very little overhead.
1) Parallelization and Vectorization Approach:
For Tesla K20m and Tegra K1 the code is developed using compute unified device architecture (CUDA) [22] . It describes the parallelism using single instruction multiple threads (SIMT) programming model. For Intel hardware, the selected programming model is the Intel Cilk Plus language extension [23] , [24] and in particular the array notation in combination with OpenMP threading model implemented in the Intel compiler suite. These have been used to develop the decimation and the reconstruction kernels.
The cross-correlation function in the correlation kernel contains several summation functions; this kernel performance relies on efficient implementation of reduction operations. In the case of CUDA, a CUDA Unbound library [25] has been used. This library contains block reduction functions developed particularly for the Kepler architecture and provides very high performance for both Tesla K20m and Tegra K1. The Cilk Plus contains highly optimized reduction primitives for both the Sandy Bridge CPU and the Xeon Phi accelerator. As their performance is very high, there is no need to use any external library of reduction primitives.
In the case of the dimension reduction algorithm, the vectorization is used across the spectral bands within a pixel, when computing wavelet coefficients. The pixels are processed independently by threads for the CPU or the Xeon Phi. For CUDA, the spectral bands within a pixel are processed by CUDA threads and pixels are processed by CUDA blocks. There are no conditional statements within the vectorized section of the code that could reduce vectorization efficiency. Table IV shows how much time is spent in each kernel if the dimension reduction algorithm is executing all five levels as shown in Fig. 1 . The number of bands per pixel is 256. Table V shows that for the Sandy Bridge CPU and the Xeon Phi accelerator, most of the execution time is spent in the reconstruction kernel, 56% and 58%, respectively. On the other hand, in the case of Tesla K20 and Tegra K1, the correlation kernel is the most time consuming, 62% and 64%, respectively. This reflects the differences in the tested architectures, in this particular case, the amount of fast on-chip memory. For GPU hardware and CUDA programing language extension, the data transfers between the off-chip and on-chip memories are directly controlled by the programmer. This is not possible in the case of CPU or Intel Xeon Phi architectures which uses transparent caches, where these transfers are controlled by the hardware.
For GPU, we can perform the following analysis. The entire pixel, i.e., all N bands, is transferred from off-chip to on-chip memory by the decimation kernel during the first level of the wavelet decomposition. Any further executions of this kernel, i.e., wavelet decomposition levels 2 to 5, reads input data from the on-chip memory. This makes the first execution of the decimation kernel the most expensive. This kernel also stores output data back to off-chip memory in every level. On level one N/2 bands are stored, on level two N/4 bands, on level three N/8 bands are stored, etc. For analysis shown in Fig. 4 , we have moved all data transfers operations out of the decimation kernel. The reconstruction and correlation kernels use data from on-chip memory only. The amount of time spent in each kernel with separated data transfer operations for 128, 256, and 512 bands per pixel for Tegra K1 is shown in Fig. 4 . Fig. 4 shows that with a growing number of levels, the decimation kernel consumes less time, while the reconstruction kernel consumes more time. The reconstruction kernel requires more computations because for five levels it is executed 15 times (once for level 1, twice times for level 2, etc.). The decimation kernel is executed only once per level as well as the correlation kernel.
For every pixel that is processed, a GPU chip has to allocate three buffers of size equal to the number of spectral bands N inside the on-chip memory. These buffers store 1) the original spectral bands-used to load data from off-chip memory; 2) the decimated spectral bands for all five levels-used to store output data; 3) the fully reconstructed spectral signature that is used by the correlation function-temporary storage for correlation Therefore, it is very efficient if the processing unit is able to keep this data inside the fast on-chip memory. This can be achieved in the case of the CPU or the Intel Xeon Phi, but K20m and K1 GPUs do not have enough on-chip storage which results in performance penalty. This penalty is in the form of a reduced amount of pixels that can be processed in parallel due to limited available storage for the three buffers.
A similar behavioral can be observed from results shown in Table V , where the Tesla K20m performs better if pixels have a lower number of spectral bands while the Intel Xeon Phi outperforms it for the cases with higher number of spectral bands per pixel.
This algorithm does not have parameters that can be tuned to achieve better vectorization efficiency as it is in the case of ACCA algorithm described in the following section.
C. ACCA Algorithm
The ACCA algorithm was selected because it is difficult to be vectorized. The main reason is that it contains a large number of nested conditional statements (one for each threshold-based filter. The original version of the ACCA algorithm, as shown in Fig. 2, contains 11 filters which can generate a high number of different code execution paths. This results in 1) a poor performance of SIMD units, and 2) a significant processing time variation, depending on the input data or in other words on cloud coverage. As an example see Fig. 5 , which shows that the execution time variation of the original algorithm can be up to 15%. The variable execution time is another negative effect of a code with high divergence degree. It is also a significant problem for on-board real-time processing systems which is addressed in the next section.
1) Parallelization and Vectorization Approach:
To minimize the execution time variation of the original ACCA algorithm and to maximize the efficiency of the SIMD units, a new version of the ACCA algorithm, called Vectorized without Branching (VNB), was developed. In this version, the code execution path divergence is completely eliminated which significantly reduces the execution time variability.
In the VNB algorithm, all threshold-based filters are redesigned to avoid code execution path divergence by setting or resetting of specific bits of a register, see Fig. 6 . At the end, the register contains the value describing whether the pixel is a cloud or not. This means that all filters are executed for every pixel which generates more work. But since this workload can be processed much more efficiently by the SIMD units, the VNB algorithm delivers high processing rate for the selected architectures.
We have also implemented a Vectorized with Branching (VWB) version of the ACCA algorithm. The branching is executed by elements which is natural for the SIMT model used by CUDA, as well as for CPU or Xeon Phi, if parallelization is expressed using Cilk++'s Elemental functions [24] . This version can still exhibit high CDD and, therefore, variations in the execution time.
The evaluation of the VNB and VWB versions for Intel Sandy Bridge CPU and Intel Xeon Phi is shown in Fig. 7 . The results clearly show that the most promising method is VNB, since it has: 1) the lowest time time variation: ∼ 1.0% for CPU and ∼3.7% for Xeon Phi and 2) in all but one case also the highest performance.
In the case of the Kepler architecture, the time variation is eliminated even more efficiently. For Tegra K1, the processing time variation as a function of cloud coverage is reduced from 8.7% to 0.2%, and the overall processing time is also reduced by up to 8.3% when compared to the VWB version. For K20m, the variation is reduced from 10% to 0.1%. For more details see Fig. 8 .
The vectorization efficiency can be evaluated only for Intel Sandy Bridge and Intel Xeon Phi architectures. The GPU does not have a programing model that contains threads and SIMD lanes, but only SIMT threads.
The efficiency of the new vectorized ACCA algorithm for the Sandy Bridge processor is shown in Fig. 9 , where the achieved speed-up is between 4.7 and 5.8 depending on the input data. The processing is done in SP, where SIMD lane width is equal to 8, therefore, the efficiency is up to 73%. The main reason for lower efficiency is mainly the parallel overhead. This means that in order to vectorize the algorithm it had to do more work than the original nonvectorized version.
A similar behavior can be observed for the Xeon Phi architecture, see Fig. 10 . In this case, the speedup achieved by vectorization is up to 7.6. Since the Xeon Phi SIMD width for SP is 16, the efficiency achieved is up to 47%.
III. RESULTS
To summarize, the proposed VNB version of the ACCA algorithm brings three major improvements: 1) it enables the execution of the algorithm on the K20m and K1 GPUs; 2) it significantly improves the performance, i.e., it provides a speedup of up to 5.7 for CPU (see Fig. 9 -scene with 26% cloud coverage), and 3) it reduces the processing time variation for different scenes, from 15.2% of the original algorithm (see Fig. 5 -vectorized w/o branching) to 1.0% for CPU (see Fig.  7 top-vectorized w/o branching), 0.1% for K20 and 0.2% for K1 (see Fig. 8-vectorized w/o branching) .
The overall performance comparison of the vectorized version of the ACCA algorithm for all four architectures and all datasets is shown in Fig. 11 . The single Sandy Bridge CPU is used as a baseline, with a speedup equal to 1, and Xeon Phi 5110, Tegra K1 and Tesla K20m being compared to it. The high performance K20m accelerator is on average 6.1 times faster than the CPU, while the Xeon Phi 5110p is faster only by 3.4 times. Using a dual processor system speeds up the execution by 1.85 times.
Very promising results are achieved by the Tegra K1. This chip while consuming 13 times less energy than eight-core Sandy Bridge CPU achieved 51% of the performance of this high-end processor. Table VI shows the performance per Watt of each processor for the ACCA algorithm. We can see that Tegra K1 is significantly more efficient in this metric.
The performance of the wavelet spectral dimension reduction algorithm for all four architectures and different number of wavelet decomposition levels is shown in Table VII . The number of spectral bands per pixel is 256. We can see that the performance of the Tegra K1 is 52% the performance of one Sandy Bridge CPU for one level of decomposition and gets less efficient for increasing numbers of levels, up to 21% for five levels. For all levels, the Tesla K20m delivers the highest performance. Table VIII shows more detailed results for various number of spectral bands per pixel. It also includes power efficiencies for all scenarios. Implementation was tested in a data independent fashion, so that all pixels are reduced by five wavelet decomposition levels. For each level, the reconstruction to the original size and its evaluation using a cross-correlation function is performed.
Unlike ACCA, this algorithm utilizes large CPU caches for high numbers of bands per pixel. This translates into the The best results are highlighted in bold.
performance of K1 being only 14% of the Intel Sandy Bridge CPU, for 512 bands per pixel. For 256 bands per pixel, which would be the case if data from the AVIRIS sensor were used, the performance is 21% of the CPU. Taking into account the 13 times lower power budget, the new Tegra K1 SoC is 2.9 times more power efficient and shows a great potential for onboard processing of complex algorithms. The performance of Intel Xeon Phi is lower for data with small number of spectral bands per pixel (128 256) where it outperforms Sandy Bridge CPU by only 1.12 times. Its potential is shown for 512 spectral bands per pixel, where it is the fastest of all compared architectures.
A. Comparison With FPGA accelerator
In the case of the ACCA algorithm, we compared the performance of the tested hardware platforms with an FPGA presented in [1] . These results were measured on a SRC-6E machine equipped with two Xilinx Virtex II-6000-4 FPGAs in 2005. This hardware was able to process 800 Mpix/s with a consumption close to 200 W. Please note that SRC-6E is a supercomputer with FPGA used as an accelerator similarly to the Nvidia Tesla or the Intel Xeon Phi; it is not an embedded device.
In terms of absolute values, the Tesla K20m can process 4700 Mpix/s, Tegra K1 400 Mpix/s. Please note that the FPGA results were measured over 10 years ago. But we still believe this is a valuable comparison.
IV. CONCLUSION
This paper evaluates the feasibility of a new mobile manycore architecture, the 192-core GPU of the Tegra K1 SoC, for onboard processing, using two remote sensing algorithms. In order to gain optimal performance, we had to redesign the original algorithms to support SIMD processing. Tegra K1 achieved a performance of 1) 51% for the ACCA algorithm and 2) up to 24% for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU. Both algorithms use only a GPU part of the SoC, leaving the 4 + 1 ARM Cortex A15 general-purpose cores available for other tasks.
This paper also presents the performance evaluation of the state of the art heterogeneous accelerators Nvidia Tesla K20m and Intel Xeon Phi 5110p. In both tests, the Tesla K20m achieved better performance as both algorithms rely on SP processing and peak performance of the Tesla K20m is 1.5 higher than the Xeon Phi 5110p.
We have also evaluated the power efficiency of all four architectures based on their TDP. For both algorithm the highest performance per Watt was delivered by the Tegra K1. In the case of the ACCA algorithm the Tegra K1 was 2.7 times more efficient than the Tesla K20m and 4.9 times than the Xeon Phi 5110p. In addition we have compared these performances with the FPGA accelerated HPC system SRC-6E. This comparison shows that this 10-year old high-end HPC machine accelerated with two FPGA coprocessors was aproximatelly two times faster than the Tegra K1 and provided similar performace as today's single high end general purpose Xeon Sandy Bridge processor. He is a Senior Research Scientist at IT4Innovations National Supercomputing Center in Ostrava, Czech Republic (IT4I). Previously, he was a Research Scientist in the High Performance Computing Lab, Department of Electrical and Computer Engineering, George Washington University, Washington, DC, USA. His research interests include acceleration of scientific applications using multi and many-core architectures, e.g., GPU and Intel Xeon Phi. He is a Principal Developer of the Intel Parallel Computing Center where his team works on the acceleration of finite element tearing and interconnect (FETI) sparse solvers using the latest Intel many-core accelerators, and is developing an interface for community codes. He is also an investigator of the FP7 EXA2CT and HORIZON 2020 READEX projects. The goal of the READEX project is to provide improved energy-efficiency applications in the field of high-performance computing. The EXA2CT project supported the development of the massively parallel multilevel FETI solver for Exascale machines. The current version, which scales to tens of thousands of compute nodes is now implemented in the ESPRESO library.
