Vision-based navigation has become increasingly important in a variety of space applications for enhancing autonomy and dependability. Future missions, such as active debris removal for remediating the low Earth orbit environment, will rely on novel high-performance avionics to support advanced image processing algorithms with substantial workloads. However, when designing new avionics architectures, constraints relating to the use of electronics in space present great challenges, further exacerbated by the need for significantly faster processing compared to conventional space-grade central processing units. With the long-term goal of designing high-performance embedded computers for space, in this paper, an extended study and trade-off analysis of a diverse set of computing platforms and architectures (i.e., central processing units, multicore digital signal processors, graphics processing units, and field-programmable gate arrays) are performed with radiation-hardened and commercial off-the-shelf technology. Overall, the study involves more than 30 devices and 10 benchmarks, which are selected after exploring the algorithms and specifications required for vision-based navigation. The present analysis combines literature survey and in-house development/testing to derive a sizable consistent picture of all possible solutions. Among others, the results show that certain 28nm system-on-chip devices perform faster than space-grade and embedded CPUs by 1−3 orders of magnitude, while consuming less than 10 Watts. FPGA platforms provide the highest performance per Watt ratio.
plex", and "autonomous", conceal increased processing workloads and decreased time budgets. Put into perspective, conventional space-grade processors, such as LEON3 and RAD750, can provide only a very limited level of performance (e.g., 50−400 DMIPS or Dhrystone Million Instructions per Second), whereas the COTS desktop central processing units (CPUs) used by algorithm designers during the early development stages of VBN techniques provide 10−100x more speed when processing images. Even with the latest space-grade devices, such as LEON4 and RAD5545 that achieve 10x more speed compared to their predecessors, or even with embedded COTS processors, the current on-board CPU performance seems to be one order of magnitude less than that necessary to execute computationally expensive vision algorithms at adequate frame rates. Therefore, to support reliable autonomous VBN, the community should consider advancing the space avionics architectures by including hardware accelerators, such as field-programmable gate arrays (FPGAs), graphics processing units, (GPUs), or multi-core very long instruction word (VLIW) DSP processors. That is, in short, use VLIWs or GPUs consisting of a number of cores able to execute bigger or smaller instructions and facilitate data parallelization. In contrast, the FPGAs consist of a network of much smaller processing/storage elements (such as flip-flops), which can be combined altogether to create bigger components with very deep pipelining and parallelization at multiple levels.
Designing new high-performance avionics architectures for embedded computing becomes particularly challenging when one is also required to meet stringent constraints related to space. The spacecraft resources available to electronic systems are restricted [2] , with the three most critical limiting factors being mass, power and volume. Power availability in space is low due to solar powering and power dissipation capabilities, whereas volume and mass have enormous costs per unit. Furthermore, the engineers must factor in reconfigurability, application-design complexity, fault/radiation tolerance, vibration tolerance, component pricing, etc. Last but not least, one of the most crucial factors is the achieved processing power. All of the aforementioned parameters are competing against each other in a mission-specific trade-off manner, which becomes even more complex when the types of the aforementioned hardware accelerators are included in the architecture.
Paper Contribution and Outline
In this paper, we perform a trade-off analysis for an extended number of processing units, both COTS and rad-hard. We consider image processing and computer vision algorithms for VBN, i.e., computationally intensive benchmarks, while we pay particular attention to the performance and power parameters. We define a set of representative benchmarks by studying the application scenario of VBN and collecting generic specifications and commonly used algorithms. Subsequently, we survey, test in-house, and compare a wide range of microchips falling into seven main categories:
space-grade CPUs, embedded CPUs, desktop CPUs, mobile GPUs, desktop GPUs, single-and multicore DSPs, and FPGAs. Notice that, besides selecting a specific device, it is of utmost importance to determine which of these architectural categories is the most promising for the future mission scenarios and thus should be further supported/exploited. Put into context, our preliminary work in ESA has shown that although multi-core DSPs are gaining the attention of the space industry with 2−3 platforms currently being developed/considered, they still have insufficient processing power for the high-performance applications. For example, ESA considers internally the spaceoriented scalable sensor data processor (SSDP) platform based on a LEON3-FT together with Xentium DSPs [12] , but this does not meet the high-performance requirements of VBN. Also, ESA considers the latest space-grade GR740 CPU, which is also not sufficient for VBN, as parallelism cannot be exploited and the projected clock speed is still too low. In the past, the most frequently considered type of hardware for intensive applications were FPGAs, i.e., Xilinx Virtex-5QV and Microsemi RTG4, or the very recent European NanoXplore NX-eFPGA (dubbed BRAVE or NG-MEDIUM) [13] . Today, besides space-grade devices, cutting-edge technology includes system-onchip (SoC) off-the-shelf devices, which constitute a very attractive solution for certain space missions, like, for example, ESA's e.Deorbit ADR [14] .
The survey and analysis performed in the current paper will be useful for a variety of missions. It provides an in-depth exploration of a vast solution space, which can then be given as input to a focused trade-off for the selection of avionics based on the custom-weighted criteria of a particular mission. For simplicity and in order to allow the consideration of many devices, we assume here generic mission specifications with demanding VBN, like in the ADR scenario: lowearth orbit, short-term life and increased VBN autonomy. Such specifications call for increased accuracy/flexibility VBN, high-resolution images and high frame-rates, all of which contribute to increased computational workload. The assumed high-level architecture is preliminary, based on previous implementations/experience, state-of-the-art image processing and available technologies, as well as the future road-maps.
The paper is organized as follows: Section II outlines the computer vision algorithms commonly employed for VBN with an emphasis on ADR, Section III surveys the computing platforms available for use in space, Section IV performs the benchmarking and comparison of the devices, and Section V highlights the most important results and concludes the paper.
II. Algorithms for Vision-based Navigation in Orbit
Keeping track of the position of an orbiting spacecraft with respect to its surroundings is one of the key ingredients of reliable spacecraft navigation. Accurate localization information is necessary to ensure that the spacecraft follows its intended path and operates in proximity with other objects while avoiding collisions [15] . The present section reviews computer vision techniques which estimate the relative pose between a chaser and a target space resident spacecraft in the context of autonomous rendezvous, inspection or docking scenarios involving orbiting spacecraft [16] . Despite that the scope of this review focuses on the ADR scenario, it is pointed out that other applications of vision-based navigation in space rely on analogous principles and employ similar visual primitives. Such applications, for example, include obstacle avoidance, mapping and visual odometry for planetary rovers [17, 18] or approach velocity estimation and localization for precision landing [18] [19] [20] . As a result, the choice of a processing platform suitable for ADR is also highly relevant to visual navigation applications for planetary exploration rovers and landers. Fig. 2 serve as features for estimating the satellite's pose relative to the camera [21] .
Relative pose refers to the 6 DoF position and attitude parameters defining the geometrical relationship between the two spacecrafts. The pose estimates can be used in a closed control loop to guide the motion of the chaser spacecraft in order to safely approach the target or collect it as debris. Relevant approaches can be characterized by the type of sensor they employ, their operating range and by whether the target spacecraft is "cooperative", i.e. carries markers or reflectors that are installed on it with the aim of facilitating pose estimation by the chaser spacecraft. The desired operating distance relates to the choice of sensor, sensory processing method and the type of information which can be extracted from the input [22] . The established consensus is that a single sensor will not perform adequately across the entire operating range, therefore autonomous rendezvous and docking requires the deployment of a suite of multiple sensor systems, each with a different range and potentially principle of operation [1] . The most commonly used types of sensors are passive cameras and laser range finders.
In this review, we limit ourselves to passive camera sensors applied to pose estimation at distances below 100 m. The choice of cameras is motivated by the fact that they are attractive for use in ADR due to their compact size, low cost, simple hardware, low power consumption and potential for dual use as navigation and visual inspection instruments. Furthermore, we focus on non-cooperative targets since ADR will almost certainly concern not specially prepared or partially unknown targets.
Estimating the pose of a target in these settings is quite challenging, because no strong prior information regarding its attitude and position is available. The situation is aggravated due to the varying illumination conditions, largely varying chaser-target distances that result in dramatic aspect changes, self-occlusions, background clutter, unstabilized target motion and measurement noise. Such challenges call for elaborate and, therefore, computationally demanding processing of images. A variety of algorithmic approaches to visual pose estimation with the purpose of docking have been proposed. We note that some of these rely on prior knowledge for pose initialization, i.e. cannot handle lost-in-space configurations. These approaches as well as the majority of working systems account for the temporal continuity of a target's motion by using tracking [23] . In this respect, by integrating past estimates, tracking can deliver a better estimate compared to that directly determined with single frame pose estimation methods. At the same time, it can provide a prediction for the next frame which typically accelerates pose estimation since it bounds considerably the pose search space. Vision-based pose estimation approaches can be broadly classified into two categories, namely monocular and stereo-based, depending on whether they are based on images from a single or several cameras. A comprehensive discussion is offered by Petit [24] and a brief overview follows next.
A. Monocular approaches
Monocular approaches to pose estimation employ a single camera and typically rely on the assumption that a 3D model of the target is known beforehand. Depending on the approach used, the 3D model can be in the form of an unordered set of points (i.e., point cloud) and their image descriptors, a wireframe, or a polygon mesh. Use of a properly scaled model helps to resolve the wellknown depth/scale ambiguity [25] . Monocular approaches infer pose by matching features detected in an image with the model. Before providing more details, a short introduction to image features is given first.
Specific structures in an image give rise to features. For example, sparse distinctive points, also called "interest points", "keypoints" or "corners" in the vision literature, refer to salient image locations with sufficient local variation [26] . Keypoints can be accurately localized in an image and their local descriptors can be robustly matched from moderately different viewpoints. Edges are another type of feature occurring when intensity changes sharply. Edges correspond to maxima of the gradient magnitude in the gradient direction [27] and are moderately robust against noise, illumination, and viewpoint changes. In general, point and edge detection is computed by using 2D convolutions with derivative filters. Popular algorithms for the detection of keypoint and edge features are those by Harris [28] and Canny [27] , respectively.
Feature-based approaches to monocular pose estimation operate by matching the detected image features with certain locations from the target's known 3D model. Pose is then estimated so that it accounts for the observed arrangement of the detected features in an acquired image. Proposed approaches differ according to whether the image-to-model correspondences are explicitly or implicitly established. Explicit methods strive to establish (usually one-to-one) correspondences of keypoint locations, relying on their feature descriptors or their spatial arrangement [29] [30] [31] [32] . Implicit methods attempt the establishment of correspondences through the matching of less localized features, such as edges, contours, or line segments [21, [33] [34] [35] [36] . Aiming to further increase robustness, approaches such as [37] [38] [39] which combine edge and point features have also been proposed. We note that despite the impressive results obtained in various application domains with state of the art keypoint detectors and descriptors, their use in space is hindered by the fact that the target objects of interest lack strong texture and are often subject to illumination changes and specular reflections. Therefore, approaches based on edges are more frequently employed for ADR.
A few approaches such as [40] [41] [42] track features over time and employ simultaneous localization and mapping (SLAM) or structure from motion (SfM) techniques to estimate the 3D structure and dynamics of the target object. Pose is again computed by matching observed features to the estimated structure. Such approaches do not require any prior information about the appearance of the target, therefore are most suitable when the state of the latter is uncertain or unknown.
However, their increased flexibility comes at the price of higher complexity and risk of failure.
B. Stereo approaches
Stereo vision employs triangulation to provide estimates of the distances from the camera of the target's visible surfaces. The primary challenge in order to achieve this concerns the determination of correspondences among the pixels appearing in a binocular image pair [43] . Correspondence establishment involves local searches for identifying the minimum of a matching cost function that measures the affinity of pixels in the two images. For efficiency, these searches are restricted in one dimension along image scanlines, after the input images have been rectified. Commonly used stereo matching costs are the sum of squared differences (SSD), the sum of absolute differences (SAD) or the normalized cross correlation (NCC) [43] . Knowledge of the baseline, i.e. the distance between the cameras, via calibration, permits scale to be determined.
To estimate pose from stereo data, either the actual surface or a set of 3D features extracted from it are matched with a 3D model of the target (see [44] and references therein). Stereo sensing is sensitive to the lack of texture and possible illumination artifacts, such as shadows and specular reflections, which are common in space environments. Furthermore, the camera resolution limits the depth resolution, i.e. the smallest change in the depth of a surface that is discernible by the stereo system. Depth resolution is also proportional to the square of the depth and is inversely proportional to the baseline. For these reasons, stereo measurements are more accurate at fairly short distances from a target. This is in contrast to monocular approaches presented in Section II A, which are also usable at the far end of the distance range.
Indicative stereoscopic approaches to pose estimation are briefly discussed next. Stereo images are used in [45] to reconstruct 3D data which are then matched with an object model using the iterative closest point (ICP) algorithm to estimate pose. Sparse 3D point clouds recovered with stereo have been employed to estimate pose, for example in [46] where the 3D points originate from natural keypoints and [47] which determines image points as the intersections of detected line segments. To achieve more dense reconstruction, [48] suggests the combination of conventional feature-based stereo with shape-from-shading. It is noted that all stereo approaches still need to solve the problem of establishing correspondences between the 3D data and the object model.
Approaches which estimate pose by using stereo to detect particular structures on the target satellite have also been proposed, e.g. [49] based processor is LEON4-FT, e.g., the GR740 at 65nm, which includes a quad-core operating at MIPS even with single-core execution. Processors from ARM company are now being considered in multiple European space projects, e.g., ARM4Space (H2020), ASCOT (CNES), as well as for the on-board computer of the Ariane 6 launch vehicle. ARM's Cortex-A9 processor is suitable for lowpower applications and has been used in multiple mobile phones, as well as in CubeSats. NASA [55] has exposed the Snapdragon APQ8064 IFC6410 board to Ar @ LET = 7 MeV-cm2/mg. However, up to now, only the ARM Cortex-M0 has been implemented as radiation tolerant by triplication, to provide no more than 50 DMIPS. In more powerful CPU-based scenarios, the P4080 bases on octocore PowerPC plus Xilinx Virtex-5 FPGA for voting to provide up to 27.6 DGIPS, however, at the cost of more than 25 Watts power dissipation. At similar power budget, the Intel Atom TMR board provides up to 3.3 GIPS with radiation tolerance being provided through triplication. Finally, we mention that ESA's GAIA mission uses a video processing unit with a SCS750 board (commercial PPC750FX) providing 1,800 DMIPS at 800 MHz, but consuming 33 Watts and requiring mitigation via triple redundancy of the PPC plus voting with Actel FPGA to withstand radiation.
Beyond the CPU-based approach, the most common general-purpose accelerators are FPGAs, VLIW DSP multi-core processors, as well as GPUs, which, however, are not widely considered for space applications due to their increased power dissipation. The VLIW DSP solution is being examined through European projects such as MACSPACE (with 64 X1643 DSP cores from CEVA) and MPPB/SSDP [56] (with Xentium VLIW DSP cores from Recore), or devices such as the 8-core
C66x DSP Texas Instruments 66AK2H12, as summarized in Table 2 . The conventional single-core VLIW space-grade DSP devices (e.g., TI SMV320C6727B) deliver inadequate processing power to be considered for high-performance applications. Instead, TI 66AK2H12 offers two orders of magnitude more processing power, however, without space-qualification yet. Similarly, at only 10 Watts, the European MACSPACE 64-core DSP will also provide one order of magnitude higher performance than single-DSP solutions.
Regarding space-grade FPGAs ( Additionally, the authors of [57] propose to combine certain functionality of the COTS Zynq such as internal watchdogs with mitigation techniques on the FPGA side in order to increase the resilience of the system to space radiation. Overall, the volume of this avionics module is 17 × 17 × 5 cm 3 , whereas its power dissipation is around 5 Watts. The CHREC space processor [58] is also based on Xilinx Zynq7000. The developed board includes rad-hard flash memory and has a form factor of 1U.
This means that it fits in a 10 × 10 × 11.35 cm 3 cubesat, weighing less than 0.1Kg and consuming around 2 Watts of electrical power. NASA's Goddard centre proposes using SpaceCube [59, 60] , which is an architecture based on space-grade FPGAs (e.g., Virtex-5QV Following the preceding survey, we summarize a number of high-level choices for designing a high-performance architecture with: 1) many-core CPU, 2) CPU plus multi-core DSP, 3) CPU plus FPGA, 4) multi-FPGA including soft-core CPU, and 5) System-on-Chip. Choices 2−4 assume dis- tinct devices interconnected with, e.g., a high-speed network like SpaceWire at 100 Mbps. The drawback of such an approach would be the increased size, mass and power dissipation due to the multiple chips involved. Moreover, co-processing bottlenecks could arise in cases 2−4 due to congestion or slow interconnects among the devices, which would result in limited communication bandwidth between intermediate steps of an algorithm. When considering multicores running SW, delivering a hard real-time system implies functional isolation together with time analysis against single-core chips. This is difficult due to state interruption between processes/applications, constraints of high-criticality over low-criticality applications as well as safety and worst-case execution time requirements of SW, as is the common practice in the space industry. Instead, FPGAs can avoid such complicated SW analysis, although they require increased development time. Softcore
CPUs (e.g., LEON synthesized in an FPGA) provide one order of magnitude lower performance than hardcore IPs integrated in a SoC device (e.g., ARM Cortex A9 in Zynq), which can execute more demanding functions in a HW/SW co-design scenario. By taking into account the above factors, we identify the SoC approach as the most promising for high-performance embedded computing. SoC devices offer a wide range of advantages owing to the coupling of acceleration resources with general purpose processors and peripherals. They can lead to architectures with single-device, low-power, small size/weight and rapid intra-chip communication able to support efficient co-processing.
IV. Comparison and Evaluation of Processing Platforms
This section compares candidate platforms, primarily, in terms of throughput, power, performance per Watt, and secondarily, in terms of connectivity, size/mass, and error mitigation opportunities. We evaluate their capabilities in relation to image processing and computer vision tasks, with the intention of selecting the most suitable platform for the future space missions under consideration. Towards a fair/accurate comparison, we combine various literature results and in-house development/testing, while we utilize a number of benchmarks common to all platforms. Table 4 summarizes the devices and algorithms used in our study. Specifically, we examine 30 representatives from all four processor categories (namely CPU, FPGA, GPU and DSP) by utilizing more than 10 well-known benchmarks applied to images of size 512x384 and 1024x1024 pixels. These experience and code optimization level of each benchmark, c) the application/algorithmic domain targeted by each architecture, and d) the technology node and size of each chip. To tackle these issues, along with the plethora of chip−benchmark combinations that emerge, we follow a two-stage pruning method tailored to our goals. First, we perform a general "CPU vs. GPU vs. DSP vs.
FPGA" evaluation, and then we proceed to a more in-depth comparison of the 4 − 5 most promising chips. To address d) in the second stage, we focus on 28nm technology node and, in particular, SoC devices. To address c) in both stages, we focus on image processing. To address b), we select published works that devote equal effort to their examined platforms, avoiding comparisons of optimized CUDA vs. generic High-Level-Synthesis implementations, and we assign distinct, experienced, developers to each platform tested in-house. To address a), we rely on the multitude of results and their convergence to consistent conclusions by experienced developers. Finally, we note that our primary goal is to derive the relative performance of the devices rather than absolute numbers for specific algorithms/datasets. Thus, we avoid running the exact same test on all devices, which would become impractical for 10 benchmarks, 2 image sizes and 30 chips. Instead, we proceed by extracting multiple individual performance ratios for key pairs of devices and then, as explained in section IV C, we combine all ratios/comparisons in a unified table.
For clarity of presentation, sections IV A and IV B report separately the most important results derived from the literature and the measurements performed in-house for both stages of the aforementioned comparison strategy. These sections focus on performance and power consumption, while combining absolute numbers and extrapolations/estimations to facilitate straightforward comparisons between the devices. Subsequently, section IV C merges all results to provide a unified perspective. Additionally, section IV C considers factors like size/mass of the COTS board, development time, COTS cost, and discusses suitability to space missions.
A. Literature results
The literature includes a multitude of implementations on various devices, which we use here to construct our first set of performance comparisons. Primarily, we collect results regarding platforms not available in-house and, secondarily, we include papers to enrich and verify our own tests.
Regarding mobile GPUs, the authors in [ Cortex-A9 (reduced at 667MHz), while for a more complex benchmark such as edge detection, the i5 CPU is 21−32x faster than the embedded Cortex-A9. The authors in [67] use N-body simulation as a benchmark and show that, for single-threaded execution, the Intel i5-2467M is 12.8x faster than the embedded Cortex-A9 (at 667MHz). Furthermore, they employ optimized CUDA coding (Compute Unified Device Architecture) on GPUs to show that the mobile Tegra K1 is 61.8x faster than Cortex-A9, whereas high-end desktop GPUs achieve a huge gain over A9, in the area of 1000x (e.g., 2032x vs. Nvidia Tesla K80 with nominal 300 Watts). We note that, in a broader sense [67, 68] , the throughput of embedded CPUs could increase even by 10x when we consider higher clock rates, e.g., 2.3 GHz, and/or bigger micro-architectures, e.g., ARM Cortex A-57. However, for the sake of categorization, we focus here on more conservative scenarios, such as Cortex A9/A15 operating below 1 GHz.
Regarding high-end many-core DSPs, such as TI's TMS320C6678 and 66AK2H which are both utilizing the C66x core, the authors in [69] and [70] compare the 8-core C66x to ARM Cortex A15 and get ∼9x faster execution for various benchmarks (4−10x). By reducing MIPS and MHz, as well as by considering the results of [67] , we derive that the A15 processor at 1.2GHz is 3−5x faster than the Cortex A9 of Zynq (at 667MHz), which we use hereafter as reference. Hence, we deduce that 66AK2H14 is ∼36x faster than A9. For hyperspectral unmixing, when comparing the 8-core C6678 at 1 GHz to ARM cortex A9 at 1 GHz, the authors in [71] derive slightly smaller factors than the above averaged ∼36x. If we adjust their ARM A9 to 667 MHz and their 8-core C66x to 1.2 GHz as in 66AK2H, we conclude that the DSP is up to ∼25x faster than the 1-core ARM CPU. The authors in [72] use a stereo matching benchmark, which in terms of complexity is roughly similar to our own "Disparity" that is detailed in [73] , on a 1-core C66x to achieve 14 fps for small images. They also show in [74] that the speedup to 8-cores is ∼4x, and hence, they would achieve ∼56 fps on 66AK2H14. cvCornerHarris on the C674x core [75] , we can roughly estimate that 66AK2H14 will complete a 1
Megapixel image in ∼50msec, and hence, it is ∼4x slower than our Zynq implementation (see section IV B 2). The authors in [76] test the 8-core TMS320C6678 and show that, for frequency-domain FIR, it achieves 1/3 throughput vs. Nvidia Tesla C2050, whereas for 25x25 convolution kernel, the TMS320C6678 achieves similar throughput with Nvidia GTX 295, or 1/3 throughput compared to FPGA Stratix III E260, or 1/9 compared to our own Zynq-7000 implementations reported in sec. IV B 2 of the current paper (90msec vs. 10msec for 1080p images). It is also shown in [76] that code optimization is important, as it boosts the DSP performance by 2−3x. Finally, based on the results of [63] for matrix multiplication benchmarks, we conclude that 66AK2H14 achieves similar throughput with Tegra K1 GPU and single-threaded Intel Core i7-3770K. Regarding the 64-core DSP computer RC64 "MACSPACE" [77] , in lieu of relative benchmarking, we consider the nominal values: 10 Watts for 20 single-precision GFLOPS (75GMACS/150GOPS) when operating at 300MHz. This FLOPS value is half than that of 66AK2H14 (GMACS are similar) when also operating at 300MHz. Hence, for mixed fixed/floating point arithmetic benchmarks, we expect that RC64 will achieve at most similar throughput with 66AK2H14.
Regarding direct comparisons of GPU and FPGAs, the authors in [78] perform an extensive literature survey for such platforms for stereo matching benchmarks. They examine dozens of algorithms and implementations, but their results concern devices which are now 5−10 years old, e.g., Intel Core2duo, Nvidia GTX 280, and TI C64x+. When looking at the cloud of results collected in their study, as a general conclusion one can remark that GPUs and FPGAs are difficult to distinguish with respect to throughput: using the paper's metrics for the fastest 50% of the implementations in each category, GPUs achieve 0.1−7 Gde/sec, whereas FPGAs achieve 1−6 Gde/sec. This picture is consistent with our own findings. Similarly, for older devices and various image processing benchmarks such as stereo matching, face detection, object detection and beamforming, some authors like [79] show ±10% throughput differences when comparing specific devices, whereas others show differences around 2−4x, either favoring GPUs [80] or FPGAs [81] . The survey in [78] also includes DSP processors, which however are 1−2 orders of magnitude slower than GPUs and FPGAs. This gap seems to be narrowing due to the latest high-end many-core DSPs considered in the current paper (see section IV C). Considering vision algorithms for space [82] , the desktop GPU achieved one order of magnitude higher performance/Watt than the desktop CPU, i.e., up to 12x with respect to GFLOPS and 49x with respect to GFLOPS/Watt. Finally, we note that for GPU-GPU comparisons, e.g., to combine results among papers and derive indirect comparisons, or to extend/support our own conclusions, we occasionally consider the variety of GPU benchmarks available online [62] . Harris corner detection (employing five 2D convolutions [17, 28] ), hyperspectral searching (of prestored spectral signatures in a pixel-by-pixel fashion [83] by using metrics such as χ 2 , L 1 , L 2 , cf. [84] ), Canny edge detection [27] and binocular visual odometry [17] (including distinct algorithms for feature detection, feature description, temporal and spatial matching, pose estimation [85] , outlier removal, etc.). The code used is fairly optimized and common among devices, with a reasonable level of compiler optimization (e.g., gcc version 4.6.3 with the -O3 flag), and the execution is singlethreaded in all cases, so that more straightforward comparisons can be made.
Harris was implemented on embedded Intel Quark X1000 using a Galileo board [86] , For the space-grade CPUs, to evaluate the performance gain of the latest LEON generation, Table 5 analyzes the execution time of our visual odometry benchmark [17] processing stereo pairs of 512x384 pixels. We remark that the first three columns/functions are executed twice and the entire pipeline is completed in 0.25sec on Intel i5-4590. On average, with a limited deviation among functions, the LEON4 achieves a 9.2x gain due to its 4x higher clock rate and improved architecture.
If we assume further clock increase to 600MHz and very effective parallelism to 4 cores, then we could anticipate future gains of even up to 100x. Even so, we note that such impressive performance improvement will not be sufficient for high-definition image processing at high frame-rates. For example, with the given algorithm, we would process only ∼1 fps of size 512x384, which is one order of magnitude less than, e.g., 1 Mega-pixel at 5 fps. That is, the space-grade CPUs are still 1−2 orders of magnitude slower than necessary for computationally demanding vision-based navigation, and hence, some form of acceleration is essential.
FPGA
To perform in-house evaluation of high-performance FPGA platforms, we accelerate multiple benchmarks on Xilinx Virtex6 and Zynq-7000 devices with hand-coded hardware description language (VHDL). The majority of these accelerators were originally developed and carefully optimized within completed ESA projects targeting vision-based navigation for planetary rovers [88] . Therefore, Zynq-7045 is 233x faster than Cortex A9 (2.8sec per image) and 5−9x faster than desktop
CPUs.
Furthermore, we also test stereo matching, which is a very intensive benchmark involving highly repetitive calculations. Specifically, our stereo matcher concerns bi-directional disparity map generation with 7x7 Gauss-weighted cost aggregation and its architecture is detailed in [73] . For one pair of 1120x1120 8-bit pixel images with 200 disparity levels, a 4-engine parallelization on Virtex6
VLX240T requires 0.54sec. Therefore, the FPGA is 29x faster than Intel i5 (see [90] for detailed comparison) and approximately 7700x faster than space-grade LEON3; this factor was derived after extrapolation from smaller images due to memory limitations on LEON3. Such huge factors can be also derived for Harris detection and SIFT description. For the former, comparing the aforementioned 12msec total time of Zynq-7045 to the almost 27sec of LEON3, which is an extrapolation based on image size and Table 5 , leads to a factor of ∼2250x. Even when we estimate the performance on a space-grade FPGA like Virtex-5QV (52% LUT and 55% RAM utilization, with 14.7ms
processing time due to 120MHz clock), the total speedup is still ∼1500x. For the latter, we optimized our SIFT descriptor architecture which is presented in [17] , to use fixed-point arithmetic and smaller description window (33x33), and hence, we improved the execution time by 30% on Virtex6 to consume 28msec when processing 1200 features. When porting to Zynq-7045 (200MHz) and assuming 4-engine parallelization, even if we add the 3.2msec overhead of PS-PL communication, then we obtain less than 10ms total execution time, i.e., an estimated ∼3000x speedup versus LEON3 (vs. SIFT in Table 5 ).
To test more streaming applications relying on continuous pixel I/O, we assume a faster PS-PL communication scheme than the one implemented above using Xillibus [89] . We assume the 4 AXI VDMA scheme developed by Xillinx [91] , which achieves 12 Gbps transmission rate and another 12 Gbps reception. In our hyperspectral benchmark that involves matching of incoming images to known spectral signatures, we develop in VHDL a large systolic network of processing elements to parallelize the pixel subtraction and metric accumulation (e.g., for L 1 , the i |x i − y i |), as well as the comparison to distinct pixel-signatures. When we assume ordinary application scenarios and up to 12 Gbps PS-PL communication, the Zynq-7045 PL operates up to 428MHz and accelerates its own ARM Cortex A9 by 340−630x. For more complex scenarios, e.g., more elaborate metrics like χ 2 , or hundreds of examined signatures, or 100 Gbps bandwidth, the FPGA acceleration would approach 10 4 x, however, in the current paper we focus on the more general case for fair comparisons. bandwidth), therefore, the GPU is faster than Zynq by a factor of 1.8x. In terms of performance per Watt, the Zynq FPGA outperforms the GPU by a factor of more than 10x. The GTX680 is 10−20x faster than desktop CPUs and up to 500x faster than ARM A9. Compared to high-end mobile GPUs, we expect/estimate the GTX680 to be 6−10x faster than Tegra X1 due to 7x bigger memory bandwidth and 6x more CUDA cores.
Besides Harris, we also use image super resolution (SR) and stereo matching ("Disparity" [73] )
to compare desktop GPUs to FPGAs. We tested a variety of devices and performed in-depth optimization of the implemented benchmarks on both platforms; a detailed analysis has been published separately [90] . Here, we consider Nvidia GTX 670/960, whereas for the FPGA (Zynq7000 Artix)
we consider both processing and communication time, in contrast to the study in [90] that focuses on processing. 
DSP
To perform in-house evaluation of high-end many-core DSP-oriented platforms, we implemented
Harris corner detection on the 12-core VLIW Movidius Myriad2 [92] . By applying a number of source code modifications to exploit the underlying architecture, we efficiently utilized the limited local memory and exploited the long instructions of Myriad2. In particular, we first identified the most frequently accessed data structures and placed them in the local memory. Second, we identified and removed storage redundancies. Third, we partitioned the image in 12 slices and assigned them to distinct cores. Fourth, we applied platform-oriented modifications, such as employing one 2D Myriad2 is 20−25x faster than such an embedded CPU. This factor increases to the area of 30x for a single 11x11 convolution, when Myriad2's time becomes similar even to that of Zynq, which suffers from its PS-PL communication overhead when sending small workloads to PS. Finally, we note that the clock rate on Myriad2 is programmable, e.g. 200−600MHz, with the overall performance decreasing proportionally and the power consumption decreasing down to 0.5W.
In addition to high-end DSP, we also test the space-oriented massive parallel processor board (MPPB) [56] , which integrates 2 VLIW Xentium DSPs running at 50 MHz. We ran simulations using the Xentium SDK environment to develop/profile 4 optimized code versions (specifically fixed-/floating-point, C and assembly) for 2D convolutions. Notice that these simulations serve only as a best-case scenario, because they discard communication delays, as well as LEON, AMBA and NoC latency. The final results show that MPPB requires 96−223 msec for one convolution with separable kernels of size 5x5 or 7x7 on a 1024x1024 8-bit image. Such performance is one order of magnitude lower than FPGAs, which are in the area of 10 msec or less. Furthermore, given that our Harris benchmark requires two 5x5 convolutions and three 7x7 convolutions, we estimate that MPPB requires almost 1 sec for 1 Mega-pixel image. Hence, currently, it outperforms LEON4 by only 3x, whereas if we make use of multi-threading on LEON4, their performance will probably be similar. We note here that MPPB is expected to improve in future versions, e.g., get a 100 MHz clock. However, this is not sufficient to bridge the gap with Myriad2 or FPGAs. Table 6 (with rounding/simplifications where necessary). Therefore, all results are unified in a linear scale such that the ratio of any two values in Table 6 equals the performance ratio of the selected platforms. The second row reports nominal/measured power and the third row reports performance per Watt. This is the ratio of the first two rows, however, it is calculated separately for each device, because the range limits in each category do not necessarily correspond to the same device, e.g., in rad-hard CPUs, throughput= 1.7 represents E698PM, but power= 18 represents RAD5545. Notice that the purpose of Table 6 is to provide succinctly a coarse overview of the vast solution space by focusing, mainly, on the high-performance image processing capabilities of each category. It is intended neither as an exhaustive evaluation of all industrial chips nor as a rigorous comparison between categories. Such a more accurate comparison is provided in Table 7 , for five selected devices, as a result of the second stage of our evaluation methodology.
C. Combination and Summary of Results
As shown in Table 6 , for single-threaded execution, the space-grade CPUs achieve performance that is comparable to that of the embedded CPUs, albeit consuming more power. Altogether, space-grade and embedded CPUs are 1−2 orders of magnitude slower than desktop CPUs, which however suffer from size limitations and high power consumption and become inappropriate for space applications. Overall, the CPUs achieve the lowest, by 1−3 orders of magnitude, performance per 
PS-PL communication)
, they achieve almost the highest throughput that is close to, or in some cases even better than, that of desktop GPUs. Therefore, given the overview of Table 6 , we now focus on FPGAs and high-end multi-core DSP processors as being the most suitable choices for the space avionics architecture. Table 7 focuses on one representative FPGA (Xilinx Zynq7045) and two multi-core DSP processors (namely 12-core Movidius Myriad2 and 8-core TI 66AK2H14), which for fairness are all SoC devices at 28nm technology node. In addition, for reference, Table 7 includes one of the latest Table 7 , the FPGA achieves 10x better performance per Watt than all platforms except Myriad2, which is however almost 10x slower than FPGA when considering high-definition images and increased FPGA parallelization/resources. If we assume a power budget of 10 Watts, then the FPGA would be the fastest choice, with 66AK2H14 being 4−6x slower. If we assume an even more limited power budget, e.g., 1−3 Watts, then Myriad2 improves the CPU throughput by ∼3x (or even by one order of magnitude considering the results of Table 5 instead of the aforementioned extrapolation). Nevertheless, the FPGA could further improve this throughput by 10x by paying only a small penalty in power dissipation. That is, the FPGA can achieve 1−2 orders of magnitude faster execution than any space-grade CPU with comparable power consumption.
On the other hand, even with throughput=20, the CPU will doubtfully meet the increased demands of future vision-based navigation. Altogether, today's multi-core DSP architectures approach the FPGA figures in terms of performance and performance/Watt, however, not simultaneously for the same device. For instance, we must choose either Myriad2 or 66AK2H14 depending on whether power or speed is the preferred criterion. Moreover, we note that the DSP values of Table 7 are derived with maximum clock frequency, i.e., 1.2 GHz for 66AK2H14. When/if this rate is decreased during space flight, e.g., for reliability reasons, then the speed gap between DSPs and FPGA will widen proportionally; we observe here that the relatively low clock rate is an advantage of FP-GAs. Therefore, when both power and performance are important, the FPGA is the most effective architecture/solution currently available.
Beyond performance and power, we consider factors such as the connectivity of the COTS board, radiation tolerance, size/mass, re-programmability, development effort, and cost. The majority of available COTS boards provide multiple interfaces like Eth, USB, PCIe, etc. However, the FPGA boards offer the advantage of extra GPIO and FMC pins allowing us to communicate with custom connectors and daughter-boards. A second advantage of FPGAs relates to the various mitigation techniques that can be applied for error correction due to radiation [93] ; these mitigation techniques can be applied almost at gate-level within the same FPGA, and hence, more efficiently than in the case of CPU/core-based processors applying, e.g., triple core lock-step. Considering programmability, the dynamic reconfiguration of SRAM FPGAs renders them almost as useful as the remaining many-core platforms, even for remote updating of their firmware. Considering size/mass, the FPGAs are among the most competitive COTS boards (e.g., 5.72 × 10.16 cm 2 for the Zynq MMP Zedboard, or 7.8 × 4.3 × 1.9 cm 3 for the Zynq Xiphos Q7, with a mass of only 65−24 g).
On the downside, as already mentioned, the development effort is increased for FPGAs by a factor of ∼4x compared to SW platforms. Still, efficient programming of many-core chips with multiple levels of parallelization, also requires an increased amount of effort compared to conventional, serial SW coding.
V. Conclusion
The current paper performed a detailed survey and trade-off analysis involving an extended number of diverse processing units competing as platforms for space avionics. The set of candidate platforms included both rad-hard and COTS devices, of older and recent technology, such as CPUs, GPUs, multi-core DSPs, and FPGAs. Gradually, our analysis focused on high-performance embedded platforms, which were also compared to more conventional devices for the sake of perspective. The application scenario considered was visual based navigation, for which we performed a distinct exploration to collect generic specifications and widely used algorithms, i.e., to define a set of representative benchmarks. Overall, the profiling of each device depended both on the benchmark complexity and the underlying HW architecture, and hence, the relative performance of the devices varied greatly among groups and created overlapping clouds of results (with size expanding even by 10x per group). The challenge was tackled in our comparative study by combining numerous literature results with in-house development/testing, which ultimately led to a big, consistent picture.
The results show that new generation space-grade CPUs are 10x faster than their predecessors, however, they are still one order of magnitude slower than what will be needed for reliable autonomous VBN. Therefore, we must design high-performance avionics architectures utilizing HW accelerators. In particular, instead of utilizing multiple chips, it is preferable to utilize SoC devices, which integrate general purpose processors and HW accelerators towards size/mass minimization, power reduction, fast intra-communication, re-programmability, and increased connectivity. Separated by orders of magnitude, the FPGA accelerators provide the highest performance per Watt across all platforms, whereas the CPUs provide the lowest, irrespective of their type. In terms of speed alone, high-end desktop GPUs and FPGAs are difficult to distinguish (as groups, they provide similar clouds of results). Likewise, high-end mobile-GPUs and many-core DSPs are difficult to distinguish, although the latter have the potential for slightly better performance and power. In terms of speed alone, desktop GPUs and FPGAs are clearly better than mobile-GPUs and many-core DSPs. In terms of power, desktop CPUs and GPUs seem prohibitive for space avionics, however, a 10 Watt budget is enough to allow many-core DSPs or mobile-GPUs or FPGAs to accelerate a conventional space-grade CPU by 1−3 orders of magnitude. In such a high-performance embedded computing scenario, with relaxed constraints on radiation tolerance due to mission specifications, it would be preferable to utilize COTS 28nm SoC FPGAs, which provide 2−29x faster execution than 28nm DSP processors.
