Abstract-Researchers, corporations, and government entities are seeking to deploy increasingly compute-intensive workloads on space platforms. This need is driving the development of two new radiation-hardened, multi-core space processors, the BAE Systems RAD5545 TM processor and the Boeing HighPerformance Spaceflight Computing (HPSC) processor. As these systems remained in the development phase, the Freescale P5020DS and P5040DS systems, based on the same PowerPC e5500 architecture as the RAD5545 processor, and the Hardkernel ODROID-C2, sharing the same ARM Cortex-A53 core as the HPSC processor, were selected as facsimiles for evaluation.
INTRODUCTION
Due to the harsh environment of space, the employment of radiation-hardened processors is essential to ensure success of many missions. Two of these processors currently in development, the BAE Systems RAD5545
TM processor and the Boeing High Performance Spaceflight Computing (HPSC) processor, are the focus of this research. These new devices drastically improve performance compared to their predecessors and bring multi-core processor architectures to space-computing platforms. This additional performance enables computational loads that were previously deemed infeasible on radiation-hardened space platforms, including advanced sensor-data analysis, computer-vision applications, and autonomous spacecraft operations. These capabilities will equip a new generation of space systems to perform complex analysis on-board, effectively communicate key data, and make autonomous decisions for navigation and critical operations.
Many members of the scientific and aerospace research communities aim to employ sophisticated algorithms at larger scales for big-data processing in space. The increasing complexity and scope of systems and sensors push and often exceed the limits of current space-grade processors. The latest experiments often require larger data sets with long compute times and high resource requirements. If space is to continue to serve as a valuable domain for gathering scientific knowledge, the systems and tools employed must continue to evolve to allow for greater computational capability.
With radiation-hardened space processors crossing the boundary into multi-core architectures, shared-memory multiprocessing becomes a source of parallelism to exploit. The distribution of compute-intensive workloads across multiple processing cores can significantly reduce the impact of a lower clock frequency and achieve speedup over singlecore execution. This approach enables the application of more advanced algorithms on larger data sets through onboard processing performance. This research seeks to investigate and compare parallel performance of the RAD5545 and HPSC processors through application benchmarking. This exploration will provide insight into the advantages and disadvantages of each platform and elucidate the new capabilities emerging for on-board processing. This paper is organized as follows. Section 2 provides background information on radiation-hardened processors, shared-memory parallelism, and the platforms and applications employed in this study. Section 3 outlines the methods and procedures that were conducted to realize a comparative analysis between the two competing platforms. Section 4 presents the results collected and incorporates observations and discussion. Section 5 presents conclusions and future work.
BACKGROUND
This section presents a cursory overview critical to the goals and motivations of this research. The fundamentals of radiation-hardened, space-grade processors are considered. Methods of enabling shared-memory parallelism through OpenMP are noted. Details on the platforms and applications investigated in this study are also shared in this section.
Radiation-Hardened, Space-Grade Processors
The latest space platforms for observation and science host a plethora of unique, sophisticated sensors. Some of these modern sensors can generate terabytes of raw data per day [1] . Transferring such large amounts of data would saturate even the highest bandwidth communication channels. This dilemma is compounded as the missions in need of the most radiation-hardened systems are typically probes or rovers with the farthest distance to travel and thus the lowest bandwidths over which to transmit. Previous research has considered the need for and benefit of on-board processing. Spaceborne high-performance computer systems facilitate applications of high computational complexity, such as sensor-data processing [2] or machine learning [3] , which enable more innovative missions. For some distant missions, on-board processing and decision-making will become essential for even basic levels of operation [4] .
Unfortunately, the harsh environment of space can be a difficult place for traditional computing devices to function. Impacts from particles like protons and heavy ions cause several types of single-event effects (SEEs). Temporary upsets or functional interrupts affect data or system integrity. More permanent and often destructive latch-ups, burnouts, and gate-ruptures can cause significant damage to the device [5] . Additionally, for long-term missions, the functional degradation of devices due to total ionizing dose (TID) of radiation becomes a serious issue. Typical radiation doses vary from as little as 0.1 krad per year in some low-Earth orbits to as much as 100 Mrad per pass for some Jupiter transfer orbits [6] .
Some of the only computing systems capable of withstanding such harsh conditions are radiation-hardened, space-grade processors. The RAD6000 TM radiation-hardened space processor was designed to handle a TID of greater than 1.0 Mrad(Si) with fewer than 7.4×10 -10 upsets per bit per day. Unfortunately, it was only capable of up to 35 DMIPS (million Dhrystone 2.1 instructions per second) at 33 MHz [7] , which is paltry compared to over 100,000 DMIPS for modern high-end processors [8] . Despite this lower performance, it achieved success in the Spirit and Opportunity Mars rovers as well as many other landers and probes [7] . The RAD750 TM , a predecessor to the RAD5545, can withstand a TID of up to 1.0 Mrad(Si) while delivering consistent computation with fewer than 1.6×10 -10 upsets per bit per day. However, it is limited to approximately 400 DMIPS at 200 MHz [9] [10] . This processor has been employed in the Lunar Reconnaissance Orbiter, the GPS III modernization effort, and the Curiosity Mars rover [11] . Some missions, though, require several of these processors to meet computational needs, adding to expense and complexity of the designed system [12] . The modern radiation-hardened processors explored in this study can withstand similar conditions while providing significantly higher computational capacity across multiple cores.
Previous research has been conducted to evaluate the performance of radiation-hardened processors. The study in [13] investigated the capabilities of the RAD5545 and several other CPU-and FPGA-based computing systems via performance metrics analysis. These metrics provide insight into performance characteristics such as computational density and memory bandwidth without requiring the device for analysis. That study is expanded in [14] to include kernel benchmarks on the same platforms. The research presented here focuses on application benchmarks to provide a more representative real-world assessment of the capability of these platforms.
Shared-Memory Parallelism with OpenMP
Parallel computing, once a niche discipline, is now ever expanding into a world of multi-core processors, massively parallel graphics processing units, and a myriad of hardware accelerators. This parallelization has allowed engineers to overcome the barriers that slowed performance gains in the processors of the past. Some complex algorithms and applications are now only realizable in given time constraints with sufficient parallelization [15] . The next generation of space processors has been equipped with immense capacity for multi-core data processing. Due to communication overhead and architectural limitations, performance does not necessarily scale with the increased number of cores. This variability in parallel performance presents the need for deeper study and analysis of different applications employed on these architectures, a need that this research is intended to help address.
There are many practical methods for parallelizing software across multiple processing units. The most commonly applied are the message-passing and shared-memory models [16] . Shared-memory models are used when compute nodes possess a common memory space, allowing operations to be conducted without the need for data transmission to and from separate nodes. The most widely used variant of this model is Open Multi-Processing (OpenMP), which allows for parallelism via compiler directives and multithreading [17] . The techniques involved in this study's approach will be based primarily in OpenMP due to the multi-core, singlenode architecture of the examined space processors.
Scheduling is another factor in parallelization that affects how a problem is divided and how well a parallel program performs. OpenMP's default scheduling methodology is the static division of computation evenly across all cores at compile time. Dynamic scheduling refers to the process of OpenMP assigning small segments of the job to each core during run-time as pieces are completed. The dynamic approach allows processing to be split more evenly across time at the cost of some run-time scheduling overhead.
Platforms
To assess performance of the BAE Systems RAD5545 and Boeing HPSC processors during their development phases, platforms of similar architecture were selected as facsimiles upon which to perform comparative application benchmarking. For the RAD5545 processor, the PowerPC e5500-based Freescale P5020DS and P5040DS systems were selected. For the HPSC processor, the ARM Cortex-A53-based Hardkernel ODROID-C2 was employed. Applications were also run on a standard x86-64-based Intel Core i7 desktop workstation for a baseline performance comparison. Specifications of these platforms can be referenced in Table  1 .
The RAD5545 is a radiation-hardened-by-design, spacegrade processor. The device is designed for extreme reliability, with fewer than 2×10 -9 upsets per bit per day, a TID rating of 1 Mrad(Si), and immunity to latch-up. This system is also specifically designed for on-board processing applications, equipped with four RAD5500 TM Power Architecture processor cores to conduct computations in an efficiently parallel manner. This processor is capable of 5.6 GOPS (billions of operations per second), 3.7 GFLOPS (billions of floating-point operations per second), and up to 1398 DMIPS per core at 466 MHz [13] , for a total of 5592 DMIPS. Its capability is aided by three levels of cache as well as the ability to interface with other devices via Serial RapidIO for high-speed communication [18] . As this device was not yet available at the time of this study, its performance was approximated using commercially available processors.
The P5020 and P5040 systems served as useful facsimiles for the RAD5545 processor because, combined, they employ all the components of interest present in the RAD5545 processor. Although nearly identical, the P5020 and P5040 systems differ in both the number of processing cores available and the nature of their interconnects. The P5020 system only has two e5500 processor cores but possesses Serial RapidIO interconnects [19] . The P5040 system lacks Serial RapidIO but contains four e5500 processor cores [20] .
Boeing's HPSC processor is a similar radiation-hardened-bydesign, space-grade processor currently in development. The HPSC processor was originally conceived to meet the mutual needs of the National Aeronautics and Space Administration (NASA) and the United States Air Force Research Laboratory (AFRL) for next-generation space processing capabilities. The device's requirements aim for eight cores per "chiplet." Overall, the processor is capable of a performance of up to 15 GOPS and attaining 1840 DMIPS per core at 800 MHz, 7360 DMIPS per ARM Cortex-A53 cluster, or 14,720 DMIPS per chiplet. Use of ARM's singleinstruction, multiple-data (SIMD) NEON accelerators can yield up to 100 GOPS. The system is intended to perform at 1×10 -10 upsets per bit per day or fewer, exhibit a TID of 1 Mrad(Si), and incorporate latch-up immunity. HPSC chiplets are designed to be scalable via interconnection through several high-speed interfaces, including Ethernet, PCIe, and Serial RapidIO [21] . Integrated fault-tolerance will enable error detection and correction, checkpoint and rollback functionality, and n-modular redundancy [22] . The HPSC processor's performance must also be approximated by commercial devices, as it is in even earlier development phases than the RAD5545 processor.
Many current devices employ a system-on-chip (SoC) containing the ARM Cortex-A53 processor architecture upon which the HPSC processor is based. The most accessible Cortex-A53 derivative to this research group is the Hardkernel ODROID-C2 platform, which boasts a quad-core ARM Cortex-A53 processor equivalent to half of a chiplet in the HPSC processor. Space-processing solutions based on the HPSC processor are projected to scale to multiple chiplets, each including two ARM Cortex-A53 quad-core processor clusters coupled with an advanced microcontroller bus architecture (AMBA) interconnect for symmetric multiprocessing operation. The ODROID-C2 only serves as a facsimile for a portion of a single chiplet. This approach is considered valid within the scope of this research due to the fair comparison it permits with the quad-core RAD5545 processor. Further, existing studies, such as [23] , note that substantial overhead is incurred by parallelization over the AMBA interconnect. Simple applications parallelized across the HPSC processor's two Cortex-A53 clusters can experience significant reductions in speedup. It may be more effective to confine some applications to a single quad-core region to maximize parallel efficiency. For example, attaining speedups of up to 3.9 for two applications on each quad-core processor at once may be a significantly better use of resources than speedups that remain in the range of four to five across all eight cores.
Notable differences to highlight between the P5020 and P5040 systems and ODROID-C2 include the employment of an L3 cache in the P5020 and P5040 systems and differences in external and internal memory bandwidth. The ODROID-C2 excels with respect to internal memory bandwidth, with 416 GB/s compared to 119 and 270 GB/s for the P5020 and P5040 systems, respectively. The P5020 and P5040 systems are superior in external memory bandwidth, with 21.3 and 25.6 GB/s, respectively, compared to 1.5 GB/s for the ODROID-C2. The significant difference in thermal design power (TDP) between the P5020 and P5040 systems in comparison to the ODROID-C2 should also be noted, with the latter consuming significantly less than the P5040 system's energy needs.
Effectively comparing commercial-off-the-shelf (COTS) platforms and radiation-hardened derivatives is nontrivial. Despite the common architectures shared by the P5020, P5040, and RAD5545 processors or the ODROID-C2 and HPSC processors, the process of radiation hardening yields a device with substantial differences in performance and power characteristics. These discrepancies make final performance of the device difficult to predict. Due to architectural similarities, the performance data garnered from the facsimiles in this study is considered the best available basis for forecasting the performance of these radiation-hardened devices.
Applications
This research evaluated five applications in comparative benchmarking, including color search, hyperspectral imaging (HSI) linearly-constrained minimum variance (LCMV) beamforming, Mandelbrot set generation, Sobel filter, and image thumbnailer applications. Many of these applications were selected due to their relevance for numerous space mission scenarios. The test image used for most applications as well as output images from each of the applications are visible in Figure 1 .
The image thumbnailer application performs bilinear interpolation to resample an input image to an output of lower resolution, creating a thumbnail. These thumbnails are useful in space use cases for creating low-resolution versions of images for verification before downloading the fullresolution version, which takes much longer. A demonstration thumbnail for the previously presented input image can be referenced in Figure 1 (b). The task of image thumbnailing could be parallelized simply by the horizontal lines of the image. While load balancing for the image thumbnailer was even, better performance was observed with the use of dynamic scheduling, and thus this modification was included in the employed thumbnailer application.
The color search application employed is a simple image-processing program that performs an exhaustive search of an image for a specified color value. The Euclidean distance between the color of each pixel in the image and a desired search pixel is calculated by the method described in [24] . If any pixel's distance is within a preset threshold, that pixel is highlighted in the output image to indicate a match. An example of the color search, a search for clouds in Earthobserving imagery, is depicted in Figure 1 noted that five, ten, and fifteen percent thresholds are denoted in this test as red, yellow, and green highlighting, respectively. The color search was parallelized via OpenMP with the image being evenly divided statically across the cores by horizontal lines.
The Sobel filter application performs edge detection on an image, which is computed in this case by performing a pair of two-dimensional convolutions with a window size of 3×3. Calculations are performed on the intensity of each of the pixels within the window to determine a gradient for change in intensity in the x and y directions for each channel. The magnitude of these gradients highlights areas corresponding to edges, as observed in Figure 1(d) . Parallelization of the Sobel filter also divides processing statically by horizontal lines in the image.
Hyperspectral imaging is the process of capturing images concurrently from many different spectral bands. The spectral profiles of the image are then used to identify objects and/or classify which materials are present at certain locations. This process can be used to build terrain maps or measure the advance of urbanization, deforestation, or glacial melt, among other object-sensing applications. The LCMV beamforming algorithm is a supervised-classification method that only requires spectral information for the targets to be detected. This application was developed as a benchmark in [25] and later parallelized. Most of the execution time consists of matrix multiplications, which are easily parallelized to provide a significant performance increase. A processed output image from the data set used in this study, colorized via the MATLAB imagesc function, is pictured in Figure 1 (e).
The Mandelbrot set fractal generator application was included in this study due to its embarrassingly parallel nature and its use of intensive double-precision floating-point computations. Construction of a Mandelbrot set consists of checking points in a complex plane under the condition zn+1 = zn 2 + c. If a point c yields a bounded sequence, that point is a part of the set. As the inclusion of one point is separate and does not depend on the inclusion of others, the problem can be easily parallelized and is considered embarrassingly parallel [26] . A fractal generated by this software can be referenced in Figure 1 (f). The Mandelbrot set application was developed from examples accessible at [27] with the OpenMP parallelization verified by [26] . With its processing also divided by horizontal lines of the image, the Mandelbrot set demonstrated uneven load distribution. Greater computational density near the center "bulb" of the fractal was accounted for using dynamic scheduling.
METHODOLOGY
This methodology section conveys the steps performed in the realization of the goals of this research. The preparation of platforms, the methods of measurement and calculation, and the approach to transforming these results into an accurate prediction of performance for the RAD5545 and HPSC processors are conveyed.
Platform Preparation
Each of the platforms employed was prepared for application benchmarking by installing a lightweight operating system (OS) and the relevant libraries for program execution. PC benchmarks were conducted on an Ubuntu 16.04.4 LTS desktop installation. ODROID-C2 benchmarks were conducted within an Ubuntu MATE 16.04.4 installation. The P5020 and P5040 systems were both equipped with custom lightweight Linux images prepared via the Linux SDK for QorIQ processors. The GNU Scientific Library (GSL) packages required for execution of the hyperspectral imaging application were installed on each of these platforms.
Application Preparation and Input
Applications were garnered from their respective sources and parallelized for shared-memory multiprocessing using OpenMP. A goal of this study was to ensure optimal performance with consistent program code across platforms. Both the serial baseline and parallel variants of each application were optimized to -O2 levels during the compilation process. Program optimizations for the NEON SIMD accelerators of the ARM Cortex-A53 architecture are not used in this research.
For input to the color search and Sobel filter applications, a terrestrial image thumbnail acquired from the NSF SHREC Space Test Program -Houston 5 -CHREC Space Processor (STP-H5/CSP) experiment aboard the International Space Station (ISS) was scaled up to the standard pixel dimensions of a full-size image, 2448×2050 pixels [28] . For input to the thumbnailer, a 4256×2832-pixel image of the Earth taken by an astronaut aboard the ISS was scaled down to typical "full HD" resolution (1920×1080). A different, larger thumbnail was created to ensure execution times of the thumbnailer reached within the same order of magnitude as most other applications. Finally, for the HSI LCMV beamforming application, the URBAN data set of HYDICE sensor imagery provided by the United States Army Corps of Engineering Geospatial Research Laboratory served as input [29] .
Performance Measures
All applications recorded and output OpenMP wall-clock timing for the execution of the primary operation of the program, such as the convolutions of the Sobel filter or the fractal generation of the Mandelbrot set. Execution times for serial baseline and parallel variants of most applications were averaged over 1000 runs. The HSI LCMV beamforming application was run for only 100 runs due to its roughly two orders-of-magnitude longer execution time. Each parallel run collected execution times for one, two, three, and four cores for every platform except the P5020 system, which was limited to two cores.
Energy Consumption Measures
System power measurements were collected for each application and platform combination using a power meter. Measurements were taken at idle, serial load, and parallel load for one through four cores. Idle was defined as the platform being fully booted and ready for application execution but with no foreground applications running. Serial load was determined as the peak power consumption while running the serial-baseline scripts. Parallel load was considered the peak power consumption for a certain number of cores while running the parallel energy-evaluation scripts. Calculations were performed for energy consumption of each combination of application, platform, and number of cores by multiplying the power consumption by the execution time.
Underclocking and Frequency Scaling
The RAD5545 and HPSC processors, being radiationhardened, have a significantly lower clock frequency than the facsimile platforms assessed. Previous performance studies for radiation-hardened processors have employed frequency scaling as in [13] . This research proposes a hybrid approach to isolate and minimize scaling error by unifying two methods: underclocking and frequency scaling. Underclocking implies collecting actual results at a reduced device clock frequency. Frequency scaling implies projecting acquired results to a lower device frequency. Applying underclocking where feasible and frequency scaling where necessary generates an effective representation of the performance of these radiation-hardened space processors.
The P5020 and P5040 RAD5545 facsimiles employed could not be effectively underclocked without hardware reconfiguration and the generation of new boot images. It was therefore decided that frequency-scaling methods to the RAD5545 projected frequency of 466 MHz consistent with [13] would be applied. The ODROID-C2 HPSC facsimile was much more straightforward to underclock via a software frequency governor integrated as a component of the processor driver. For thorough HPSC processor analysis, two frequency values were targeted: 466 MHz and 800 MHz. The 800 MHz target is noted in [21] as the planned maximum frequency for the HPSC processor. The 466 MHz target is meant to equate to the RAD5545 processor frequency and allow more direct architecture comparison without variance in clock speed. Minimum and maximum frequencies were desired for the HPSC processor as further design and fabrication may reduce the target frequency. However, the ODROID-C2 was bound by hardware limitations to 500 MHz and 1000 MHz. In order to remedy this discrepancy from the target frequencies, a hybrid-scaling approach was adopted. The ODROID-C2 was underclocked to 500 MHz and 1000 MHz and results were gathered. These results were then scaled to the target frequencies of 466 MHz and 800 MHz, respectively using Eq. 1.
While frequency scaling is a common and accepted practice for benchmarking predicted performance, parallel performance measures vary between frequencies, especially when non-static scheduling methods are employed, in manners not precisely predictable with scaling alone. The hybrid approach employed in this study allows a better representation of parallel performance and scaling behavior by relying less on the frequency-scaling model and more on real data. In order to ensure that scaling was accurate in the cases where it was necessary, results acquired at 500 MHz and 1000 MHz frequencies were compared with the fullspeed results scaled to those frequencies, directly comparing real and scaled versions of results.
RESULTS
This section displays all performance and energy consumption results and offers some discussion on the trends these results represent. The results include the execution times, speedups, parallel efficiencies, and energy consumptions for each combination of application, platform, and number of cores. A comparison of the validity of scaled versus underclocked benchmark results cements the legitimacy of the usage of frequency scaling where necessary.
Performance and Energy Consumption Results
The charts in Figure 2 and segments of discussion that follow reflect the execution time of each application on each platform. For the RAD5545 and HPSC processors, projected minimum and maximum execution times are shown. The front bars of typical coloration denote predicted minimum execution time while the back, darker bars denote predicted maximum execution time. For the RAD5545 processor, these were determined by taking the minimum and maximumscaled values from the P5020 and P5040 results. For the HPSC processor, these denote results at 800 MHz as the fastest and 466 MHz as the slowest. The color search in Figure 2 (a) and the Sobel filter in Figure 2 (d) are the quickest of the five applications, an important consideration for later inspections of speedup and efficiency. The HSI LCMV beamforming application in Figure 2(b) is the slowest. Tabulated execution times for each application and platform averaged across runs can be referenced in Table 2 in Appendix A.
The P5040 system depicts execution times that are, on average, 10% faster than the P5020 system across all applications. This improvement is primarily due to its faster clock speed, 2266 MHz for the P5040 system compared to 2000 MHz for the P5020 system, a 13% increase. The P5020 system may reconcile some of this difference through more advanced scheduling and memory management schemes, as noted in [19] . These same advanced features are likely to be employed in the RAD5545 processor, improving its overall performance.
The ODROID-C2 depicts significantly faster execution times than the P5020 and P5040 systems. On average, it performs roughly 37% faster than the P5040 system. In some isolated cases, particularly involving memory-bound applications like the thumbnailer in Figure 2 (e), the ODROID-C2 executed more than twice as fast. These faster times are notable considering its 1540 MHz clock speed compared to the P5040 system's 2266 MHz, 47% faster. This difference in performance is attributed to architectural advantages as well as a higher internal memory bandwidth: 416 GB/s on the ODROID-C2 compared to 119 and 270 GB/s on the P5020 and P5040 systems, respectively. For the Mandelbrot set in Figure 2 (c), a compute-bound application, the P5040 system outpaces the ODROID-C2 by up to 15%.
The RAD5545 and HPSC processors are projected to be capable of performance within the same order of magnitude for most applications. While the HPSC processor is consistently faster, much of this speed relies on the attainment of 800 MHz performance in a radiation-hardened package. The lower 466 MHz assessment, as visualized by the darker back bars, is still faster but less competitive for a few applications tested. Even so, the observed advantages of the ARM Cortex-A53 architecture will apply regardless of frequency.
The charts in Figure 3 and segments of discussion that follow reflect the speedups and parallel efficiencies of each application on each platform. For the RAD5545 and HPSC processors, projected minimum and maximum speedups and parallel efficiencies are shown. A common legend for these speedup and parallel efficiency charts is included. The circle and square markers denote the maximum speedups and parallel efficiencies, respectively. The triangle and diamond markers denote the minimum speedups and parallel efficiencies, respectively. For the RAD5545 processor, minima and maxima were determined by taking the minimum-and maximum-scaled values from the P5020 and P5040 results. For the HPSC processor, minima and maxima denote the minimum and maximum speedups and parallel efficiencies from results at 800 MHz and 466 MHz for each application. Tabulated speedups and parallel efficiencies for each application and platform averaged across runs can be referenced in Table 3 and Table 4 in Appendix B.
For applications that execute more quickly, the color search in Figure 3 (a) and Sobel filter in Figure 3 (d), a higher overhead is experienced. Preparing shared and private data for parallelization and forking or joining threads takes a larger portion of the overall execution-time of the program. This overhead results in lower speedups and parallel efficiencies for those applications. The Mandelbrot set in Figure 3 (c) presents some of the most ideal trends, reaching speedups of up to 3.9 and an average efficiency of 98% across all platforms. To improve the performance of the Mandelbrot set and thumbnailer in Figure 3 (e), dynamic scheduling is employed. These are two visible cases of trends where dynamic scheduling is more effective than statically dividing the workload at compile time.
Regardless of the scheduling methodology, the trends indicate significantly better speedups and parallel efficiencies from the P5020 and P5040 systems. These platforms average 5% improvement in speedup and parallel efficiency compared to the ODROID-C2. This average can be misleading, however, as consistency for larger numbers of cores is most critical. In some cases, such as the quad-core Sobel filter in Figure 3(d) , the P5020 and P5040 systems boast more than 50% better speedup and parallel efficiency than the ODROID-C2. Despite falloff in efficiencies for the HSI application visible in Figure 3(b) , likely due to the overhead of preparing large data structures for parallelism, efficiencies remain consistently high thereafter. The authors reason that the Data Path Acceleration Architecture (DPAA) and associated hardware accelerators present in the P5020 and P5040 systems for buffer, queue, and frame management allow significantly better performance in these cases.
In comparison, the ODROID-C2 exhibits large drops in speedup and parallel efficiency, occasionally even before using all four cores. Particularly large falloffs are experienced for four-core parallelization. However, OS overhead likely contributes to this lackluster quad-core performance, as one or more cores running the application must bear the overhead of running OS tasks. The P5020 and P5040 systems boast much lighter operating systems by comparison. Effectively parallelized applications, such as the HSI or Mandelbrot set, fare better on the ODROID-C2 and thus show more promise for further scaling across devices.
The charts in Figure 4 and segments of discussion that follow reflect the energy consumption of each application on each platform. While platform hardware contributes more to the energy consumption than the application, patterns of energy consumption relate highly to the patterns observed previously in execution time. The applications with the shortest execution times, the color search in Figure 4 (a) and Sobel filter in Figure 4 (d), consumed less energy on all platforms. Due primarily to its significantly longer execution time, the HSI application in Figure 4 (b) consumes the most energy, measured in kilojoules. Tabulated energy consumptions for each application and platform averaged across runs can be referenced in Table 5 in Appendix C.
The ODROID-C2, as a single-board computer, significantly bests the other platforms in energy consumption. In most tests, the P5020 and P5040 systems consumed more energy than the desktop workstation. System energy comparison with the ODROID-C2 is biased as the P5020 and P5040 systems include additional interfaces and peripherals, such as optical and hard disk drives as well as higher-rated power supplies, that are not present on the ODROID-C2 and will not be present on the RAD5545 or HPSC processors. It should be noted that the 17.7-Watt power consumption documented for the RAD5545 processor in [18] would significantly reduce its energy consumption in comparison to the P5020 and P5040 systems. However, the HPSC processor aims for significantly lower power consumption, below seven Watts per chiplet [21] , which makes it favorable for many low-energy applications. In applications with larger energy budgets, room remains for additional devices, increasing the potential for scalability to higher computational capabilities. Due to the early development stages of the HPSC processor, no direct 
Another key insight is the effect of parallelization on energy consumption. For applications that parallelize poorly, such as the color search and the Sobel filter due to their speed, the increased dynamic power of another core partaking in the workload results in higher energy consumption. For applications that parallelize well, especially visible in HSI, the reduction in processing time negates the additional dynamic-power overhead of another core, and energy consumption is significantly reduced. While HSI biases this assessment due to its long execution-time, the Mandelbrot set in Figure 4 (c) and thumbnailer in Figure 4 (e) also produce consistently reduced energy consumption. This energy-consumption trend is especially revealing considering many of the most critical applications for space-grade processors involve complex calculations and long executiontimes. The knowledge that effective parallelization can further reduce energy consumption is significant motivation for the adoption of these multi-core platforms.
Frequency Scaling Versus Underclocking
To ensure that frequency scaling applied for these application and platform combinations with limited error, a validation method was devised. Full-speed results on the ODROID-C2 were scaled to 500 MHz and 1000 MHz, and then compared to results collected at those frequencies. Ratios were derived between the scaled and actual results and then averaged. This average was compared with the expected ratio of the device frequency to the target frequency, allowing a determination of the average scaling error for this set of applications on the ODROID-C2.
As expected, scaling error was minimal, averaging less than 2.70%. However, this scaling error does not merely result from subtle variations in underclocking the device. Much more impact is derived from how well the considered applications parallelize and what scheduling methodologies are used. This conclusion is supported by the tendency of faster applications, such as the color search and Sobel filter, to stray from the expected ratio due to relatively high parallel overhead relative to their total execution time. Again, this validation of scaling accuracy is also limited to one platform, the ODROID-C2, as underclocking the P5020 and P5040 systems would have required modifications to their underlying operating system images that were infeasible in the scope of this study. Thus, the full 2.70% scaling error applies to the projected RAD5545 processor measures. However, scaling was only applied to the HPSC processor measures for the transitions from 500 MHz to 466 MHz, a 6.8% change, and 1000 MHz to 800 MHz, a 20% change. Scaling error applied in this manner results in 0.18% and 0.54% scaling errors for 466 MHz and 800 MHz measures, respectively, averaging a 0.36% scaling error for predicted HPSC processor times. 
CONCLUSIONS
The primary focus of this study was the assessment of parallel application performance to predict and compare the capabilities of two next-generation space processors, BAE's RAD5545 processor and Boeing's HPSC processor. The primary platforms of consideration were the Freescale QorIQ P5020DS and P5040DS systems, which feature PowerPC e5500 architecture dual-and quad-core processors, respectively, and the Hardkernel ODROID-C2, which features a quad-core ARM Cortex-A53 processor. These platforms serve as COTS facsimiles for the space-grade RAD5545 and HPSC processors, respectively, currently in development. Several applications of relevance to space missions were benchmarked on these platforms to determine the expected performance as well as strengths and weaknesses of the facsimiles' space-grade counterparts. Facsimile power measures allowed comparison on the dimensions of system energy consumption.
Summary of Results
The high parallel efficiencies boasted by the PowerPCe5500-based facsimiles indicate substantial scalability for parallel applications. Considering the RAD5545 processor's capacity for high-speed interconnect via Serial RapidIO, this high efficiency maximizes the effectiveness of parallelization over a network of interconnected processors. By comparison, the HPSC facsimile achieves significantly better performance at lower clock speeds and much lower energy consumption. This level of performance aids in ensuring the HPSC processor will remain competitive even after the decrease in clock speed and increase in power consumption inherent from the radiation-hardening process. Despite the HPSC processor's performance, efficiencies are projected to diminish for some applications even before parallelizing over all four cores. Performance may decrease further for parallelization over the AMBA interconnect for all eight cores of the chiplet as well as over multiple chiplets [23] , even with the high-speed interfaces employed.
These results depict an effective forecast for the performance of the BAE Systems RAD5545 and Boeing HPSC nextgeneration space processors. Comparison of scaled versus underclocked performance results for the applications tested indicate validity of frequency scaling within 2.70% error. This scaling error applied in full to the expected RAD5545 results as frequency scaling alone was used for the P5020 and P5040 results. The error is further minimized to 0.36% error for the HPSC results by using a hybrid approach, underclocking the ODROID-C2 to the nearest supported frequency and scaling the remainder of the way to the target frequency of the final device.
Future Work
This research is easily extended to many additional applications and platforms for further analyses. With regard to the HPSC processor, the tests in this study only relate to the performance of one quad-core cluster of a single chiplet. Further exploration may investigate the performance of these applications extended with parallelism across both quad-core clusters of a chiplet or with support of the SIMD NEON accelerators. For both the RAD5545 and HPSC processors, studies in scaling these applications across interconnects, such as Serial RapidIO, between chiplets or processors will yield intriguing results with respect to their large-scale employment on future space-computing platforms.
APPENDICES
The following appendices are included to better convey numerical data represented in the presented charts. The appendices are divided into execution time, speedup and parallel efficiency, and energy consumption categories. 
A. EXECUTION TIME

