Abstract-In high-performance computing (HPC), end-to-end workflows are typically utilized to gain insights from scientific simulations. An end-to-end workflow consists of scientific simulation and data analysis, and can be executed in-situ, in-transit, and offline. Existing studies on end-to-end workflows have largely focused on the high-performance execution approaches. However, the emerging heterogeneous architectures and energy concerns lead to the rethinking of workflow execution approaches. As a guide to the rethinking, this paper evaluates how to run end-to-end HPC workflows efficiently in terms of performance, energy, and error resilience. The evaluation covers emerging heterogeneous processor architectures, processor power capping techniques, and heterogeneous-reliability memory.
I. INTRODUCTION
End-to-end, high-performance computing (HPC) workflows comprise of large-scale simulations followed by a series of data analysis tasks in order to gain scientific insights. Simulations run iteratively to generate raw data during each iteration. Traditionally, for an offline workflow, the simulation part of the workflow is run on the simulation machine, while the postprocessing tasks are run on data analysis clusters, sharing data via a storage system. However, the offline workflow incurs redundant data movement as analysis needs to read the simulation output data back from the storage system for post-processing. Moreover, a large-scale simulation, like Fusion, running on Titan can produce over a petabyte of output data in a 24-hour run. Since the energy cost of data movement to the storage system can be 1000 times higher than data movement over the interconnect, re-reading such large data for analysis will be extremely energy-consuming. By comparison, online workflow (in-situ and in-transit) is gaining popularity because it transfers simulation output directly to analysis without touching the storage system. For an in-situ workflow, the analysis part always shares the same compute nodes as the simulation part. There is no extra data movement energy cost, but either the simulation or the analysis can be idling and wasting energy. Regarding in-transit workflow, the analysis part usually runs on different compute nodes than the simulation part. Simulation output data are staged to the analysis nodes through the interconnect. Only data movement over the interconnect consumes extra energy, but it is cumbersome to completely avoid idling energy cost.
In-situ, in-transit and offline workflows have their own pros and cons. Extant studies have largely focused on the performance of end-to-end workflow execution approaches instead of energy efficiency. However, in recent years, power and energy concerns are drawing more and more attention in HPC. How to run an end-to-end HPC workflow in an energy-efficient manner remains unclear on the emerging heterogeneous processor and memory architectures. On processor side, there are multiple platform choices to run HPC workflows, such as high performance and power-consuming CPUs (brawny processors), low performance and power-consuming CPUs (wimpy processors) and coprocessors. On memory side, heterogeneous-reliability memory [1] provides the opportunity to run HPC workflows on various types of memory devices.
In order to resolve the above uncertainties and guide the rethinking of workflow execution approaches, we make the following contributions: 1) Comprehensive performance and energy efficiency evaluations for HPC workflows on emerging heterogeneous processor architectures and under processor power capping techniques; 2) An error resilience evaluation of HPC workflows for heterogeneous-reliability memory.
II. RELATED WORK
Recent studies have investigated the performance and energy efficiency of end-to-end HPC workflows. Li et al. [2] and Zhang et al. [3] proposed to run both simulation and analysis workloads using separate cores on a chip. Bennett et al. [4] proposed to combine in-situ and in-transit processing to achieve the trade-off between performance and data movement cost. Haldeman et al. [5] and Rodero et al. [6] evaluated the performance and power/energy trade-offs of different data movement strategies for in-situ processing. However, executing simulation and analysis programs on homogeneous processors may cause energy inefficiency since they have different computing requirements and execution characteristics.
Heterogeneous processor design has also recently become popular in HPC, e.g., utilizing coprocessors to boost performance and improve energy efficiency. An emerging trend is to utilize wimpy processors to reduce the high power consumption and improve the energy efficiency of HPC workloads. Ou et al. [7] evaluated the energy and cost-efficiency of ARM clusters. The conclusion of their study is that ARM clusters are advantageous in lightweight computation, but they are relatively inefficient in executing compute-bound applications. Rajovic et al. [8] [9] illustrated that the performance gap between commodity processors and mobile processors is slowly narrowing. SoCs running at a larger scale could provide sufficient high performance with low energy consumption. Therefore, they conclude that SoCs are almost ready for HPC. However, this also requires the programs to have good scalability. In contrast to previous findings, we find that GPUs are more energy-efficient than SoCs for simulation programs, while SoCs are more energy-efficient for analysis programs.
III. METHODOLOGY
Performing power-related experiments with HPC workflows directly on supercomputing production machines is not viable, because power measurement and power capping are currently not supported on the platforms. To overcome the platform limitation, on hardware side, we choose platforms that support power capping from the same generation as supercomputer Titan [10] to compare the characteristics of simulation and analysis programs across heterogeneous architectures. Mainstream processors from this generation are listed in Figure 1 , which are clustered into three categories: brawny processors, wimpy processors and coprocessors. Intel Xeon E5 2670 (XeonH) and Intel Xeon E5 2603 (XeonL) are chosen for evaluation in the category of brawny processors. In the category of wimpy processors, Nvidia Tegra 3 (ARM) is evaluated. In the category of coprocessors, Nvidia Tesla K20 (GPU) is evaluated. On software side, these platforms cannot host large-scale HPC workflows which depend on platform-specific libraries. Therefore, we employ a wide variety of scientific simulation and analysis applications from Rodinia benchmark suite and NPB benchmark suite to mimic HPC workflows and to study their characteristics, as listed in Table I . The TAU profiling tool [11] and PAPI counters [12] are utilized to profile the runtime hardware counters. Librapl [13] and NVML [14] are used to profile CPU and GPU power consumption respectively. Intel Power Governor is utilized to cap the package power consumption.ARM power consumption is measured using a power meter which is calibrated against the processor power consumption data in the Nvidia whitepaper [15] . To compare the error resilience of simulation and analysis under hardware errors, a software fault injector PINFI [16] is utilized to inject bit flips into programs. IV. PERFORMANCE AND ENERGY EFFICIENCY
In this section, we characterize the computation behaviors of simulation and analysis on heterogeneous processors and under power capping techniques.
A. Heterogeneous Processors
In this subsection, we first evaluate the performance and energy efficiency of simulation and analysis programs across heterogeneous processors. Then, we revisit whether SoCs are ready for HPC based on our evaluation. Figure 2a shows the execution time of all benchmarks. Analysis programs perform better on Xeon and simulation programs perform better on the GPU. Figure 2b shows the energy consumption of all benchmarks. All analysis programs are more energy-efficient on ARM. However, 60% of simulation programs are more energy-efficient on ARM than on XeonH, and only one simulation benchmark is more energy-efficient on ARM than on the GPU. In terms of energy efficiency, analysis programs behave well on ARM but the simulation programs do not. The reason is that simulation programs have either more floating-point operations (e.g., CFD) or more memory accesses (e.g. LUD) than analysis programs, as shown in Figure 3 . It is known that wimpy processors (e.g., ARM) have less floatingpoint units and lower memory bandwidth. Therefore, wimpy processors are more efficient in executing analysis programs that are less floating-point intensive and less memory intensive.
Previous research [8] showed that SoCs are ready for HPC based on performance and energy consumption. However, the above energy efficiency evaluation shows that it is not entirely the case. The previous research did not involve GPU platforms and analysis programs. Our results reveal that GPU is more energy-efficient than ARM for simulation programs, as shown in Figure 2 . Moreover, scalability becomes more critical when running HPC applications on SoCs [8] [9] . However, CORAL (Department of Energy's consortium of labs for O(100) petaflop machine acquisitions) scalability data [17] shows that, applications such as QBOX, NAMD, SNAP and CAM-SE can only achieve 44% of ideal speedup on average at large scale.
Finding: In terms of energy efficiency, GPUs perform much better than SoCs for simulation programs, but SoCs are more energy-efficient for analysis programs. 
B. Processor Power Capping
As supercomputers are moving towards exascale, facility power and cooling resources are becoming potential bottlenecks [18] [19] . Efficiently applying power capping to supercomputers can alleviate the burden on those critical resources. However, a power cap which is too high will have no effect at all. Moreover, a power cap that is too low will lead to significant performance degradation. We study the computation behaviors of simulation and analysis programs under power capping in this subsection.
First, we evaluate the performance and energy efficiency of simulation and analysis programs under different power caps on XeonH. The results show that the most energy-efficient point is achieved when the package power consumption is constrained around 40 watts for this platform, which is much lower than the power consumption under no power cap, as shown in Figure 4 . Simulation and analysis programs exhibit similar behaviors under power capping.We carry out the same experiments on XeonL and observe similar results.
Finding: The most energy-efficient power caps for both simulation and analysis programs are gathering around a platform-dependent (or application-independent) point. Finally, we evaluate the efficiency of power capping for all 20 benchmarks in terms of performance and energy efficiency. The most energy-efficient point of XeonL is under no power cap. Therefore, we compare that with XeonH under no power cap, 40 watts (most energy-efficient cap) and 30 watts. Figure 5 shows what percentage of simulation and analysis benchmarks are doing better on each platform. Simulation and analysis programs perform better in terms of execution time on XeonH near a power cap 40 watts. When power consumption is capped at 30 watts on XeonH, simulation programs perform equally on both platforms while analysis programs perform better on XeonL in terms of execution time. XeonH is loosing the performance advantage over XeonL when power caps drop towards to the minimum.
For energy efficiency, analysis programs favor XeonL with no power cap and under power 30 watts while simulation programs show no strong preference. Under a power cap of 40 watts on XeonH, analysis programs are slightly more energyefficient while simulation programs are much more energy-efficient on XeonH than on XeonL.
Finding: When energy efficiency becomes a major concern, it is always advisable to set a power cap at the most energyefficient point. However, when power constraint turns critical and power cap needs to be set below the most energy-efficient point, it is advisable to set a lower power cap for simulation programs instead of analysis programs. 
V. ERROR RESILIENCE
Memory errors are especially lethal to HPC applications, since they happen more frequently at larger scale. In order to improve memory reliability, error detection and correction features have been added to memory devices. However, these features require extra circuitry which incurs additional access latency, energy consumption, and cost [1] [20] [21] . In order to lower datacenter cost, Luo et al. [1] quantified the application error resilience and proposed new heterogeneous-reliability memory system designs. In this subsection, we examine the feasibility of relaxing the memory error resilience requirement through characterizing resilience behaviors between simulation and analysis programs, in order to lower memory access latency and energy consumption.
Application error resilience depends on the program structure and algorithmic characteristics [22] [23] . We test the error resilience of simulation and analysis programs with PINFI fault injector. In the experiments, each benchmark runs for 1,000 iterations with 1 bit-flip fault injected into a random assembly instruction per iteration. Benchmark problem sizes are reduced to guarantee that PINFI instrumentation finishes within a reasonable amount of execution time (less than 80 hours per 1,000 iterations). Comparing the percentage of correct execution, silent data corruption (SDC), and crashes reveals the error resilience capability of each benchmark. Correct execution means that a program finishes correctly, and the results are also correct. SDC means that the program finishes correctly but generates incorrect results. Crash is an event when a program either quits with an error or is terminated by the operating system before it finishes correctly. The Rodinia and NPB benchmark suites are used to carry out a comparison between the simulation and analysis programs. Each benchmark runs for 1,000 iterations with PINFI instrumentation on Xeon. The results are ordered based on SDC, and are shown in Figure 6 . Most analysis programs have 60% to 80% correct execution and less than 8% SDC. By comparison, most simulation programs have 16% to 45% correct execution and 30% to 55% SDC. However, BFS behaves similar to simulation while FT behaves similar to analysis. Besides, CFD and lavaMD have less than 7% crashes and more than 50% correct execution. Luo et al. [1] have found that error resilience varies across applications, which can be quantified and explained by safe data regions.
Finding: In general, analysis programs are more resilient to SDC, and have more correct execution when compared to simulations. However, resilience to crashes cannot be clearly distinguished between simulation and analysis programs.
Therefore, it is feasible to run analysis programs on memory devices with lower error resilience capability. For example, only detect memory errors in hardware using parity code or in software through computing checksums [24] . This can potentially reduce the memory access latency and energy consumption for analysis programs.
VI. CONCLUSION AND ACKNOWLEDGEMENT
In this paper, we evaluate how to run end-to-end HPC workflows efficiently with respect to heterogeneous processor architectures, power capping techniques, and heterogeneousreliability memory. The evaluation provides insights into how to choose and configure the platforms for HPC workflows in an energy-efficient manner. The work conducted at Temple University is partially sponsored by the U.S. National Science Foundation (NSF) under grants CNS-1702474, CNS-1700719, and CCF-1547804. This research also used the resources of OLCF, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
