So error of exascale application is a challenge problem in modern High Performance Computing. In order to quantify an application s resilience and vulnerability, the application-level fault injection method is widely adopted by HPC users. However, it is not easy since users need to inject a large number of faults to ensure statistical signi cance, especially for parallel version program. Normally, parallel execution is more complex and requires more hardware resources than its serial execution. erefore, it is essential that we can predict error rate of parallel application based on its corresponding serial version. In this poster, we characterize fault pa ern in serial and parallel executions. We nd rst there are same fault sources in serial and parallel execution. Second, parallel execution also has some unique fault sources compared with serial executions. ose unique fault sources are important for us to understand the di erence of fault pa ern between serial and parallel executions.
INTRODUCTION
Making system resilient to hardware and so ware faults is a critical design goal for future extreme scale systems. To implement resilient HPC, we must have a good understanding of application resilience in the existence of faults. Currently, the application level fault injection is the major method to understand application resilience.
e application level fault injection triggers random bit ip in the operand or result of a random instruction. Typically, the statistical results of many fault injection tests, e.g., the percentage of the fault injection tests that have success application outcome, is used to evaluate the application resilience.
However, the application level fault injection can be very expensive, because HPC users need to inject a large number of faults to ensure statistical signi cance. Moreover, comparing with fault injection for the serial execution, fault injection for the parallel execution can be even more expensive. First, the parallel version needs more hardware resource than the serial version to deploy fault injection tests. Second, injecting faults into the parallel execution can be more di cult, because there is a larger exploration space for fault injection.
In this poster, we explore the correlation between the parallel and serial executions regarding their resilience. Our ultimate goal is that by studying the resilience of the serial execution we can derive the resilience of the parallel execution without using expensive fault injection. We aim to answer two fundamental questions. First, does the application resilience remain the same across the serial and parallel executions? Second, if the application resilience is di erence between the two executions, what code structure causes such di erence? We use an application-level fault injection tool named PFSEFI [2] to randomly choose dynamic instruction and * LA-UR-17-26470 then randomly ip one bit in the instruction result. A er enough fault injection tests, we characterize and compare the serial and parallel execution codes based on the fault injection results. We hope that our work can lay foundation to build a model to predict the resilience of the parallel execution only based on fault injection results in the serial execution.
EVALUATION METHODOLOGY
We employ a fault injection tool, PFSEFI to study three NAS benchmarks (CG, FT, BT) with the input problem S. For serial execution fault injection, we only run one MPI process; For parallel execution fault injection, we run four MPI processes and then randomly choose one MPI process for fault injection. We inject faults into the whole application and focus on two types of instructions, i.e., oating point addition (fadd) and oating point multiplication (fmul), because they are the most common ones in HPC applications. To ensure statistical signi cance for fault injection, we gradually increase the number of fault injection tests until the fault injection result becomes stable. e fault injection results are classi ed into three types: (1) Benign: the computation results of benchmarks pass the benchmarks' veri cation phase, it means the computation results are acceptable. But the computation results may be di erent from those without fault injection. (2) Silent data corruption (SDC): the computation results of benchmarks do not pass the benchmarks' veri cation phase; (3) Crashes: the benchmark cannot run to completion. Since the fault injection happens based on the random selection of dynamic instruction, we cannot know where the fault happens within the application code. But we can know the instruction address in the EIP register when the fault happens. We map the instruction address into the application code via PYELFTOOLS [1]. Based on the EIP information for all random fault injection points, we can know the occurrence frequency of each faulty instruction; also, we can analyze the code, and understand the di erence or similarity of application resilience in serial and parallel executions. Figure 1 shows the fault injection results (i.e., SDC rate of fault injection tests). We collect 10,000 fault injection test results for each benchmark and calculate the SDC rate every 1000 fault injection tests. e fault injection results become stable a er rst 6,000 tests. Figure 2 shows the faulty instruction distribution for the fault injection tests on oating point add instructions. In particular, we nd that there are no crashes happened in tests of three benchmarks; thus we only show how frequent each instruction is selected when the fault injection results are benign and SDC. Figure 2 (a)-(d)shows that for FT, the randomly selected faulty instructions in the fault injection tests for the serial and parallel executions are the same, which explains why the fault injection results for the two executions are almost the same. e fact that the faulty instructions are the same mainly because of the code similarity between the serial and parallel codes.
EXPERIMENT RESULTS
For BT(see gure 2(e)-(h)), we nd that faulty instructions are widely spread across the parallel and serial executions. ere is almost no instruction similarity in those faulty instructions between the serial and parallel executions. It is because BT has complicated computation. ere is no dominant computation phase where the faulty instructions can repeatedly happen.
For CG(see gure 2(i)-(l)), we nd that faulty instructions are limited to a few instructions, which is very di erent from the cases of BT. Also, the fault injection results for the serial and parallel executions are quite di erent. To understand the reason for such di erence, we map the faulty instructions into the source code of CG and have the following observations.
Observation 1: e instruction at 0x0804A03F is the most frequently selected instruction for fault injection. Such instruction appears in all cases (serial+benign), (serial+SDC),(parallel+benign) and (parallel+SDC). is instruction is used so o en in the benchmark, such that most of faults are injected into it. Also, the corruption of this instruction seems to easily cause SDC.
Observation 2: Some instructions only appear in (serial+benign), (parallel+benign) and (parallel+SDC), but do not appear in (serial+SDC). ose instructions include those at 0x0804A2B7, 0x0804A3BA, 0X0804A402 and 0x0804A502. ose instructions cause fault injection result di erence between the serial and parallel executions. Figure 3 shows the related code segment for 0x0804A2B7. In particular, the serial and parallel executions have a di erent value for the variable l2npcols, which leads to di erent code structure (particularly the MPI synchronization) for serial and parallel executions. Such di erence in the code structure makes the faulty injection at 0x0804A2B7 behave di erently in the serial and parallel executions.
Observation 3: e instruction at 0x0804A163 is only shown in (parallel+bengin) and (parallel+SDC), and such instruction only exists in the parallel execution because of the following reason: the variable l2npcols has a di erent value in the serial and parallel executions. Hence the two executions behave di erently (Figure 4 ).
CONCLUSIONS AND FUTURE WORK
is work is a preliminary study to explain the reason for similar or di erent application resilience between the serial and parallel executions. For the future work, we will investigate more benchmarks and establish a model to predict application resilience for the parallel execution based on the fault injection results for the serial execution.
