2,486 research outputs found
Online Fault Classification in HPC Systems through Machine Learning
As High-Performance Computing (HPC) systems strive towards the exascale goal,
studies suggest that they will experience excessive failure rates. For this
reason, detecting and classifying faults in HPC systems as they occur and
initiating corrective actions before they can transform into failures will be
essential for continued operation. In this paper, we propose a fault
classification method for HPC systems based on machine learning that has been
designed specifically to operate with live streamed data. We cast the problem
and its solution within realistic operating constraints of online use. Our
results show that almost perfect classification accuracy can be reached for
different fault types with low computational overhead and minimal delay. We
have based our study on a local dataset, which we make publicly available, that
was acquired by injecting faults to an in-house experimental HPC system.Comment: Accepted for publication at the Euro-Par 2019 conferenc
An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration
We empirically evaluate an undervolting technique, i.e., underscaling the
circuit supply voltage below the nominal level, to improve the power-efficiency
of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable
Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing
faults due to excessive circuit latency increase. We evaluate the
reliability-power trade-off for such accelerators. Specifically, we
experimentally study the reduced-voltage operation of multiple components of
real FPGAs, characterize the corresponding reliability behavior of CNN
accelerators, propose techniques to minimize the drawbacks of reduced-voltage
operation, and combine undervolting with architectural CNN optimization
techniques, i.e., quantization and pruning. We investigate the effect of
environmental temperature on the reliability-power trade-off of such
accelerators. We perform experiments on three identical samples of modern
Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification
CNN benchmarks. This approach allows us to study the effects of our
undervolting technique for both software and hardware variability. We achieve
more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain
is the result of eliminating the voltage guardband region, i.e., the safe
voltage region below the nominal level that is set by FPGA vendor to ensure
correct functionality in worst-case environmental and circuit conditions. 43%
of the power-efficiency gain is due to further undervolting below the
guardband, which comes at the cost of accuracy loss in the CNN accelerator. We
evaluate an effective frequency underscaling technique that prevents this
accuracy loss, and find that it reduces the power-efficiency gain from 43% to
25%.Comment: To appear at the DSN 2020 conferenc
Transient fault behavior in a microprocessor: A case study
An experimental analysis is described which studies the susceptibility of a microprocessor based jet engine controller to upsets caused by current and voltage transients. A design automation environment which allows the run time injection of transients and the tracing from their impact device to the pin level is described. The resulting error data are categorized by the charge levels of the injected transients by location and by their potential to cause logic upsets, latched errors, and pin errors. The results show a 3 picoCouloumb threshold, below which the transients have little impact. An Arithmetic and Logic Unit transient is most likely to result in logic upsets and pin errors (i.e., impact the external environment). The transients in the countdown unit are potentially serious since they can result in latched errors, thus causing latent faults. Suggestions to protect the processor against these errors, by incorporating internal error detection and transient suppression techniques, are also made
Fault-tolerant quantum computation against biased noise
We formulate a scheme for fault-tolerant quantum computation that works effectively against highly biased noise, where dephasing is far stronger than all other types of noise. In our scheme, the fundamental operations performed by the quantum computer are single-qubit preparations, single-qubit measurements, and conditional-phase (CPHASE) gates, where the noise in the CPHASE gates is biased. We show that the accuracy threshold for quantum computation can be improved by exploiting this noise asymmetry; e.g., if dephasing dominates all other types of noise in the CPHASE gates by four orders of magnitude, we find a rigorous lower bound on the accuracy threshold higher by a factor of 5 than for the case of unbiased noise
Large-Scale Application of Fault Injection into PyTorch Models -- an Extension to PyTorchFI for Validation Efficiency
Transient or permanent faults in hardware can render the output of Neural
Networks (NN) incorrect without user-specific traces of the error, i.e. silent
data errors (SDE). On the other hand, modern NNs also possess an inherent
redundancy that can tolerate specific faults. To establish a safety case, it is
necessary to distinguish and quantify both types of corruptions. To study the
effects of hardware (HW) faults on software (SW) in general and NN models in
particular, several fault injection (FI) methods have been established in
recent years. Current FI methods focus on the methodology of injecting faults
but often fall short of accounting for large-scale FI tests, where many fault
locations based on a particular fault model need to be analyzed in a short
time. Results need to be concise, repeatable, and comparable. To address these
requirements and enable fault injection as the default component in a machine
learning development cycle, we introduce a novel fault injection framework
called PyTorchALFI (Application Level Fault Injection for PyTorch) based on
PyTorchFI. PyTorchALFI provides an efficient way to define randomly generated
and reusable sets of faults to inject into PyTorch models, defines complex test
scenarios, enhances data sets, and generates test KPIs while tightly coupling
fault-free, faulty, and modified NN. In this paper, we provide details about
the definition of test scenarios, software architecture, and several examples
of how to use the new framework to apply iterative changes in fault location
and number, compare different model modifications, and analyze test results.Comment: accepted in DSN202
- …