8,729 research outputs found
What does fault tolerant Deep Learning need from MPI?
Deep Learning (DL) algorithms have become the de facto Machine Learning (ML)
algorithm for large scale data analysis. DL algorithms are computationally
expensive - even distributed DL implementations which use MPI require days of
training (model learning) time on commonly studied datasets. Long running DL
applications become susceptible to faults - requiring development of a fault
tolerant system infrastructure, in addition to fault tolerant DL algorithms.
This raises an important question: What is needed from MPI for de- signing
fault tolerant DL implementations? In this paper, we address this problem for
permanent faults. We motivate the need for a fault tolerant MPI specification
by an in-depth consideration of recent innovations in DL algorithms and their
properties, which drive the need for specific fault tolerance features. We
present an in-depth discussion on the suitability of different parallelism
types (model, data and hybrid); a need (or lack thereof) for check-pointing of
any critical data structures; and most importantly, consideration for several
fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI
and their applicability to fault tolerant DL implementations. We leverage a
distributed memory implementation of Caffe, currently available under the
Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches
by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation
using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies
demonstrates the effectiveness of the proposed fault tolerant DL implementation
using OpenMPI based ULFM
Fault detection, identification and accommodation techniques for unmanned airborne vehicles
Unmanned Airborne Vehicles (UAV) are assuming prominent roles in both the commercial and military aerospace industries. The promise of reduced costs and reduced risk to human life is one of their major attractions, however these low-cost systems are yet to gain acceptance as a safe alternate to manned solutions. The absence of a thinking, observing, reacting and decision making pilot reduces the UAVs capability of managing adverse situations such as faults and failures. This paper presents a review of techniques that can be used to track the system health onboard a UAV. The review is based on a year long literature review aimed at identifying approaches suitable for combating the low reliability and high attrition rates of today’s UAV. This research primarily focuses on real-time, onboard implementations for generating accurate estimations of aircraft health for fault accommodation and mission management (change of mission objectives due to deterioration in aircraft health). The major task of such systems is the process of detection, identification and accommodation of faults and failures (FDIA). A number of approaches exist, of which model-based techniques show particular promise. Model-based approaches use analytical redundancy to generate residuals for the aircraft parameters that can be used to indicate the occurrence of a fault or failure. Actions such as switching between redundant components or modifying control laws can then be taken to accommodate the fault. The paper further describes recent work in evaluating neural-network approaches to sensor failure detection and identification (SFDI). The results of simulations with a variety of sensor failures, based on a Matlab non-linear aircraft model are presented and discussed. Suggestions for improvements are made based on the limitations of this neural network approach with the aim of including a broader range of failures, while still maintaining an accurate model in the presence of these failures
On The Robustness of a Neural Network
With the development of neural networks based machine learning and their
usage in mission critical applications, voices are rising against the
\textit{black box} aspect of neural networks as it becomes crucial to
understand their limits and capabilities. With the rise of neuromorphic
hardware, it is even more critical to understand how a neural network, as a
distributed system, tolerates the failures of its computing nodes, neurons, and
its communication channels, synapses. Experimentally assessing the robustness
of neural networks involves the quixotic venture of testing all the possible
failures, on all the possible inputs, which ultimately hits a combinatorial
explosion for the first, and the impossibility to gather all the possible
inputs for the second.
In this paper, we prove an upper bound on the expected error of the output
when a subset of neurons crashes. This bound involves dependencies on the
network parameters that can be seen as being too pessimistic in the average
case. It involves a polynomial dependency on the Lipschitz coefficient of the
neurons activation function, and an exponential dependency on the depth of the
layer where a failure occurs. We back up our theoretical results with
experiments illustrating the extent to which our prediction matches the
dependencies between the network parameters and robustness. Our results show
that the robustness of neural networks to the average crash can be estimated
without the need to neither test the network on all failure configurations, nor
access the training set used to train the network, both of which are
practically impossible requirements.Comment: 36th IEEE International Symposium on Reliable Distributed Systems 26
- 29 September 2017. Hong Kong, Chin
Parallel Architectures for Planetary Exploration Requirements (PAPER)
The Parallel Architectures for Planetary Exploration Requirements (PAPER) project is essentially research oriented towards technology insertion issues for NASA's unmanned planetary probes. It was initiated to complement and augment the long-term efforts for space exploration with particular reference to NASA/LaRC's (NASA Langley Research Center) research needs for planetary exploration missions of the mid and late 1990s. The requirements for space missions as given in the somewhat dated Advanced Information Processing Systems (AIPS) requirements document are contrasted with the new requirements from JPL/Caltech involving sensor data capture and scene analysis. It is shown that more stringent requirements have arisen as a result of technological advancements. Two possible architectures, the AIPS Proof of Concept (POC) configuration and the MAX Fault-tolerant dataflow multiprocessor, were evaluated. The main observation was that the AIPS design is biased towards fault tolerance and may not be an ideal architecture for planetary and deep space probes due to high cost and complexity. The MAX concepts appears to be a promising candidate, except that more detailed information is required. The feasibility for adding neural computation capability to this architecture needs to be studied. Key impact issues for architectural design of computing systems meant for planetary missions were also identified
A Review of Fault Diagnosing Methods in Power Transmission Systems
Transient stability is important in power systems. Disturbances like faults need to be segregated to restore transient stability. A comprehensive review of fault diagnosing methods in the power transmission system is presented in this paper. Typically, voltage and current samples are deployed for analysis. Three tasks/topics; fault detection, classification, and location are presented separately to convey a more logical and comprehensive understanding of the concepts. Feature extractions, transformations with dimensionality reduction methods are discussed. Fault classification and location techniques largely use artificial intelligence (AI) and signal processing methods. After the discussion of overall methods and concepts, advancements and future aspects are discussed. Generalized strengths and weaknesses of different AI and machine learning-based algorithms are assessed. A comparison of different fault detection, classification, and location methods is also presented considering features, inputs, complexity, system used and results. This paper may serve as a guideline for the researchers to understand different methods and techniques in this field
An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration
We empirically evaluate an undervolting technique, i.e., underscaling the
circuit supply voltage below the nominal level, to improve the power-efficiency
of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable
Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing
faults due to excessive circuit latency increase. We evaluate the
reliability-power trade-off for such accelerators. Specifically, we
experimentally study the reduced-voltage operation of multiple components of
real FPGAs, characterize the corresponding reliability behavior of CNN
accelerators, propose techniques to minimize the drawbacks of reduced-voltage
operation, and combine undervolting with architectural CNN optimization
techniques, i.e., quantization and pruning. We investigate the effect of
environmental temperature on the reliability-power trade-off of such
accelerators. We perform experiments on three identical samples of modern
Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification
CNN benchmarks. This approach allows us to study the effects of our
undervolting technique for both software and hardware variability. We achieve
more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain
is the result of eliminating the voltage guardband region, i.e., the safe
voltage region below the nominal level that is set by FPGA vendor to ensure
correct functionality in worst-case environmental and circuit conditions. 43%
of the power-efficiency gain is due to further undervolting below the
guardband, which comes at the cost of accuracy loss in the CNN accelerator. We
evaluate an effective frequency underscaling technique that prevents this
accuracy loss, and find that it reduces the power-efficiency gain from 43% to
25%.Comment: To appear at the DSN 2020 conferenc
- …