26,339 research outputs found
Online Fault Classification in HPC Systems through Machine Learning
As High-Performance Computing (HPC) systems strive towards the exascale goal,
studies suggest that they will experience excessive failure rates. For this
reason, detecting and classifying faults in HPC systems as they occur and
initiating corrective actions before they can transform into failures will be
essential for continued operation. In this paper, we propose a fault
classification method for HPC systems based on machine learning that has been
designed specifically to operate with live streamed data. We cast the problem
and its solution within realistic operating constraints of online use. Our
results show that almost perfect classification accuracy can be reached for
different fault types with low computational overhead and minimal delay. We
have based our study on a local dataset, which we make publicly available, that
was acquired by injecting faults to an in-house experimental HPC system.Comment: Accepted for publication at the Euro-Par 2019 conferenc
Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer
SIFT (Software Implemented Fault Tolerance) is an experimental, fault-tolerant computer system designed to meet the extreme reliability requirements for safety-critical functions in advanced aircraft. Errors are masked by performing a majority voting operation over the results of identical computations, and faulty processors are removed from service by reassigning computations to the nonfaulty processors. This scheme has been implemented in a special architecture using a set of standard Bendix BDX930 processors, augmented by a special asynchronous-broadcast communication interface that provides direct, processor to processor communication among all processors. Fault isolation is accomplished in hardware; all other fault-tolerance functions, together with scheduling and synchronization are implemented exclusively by executive system software. The system reliability is predicted by a Markov model. Mathematical consistency of the system software with respect to the reliability model has been partially verified, using recently developed tools for machine-aided proof of program correctness
Distributed Anomaly Detection using Autoencoder Neural Networks in WSN for IoT
Wireless sensor networks (WSN) are fundamental to the Internet of Things
(IoT) by bridging the gap between the physical and the cyber worlds. Anomaly
detection is a critical task in this context as it is responsible for
identifying various events of interests such as equipment faults and
undiscovered phenomena. However, this task is challenging because of the
elusive nature of anomalies and the volatility of the ambient environments. In
a resource-scarce setting like WSN, this challenge is further elevated and
weakens the suitability of many existing solutions. In this paper, for the
first time, we introduce autoencoder neural networks into WSN to solve the
anomaly detection problem. We design a two-part algorithm that resides on
sensors and the IoT cloud respectively, such that (i) anomalies can be detected
at sensors in a fully distributed manner without the need for communicating
with any other sensors or the cloud, and (ii) the relatively more
computation-intensive learning task can be handled by the cloud with a much
lower (and configurable) frequency. In addition to the minimal communication
overhead, the computational load on sensors is also very low (of polynomial
complexity) and readily affordable by most COTS sensors. Using a real WSN
indoor testbed and sensor data collected over 4 consecutive months, we
demonstrate via experiments that our proposed autoencoder-based anomaly
detection mechanism achieves high detection accuracy and low false alarm rate.
It is also able to adapt to unforeseeable and new changes in a non-stationary
environment, thanks to the unsupervised learning feature of our chosen
autoencoder neural networks.Comment: 6 pages, 7 figures, IEEE ICC 201
Randomized protocols for asynchronous consensus
The famous Fischer, Lynch, and Paterson impossibility proof shows that it is
impossible to solve the consensus problem in a natural model of an asynchronous
distributed system if even a single process can fail. Since its publication,
two decades of work on fault-tolerant asynchronous consensus algorithms have
evaded this impossibility result by using extended models that provide (a)
randomization, (b) additional timing assumptions, (c) failure detectors, or (d)
stronger synchronization mechanisms than are available in the basic model.
Concentrating on the first of these approaches, we illustrate the history and
structure of randomized asynchronous consensus protocols by giving detailed
descriptions of several such protocols.Comment: 29 pages; survey paper written for PODC 20th anniversary issue of
Distributed Computin
- …