17,760 research outputs found
Parallel Architectures for Planetary Exploration Requirements (PAPER)
The Parallel Architectures for Planetary Exploration Requirements (PAPER) project is essentially research oriented towards technology insertion issues for NASA's unmanned planetary probes. It was initiated to complement and augment the long-term efforts for space exploration with particular reference to NASA/LaRC's (NASA Langley Research Center) research needs for planetary exploration missions of the mid and late 1990s. The requirements for space missions as given in the somewhat dated Advanced Information Processing Systems (AIPS) requirements document are contrasted with the new requirements from JPL/Caltech involving sensor data capture and scene analysis. It is shown that more stringent requirements have arisen as a result of technological advancements. Two possible architectures, the AIPS Proof of Concept (POC) configuration and the MAX Fault-tolerant dataflow multiprocessor, were evaluated. The main observation was that the AIPS design is biased towards fault tolerance and may not be an ideal architecture for planetary and deep space probes due to high cost and complexity. The MAX concepts appears to be a promising candidate, except that more detailed information is required. The feasibility for adding neural computation capability to this architecture needs to be studied. Key impact issues for architectural design of computing systems meant for planetary missions were also identified
Improving Performance of Iterative Methods by Lossy Checkponting
Iterative methods are commonly used approaches to solve large, sparse linear
systems, which are fundamental operations for many modern scientific
simulations. When the large-scale iterative methods are running with a large
number of ranks in parallel, they have to checkpoint the dynamic variables
periodically in case of unavoidable fail-stop errors, requiring fast I/O
systems and large storage space. To this end, significantly reducing the
checkpointing overhead is critical to improving the overall performance of
iterative methods. Our contribution is fourfold. (1) We propose a novel lossy
checkpointing scheme that can significantly improve the checkpointing
performance of iterative methods by leveraging lossy compressors. (2) We
formulate a lossy checkpointing performance model and derive theoretically an
upper bound for the extra number of iterations caused by the distortion of data
in lossy checkpoints, in order to guarantee the performance improvement under
the lossy checkpointing scheme. (3) We analyze the impact of lossy
checkpointing (i.e., extra number of iterations caused by lossy checkpointing
files) for multiple types of iterative methods. (4)We evaluate the lossy
checkpointing scheme with optimal checkpointing intervals on a high-performance
computing environment with 2,048 cores, using a well-known scientific
computation package PETSc and a state-of-the-art checkpoint/restart toolkit.
Experiments show that our optimized lossy checkpointing scheme can
significantly reduce the fault tolerance overhead for iterative methods by
23%~70% compared with traditional checkpointing and 20%~58% compared with
lossless-compressed checkpointing, in the presence of system failures.Comment: 14 pages, 10 figures, HPDC'1
A methodology for the generation of efficient error detection mechanisms
A dependable software system must contain error detection mechanisms and error recovery mechanisms. Software components for the detection of errors are typically designed based on a system specification or the experience of software engineers, with their efficiency typically being measured using fault injection and metrics such as coverage and latency. In this paper, we introduce a methodology for the design of highly efficient error detection mechanisms. The proposed methodology combines fault injection analysis and data mining techniques in order to generate predicates for efficient error detection mechanisms. The results presented demonstrate the viability of the methodology as an approach for the development of efficient error detection mechanisms, as the predicates generated yield a true positive rate of almost 100% and a false positive rate very close to 0% for the detection of failure-inducing states. The main advantage of the proposed methodology over current state-of-the-art approaches is that efficient detectors are obtained by design, rather than by using specification-based detector design or the experience of software engineers
Study of fault-tolerant software technology
Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance
Reliable Linear, Sesquilinear and Bijective Operations On Integer Data Streams Via Numerical Entanglement
A new technique is proposed for fault-tolerant linear, sesquilinear and
bijective (LSB) operations on integer data streams (), such as:
scaling, additions/subtractions, inner or outer vector products, permutations
and convolutions. In the proposed method, the input integer data streams
are linearly superimposed to form numerically-entangled integer data
streams that are stored in-place of the original inputs. A series of LSB
operations can then be performed directly using these entangled data streams.
The results are extracted from the entangled output streams by additions
and arithmetic shifts. Any soft errors affecting any single disentangled output
stream are guaranteed to be detectable via a specific post-computation
reliability check. In addition, when utilizing a separate processor core for
each of the streams, the proposed approach can recover all outputs after
any single fail-stop failure. Importantly, unlike algorithm-based fault
tolerance (ABFT) methods, the number of operations required for the
entanglement, extraction and validation of the results is linearly related to
the number of the inputs and does not depend on the complexity of the performed
LSB operations. We have validated our proposal in an Intel processor (Haswell
architecture with AVX2 support) via fast Fourier transforms, circular
convolutions, and matrix multiplication operations. Our analysis and
experiments reveal that the proposed approach incurs between to
reduction in processing throughput for a wide variety of LSB operations. This
overhead is 5 to 1000 times smaller than that of the equivalent ABFT method
that uses a checksum stream. Thus, our proposal can be used in fault-generating
processor hardware or safety-critical applications, where high reliability is
required without the cost of ABFT or modular redundancy.Comment: to appear in IEEE Trans. on Signal Processing, 201
- …