8,199 research outputs found
Survivable algorithms and redundancy management in NASA's distributed computing systems
The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. We outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design. We introduce our consensus-based framework for fault-tolerant system design. This is followed by a description of a hierarchical partitioning method for efficient consensus. A scheduler for redundancy management is introduced, and application-specific fault tolerance is described. We give an overview of our hybrid algorithm technique, which is an alternative to the formal approach given
Formal verification of a software countermeasure against instruction skip attacks
Fault attacks against embedded circuits enabled to define many new attack
paths against secure circuits. Every attack path relies on a specific fault
model which defines the type of faults that the attacker can perform. On
embedded processors, a fault model consisting in an assembly instruction skip
can be very useful for an attacker and has been obtained by using several fault
injection means. To avoid this threat, some countermeasure schemes which rely
on temporal redundancy have been proposed. Nevertheless, double fault injection
in a long enough time interval is practical and can bypass those countermeasure
schemes. Some fine-grained countermeasure schemes have also been proposed for
specific instructions. However, to the best of our knowledge, no approach that
enables to secure a generic assembly program in order to make it fault-tolerant
to instruction skip attacks has been formally proven yet. In this paper, we
provide a fault-tolerant replacement sequence for almost all the instructions
of the Thumb-2 instruction set and provide a formal verification for this fault
tolerance. This simple transformation enables to add a reasonably good security
level to an embedded program and makes practical fault injection attacks much
harder to achieve
Parallelizing Deadlock Resolution in Symbolic Synthesis of Distributed Programs
Previous work has shown that there are two major complexity barriers in the
synthesis of fault-tolerant distributed programs: (1) generation of fault-span,
the set of states reachable in the presence of faults, and (2) resolving
deadlock states, from where the program has no outgoing transitions. Of these,
the former closely resembles with model checking and, hence, techniques for
efficient verification are directly applicable to it. Hence, we focus on
expediting the latter with the use of multi-core technology.
We present two approaches for parallelization by considering different design
choices. The first approach is based on the computation of equivalence classes
of program transitions (called group computation) that are needed due to the
issue of distribution (i.e., inability of processes to atomically read and
write all program variables). We show that in most cases the speedup of this
approach is close to the ideal speedup and in some cases it is superlinear. The
second approach uses traditional technique of partitioning deadlock states
among multiple threads. However, our experiments show that the speedup for this
approach is small. Consequently, our analysis demonstrates that a simple
approach of parallelizing the group computation is likely to be the effective
method for using multi-core computing in the context of deadlock resolution
Integration of tools for the Design and Assessment of High-Performance, Highly Reliable Computing Systems (DAHPHRS), phase 1
Systems for Space Defense Initiative (SDI) space applications typically require both high performance and very high reliability. These requirements present the systems engineer evaluating such systems with the extremely difficult problem of conducting performance and reliability trade-offs over large design spaces. A controlled development process supported by appropriate automated tools must be used to assure that the system will meet design objectives. This report describes an investigation of methods, tools, and techniques necessary to support performance and reliability modeling for SDI systems development. Models of the JPL Hypercubes, the Encore Multimax, and the C.S. Draper Lab Fault-Tolerant Parallel Processor (FTPP) parallel-computing architectures using candidate SDI weapons-to-target assignment algorithms as workloads were built and analyzed as a means of identifying the necessary system models, how the models interact, and what experiments and analyses should be performed. As a result of this effort, weaknesses in the existing methods and tools were revealed and capabilities that will be required for both individual tools and an integrated toolset were identified
Real-time and fault tolerance in distributed control software
Closed loop control systems typically contain multitude of spatially distributed sensors and actuators operated simultaneously. So those systems are parallel and distributed in their essence. But mapping this parallelism onto the given distributed hardware architecture, brings in some additional requirements: safe multithreading, optimal process allocation, real-time scheduling of bus and network resources. Nowadays, fault tolerance methods and fast even online reconfiguration are becoming increasingly important. All those often conflicting requirements, make design and implementation of real-time distributed control systems an extremely difficult task, that requires substantial knowledge in several areas of control and computer science. Although many design methods have been proposed so far, none of them had succeeded to cover all important aspects of the problem at hand. [1] Continuous increase of production in embedded market, makes a simple and natural design methodology for real-time systems needed more then ever
Optimizing Lossy Compression Rate-Distortion from Automatic Online Selection between SZ and ZFP
With ever-increasing volumes of scientific data produced by HPC applications,
significantly reducing data size is critical because of limited capacity of
storage space and potential bottlenecks on I/O or networks in writing/reading
or transferring data. SZ and ZFP are the two leading lossy compressors
available to compress scientific data sets. However, their performance is not
consistent across different data sets and across different fields of some data
sets: for some fields SZ provides better compression performance, while other
fields are better compressed with ZFP. This situation raises the need for an
automatic online (during compression) selection between SZ and ZFP, with a
minimal overhead. In this paper, the automatic selection optimizes the
rate-distortion, an important statistical quality metric based on the
signal-to-noise ratio. To optimize for rate-distortion, we investigate the
principles of SZ and ZFP. We then propose an efficient online, low-overhead
selection algorithm that predicts the compression quality accurately for two
compressors in early processing stages and selects the best-fit compressor for
each data field. We implement the selection algorithm into an open-source
library, and we evaluate the effectiveness of our proposed solution against
plain SZ and ZFP in a parallel environment with 1,024 cores. Evaluation results
on three data sets representing about 100 fields show that our selection
algorithm improves the compression ratio up to 70% with the same level of data
distortion because of very accurate selection (around 99%) of the best-fit
compressor, with little overhead (less than 7% in the experiments).Comment: 14 pages, 9 figures, first revisio
- …