39,854 research outputs found
Resilience in Numerical Methods: A Position on Fault Models and Methodologies
Future extreme-scale computer systems may expose silent data corruption (SDC)
to applications, in order to save energy or increase performance. However,
resilience research struggles to come up with useful abstract programming
models for reasoning about SDC. Existing work randomly flips bits in running
applications, but this only shows average-case behavior for a low-level,
artificial hardware model. Algorithm developers need to understand worst-case
behavior with the higher-level data types they actually use, in order to make
their algorithms more resilient. Also, we know so little about how SDC may
manifest in future hardware, that it seems premature to draw conclusions about
the average case. We argue instead that numerical algorithms can benefit from a
numerical unreliability fault model, where faults manifest as unbounded
perturbations to floating-point data. Algorithms can use inexpensive "sanity"
checks that bound or exclude error in the results of computations. Given a
selective reliability programming model that requires reliability only when and
where needed, such checks can make algorithms reliable despite unbounded
faults. Sanity checks, and in general a healthy skepticism about the
correctness of subroutines, are wise even if hardware is perfectly reliable.Comment: Position Pape
Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators
In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications’ output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude.
We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.This work was supported by the STIC-AmSud/CAPES scientific cooperation program under the EnergySFE research
project grant 99999.007556/2015-02, EU H2020 Programme, and MCTI/RNP-Brazil under the HPC4E Project, grant agreement
n° 689772. Tested K40 boards were donated thanks to Steve Keckler, Timothy Tsai, and Siva Hari from NVIDIA.Postprint (author's final draft
Fault-Tolerant Adaptive Parallel and Distributed Simulation
Discrete Event Simulation is a widely used technique that is used to model
and analyze complex systems in many fields of science and engineering. The
increasingly large size of simulation models poses a serious computational
challenge, since the time needed to run a simulation can be prohibitively
large. For this reason, Parallel and Distributes Simulation techniques have
been proposed to take advantage of multiple execution units which are found in
multicore processors, cluster of workstations or HPC systems. The current
generation of HPC systems includes hundreds of thousands of computing nodes and
a vast amount of ancillary components. Despite improvements in manufacturing
processes, failures of some components are frequent, and the situation will get
worse as larger systems are built. In this paper we describe FT-GAIA, a
software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation
middleware. FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some
protection against byzantine failures since synchronization messages are
replicated as well, so that the receiving entity can identify and discard
corrupted messages. We provide an experimental evaluation of FT-GAIA on a
running prototype. Results show that a high degree of fault tolerance can be
achieved, at the cost of a moderate increase in the computational load of the
execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed
Simulation and Real Time Applications (DS-RT 2016
A Study of Concurrency Bugs and Advanced Development Support for Actor-based Programs
The actor model is an attractive foundation for developing concurrent
applications because actors are isolated concurrent entities that communicate
through asynchronous messages and do not share state. Thereby, they avoid
concurrency bugs such as data races, but are not immune to concurrency bugs in
general. This study taxonomizes concurrency bugs in actor-based programs
reported in literature. Furthermore, it analyzes the bugs to identify the
patterns causing them as well as their observable behavior. Based on this
taxonomy, we further analyze the literature and find that current approaches to
static analysis and testing focus on communication deadlocks and message
protocol violations. However, they do not provide solutions to identify
livelocks and behavioral deadlocks. The insights obtained in this study can be
used to improve debugging support for actor-based programs with new debugging
techniques to identify the root cause of complex concurrency bugs.Comment: - Submitted for review - Removed section 6 "Research Roadmap for
Debuggers", its content was summarized in the Future Work section - Added
references for section 1, section 3, section 4.3 and section 5.1 - Updated
citation
Considering Human Aspects on Strategies for Designing and Managing Distributed Human Computation
A human computation system can be viewed as a distributed system in which the
processors are humans, called workers. Such systems harness the cognitive power
of a group of workers connected to the Internet to execute relatively simple
tasks, whose solutions, once grouped, solve a problem that systems equipped
with only machines could not solve satisfactorily. Examples of such systems are
Amazon Mechanical Turk and the Zooniverse platform. A human computation
application comprises a group of tasks, each of them can be performed by one
worker. Tasks might have dependencies among each other. In this study, we
propose a theoretical framework to analyze such type of application from a
distributed systems point of view. Our framework is established on three
dimensions that represent different perspectives in which human computation
applications can be approached: quality-of-service requirements, design and
management strategies, and human aspects. By using this framework, we review
human computation in the perspective of programmers seeking to improve the
design of human computation applications and managers seeking to increase the
effectiveness of human computation infrastructures in running such
applications. In doing so, besides integrating and organizing what has been
done in this direction, we also put into perspective the fact that the human
aspects of the workers in such systems introduce new challenges in terms of,
for example, task assignment, dependency management, and fault prevention and
tolerance. We discuss how they are related to distributed systems and other
areas of knowledge.Comment: 3 figures, 1 tabl
Producing Scheduling that Causes Concurrent Programs to Fail
A noise maker is a tool that seeds a concurrent program with conditional synchronization primitives (such as yield()) for the purpose of increasing the likelihood that a bug manifest itself. This work explores the theory and practice of choosing where in the program to induce such thread switches at runtime. We introduce a novel fault model that classifies locations as .good., .neutral., or .bad,. based on the effect of a thread switch at the location. Using the model we explore the terms in which efficient search for real-life concurrent bugs can be carried out. We accordingly justify the use of probabilistic algorithms for this search and gain a deeper insight of the work done so far on noise-making. We validate our approach by experimenting with a set of programs taken from publicly available multi-threaded benchmark. Our empirical evidence demonstrates that real-life behavior is similar to what our model predicts
- …