7,976 research outputs found
Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators
In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applicationsâ output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude.
We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.This work was supported by the STIC-AmSud/CAPES scientific cooperation program under the EnergySFE research
project grant 99999.007556/2015-02, EU H2020 Programme, and MCTI/RNP-Brazil under the HPC4E Project, grant agreement
n° 689772. Tested K40 boards were donated thanks to Steve Keckler, Timothy Tsai, and Siva Hari from NVIDIA.Postprint (author's final draft
Survey and Systematization of Secure Device Pairing
Secure Device Pairing (SDP) schemes have been developed to facilitate secure
communications among smart devices, both personal mobile devices and Internet
of Things (IoT) devices. Comparison and assessment of SDP schemes is
troublesome, because each scheme makes different assumptions about out-of-band
channels and adversary models, and are driven by their particular use-cases. A
conceptual model that facilitates meaningful comparison among SDP schemes is
missing. We provide such a model. In this article, we survey and analyze a wide
range of SDP schemes that are described in the literature, including a number
that have been adopted as standards. A system model and consistent terminology
for SDP schemes are built on the foundation of this survey, which are then used
to classify existing SDP schemes into a taxonomy that, for the first time,
enables their meaningful comparison and analysis.The existing SDP schemes are
analyzed using this model, revealing common systemic security weaknesses among
the surveyed SDP schemes that should become priority areas for future SDP
research, such as improving the integration of privacy requirements into the
design of SDP schemes. Our results allow SDP scheme designers to create schemes
that are more easily comparable with one another, and to assist the prevention
of persisting the weaknesses common to the current generation of SDP schemes.Comment: 34 pages, 5 figures, 3 tables, accepted at IEEE Communications
Surveys & Tutorials 2017 (Volume: PP, Issue: 99
A Pattern Language for High-Performance Computing Resilience
High-performance computing systems (HPC) provide powerful capabilities for
modeling, simulation, and data analytics for a broad class of computational
problems. They enable extreme performance of the order of quadrillion
floating-point arithmetic calculations per second by aggregating the power of
millions of compute, memory, networking and storage components. With the
rapidly growing scale and complexity of HPC systems for achieving even greater
performance, ensuring their reliable operation in the face of system
degradations and failures is a critical challenge. System fault events often
lead the scientific applications to produce incorrect results, or may even
cause their untimely termination. The sheer number of components in modern
extreme-scale HPC systems and the complex interactions and dependencies among
the hardware and software components, the applications, and the physical
environment makes the design of practical solutions that support fault
resilience a complex undertaking. To manage this complexity, we developed a
methodology for designing HPC resilience solutions using design patterns. We
codified the well-known techniques for handling faults, errors and failures
that have been devised, applied and improved upon over the past three decades
in the form of design patterns. In this paper, we present a pattern language to
enable a structured approach to the development of HPC resilience solutions.
The pattern language reveals the relations among the resilience patterns and
provides the means to explore alternative techniques for handling a specific
fault model that may have different efficiency and complexity characteristics.
Using the pattern language enables the design and implementation of
comprehensive resilience solutions as a set of interconnected resilience
patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of
Program
Model Based Mission Assurance: NASA's Assurance Future
Model Based Systems Engineering (MBSE) is seeing increased application in planning and design of NASAs missions. This suggests the question: what will be the corresponding practice of Model Based Mission Assurance (MBMA)? Contemporaneously, NASAs Office of Safety and Mission Assurance (OSMA) is evaluating a new objectives based approach to standards to ensure that the Safety and Mission Assurance disciplines and programs are addressing the challenges of NASAs changing missions, acquisition and engineering practices, and technology. MBSE is a prominent example of a changing engineering practice. We use NASAs objectives-based strategy for Reliability and Maintainability as a means to examine how MBSE will affect assurance. We surveyed MBSE literature to look specifically for these affects, and find a variety of them discussed (some are anticipated, some are reported from applications to date). Predominantly these apply to the early stages of design, although there are also extrapolations of how MBSE practices will have benefits for testing phases. As the effort to develop MBMA continues, it will need to clearly and unambiguously establish the roles of uncertainty and risk in the system model. This will enable a variety of uncertainty-based analyses to be performed much more rapidly than ever before and has the promise to increase the integration of CRM (Continuous Risk Management) and PRA (Probabilistic Risk Analyses) even more fully into the project development life cycle. Various views and viewpoints will be required for assurance disciplines, and an over-arching viewpoint will then be able to more completely characterize the state of the project/program as well as (possibly) enabling the safety case approach for overall risk awareness and communication
- âŠ