Search CORE

7,976 research outputs found

Understanding the propagation of hard errors to software and implications for resilient system design

Author: Chris
Daniel
David
Edward
Fred
Jayanth
Jun
Man-Lap Li
Milos
Nicholas
Pradeep Ramachandran
R. Rodriguez
Rajesh
Rotenberg Eric
Sarita V. Adve
Srinivasan M.
Swarup Kumar Sahoo
Todd
V. Reddy
Vikram S. Adve
Weining
Yuanyuan Zhou
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

Author: Carro Luigi
Cela Jose M.
Fernandes Fernando
Fratin Vinicius
Hanzich Mauricio
Lunardi Caio
Navaux Philippe
Oliveira Daniel
Pilla Laercio
Rech Paolo
Publication venue
Publication date: 01/03/2016
Field of study

In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications’ output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.This work was supported by the STIC-AmSud/CAPES scientific cooperation program under the EnergySFE research project grant 99999.007556/2015-02, EU H2020 Programme, and MCTI/RNP-Brazil under the HPC4E Project, grant agreement n° 689772. Tested K40 boards were donated thanks to Steve Keckler, Timothy Tsai, and Siva Hari from NVIDIA.Postprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Survey and Systematization of Secure Device Pairing

Author: Fomichev Mikhail
Gardner-Stephen Paul
Hollick Matthias
Steinmetzer Daniel
Álvarez Flor
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2017
Field of study

Secure Device Pairing (SDP) schemes have been developed to facilitate secure communications among smart devices, both personal mobile devices and Internet of Things (IoT) devices. Comparison and assessment of SDP schemes is troublesome, because each scheme makes different assumptions about out-of-band channels and adversary models, and are driven by their particular use-cases. A conceptual model that facilitates meaningful comparison among SDP schemes is missing. We provide such a model. In this article, we survey and analyze a wide range of SDP schemes that are described in the literature, including a number that have been adopted as standards. A system model and consistent terminology for SDP schemes are built on the foundation of this survey, which are then used to classify existing SDP schemes into a taxonomy that, for the first time, enables their meaningful comparison and analysis.The existing SDP schemes are analyzed using this model, revealing common systemic security weaknesses among the surveyed SDP schemes that should become priority areas for future SDP research, such as improving the integration of privacy requirements into the design of SDP schemes. Our results allow SDP scheme designers to create schemes that are more easily comparable with one another, and to assist the prevention of persisting the weaknesses common to the current generation of SDP schemes.Comment: 34 pages, 5 figures, 3 tables, accepted at IEEE Communications Surveys & Tutorials 2017 (Volume: PP, Issue: 99

arXiv.org e-Print Archive

TUbiblio

A Pattern Language for High-Performance Computing Resilience

Author: Chung Jinsuk
Mohror Kathryn
Saridakis Titos
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/10/2017
Field of study

High-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic calculations per second by aggregating the power of millions of compute, memory, networking and storage components. With the rapidly growing scale and complexity of HPC systems for achieving even greater performance, ensuring their reliable operation in the face of system degradations and failures is a critical challenge. System fault events often lead the scientific applications to produce incorrect results, or may even cause their untimely termination. The sheer number of components in modern extreme-scale HPC systems and the complex interactions and dependencies among the hardware and software components, the applications, and the physical environment makes the design of practical solutions that support fault resilience a complex undertaking. To manage this complexity, we developed a methodology for designing HPC resilience solutions using design patterns. We codified the well-known techniques for handling faults, errors and failures that have been devised, applied and improved upon over the past three decades in the form of design patterns. In this paper, we present a pattern language to enable a structured approach to the development of HPC resilience solutions. The pattern language reveals the relations among the resilience patterns and provides the means to explore alternative techniques for handling a specific fault model that may have different efficiency and complexity characteristics. Using the pattern language enables the design and implementation of comprehensive resilience solutions as a set of interconnected resilience patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of Program

arXiv.org e-Print Archive

Crossref

Model Based Mission Assurance: NASA's Assurance Future

Author: Cornford Steven
Evans John
Feather Martin S.
Publication venue
Publication date
Field of study

Model Based Systems Engineering (MBSE) is seeing increased application in planning and design of NASAs missions. This suggests the question: what will be the corresponding practice of Model Based Mission Assurance (MBMA)? Contemporaneously, NASAs Office of Safety and Mission Assurance (OSMA) is evaluating a new objectives based approach to standards to ensure that the Safety and Mission Assurance disciplines and programs are addressing the challenges of NASAs changing missions, acquisition and engineering practices, and technology. MBSE is a prominent example of a changing engineering practice. We use NASAs objectives-based strategy for Reliability and Maintainability as a means to examine how MBSE will affect assurance. We surveyed MBSE literature to look specifically for these affects, and find a variety of them discussed (some are anticipated, some are reported from applications to date). Predominantly these apply to the early stages of design, although there are also extrapolations of how MBSE practices will have benefits for testing phases. As the effort to develop MBMA continues, it will need to clearly and unambiguously establish the roles of uncertainty and risk in the system model. This will enable a variety of uncertainty-based analyses to be performed much more rapidly than ever before and has the promise to increase the integration of CRM (Continuous Risk Management) and PRA (Probabilistic Risk Analyses) even more fully into the project development life cycle. Various views and viewpoints will be required for assurance disciplines, and an over-arching viewpoint will then be able to more completely characterize the state of the project/program as well as (possibly) enabling the safety case approach for overall risk awareness and communication

NASA Technical Reports Server