315 research outputs found
Modeling RTL Fault Models Behavior to Increase the Confidence on TSIM-based Fault Injection
Future high-performance safety-relevant applications require microcontrollers delivering higher performance than the existing certified ones. However, means for assessing their dependability are needed so that they can be certified against safety critical certification standars (e.g ISO26262). Dependability assessment analyses performed at high level of abstraction inject single faults to investigate the effects these have in the system. In this work we show that single faults do not comprise the whole picture, due to fault multiplicities and reactivations. Later we prove that, by injecting complex fault models that consider multiplicities and reactivations in higher levels of abstraction, results are substantially different, thus indicating that a change in the methodology is needed.The research leading to these results has received funding from the Ministry of Science and Technology of Spain under contract TIN2015-65316-P and the HiPEAC Network of Excellence.
Carles Hern´andez is jointly funded by the Spanish Ministry of Economy and Competitiveness (MINECO) and FEDER funds through grant TIN2014-60404-JIN. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717.Postprint (author's final draft
Building on Quicksand
Reliable systems have always been built out of unreliable components. Early
on, the reliable components were small such as mirrored disks or ECC (Error
Correcting Codes) in core memory. These systems were designed such that
failures of these small components were transparent to the application. Later,
the size of the unreliable components grew larger and semantic challenges crept
into the application when failures occurred.
As the granularity of the unreliable component grows, the latency to
communicate with a backup becomes unpalatable. This leads to a more relaxed
model for fault tolerance. The primary system will acknowledge the work request
and its actions without waiting to ensure that the backup is notified of the
work. This improves the responsiveness of the system.
There are two implications of asynchronous state capture: 1) Everything
promised by the primary is probabilistic. There is always a chance that an
untimely failure shortly after the promise results in a backup proceeding
without knowledge of the commitment. Hence, nothing is guaranteed! 2)
Applications must ensure eventual consistency. Since work may be stuck in the
primary after a failure and reappear later, the processing order for work
cannot be guaranteed.
Platform designers are struggling to make this easier for their applications.
Emerging patterns of eventual consistency and probabilistic execution may soon
yield a way for applications to express requirements for a "looser" form of
consistency while providing availability in the face of ever larger failures.
This paper recounts portions of the evolution of these trends, attempts to
show the patterns that span these changes, and talks about future directions as
we continue to "build on quicksand".Comment: CIDR 200
Enabling portable I/O analysis of commercially sensitive HPC applications through workload replication
Benchmarking and analyzing I/O performance across high performance computing (HPC) platforms is necessary to identify performance bottlenecks and guide effective use of new and existing storage systems. Doing this with large production applications, which can often be commercially sensitive and lack portability, is not a straightforward task and the availability of a representative proxy for I/O workloads can help to provide a solution. We use Darshan I/O characterization and the MACSio proxy application to replicate five production workloads, showing how these can be used effectively to investigate I/O performance when migrating between HPC systems ranging from small local clusters to leadership scale machines. Preliminary results indicate that it is possible to generate datasets that match the target application with a good degree of accuracy. This enables a predictive performance analysis study of a representative workload to be conducted on five different systems. The results of this analysis are used to identify how workloads exhibit different I/O footprints on a file system and what effect file system configuration can have on performance
- …