Search CORE

17 research outputs found

Improving redundant multithreading performance for soft-error detection in HPC applications

Author: Pérez-Arroyo Diego Simón
Publication venue: 'Instituto Tecnologico de Costa Rica'
Publication date: 01/01/2018
Field of study

Tesis de Graduación (Maestría en Computación) Instituto Tecnológico de Costa Rica, Escuela de Computación, 2018As HPC systems move towards extreme scale, soft errors leading to silent data corruptions become a major concern. In this thesis, we propose a set of three optimizations to the classical Redundant Multithreading (RMT) approach to allow faster soft error detection. First, we leverage the use of Simultaneous Multithreading (SMT) to collocate sibling replicated threads on the same physical core to efficiently exchange data to expose errors. Some HPC applications cannot fully exploit SMT for performance improvement and instead, we propose to use these additional resources for fault tolerance. Second, we present variable aggregation to group several values together and use this merged value to speed up detection of soft errors. Third, we introduce selective checking to decrease the number of checked values to a minimum. The last two techniques reduce the overall performance overhead by relaxing the soft error detection scope. Our experimental evaluation, executed on recent multicore processors with representative HPC benchmarks, proves that the use of SMT for fault tolerance can enhance RMT performance. It also shows that, at constant computing power budget, with optimizations applied, the overhead of the technique can be significantly lower than the classical RMT replicated execution. Furthermore, these results show that RMT can be a viable solution for soft-error detection at extreme scale

Repositorio Institucional del Instituto Tecnologico de Costa Rica

BatchQueue : file producteur / consommateur optimisée pour les multi-cœurs

Author: Folliot Bertil
Preud'Homme Thomas
Sopena Julien
Thomas Gaël
Publication venue: HAL CCSD
Publication date: 10/05/2011
Field of study

National audienceLes applications séquentielles peuvent tirer partie des systèmes multi-cœurs en utilisant le parallélisme pipeline pour accroître leur performance. Dans un tel schéma de parallélisme, l'accélération possible est limitée par le surcoût dû à la communication cœur à cœur. Ce papier présente l'algorithme BatchQueue, un système de communication rapide conçu pour optimiser l'utilisation du cache matériel, notamment au regard du pré-chargement. BatchQueue propose des performances améliorées d'un facteur 2 : il est capable d'envoyer un mot de données en 3,5 nanosecondes sur un système 64 bits, représentant un débit de 2 Gio/s

INRIA a CCSD electronic archive server

Using proxy design pattern for transparent redundant execution

Author: Öz Dündar
Öz Işıl
Öz Sinan
Publication venue: CEUR Workshop Proceedings
Publication date: 01/01/2018
Field of study

12th Turkish National Software Engineering Symposium, UYMS 2018; Istanbul; Turkey; 10 September 2018 through 12 September 2018In this study, we propose a transparent model for reliable execution of object-oriented software. We design a generic object-oriented programming tool for redundant software execution to provide the desired level of reliability against transient hardware faults. To achieve this, we utilize the Proxy design pattern which is one of the well-known GoF design patterns that are formed to make software systems exible and easy to maintain. Proxy design pattern provides a controlled access and a transparent mechanism for adding new functionalities to an existing object when accessing it. Combining the instruments of dynamic proxy and annotations in Java programming language, we present, Redundant- Caller, a generic, transparent, and con gurable tool for redundant execution and majority voting. Our tool takes any object and creates a dynamic proxy for it which executes the methods of the object multiple times in separate threads, and performs majority voting on the background, requiring minimum amount of change in the original user code. Thanks to annotations, users can con gure the redundant execution scheme methodwise. Our experiments demonstrate that our tool provides a signi cant level of reliability to any object-oriented software with a reasonable amount of performance degradation through multithreaded execution.Bu çalışsmada, nesneye yönelik programların güvenilir bir şekilde çalıştırılması için saydam bir model önermekteyiz. Geçici donanım hatalarıa karşı istenen seviyede güvenilirliği sağlayabilmek amacıyla artıklı (redundant) program çalıştıması için genel bir nesneye yönelik programlama araç tasarladık. Bunun için yazılım sistemlerini esnek ve kolay sürdürülebilir yapabilmek için oluşturulmuş ve yaygınca kullanılan GoF tasarım örüntülerinden biri olan vekil tasarım örünüsünü kullandık. Vekil tasarım örüntüsü, var olan bir nesneye erişirken ona yeni fonksiyonellikler eklemeye yarayan saydam bir düzenek ve kontrollü bir erişim sağlamaktadır. Java programlama dilindeki dinamik vekil ve annotation araçlarını birleştirerek, artıklı çalıştırma ve çoğunluk oylaması için genel, saydam ve yapılandırılabilir bir araç olan RedundantCaller'ı sunmaktayız. Aracımız, herhangi bir nesneyi alır ve özgün kullanıcı koduna en az miktarda değişiklik gerektirerek nesnenin metotlarını farklı iş parçacıkların da çoklu miktarda çalıştıran ve arka planda çoğunluk oylaması yapan bir dinamik vekil yaratır. annotationlar sayesinde, kullanıcılar artıklı çalıştırmayı metot seviyesinde yapılandırabilirler. Deneylerimiz göstermektedir ki; aracımız herhangi bir nesneye yönelik program için çok iş parçacıklı çalıştırma sayesinde makul bir performans düşüşüyle kayda değer bir güvenilirlik seviyesi sağlamaktadır.Ulusal Yüksek Başarılı Hesaplama Merkezi'nin (UHeM), (1005202018

Recommended from our members

Runtime asynchronous fault tolerance via speculation

Author: August David I
Ghosh Soumyadeep
Huang Jialu
Lee Jae W
Mahlke Scott A
Zhang Yun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2012
Field of study

Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are commonly deployed to detect transient faults, but they are less flexible and cost-effective than software solutions. However, software solutions are rendered impractical because of high performance overheads. To address this problem, this paper presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the fastest transient fault detection technique known to date. Serving as a layer between the application and the underlying platform, RAFT automatically generates two symmetric program instances from a program binary. It detects transient faults in a non-invasive way and exploits high-confidence value speculation to achieve low runtime overhead. Evaluation on a commodity multicore system demonstrates that RAFT delivers a geomean performance overhead of 2.83% on a set of 30 SPEC CPU benchmarks and STAMP benchmarks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without requiring any change to the hardware or the software applications

Princeton University Open Access Repository

Crossref

Parallel error detection using heterogeneous cores

Author: Ainsworth S
Jones TM
Publication venue: Proceedings - 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018
Publication date: 01/01/2018
Field of study

Microprocessor error detection is increasingly important, as the number of transistors in modern systems heightens their vulnerability. In addition, many modern workloads in domains such as the automotive and health industries are increasingly error intolerant, due to strict safety standards. However, current detection techniques require duplication of all hardware structures, causing a considerable increase in power consumption and chip area. Solutions in the literature involve running the code multiple times on the same hardware, which reduces performance significantly and cannot capture all errors. We have designed a novel hardware-only solution for error detection, that exploits parallelism in checking code which may not exist in the original execution. We pair a high-performance out-of-order core with a set of small low-power cores, each of which checks a portion of the out-of-order core's execution. Our system enables the detection of both hard and soft errors, with low area, power and performance overheads.This work was supported by the Engineering and Physical Sciences Research Council (EPSRC), through grant references EP/K026399/1 and EP/M506485/1, and Arm Ltd

Crossref

Apollo (Cambridge)

ParaMedic: Heterogeneous Parallel Error Correction

Author: Ainsworth S
Jones TM
Publication venue: Proceedings - 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2019
Publication date: 01/01/2019
Field of study

Processor error detection can be reduced in cost significantly by exploiting the parallelism that exists in a repeated copy of an execution, which may not exist in the original code, to split up the redundant work on a large number of small, highly efficient cores. However, such schemes don't provide a method for automatic error recovery. We develop ParaMedic, an architecture to allow efficient automatic correction of errors detected in a system by using parallel heterogeneous cores, to provide a full fail-safe system that does not propagate errors to other systems, and can recover without manual intervention. This uses logging to roll back any computation that occurred after a detected error, along with a set of techniques to provide error-checking parallelism while still preventing the escape of incorrect processor values in multicore environments, where ordering of individual processors' logs is not enough to be able to roll back execution. Across a set of single and multi-threaded benchmarks, we achieve 3.1\% and 1.5\% overhead respectively, compared with 1.9\% and 1\% for error detection alone.Arm Lt

Crossref

Apollo (Cambridge)

ESoftCheck: Removal of Non-vital Checks for Fault Tolerance

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref