17 research outputs found

    Improving redundant multithreading performance for soft-error detection in HPC applications

    Get PDF
    Tesis de Graduación (Maestría en Computación) Instituto Tecnológico de Costa Rica, Escuela de Computación, 2018As HPC systems move towards extreme scale, soft errors leading to silent data corruptions become a major concern. In this thesis, we propose a set of three optimizations to the classical Redundant Multithreading (RMT) approach to allow faster soft error detection. First, we leverage the use of Simultaneous Multithreading (SMT) to collocate sibling replicated threads on the same physical core to efficiently exchange data to expose errors. Some HPC applications cannot fully exploit SMT for performance improvement and instead, we propose to use these additional resources for fault tolerance. Second, we present variable aggregation to group several values together and use this merged value to speed up detection of soft errors. Third, we introduce selective checking to decrease the number of checked values to a minimum. The last two techniques reduce the overall performance overhead by relaxing the soft error detection scope. Our experimental evaluation, executed on recent multicore processors with representative HPC benchmarks, proves that the use of SMT for fault tolerance can enhance RMT performance. It also shows that, at constant computing power budget, with optimizations applied, the overhead of the technique can be significantly lower than the classical RMT replicated execution. Furthermore, these results show that RMT can be a viable solution for soft-error detection at extreme scale

    BatchQueue : file producteur / consommateur optimisée pour les multi-cœurs

    Get PDF
    National audienceLes applications séquentielles peuvent tirer partie des systèmes multi-cœurs en utilisant le parallélisme pipeline pour accroître leur performance. Dans un tel schéma de parallélisme, l'accélération possible est limitée par le surcoût dû à la communication cœur à cœur. Ce papier présente l'algorithme BatchQueue, un système de communication rapide conçu pour optimiser l'utilisation du cache matériel, notamment au regard du pré-chargement. BatchQueue propose des performances améliorées d'un facteur 2 : il est capable d'envoyer un mot de données en 3,5 nanosecondes sur un système 64 bits, représentant un débit de 2 Gio/s

    Using proxy design pattern for transparent redundant execution

    Get PDF
    12th Turkish National Software Engineering Symposium, UYMS 2018; Istanbul; Turkey; 10 September 2018 through 12 September 2018In this study, we propose a transparent model for reliable execution of object-oriented software. We design a generic object-oriented programming tool for redundant software execution to provide the desired level of reliability against transient hardware faults. To achieve this, we utilize the Proxy design pattern which is one of the well-known GoF design patterns that are formed to make software systems exible and easy to maintain. Proxy design pattern provides a controlled access and a transparent mechanism for adding new functionalities to an existing object when accessing it. Combining the instruments of dynamic proxy and annotations in Java programming language, we present, Redundant- Caller, a generic, transparent, and con gurable tool for redundant execution and majority voting. Our tool takes any object and creates a dynamic proxy for it which executes the methods of the object multiple times in separate threads, and performs majority voting on the background, requiring minimum amount of change in the original user code. Thanks to annotations, users can con gure the redundant execution scheme methodwise. Our experiments demonstrate that our tool provides a signi cant level of reliability to any object-oriented software with a reasonable amount of performance degradation through multithreaded execution.Bu çalışsmada, nesneye yönelik programların güvenilir bir şekilde çalıştırılması için saydam bir model önermekteyiz. Geçici donanım hatalarıa karşı istenen seviyede güvenilirliği sağlayabilmek amacıyla artıklı (redundant) program çalıştıması için genel bir nesneye yönelik programlama araç tasarladık. Bunun için yazılım sistemlerini esnek ve kolay sürdürülebilir yapabilmek için oluşturulmuş ve yaygınca kullanılan GoF tasarım örüntülerinden biri olan vekil tasarım örünüsünü kullandık. Vekil tasarım örüntüsü, var olan bir nesneye erişirken ona yeni fonksiyonellikler eklemeye yarayan saydam bir düzenek ve kontrollü bir erişim sağlamaktadır. Java programlama dilindeki dinamik vekil ve annotation araçlarını birleştirerek, artıklı çalıştırma ve çoğunluk oylaması için genel, saydam ve yapılandırılabilir bir araç olan RedundantCaller'ı sunmaktayız. Aracımız, herhangi bir nesneyi alır ve özgün kullanıcı koduna en az miktarda değişiklik gerektirerek nesnenin metotlarını farklı iş parçacıkların da çoklu miktarda çalıştıran ve arka planda çoğunluk oylaması yapan bir dinamik vekil yaratır. annotationlar sayesinde, kullanıcılar artıklı çalıştırmayı metot seviyesinde yapılandırabilirler. Deneylerimiz göstermektedir ki; aracımız herhangi bir nesneye yönelik program için çok iş parçacıklı çalıştırma sayesinde makul bir performans düşüşüyle kayda değer bir güvenilirlik seviyesi sağlamaktadır.Ulusal Yüksek Başarılı Hesaplama Merkezi'nin (UHeM), (1005202018

    Parallel error detection using heterogeneous cores

    Get PDF
    Microprocessor error detection is increasingly important, as the number of transistors in modern systems heightens their vulnerability. In addition, many modern workloads in domains such as the automotive and health industries are increasingly error intolerant, due to strict safety standards. However, current detection techniques require duplication of all hardware structures, causing a considerable increase in power consumption and chip area. Solutions in the literature involve running the code multiple times on the same hardware, which reduces performance significantly and cannot capture all errors. We have designed a novel hardware-only solution for error detection, that exploits parallelism in checking code which may not exist in the original execution. We pair a high-performance out-of-order core with a set of small low-power cores, each of which checks a portion of the out-of-order core's execution. Our system enables the detection of both hard and soft errors, with low area, power and performance overheads.This work was supported by the Engineering and Physical Sciences Research Council (EPSRC), through grant references EP/K026399/1 and EP/M506485/1, and Arm Ltd

    ParaMedic: Heterogeneous Parallel Error Correction

    Get PDF
    Processor error detection can be reduced in cost significantly by exploiting the parallelism that exists in a repeated copy of an execution, which may not exist in the original code, to split up the redundant work on a large number of small, highly efficient cores. However, such schemes don't provide a method for automatic error recovery. We develop ParaMedic, an architecture to allow efficient automatic correction of errors detected in a system by using parallel heterogeneous cores, to provide a full fail-safe system that does not propagate errors to other systems, and can recover without manual intervention. This uses logging to roll back any computation that occurred after a detected error, along with a set of techniques to provide error-checking parallelism while still preventing the escape of incorrect processor values in multicore environments, where ordering of individual processors' logs is not enough to be able to roll back execution. Across a set of single and multi-threaded benchmarks, we achieve 3.1\% and 1.5\% overhead respectively, compared with 1.9\% and 1\% for error detection alone.Arm Lt

    ESoftCheck: Removal of Non-vital Checks for Fault Tolerance

    Full text link

    Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection

    No full text
    corecore