357 research outputs found

    Towards resilient EU HPC systems: A blueprint

    Get PDF
    This document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. We analyse a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC systems. Our guidelines will be useful in the allocation of available resources, as well as guiding researchers and research funding towards the enhancement of resilience approaches with the highest priority and utility. Although our work is focused on the needs of next generation HPC systems in Europe, the principles and evaluations are applicable globally.This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the projects ECOSCALE (grant agreement No 671632), EPI (grant agreement No 826647), EuroEXA (grant agreement No 754337), Eurolab4HPC (grant agreement No 800962), EVOLVE (grant agreement No 825061), EXA2PRO (grant agreement No 801015), ExaNest (grant agreement No 671553), ExaNoDe (grant agreement No 671578), EXDCI-2 (grant agreement No 800957), LEGaTO (grant agreement No 780681), MB2020 (grant agreement No 779877), RECIPE (grant agreement No 801137) and SDK4ED (grant agreement No 780572). The work was also supported by the European Commission’s Seventh Framework Programme under the projects CLERECO (grant agreement No 611404), the NCSA-Inria-ANL-BSC-JSCRiken-UTK Joint-Laboratory for Extreme Scale Computing – JLESC (https://jlesc.github.io/), OMPI-X project (No ECP-2.3.1.17) and the Spanish Government through Severo Ochoa programme (SEV-2015-0493). This work was sponsored in part by the U.S. Department of Energy's Office of Advanced Scientific Computing Research, program managers Robinson Pino and Lucy Nowell. This manuscript has been authored by UT-Battelle, LLC under Contract No DE-AC05-00OR22725 with the U.S. Department of Energy.Preprin

    GPU devices for safety-critical systems: a survey

    Get PDF
    Graphics Processing Unit (GPU) devices and their associated software programming languages and frameworks can deliver the computing performance required to facilitate the development of next-generation high-performance safety-critical systems such as autonomous driving systems. However, the integration of complex, parallel, and computationally demanding software functions with different safety-criticality levels on GPU devices with shared hardware resources contributes to several safety certification challenges. This survey categorizes and provides an overview of research contributions that address GPU devices’ random hardware failures, systematic failures, and independence of execution.This work has been partially supported by the European Research Council with Horizon 2020 (grant agreements No. 772773 and 871465), the Spanish Ministry of Science and Innovation under grant PID2019-107255GB, the HiPEAC Network of Excellence and the Basque Government under grant KK-2019-00035. The Spanish Ministry of Economy and Competitiveness has also partially supported Leonidas Kosmidis with a Juan de la Cierva Incorporación postdoctoral fellowship (FJCI-2020- 045931-I).Peer ReviewedPostprint (author's final draft

    Understanding Soft Errors in Uncore Components

    Full text link
    The effects of soft errors in processor cores have been widely studied. However, little has been published about soft errors in uncore components, such as memory subsystem and I/O controllers, of a System-on-a-Chip (SoC). In this work, we study how soft errors in uncore components affect system-level behaviors. We have created a new mixed-mode simulation platform that combines simulators at two different levels of abstraction, and achieves 20,000x speedup over RTL-only simulation. Using this platform, we present the first study of the system-level impact of soft errors inside various uncore components of a large-scale, multi-core SoC using the industrial-grade, open-source OpenSPARC T2 SoC design. Our results show that soft errors in uncore components can significantly impact system-level reliability. We also demonstrate that uncore soft errors can create major challenges for traditional system-level checkpoint recovery techniques. To overcome such recovery challenges, we present a new replay recovery technique for uncore components belonging to the memory subsystem. For the L2 cache controller and the DRAM controller components of OpenSPARC T2, our new technique reduces the probability that an application run fails to produce correct results due to soft errors by more than 100x with 3.32% and 6.09% chip-level area and power impact, respectively.Comment: to be published in Proceedings of the 52nd Annual Design Automation Conferenc

    Variation Analysis, Fault Modeling and Yield Improvement of Emerging Spintronic Memories

    Get PDF

    타임 윈도우 카운터를 활용한 로우 해머링 방지 및 주기억장치 성능 향상

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2020. 8. 안정호.Computer systems using DRAM are exposed to row-hammer (RH) attacks, which can flip data in a DRAM row without directly accessing a row but by frequently activating its adjacent ones. There have been a number of proposals to prevent RH, including both probabilistic and deterministic solutions. However, the probabilistic solutions provide protection with no capability to detect attacks and have a non-zero probability for missing protection. Otherwise, counter-based deterministic solutions either incur large area overhead or suffer from noticeable performance drop on adversarial memory access patterns. To overcome these challenges, we propose a new counter-based RH prevention solution named Time Window Counter (TWiCe) based row refresh, which accurately detects potential RH attacks only using a small number of counters with a minimal performance impact. We first make a key observation that the number of rows that can cause RH is limited by the maximum values of row activation frequency and DRAM cell retention time. We calculate the maximum number of required counter entries per DRAM bank, with which TWiCe prevents RH with a strong deterministic guarantee. TWiCe incurs no performance overhead on normal DRAM operations and less than 0.7% area and energy overheads over contemporary DRAM devices. Our evaluation shows that TWiCe makes no more than 0.006% of additional DRAM row activations for adversarial memory access patterns, including RH attack scenarios. To reduce the area and energy overhead further, we propose the threshold adjusted rank-level TWiCe. We first introduce pseudo-associative TWiCe (pa-TWiCe) that can search for hundreds of TWiCe table entries energy-efficiently. In addition, by exploiting pa-TWiCe structure, we propose rank-level TWiCe that reduces the number of required entries further by managing the table entries at a rank-level. We also adjust the thresholds of TWiCe to reduce the number of entries without the increase of false-positive detection on general workloads. Finally, we propose extend TWiCe as a hot-page detector to improve main-memory performance. TWiCe table contains the row addresses that have been frequently activated recently, and they are likely to be activated again due to temporal locality in memory accesses. We show how the hot-page detection in TWiCe can be combined with a DRAM page swap methodology to reduce the DRAM latency for the hot pages. Also, our evaluation shows that low-latency DRAM using TWiCe achieves up to 12.2% IPC improvement over a baseline DDR4 device for a multi-threaded workload.DRAM을 주기억장치로 사용하는 컴퓨터 시스템은 로우 해머링 공격에 노출된다. 로우 해머링은 인접 DRAM 로우를 자주 activation함으로써 특정 DRAM 로우 데이터에 직접 접근하지 않고서도 데이터를 뒤집을 수 있는 현상을 말한다. 이러한 로우 해머링 현상을 방지하기 위해 여러가지 확률적인 방지 기법과 결정론적 방지 기법들이 연구되어 왔다. 그러나, 확률적인 방지 기법은 공격 자체를 탐지할 수 없고, 방지에 실패할 확률이 0이 아니라는 한계가 있다. 또한 기존의 카운터를 활용한 결정론적 방지 기법들은 큰 칩 면적 비용을 발생시키거나 특정 메모리 접근 패턴에서 현저한 성능 하락을 야기한다는 단점이 있다. 이러한 문제를 해결하기 위해, 우리는 TWiCe (Time Window Counter based row refresh)라는 새로운 카운터 기반 결정론적 방지 기법을 제안한다. TWiCe는 적은 수의 카운터를 활용하여 로우 해머링 공격을 정확하게 탐지하면서도 성능에 악영향을 최소화하는 방법이다. 우리는 DRAM 타이밍 파라미터에 의해 로우 activation 빈도가 제한되고 DRAM 셀이 주기적으로 리프레시 되기 때문에 로우 해머링을 야기할 수 있는 DRAM 로우의 수가 한정된다는 사실에 주목하였다. 이로부터 우리는 TWiCe가 확실한 결정론적 방지를 보장할 경우 필요한 DRAM 뱅크 당 필요한 카운터 수의 최대값을 구하였다. TWiCe는 일반적인 DRAM 동작 과정에서는 성능에 아무런 영향을 미치지 않으며, 현대 DRAM 디바이스에서 0.7% 이하의 칩 면적 증가 및 에너지 증가만을 필요로 한다. 우리가 진행한 평가에서 TWiCe는 로우 해머링 공격 시나리오를 포함한 여러가지 메모리 접근 패턴에서 0.006% 이하의 추가적인 DRAM activation을 요구하였다. 또한 TWiCe의 칩 면적 및 에너지 비용을 더욱 줄이기 위하여, 우리는 threshold가 조정된 랭크 단위 TWiCe를 제안한다. 먼저, 수백개가 넘는 TWiCe 테이블 항목 검색을 에너지 효율적으로 수행할 수 있는 pa-TWiCe (pseudo-associatvie TWiCe)를 제안하였다. 그리고, 테이블 항목을 랭크 단위로 관리하여 필요한 테이블 항목의 수를 더욱 줄인 랭크 단위 TWiCe를 제안하였다. 또한, 우리는 TWiCe의 threshold 값을 조절함으로써 일반적인 워크로드 상에서 거짓 양성(false-positive) 탐지를 증가시키지 않는 선에서 TWiCe의 테이블 항목 수를 더욱 줄였다. 마지막으로, 우리는 컴퓨터 시스템의 주기억장치 성능 향상을 위해 TWiCe를 hot-page 감지기로 사용하는 것을 제안한다. 메모리 접근의 시간적 지역성에 의해 최근 자주 activation된 DRAM 로우들은 다시 activation될 확률이 높고, TWiCe는 최근 자주 activation된 DRAM 로우에 대한 정보를 가지고 있다. 이러한 사실에 기반하여, 우리는 hot-page에 대한 DRAM 접근 지연시간을 줄이는 DRAM 페이지 스왑(swap) 기법들에 TWiCe를 적용하는 방법을 보인다. 우리가 수행한 평가에서 TWiCe를 사용한 저지연시간 DRAM은 멀티 쓰레딩 워크로드들에서 기존 DDR4 디바이스 대비 IPC를 최대 12.2% 증가시켰다.Introduction 1 1.1 Time Window Counter Based Row Refresh to Prevent Row-hammering 2 1.2 Optimizing Time Window Counter 6 1.3 Using Time Window Counters to Improve Main Memory Performance 8 1.4 Outline 10 Background of DRAM and Row-hammering 11 2.1 DRAM Device Organization 12 2.2 Sparing DRAM Rows to Combat Reliability Challenges 13 2.3 Main Memory Subsystem Organization and Operation 14 2.4 Row-hammering (RH) 18 2.5 Previous RH Prevention Solutions 20 2.6 Limitations of the Previous RH Solutions 21 TWiCe: Time Window Counter based RH Prevention 26 3.1 TWiCe: Time Window Counter 26 3.2 Proof of RH Prevention 30 3.3 Counter Table Size 33 3.4 Architecting TWiCe 35 3.4.1 Location of TWiCe Table 35 3.4.2 Augmenting DRAM Interface with a New Adjacent Row Refresh (ARR) Command 37 3.5 Analysis 40 3.6 Evaluation 42 Optimizing TWiCe to Reduce Implementation Cost 47 4.1 Pseudo-associative TWiCe 47 4.2 Rank-level TWiCe 50 4.3 Adjusting Threshold to Reduce Table Size 55 4.4 Analysis 57 4.5 Evaluation 59 Augmenting TWiCe for Hot-page Detection 62 5.1 Necessity of Counters for Detecting Hot Pages 62 5.2 Previous Studies on Migration for Asymmetric Low-latency DRAM 64 5.3 Extending TWiCe for Dynamic Hot-page Detection 67 5.4 Additional Components and Methodology 70 5.5 Analysis and Evaluation 73 5.5.1 Overhead Analysis 73 5.5.2 Evaluation 75 Conclusion 82 6.1 Future work 84 Bibliography 85 국문초록 94Docto

    Efficient scrub mechanisms for error-prone emerging memories

    Get PDF
    Journal ArticleMany memory cell technologies are being considered as possible replacements for DRAM and Flash technologies, both of which are nearing their scaling limits. While these new cells (PCM, STT-RAM, FeRAM, etc.) promise high density, better scaling, and non-volatility, they introduce new challenges. Solutions at the architecture level can help address some of these problems; e.g., prior re-search has proposed wear-leveling and hard error tolerance mechanisms to overcome the limited write endurance of PCM cells. In this paper, we focus on the soft error problem in PCM, a topic that has received little attention in the architecture community. Soft errors in DRAM memories are typically addressed by having SECDED support and a scrub mechanism. The scrub mechanism scans the memory looking for a single-bit error and corrects it be-fore the line experiences a second uncorrectable error. However, PCM (and other emerging memories) are prone to new sources of soft errors. In particular, multi-level cell (MLC) PCM devices will suffer from resistance drift, that increases the soft error rate and incurs high overheads for the scrub mechanism. This paper is the first to study the design of architectural scrub mechanisms, especially when tailored to the drift phenomenon in MLC PCM. Many of our solutions will also apply to other soft-error prone emerging memories. We first show that scrub overheads can be reduced with support for strong ECC codes and a lightweight error detection operation. We then design different scrub algorithms that can adaptively trade-off soft and hard errors. Using an approach that combines all proposed solutions, our scrub mechanism yields a 96.5% reduction in uncorrectable errors, a 24.4 × decrease in scrub-related writes, and a 37.8% reduction in scrub energy, relative to a basic scrub algorithm used in modern DRAM systems

    Exploiting Natural On-chip Redundancy for Energy Efficient Memory and Computing

    Get PDF
    Power density is currently the primary design constraint across most computing segments and the main performance limiting factor. For years, industry has kept power density constant, while increasing frequency, lowering transistors supply (Vdd) and threshold (Vth) voltages. However, Vth scaling has stopped because leakage current is exponentially related to it. Transistor count and integration density keep doubling every process generation (Moore’s Law), but the power budget caps the amount of hardware that can be active at the same time, leading to dark silicon. With each new generation, there are more resources available, but we cannot fully exploit their performance potential. In the last years, different research trends have explored how to cope with dark silicon and unlock the energy efficiency of the chips, including Near-Threshold voltage Computing (NTC) and approximate computing. NTC aggressively lowers Vdd to values near Vth. This allows a substantial reduction in power, as dynamic power scales quadratically with supply voltage. The resultant power reduction could be used to activate more chip resources and potentially achieve performance improvements. Unfortunately, Vdd scaling is limited by the tight functionality margins of on-chip SRAM transistors. When scaling Vdd down to values near-threshold, manufacture-induced parameter variations affect the functionality of SRAM cells, which eventually become not reliable. A large amount of emerging applications, on the other hand, features an intrinsic error-resilience property, tolerating a certain amount of noise. In this context, approximate computing takes advantage of this observation and exploits the gap between the level of accuracy required by the application and the level of accuracy given by the computation, providing that reducing the accuracy translates into an energy gain. However, deciding which instructions and data and which techniques are best suited for approximation still poses a major challenge. This dissertation contributes in these two directions. First, it proposes a new approach to mitigate the impact of SRAM failures due to parameter variation for effective operation at ultra-low voltages. We identify two levels of natural on-chip redundancy: cache level and content level. The first arises because of the replication of blocks in multi-level cache hierarchies. We exploit this redundancy with a cache management policy that allocates blocks to entries taking into account the nature of the cache entry and the use pattern of the block. This policy obtains performance improvements between 2% and 34%, with respect to block disabling, a technique with similar complexity, incurring no additional storage overhead. The latter (content level redundancy) arises because of the redundancy of data in real world applications. We exploit this redundancy compressing cache blocks to fit them in partially functional cache entries. At the cost of a slight overhead increase, we can obtain performance within 2% of that obtained when the cache is built with fault-free cells, even if more than 90% of the cache entries have at least a faulty cell. Then, we analyze how the intrinsic noise tolerance of emerging applications can be exploited to design an approximate Instruction Set Architecture (ISA). Exploiting the ISA redundancy, we explore a set of techniques to approximate the execution of instructions across a set of emerging applications, pointing out the potential of reducing the complexity of the ISA, and the trade-offs of the approach. In a proof-of-concept implementation, the ISA is shrunk in two dimensions: Breadth (i.e., simplifying instructions) and Depth (i.e., dropping instructions). This proof-of-concept shows that energy can be reduced on average 20.6% at around 14.9% accuracy loss
    corecore