9 research outputs found

    Designing Low Cost Error Correction Schemes for Improving Memory Reliability

    Get PDF
    abstract: Memory systems are becoming increasingly error-prone, and thus guaranteeing their reliability is a major challenge. In this dissertation, new techniques to improve the reliability of both 2D and 3D dynamic random access memory (DRAM) systems are presented. The proposed schemes have higher reliability than current systems but with lower power, better performance and lower hardware cost. First, a low overhead solution that improves the reliability of commodity DRAM systems with no change in the existing memory architecture is presented. Specifically, five erasure and error correction (E-ECC) schemes are proposed that provide at least Chipkill-Correct protection for x4 (Schemes 1, 2 and 3), x8 (Scheme 4) and x16 (Scheme 5) DRAM systems. All schemes have superior error correction performance due to the use of strong symbol-based codes. In addition, the use of erasure codes extends the lifetime of the 2D DRAM systems. Next, two error correction schemes are presented for 3D DRAM memory systems. The first scheme is a rate-adaptive, two-tiered error correction scheme (RATT-ECC) that provides strong reliability (10^10x) reduction in raw FIT rate) for an HBM-like 3D DRAM system that services CPU applications. The rate-adaptive feature of RATT-ECC enables permanent bank failures to be handled through sparing. It can also be used to significantly reduce the refresh power consumption without decreasing the reliability and timing performance. The second scheme is a two-tiered error correction scheme (Config-ECC) that supports different sized accesses in GPU applications with strong reliability. It addresses the mismatch between data access size and fixed sized ECC scheme by designing a product code based flexible scheme. Config-ECC is built around a core unit designed for 32B access with a simple extension to support 64B and 128B accesses. Compared to fixed 32B and 64B ECC schemes, Config-ECC reduces the failure in time (FIT) rate by 200x and 20x, respectively. It also reduces the memory energy by 17% (in the dynamic mode) and 21% (in the static mode) compared to a state-of-the-art fixed 64B ECC scheme.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    Dvé:Improving DRAM reliability and performance on-demand via coherent replication

    Get PDF

    Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study

    Full text link
    In large-scale datacenters, memory failure is a common cause of server crashes, with Uncorrectable Errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) defects. Existing approaches primarily focus on predicting UEs using Correctable Errors (CEs), without fully considering the information provided by error bits. However, error bit patterns have a strong correlation with the occurrence of UEs. In this paper, we present a comprehensive study on the correlation between CEs and UEs, specifically emphasizing the importance of spatio-temporal error bit information. Our analysis reveals a strong correlation between spatio-temporal error bits and UE occurrence. Through evaluations using real-world datasets, we demonstrate that our approach significantly improves prediction performance by 15% in F1-score compared to the state-of-the-art algorithms. Overall, our approach effectively reduces the number of virtual machine interruptions caused by UEs by approximately 59%.Comment: Published at ICCAD 202

    Architectural Techniques to Enable Reliable and Scalable Memory Systems

    Get PDF
    High capacity and scalable memory systems play a vital role in enabling our desktops, smartphones, and pervasive technologies like Internet of Things (IoT). Unfortunately, memory systems are becoming increasingly prone to faults. This is because we rely on technology scaling to improve memory density, and at small feature sizes, memory cells tend to break easily. Today, memory reliability is seen as the key impediment towards using high-density devices, adopting new technologies, and even building the next Exascale supercomputer. To ensure even a bare-minimum level of reliability, present-day solutions tend to have high performance, power and area overheads. Ideally, we would like memory systems to remain robust, scalable, and implementable while keeping the overheads to a minimum. This dissertation describes how simple cross-layer architectural techniques can provide orders of magnitude higher reliability and enable seamless scalability for memory systems while incurring negligible overheads.Comment: PhD thesis, Georgia Institute of Technology (May 2017

    Enabling Recovery of Secure Non-Volatile Memories

    Get PDF
    Emerging non-volatile memories (NVMs), such as phase change memory (PCM), spin-transfer torque RAM (STT-RAM) and resistive RAM (ReRAM), have dual memory-storage characteristics and, therefore, are strong candidates to replace or augment current DRAM and secondary storage devices. The newly released Intel 3D XPoint persistent memory and Optane SSD series have shown promising features. However, when these new devices are exposed to events such as power loss, many issues arise when data recovery is expected. In this dissertation, I devised multiple schemes to enable secure data recovery for emerging NVM technologies when memory encryption is used. With the data-remanence feature of NVMs, physical attacks become easier; hence, emerging NVMs are typically paired with encryption. In particular, counter-mode encryption is commonly used due to its performance and security advantages over other schemes (e.g., electronic codebook encryption). However, enabling data recovery in power failure events requires the recovery of security metadata associated with data blocks. Naively writing security metadata updates along with data for each operation can further exacerbate the write endurance problem of NVMs as they have limited write endurance and very slow write operations. Therefore, it is necessary to enable the recovery of data and security metadata (encryption counters) but without incurring a significant number of writes. The first work of this dissertation presents an explanation of Osiris, a novel mechanism that repurposes error correcting code (ECC) co-located with data to enable recovery of encryption counters by additionally serving as a sanity-check for encryption counters used. Thus, by using a stop-loss mechanism with a limited number of trials, ECC can be used to identify which encryption counter that was used most recently to encrypt the data and, hence, allow correct decryption and recovery. The first work of this dissertation explores how different stop-loss parameters along with optimizations of Osiris can potentially reduce the number of writes. Overall, Osiris enables the recovery of encryption counters while achieving better performance and fewer writes than a conventional write-back caching scheme of encryption counters, which lacks the ability to recover encryption counters. Later, in the second work, Osiris implementation is expanded to work with different counter-mode memory encryption schemes, where we use an epoch-based approach to periodically persist updated counters. Later, when a crash occurs, we can recover counters through test-and-verification to identify the correct counter within the size of an epoch for counter recovery. Our proposed scheme, Osiris-Global, incurs minimal performance overheads and write overheads in enabling the recovery of encryption counters. In summary, the findings of the present PhD work enable the recovery of secure NVM systems and, hence, allows persistent applications to leverage the persistency features of NVMs. Meanwhile, it also minimizes the number of writes required in meeting this crash consistency requirement of secure NVM systems

    Co-designing reliability and performance for datacenter memory

    Get PDF
    Memory is one of the key components that affects reliability and performance of datacenter servers. Memory in today’s servers is organized and shared in several ways to provide the most performant and efficient access to data. For example, cache hierarchy in multi-core chips to reduce access latency, non-uniform memory access (NUMA) in multi-socket servers to improve scalability, disaggregation to increase memory capacity. In all these organizations, hardware coherence protocols are used to maintain memory consistency of this shared memory and implicitly move data to the requesting cores. This thesis aims to provide fault-tolerance against newer models of failure in the organization of memory in datacenter servers. While designing for improved reliability, this thesis explores solutions that can also enhance performance of applications. The solutions build over modern coherence protocols to achieve these properties. First, we observe that DRAM memory system failure rates have increased, demanding stronger forms of memory reliability. To combat this, the thesis proposes Dvé, a hardware driven replication mechanism where data blocks are replicated across two different memory controllers in a cache-coherent NUMA system. Data blocks are accompanied by a code with strong error detection capabilities so that when an error is detected, correction is performed using the replica. Dvé’s organization offers two independent points of access to data which enables: (a) strong error correction that can recover from a range of faults affecting any of the components in the memory and (b) higher performance by providing another nearer point of memory access. Dvé’s coherent replication keeps the replicas in sync for reliability and also provides coherent access to read replicas during fault-free operation for improved performance. Dvé can flexibly provide these benefits on-demand at runtime. Next, we observe that the coherence protocol itself requires to be hardened against failures. Memory in datacenter servers is being disaggregated from the compute servers into dedicated memory servers, driven by standards like CXL. CXL specifies the coherence protocol semantics for compute servers to access and cache data from a shared region in the disaggregated memory. However, the CXL specification lacks the requisite level of fault-tolerance necessary to operate at an inter-server scale within the datacenter. Compute servers can fail or be unresponsive in the datacenter and therefore, it is important that the coherence protocol remain available in the presence of such failures. The thesis proposes Āpta, a CXL-based, shared disaggregated memory system for keeping the cached data consistent without compromising availability in the face of compute server failures. Āpta architects a high-performance fault-tolerant object-granular memory server that significantly improves performance for stateless function-as-a-service (FaaS) datacenter applications

    Adaptive reliability chipkill correct (ARCC)

    Get PDF
    Chipkill correct is an advanced type of error correction in memory that is popular among servers. Large field studies of memories have shown that chipkill correct reduces uncorrectable error rate by 4X to 36X compared to the weaker Single Error Correct Double Error Detect (SECDED). Currently, there is a strong tradeoff between power and reliability among different chipkill correct solutions. For example, commercially available chipkill correct solutions that can detect up to two failed devices and correct one failed devices require accessing 36 memory devices per memory request. However, a weaker single chipkill correct single chipkill detect solution only requires accessing 18 devices per memory request and, therefore, consumes much lower memory power. In this research, we present Adaptive Reliability Chipkill Correct (ARCC) — an optimization to be applied to existing chipkill correct solutions to allow them to incur the low power consumption of a lower strength chipkill correct solution while maintaining similar reliability as that of a stronger chipkill correct solution. ARCC is based on the observation that, on average, only a tiny fraction of memory experiences any type of faults during the typical operational lifespan of a server. As such, it proposes relaxing the strength of chipkill correct in the beginning and then adaptively increasing the strength as needed on a page-by-page basis in order to reap the benefit of lower power consumption during the majority of the lifetime of a memory system. Our evaluation shows that ARCC reduces the power consumption of memory by 36%, on average, when applied to commercial chipkill correct, while keeping the storage overhead the same and maintaining similar reliability
    corecore