115,225 research outputs found

    RAMPART: RowHammer Mitigation and Repair for Server Memory Systems

    Full text link
    RowHammer attacks are a growing security and reliability concern for DRAMs and computer systems as they can induce many bit errors that overwhelm error detection and correction capabilities. System-level solutions are needed as process technology and circuit improvements alone are unlikely to provide complete protection against RowHammer attacks in the future. This paper introduces RAMPART, a novel approach to mitigating RowHammer attacks and improving server memory system reliability by remapping addresses in each DRAM in a way that confines RowHammer bit flips to a single device for any victim row address. When RAMPART is paired with Single Device Data Correction (SDDC) and patrol scrub, error detection and correction methods in use today, the system can detect and correct bit flips from a successful attack, allowing the memory system to heal itself. RAMPART is compatible with DDR5 RowHammer mitigation features, as well as a wide variety of algorithmic and probabilistic tracking methods. We also introduce BRC-VL, a variation of DDR5 Bounded Refresh Configuration (BRC) that improves system performance by reducing mitigation overhead and show that it works well with probabilistic sampling methods to combat traditional and victim-focused mitigation attacks like Half-Double. The combination of RAMPART, SDDC, and scrubbing enables stronger RowHammer resistance by correcting bit flips from one successful attack. Uncorrectable errors are much less likely, requiring two successful attacks before the memory system is scrubbed.Comment: 16 pages, 13 figures. A version of this paper will appear in the Proceedings of MEMSYS2

    Reliability Models for Double Chipkill Detect/Correct Memory Systems

    Get PDF
    Chipkill correct is an advanced type of error correction used in memory subsystems. Existing analytical approaches for modeling the reliability of memory subsystems with chipkill correct are limited to those with chipkill correct solutions that can only guarantee correction of errors in a single DRAM device. However, chipkill correct solutions capable of guaranteeing the detection and even correction of errors in up to two DRAM devices have become common in existing HPC systems. Analytical reliability models are needed for such memory subsystems. This paper proposes analytical models for the reliability of double chipkill detect and/or correct. Validation against Monte Carlo simulations shows that the outputs of our analytical models are within 3.9% of Monte Carlo simulations, on average.Department of Energy /DE-FC02-06ER25750Ope

    Evaluation of Algorithm-Based Fault Tolerance for Machine Learning and Computer Vision under Neutron Radiation

    Get PDF
    In the past decade, there has been a push for deployment of commercial-off-the-shelf (COTS) avionics due in part to cheaper costs and the desire for more performance. Traditional radiation-hardened processors are expensive and only provide limited processing power. With smaller mission budgets and the need for more computational power, low-cost and high-performance COTS solutions become more attractive for these missions. Due to the computational capacity enhancements provided by COTS technology, machine-learning and computer-vision applications are now being deployed on modern space missions. However, COTS electronics are highly susceptible to radiation environments. As a result, reliability in the underlying computations becomes a concern. Matrix multiplication is used in machine-learning and computer-vision applications as the main computation for decisions, making it a critical part of the application. Therefore, the large time and memory footprint of the matrix multiplication in machine-learning and computer-vision applications makes them even more susceptible to single-event upsets. In this thesis, algorithm-based fault tolerance (ABFT) is investigated to mitigate silent data errors in machine learning and computer vision. ABFT is a methodology of data error detection and correction using information redundancy contained in separate data structures from the primary data. In matrix multiplication, ABFT consists of storing checksum data in vectors separate from the matrix to use for error detection and correction. Fault injection into a matrix-multiplication kernel was performed prior to irradiation. Irradiation was then performed on the kernel under wide-spectrum neutrons at Los Alamos Neutron Science Center to observe the mitigation effects of ABFT. Fault injections targeted towards the general-purpose registers show a 48×48\times reduction in data errors using data-error mitigation with ABFT with a negligible change in run-time. Cross-section results from irradiation show a 5.3x improvement in reliability of using ABFT as opposed to no mitigation with a >99.9999 confidence level. The results of this experiment demonstrate that ABFT is a viable solution for run-time error correction in matrix multiplication for machine-learning and computer-vision applications in future spacecraft

    Compiler extensions towards reliable multicore processors

    Full text link
    The current trend in commercial processors is producing multi-core architectures which pose both an opportunity and a challenge for future space based processing. The opportunity is how to leverage multi-core processors for high intensity computing applications and thus provide an order of magnitude increase in onboard processing capability with less size, mass, and power. The challenge is to provide the requisite safety and reliability in an extremely challenging radiation environment. The objective is to advance from multiple single processor systems typically flown to a fault tolerant multi-core system. Software based methods for multi-core processor fault tolerance to single event effects (SEEs) causing interrupts or ‘bit-flips’ are investigated and we propose to utilize additional cores and memory resources together with newly developed software protection techniques. This work also assesses the optimal trade space between reliability and performance. Our work is based on the modern compiler “LLVM” as it is ported to many architectures, where we implement optimization passes that enable automatic addition of protection techniques including Nmodular redundancy (NMR) and error detection and correction (EDAC) at assembly/instruction level to languages supported. The optimization passes modify the intermediate representation of the source code meaning it could be applied for any high level language, and any processor architecture supported by the LLVM framework. In our initial experiments, we implement separately triple modular redundancy (TMR) and error detection and correction codes including (Hamming, BCH) at instruction level. We combine these two methods for critical applications, where we first TMR our instructions, and then use EDAC as a further measure, when TMR is not able to correct the errors originating from the SEE. Our initial experiments show good performance (about 10% overhead) when protecting the memory of code using double error detection single error correction hamming code and TMR (Triple modular redundancy), further work is needed to improve the performance when protecting the memory of code using the BCH code. This work would be highly valuable, both to satellites/space but also in general computing such as in in aircraft, automotive, server farms, and medical equipment (or anywhere that needs safety critical performance) as hardware gets smaller and more susceptible

    Rethinking DRAM design and organization for energy-constrained multicores

    Get PDF
    PosterDRAM vendors have traditionally optimized for low cost and high performance, often making design decisions that incur energy penalties. For example, a single conventional DRAM access activates thousands of bitlines in many chips, to return a single cache line to the CPU. The other bits may be accessed if there is high locality in the access stream. Otherwise significant energy is wasted, especially when memory access locality is reduced as core, thread, and socket counts increase. Instead we propose and analyze two optimizations which activate as little of the DRAM circuitry as possible, while incurring only modest performance penalties. The first approach, Selective Bitline Activation (SBA), is compatible with existing DRAM standards. SBA waits for both RAS and CAS signals to arrive before activating exactly those bitlines that provide the requested cache line. Our second proposal, Single Subarray Access (SSA), re-organizes the layout of DRAM subarrays and the mapping of data to subarrays so that the entire cache line is fetched from a single subarray. Since SSA reads an entire cache line from a single DRAM, we also examine the addition of DRAM checksums, to increase error detection and correction capabilities. The approach is similar to existing methods used on disk drives. We then provide chipkill-level reliability through RAID techniques. The SSA design yields memory energy savings of 6X, while incurring an area cost of 4.5%, and even improving performance by up to 15%. Our chipkill solution has significantly lower capacity and energy overheads than other known chipkill solutions, while incurring only an 11% performance penalty compared to an SSA memory system without chipkill

    Operating System Support for Redundant Multithreading

    Get PDF
    Failing hardware is a fact and trends in microprocessor design indicate that the fraction of hardware suffering from permanent and transient faults will continue to increase in future chip generations. Researchers proposed various solutions to this issue with different downsides: Specialized hardware components make hardware more expensive in production and consume additional energy at runtime. Fault-tolerant algorithms and libraries enforce specific programming models on the developer. Compiler-based fault tolerance requires the source code for all applications to be available for recompilation. In this thesis I present ASTEROID, an operating system architecture that integrates applications with different reliability needs. ASTEROID is built on top of the L4/Fiasco.OC microkernel and extends the system with Romain, an operating system service that transparently replicates user applications. Romain supports single- and multi-threaded applications without requiring access to the application's source code. Romain replicates applications and their resources completely and thereby does not rely on hardware extensions, such as ECC-protected memory. In my thesis I describe how to efficiently implement replication as a form of redundant multithreading in software. I develop mechanisms to manage replica resources and to make multi-threaded programs behave deterministically for replication. I furthermore present an approach to handle applications that use shared-memory channels with other programs. My evaluation shows that Romain provides 100% error detection and more than 99.6% error correction for single-bit flips in memory and general-purpose registers. At the same time, Romain's execution time overhead is below 14% for single-threaded applications running in triple-modular redundant mode. The last part of my thesis acknowledges that software-implemented fault tolerance methods often rely on the correct functioning of a certain set of hardware and software components, the Reliable Computing Base (RCB). I introduce the concept of the RCB and discuss what constitutes the RCB of the ASTEROID system and other fault tolerance mechanisms. Thereafter I show three case studies that evaluate approaches to protecting RCB components and thereby aim to achieve a software stack that is fully protected against hardware errors
    corecore