5 research outputs found

    LOT-ECC: LOcalized and tiered reliability mechanisms for commodity memory systems

    Get PDF
    pre-printMemory system reliability is a serious and growing concern in modern servers. Existing chipkill-level mem- ory protection mechanisms suffer from several draw- backs. They activate a large number of chips on ev- ery memory access - this increases energy consump- tion, and reduces performance due to the reduction in rank-level parallelism. Additionally, they increase ac- cess granularity, resulting in wasted bandwidth in the absence of sufficient access locality. They also restrict systems to use narrow-I/O x4 devices, which are known to be less energy-efficient than the wider x8 DRAM de- vices. In this paper, we present LOT-ECC, a local- ized and multi-tiered protection scheme that attempts to solve these problems. We separate error detection and error correction functionality, and employ simple checksum and parity codes effectively to provide strong fault-tolerance, while simultaneously simplifying imple- mentation. Data and codes are localized to the same DRAM row to improve access efficiency. We use sys- tem firmware to store correction codes in DRAM data memory and modify the memory controller to handle data mapping. We thus build an effective fault-tolerance mechanism that provides strong reliability guarantees, activates as few chips as possible (reducing power con- sumption by up to 44.8% and reducing latency by up to 46.9%), and reduces circuit complexity, all while work- ing with commodity DRAMs and operating systems. Fi- nally, we propose the novel concept of a heterogeneous DIMM that enables the extension of LOT-ECC to x16 and wider DRAM parts

    LOT-ECC: LOcalized and tiered reliability mechanisms for commodity memory systems

    Get PDF
    pre-printMemory system reliability is a serious and growing concern in modern servers. Existing chipkill-level mem- ory protection mechanisms suffer from several draw- backs. They activate a large number of chips on ev- ery memory access - this increases energy consump- tion, and reduces performance due to the reduction in rank-level parallelism. Additionally, they increase ac- cess granularity, resulting in wasted bandwidth in the absence of sufficient access locality. They also restrict systems to use narrow-I/O x4 devices, which are known to be less energy-efficient than the wider x8 DRAM de- vices. In this paper, we present LOT-ECC, a local- ized and multi-tiered protection scheme that attempts to solve these problems. We separate error detection and error correction functionality, and employ simple checksum and parity codes effectively to provide strong fault-tolerance, while simultaneously simplifying imple- mentation. Data and codes are localized to the same DRAM row to improve access efficiency. We use sys- tem firmware to store correction codes in DRAM data memory and modify the memory controller to handle data mapping. We thus build an effective fault-tolerance mechanism that provides strong reliability guarantees, activates as few chips as possible (reducing power con- sumption by up to 44.8% and reducing latency by up to 46.9%), and reduces circuit complexity, all while work- ing with commodity DRAMs and operating systems. Fi- nally, we propose the novel concept of a heterogeneous DIMM that enables the extension of LOT-ECC to x16 and wider DRAM parts

    Enhancing Performance and Energy Consumption of HER Caches by Adding Associativity

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-54420-0_45Unlike other previous techniques, the recently proposed Hard Error Recovery (HER) fault-tolerant cache provides 100% fault-coverage in L1 data caches. This full coverage makes the HER cache appropiate for fault-dominated future technology nodes. An n-way set-associative HER cache implements one cache way with fast SRAM banks and the remaining ways with eDRAM banks to address power and area. Since the number of eDRAM cache blocks used in a specific HER cache organization depends on the cache associativity (i.e., the implemented number of ways), we expect that the performance and energy consumption provided by a given HER cache design strongly depends on the cache geometry. In this work we study the behavior of the HER cache design when applied to a highly associative L1 cache like those found in some modern microprocessors. In particular this work explores a 32KB 8-way associative L1 data cache such as the one used in Intel Haswell microarchitecture. Experimental results show that, at low-power modes compared to a conventional cache with the same storage capacity and number of ways, area, leakage power, and dynamic energy savings of a 4-way HER cache are by 25%, 85%, and 62%, respectively. These percentages are further improved (by 40%, 89%, and 68%, respectively) when the cache associativity is increased to 8 ways, while the performance loss with respect to both an 8-way conventional cache and the 4-way HER cache is minimal.This work was supponed by Generalitat de Catalunya (200950R1250), by FP7 program of the European Commission (TRAMS-248789), by Spanish Ministerio de Economía y Competitividad (MINECO) and by FEDER funds under Grant TlN2012-38341-C04-01 and TIN2010-18368.Lorente Garcés, VJ.; Valero Bresó, A.; Canal, R. (2014). Enhancing Performance and Energy Consumption of HER Caches by Adding Associativity. En Euro-Par 2013: Parallel Processing Workshops. Springer. 454-464. https://doi.org/10.1007/978-3-642-54420-0_45S454464Bhavnagarwala, A.J., et al.: The Impact of Intrinsic Device Fluctuations on CMOS SRAM Cell Stability. IEEE Journal of Solid-State Circuits 36(4), 658–665 (2001)Mukhopadhyay, S., et al.: Modeling of Failure Probability and Statistical Design of SRAM Array for Yield Enhancement in Nanoscaled CMOS. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 24(12), 1859–1880 (2005)Shirvani, P.P., McCluskey, E.J.: PADded Cache: A New Fault-Tolerance Technique for Cache Memories. In: Proceedings of the 17th IEEE VLSI Test Symposium, pp. 440–445 (1999)Wilkerson, C., et al.: Trading off Cache Capacity for Reliability to Enable Low Voltage Operation. In: Proceedings of the 35th Annual International Symposium on Computer Architecture, pp. 203–214 (2008)Agarwal, A., et al.: Process Variation in Embedded Memories: Failure Analysis and Variation Aware Architecture. IEEE Journal of Solid-State Circuits 40(9), 1804–1814 (2005)Ansari, A., et al.: Archipelago: A Polymorphic Cache Design for Enabling Robust Near-Threshold Operation. In: Proceedings of the 17th International Symposium on High Performance Computer Architecture, pp. 539–550 (2011)Nomura, S., et al.: Sampling + DMR: Practical and Low-overhead Permanent Fault Detection. In: Proceedings of the 38th Annual International Symposium on Computer Architecture, pp. 201–212 (2011)Sinharoy, B., et al.: IBM POWER7 multicore server processor. IBM Journal of Research and Development 55(3) (2011)Lorente, V., et al.: Combining RAM technologies for hard-error recovery in L1 data caches working at very-low power modes. In: Proceedings of the Design, Automation, and Test in Europe Conference, pp. 83–88 (2013)Kanter, D.: Intel’s Haswell CPU Microarchitecture, ”Real World Technologies” (November 13, 2012), http://www.realworldtech.com/haswell-cpu/Paul, S., et al.: Reliability-Driven ECC Allocation for Multiple Bit Error Resilience in Processor Cache. IEEE Transactions on Computers 60(1), 20–34 (2011)Alameldeen, A.R., et al.: Adaptive Cache Design to Enable Reliable Low-Voltage Operation. IEEE Transactions on Computers 60, 50–63 (2011)Dreslinski, R.G., et al.: Reconfigurable Energy Efficient Near Threshold Cache Architectures. In: Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, pp. 459–470 (2008)Wilkerson, C., et al.: Reducing Cache Power with Low-Cost, Multi-bit Error-Correcting Codes. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, pp. 83–93 (2010)Burger, D., Austin, T.M.: The SimpleScalar Tool Set, Version 2.0. ACM SIGARCH Computer Architecture News 25(3), 13–25 (1997)Thoziyoor, S., et al.: CACTI 5.1. Hewlett-Packard Laboratories, Palo Alto, Technical Report (2008)spec2000: Standard Performance Evaluation Corporation, http://www.spec.org/cpu2000Kulkarni, J.P., et al.: A 160 mV, Fully Differential, Robust Schmitt Trigger Based Sub-threshold SRAM. In: Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 171–176 (2007)Keeth, B., et al.: DRAM Circuit Design. Fundamental and High-Speed Topics. John Wiley and Sons, Inc., Hoboken (2008)Mueller, W., et al.: Challenges for the DRAM Cell Scaling to 40nm. In: IEEE International Electron Devices Meeting 4, pp. 336–339 (2005

    Doctor of Philosophy in Computing

    Get PDF
    dissertatio

    Resilience of an embedded architecture using hardware redundancy

    Get PDF
    In the last decade the dominance of the general computing systems market has being replaced by embedded systems with billions of units manufactured every year. Embedded systems appear in contexts where continuous operation is of utmost importance and failure can be profound. Nowadays, radiation poses a serious threat to the reliable operation of safety-critical systems. Fault avoidance techniques, such as radiation hardening, have been commonly used in space applications. However, these components are expensive, lag behind commercial components with regards to performance and do not provide 100% fault elimination. Without fault tolerant mechanisms, many of these faults can become errors at the application or system level, which in turn, can result in catastrophic failures. In this work we study the concepts of fault tolerance and dependability and extend these concepts providing our own definition of resilience. We analyse the physics of radiation-induced faults, the damage mechanisms of particles and the process that leads to computing failures. We provide extensive taxonomies of 1) existing fault tolerant techniques and of 2) the effects of radiation in state-of-the-art electronics, analysing and comparing their characteristics. We propose a detailed model of faults and provide a classification of the different types of faults at various levels. We introduce an algorithm of fault tolerance and define the system states and actions necessary to implement it. We introduce novel hardware and system software techniques that provide a more efficient combination of reliability, performance and power consumption than existing techniques. We propose a new element of the system called syndrome that is the core of a resilient architecture whose software and hardware can adapt to reliable and unreliable environments. We implement a software simulator and disassembler and introduce a testing framework in combination with ERA’s assembler and commercial hardware simulators
    corecore