23 research outputs found

    A Real-Time Error Detection (RTD) architecture and its use for reliability and post-silicon validation for F/F based memory arrays

    Get PDF
    This work proposes in-situ Real-Time Error Detection (RTD): embedding hardware in a memory array for detecting a fault in the array when it occurs, rather than when it is read. RTD breaks the serialization between data access and error-detection and, thus, it can speed-up the access-time of arrays that use in-line error-correction. The approach can also reduce the time needed to root-cause array related bugs during post-silicon validation and product testing. The paper introduces a two-dimensional error-correction scheme based on RTD and, also, presents a proactive error-correction method that combines RTD with demand-scrubbing. The work describes how to build RTD into a memory array with flip-flops to track in real-time the column-parity. A comparison of the proposed two-dimensional ECC scheme, as compared to single-error-correction-double-error-detection, shows that the RTD design has comparable error-detection-and-correction strength and, depending on the array dimensions and configuration, RTD reduces access time by 4% to 26% at an area and power overhead (negative is a reduction) between -7% to 33% and -42% to 86% respectively.Peer ReviewedPostprint (author's final draft

    Reliability-aware memory design using advanced reconfiguration mechanisms

    Get PDF
    Fast and Complex Data Memory systems has become a necessity in modern computational units in today's integrated circuits. These memory systems are integrated in form of large embedded memory for data manipulation and storage. This goal has been achieved by the aggressive scaling of transistor dimensions to few nanometer (nm) sizes, though; such a progress comes with a drawback, making it critical to obtain high yields of the chips. Process variability, due to manufacturing imperfections, along with temporal aging, mainly induced by higher electric fields and temperature, are two of the more significant threats that can no longer be ignored in nano-scale embedded memory circuits, and can have high impact on their robustness. Static Random Access Memory (SRAM) is one of the most used embedded memories; generally implemented with the smallest device dimensions and therefore its robustness can be highly important in nanometer domain design paradigm. Their reliable operation needs to be considered and achieved both in cell and also in architectural SRAM array design. Recently, and with the approach to near/below 10nm design generations, novel non-FET devices such as Memristors are attracting high attention as a possible candidate to replace the conventional memory technologies. In spite of their favorable characteristics such as being low power and highly scalable, they also suffer with reliability challenges, such as process variability and endurance degradation, which needs to be mitigated at device and architectural level. This thesis work tackles such problem of reliability concerns in memories by utilizing advanced reconfiguration techniques. In both SRAM arrays and Memristive crossbar memories novel reconfiguration strategies are considered and analyzed, which can extend the memory lifetime. These techniques include monitoring circuits to check the reliability status of the memory units, and architectural implementations in order to reconfigure the memory system to a more reliable configuration before a fail happens.Actualmente, el diseño de sistemas de memoria en circuitos integrados busca continuamente que sean más rápidos y complejos, lo cual se ha vuelto de gran necesidad para las unidades de computación modernas. Estos sistemas de memoria están integrados en forma de memoria embebida para una mejor manipulación de los datos y de su almacenamiento. Dicho objetivo ha sido conseguido gracias al agresivo escalado de las dimensiones del transistor, el cual está llegando a las dimensiones nanométricas. Ahora bien, tal progreso ha conllevado el inconveniente de una menor fiabilidad, dado que ha sido altamente difícil obtener elevados rendimientos de los chips. La variabilidad de proceso - debido a las imperfecciones de fabricación - junto con la degradación de los dispositivos - principalmente inducido por el elevado campo eléctrico y altas temperaturas - son dos de las más relevantes amenazas que no pueden ni deben ser ignoradas por más tiempo en los circuitos embebidos de memoria, echo que puede tener un elevado impacto en su robusteza final. Static Random Access Memory (SRAM) es una de las celdas de memoria más utilizadas en la actualidad. Generalmente, estas celdas son implementadas con las menores dimensiones de dispositivos, lo que conlleva que el estudio de su robusteza es de gran relevancia en el actual paradigma de diseño en el rango nanométrico. La fiabilidad de sus operaciones necesita ser considerada y conseguida tanto a nivel de celda de memoria como en el diseño de arquitecturas complejas basadas en celdas de memoria SRAM. Actualmente, con el diseño de sistemas basados en dispositivos de 10nm, dispositivos nuevos no-FET tales como los memristores están atrayendo una elevada atención como posibles candidatos para reemplazar las actuales tecnologías de memorias convencionales. A pesar de sus características favorables, tales como el bajo consumo como la alta escabilidad, ellos también padecen de relevantes retos de fiabilidad, como son la variabilidad de proceso y la degradación de la resistencia, la cual necesita ser mitigada tanto a nivel de dispositivo como a nivel arquitectural. Con todo esto, esta tesis doctoral afronta tales problemas de fiabilidad en memorias mediante la utilización de técnicas de reconfiguración avanzada. La consideración de nuevas estrategias de reconfiguración han resultado ser validas tanto para las memorias basadas en celdas SRAM como en `memristive crossbar¿, donde se ha observado una mejora significativa del tiempo de vida en ambos casos. Estas técnicas incluyen circuitos de monitorización para comprobar la fiabilidad de las unidades de memoria, y la implementación arquitectural con el objetivo de reconfigurar los sistemas de memoria hacia una configuración mucho más fiables antes de que el fallo suced

    Soft-Error Resilience Framework For Reliable and Energy-Efficient CMOS Logic and Spintronic Memory Architectures

    Get PDF
    The revolution in chip manufacturing processes spanning five decades has proliferated high performance and energy-efficient nano-electronic devices across all aspects of daily life. In recent years, CMOS technology scaling has realized billions of transistors within large-scale VLSI chips to elevate performance. However, these advancements have also continually augmented the impact of Single-Event Transient (SET) and Single-Event Upset (SEU) occurrences which precipitate a range of Soft-Error (SE) dependability issues. Consequently, soft-error mitigation techniques have become essential to improve systems\u27 reliability. Herein, first, we proposed optimized soft-error resilience designs to improve robustness of sub-micron computing systems. The proposed approaches were developed to deliver energy-efficiency and tolerate double/multiple errors simultaneously while incurring acceptable speed performance degradation compared to the prior work. Secondly, the impact of Process Variation (PV) at the Near-Threshold Voltage (NTV) region on redundancy-based SE-mitigation approaches for High-Performance Computing (HPC) systems was investigated to highlight the approach that can realize favorable attributes, such as reduced critical datapath delay variation and low speed degradation. Finally, recently, spin-based devices have been widely used to design Non-Volatile (NV) elements such as NV latches and flip-flops, which can be leveraged in normally-off computing architectures for Internet-of-Things (IoT) and energy-harvesting-powered applications. Thus, in the last portion of this dissertation, we design and evaluate for soft-error resilience NV-latching circuits that can achieve intriguing features, such as low energy consumption, high computing performance, and superior soft errors tolerance, i.e., concurrently able to tolerate Multiple Node Upset (MNU), to potentially become a mainstream solution for the aerospace and avionic nanoelectronics. Together, these objectives cooperate to increase energy-efficiency and soft errors mitigation resiliency of larger-scale emerging NV latching circuits within iso-energy constraints. In summary, addressing these reliability concerns is paramount to successful deployment of future reliable and energy-efficient CMOS logic and spintronic memory architectures with deeply-scaled devices operating at low-voltages

    Cost-Efficient Soft-Error Resiliency for ASIP-based Embedded Systems

    Full text link
    Recent decades have witnessed the rapid growth of embedded systems. At present, embedded systems are widely applied in a broad range of critical applications including automotive electronics, telecommunication, healthcare, industrial electronics, consumer electronics military and aerospace. Human society will continue to be greatly transformed by the pervasive deployment of embedded systems. Consequently, substantial amount of efforts from both industry and academic communities have contributed to the research and development of embedded systems. Application-specific instruction-set processor (ASIP) is one of the key advances in embedded processor technology, and a crucial component in some embedded systems. Soft errors have been directly observed since the 1970s. As devices scale, the exponential increase in the integration of computing systems occurs, which leads to correspondingly decrease in the reliability of computing systems. Today, major research forums state that soft errors are one of the major design technology challenges at and beyond the 22 nm technology node. Therefore, a large number of soft-error solutions, including error detection and recovery, have been proposed from differing perspectives. Nonetheless, most of the existing solutions are designed for general or high-performance systems which are different to embedded systems. For embedded systems, the soft-error solutions must be cost-efficient, which requires the tailoring of the processor architecture with respect to the feature of the target application. This thesis embodies a series of explorations for cost-efficient soft-error solutions for ASIP-based embedded systems. In this exploration, five major solutions are proposed. The first proposed solution realizes checkpoint recovery in ASIPs. By generating customized instructions, ASIP-implemented checkpoint recovery can perform at a finer granularity than what was previously possible. The fault-free performance overhead of this solution is only 1.45% on average. The recovery delay is only 62 cycles at the worst case. The area and leakage power overheads are 44.4% and 45.6% on average. The second solution explores utilizing two primitive error recovery techniques jointly. This solution includes three application-specific optimization methodologies. This solution generates the optimized error-resilient ASIPs, based on the characteristics of primitive error recovery techniques, static reliability analysis and design constraints. The resultant ASIP can be configured to perform at runtime according to the optimized recovery scheme. This solution can strategically enhance cost-efficiency for error recovery. In order to guarantee cost-efficiency in unpredictable runtime situations, the third solution explores runtime adaptation for error recovery. This solution aims to budget and adapt the error recovery operations, so as to spend the resources intelligently and to tolerate adverse influences of runtime variations. The resultant ASIP can make runtime decisions to determine the activation of spatial and temporal redundancies, according to the runtime situations. At the best case, this solution can achieve almost 50x reliability gain over the state of the art solutions. Given the increasing demand for multi-core computing systems, the last two proposed solutions target error recovery in multi-core ASIPs. The first solution of these two explores ASIP-implemented fine-grained process migration. This solution is a key infrastructure, which allows cost-efficient task management, for realizing cost-efficient soft-error recovery in multi-core ASIPs. The average time cost is only 289 machine cycles to perform process migration. The last solution explores using dynamic and adaptive mapping to assign heterogeneous recovery operations to the tasks in the multi-core context. This solution allows each individual ASIP-based processing core to dynamically adapt its specific error recovery functionality according to the corresponding task's characteristics, in terms of soft error vulnerability and execution time deadline. This solution can significantly improve the reliability of the system by almost two times, with graceful constraint penalty, in comparison to the state-of-the-art counterparts

    耐ソフトエラーラッチにおける欠陥の分析、検出及び評価に関する研究

    Get PDF
    The development of modern integrated circuits (ICs) has greatly changed the life of humankind. Nowadays, IC s are also indispensable to mission-critical applications, such as medical devices, autonomous cars, aircraft navigating systems, and satellites. The reliability of these mission-critical applications is a major concern. A soft-error occurring in an IC is a severe threat to its reliability, especially for mission-critical applications. The continuous trend of shrinking technology feature sizes makes modern ICs more and more vulnerable to soft errors. Soft-errors are caused by radiation particles striking an IC and generating current pulses to disturb its functionality. A soft-error can cause data corruption and may eventually lead to system failure s If a soft-error occurs in an operational medical device during surgery, it may cause a malfunction of this device and interrupt the surgery process. A soft-error may change the control data of an autonomous car which may lead to an accident. A soft-error may corrupt the aircraft navigating systems. No one would take the chance to let it happen even though malfunction s caused by soft errors can be solved by resetting these devices. Because reset takes time and severe results may happen during the resetting. If a soft-error causes a malfunction in the control system of a satellite, it may not be able to maintain its height and eventually burn up as it falls into the Earth’s atmosphere. Hence, it is important to protect ICs from soft errors. Many soft-error tolerance methods have been proposed to protect ICs against soft-errors. In an IC, memory elements and storage elements (e.g., latches and flip flops) are the most vulnerable to soft-errors, and data stored in them are crucial to the operation of a circuit. Error correction codes (ECCs) can be u sed to protect memories. Register-level soft-error tolerance methods can be used to detect soft-errors in latches by using parity checking and correct them by resetting. Hardened designs protect latches against soft-errors by using redundant feedback loops to store the same input data and using a voter to select the correct output. The advantage of using hardened designs is that they can prevent soft-errors from reaching outputs while ECCs and register-level soft-error tolerance methods must detect soft-errors and then correct them by restoring the data. For protecting storage elements in mission-critical applications, hardened latch design is the best option because it has high reliability and can save the resetting time. Many state-of-the-art hardened latch designs have been proposed to tolerate soft errors and they are believed to have good soft-error tolerability. Defects (physical flaws due to imperfect production (production defects) and physical changes caused by aging effects after a long operation time (aging-related defects) can also cause a malfunction of a circuit and cause a system failure eventually. Different from the temporal state change of a circuit caused by soft errors, defects are permanent damages to a circuit and can disturb the behavior of a circuit from its desired manner. Defects in storage elements should be detected to make sure a system/device operating correctly and stably. Scan test is a commonly used defect detection method, which connects reconfigured storage elements to form a shift register with external access and the internal states of these storage elements can be easily controlled and checked. However, the impact of defects on existing state of the art hardened latch design has not been considered. This impact requires consideration because added redundancy in hardened latch designs can not only mask soft-errors but also mask the effects of defects and it can lead to two serious problems: Problem-1 (Low Testability): Production defects in hardened latch designs are difficult to detect with conventional scan tests, in which the observability (an important metric to evaluate a circuit’s testability) of defects in hardened latch designs can be greatly reduced. Therefore, existing state-of-the-art hardened latches have low observability and thus low testability. Furthermore, defects that escaped the production test (undetected defects) may become more and more serious and cause a system failure eventually. Problem-2 (Low Soft-Error Tolerability): Undetected defects and aging-related defects can make hardened latch designs vulnerable to soft-errors while defect-free ones do not. The soft-error tolerability of hardened latch designs may be compromise d by undetected defects or aging related defects. This research is the first to consider Problem-1 of low testability of hardened latches and Problem-2 of defects reducing the reliability of hardened latches. Furthermore, this research is the first to pro pose a comprehensive solution to solve these two problems with the following five major contributions: Contribution-1: A first of its kind metric for quantifying the impact of defects on hardened latches, called Post-Test Vulnerability Factor (PTVF). It is used to analyze the residual soft-error tolerability of hardened latches after testing. Problem-2 is solved by this first major contribution. Contribution-2: A novel design called Scan-Test-Aware Hardened Latch (STAHL) that provides the highest defect coverage in comparison with all existing hardened latches. Problem-1 is solved by using STAHL to build a scan c ell to perform a scan test. Contribution-3: A novel scan test procedure is proposed to solve Problem-1 by fully testing the STAHL based scan cell. Contribution-4: A novel High-Performance Scan-Test-Aware Hardened Latch (HP-STAHL) design can also solve Problem-1 and has similar defect coverage as STAHL but has lower power consumption and higher propagation speed. Contribution-5: A novel scan test procedure is proposed to fully test the HP STAHL-based scan cell to solve Problem-1. Comprehensive simulation results demonstrate the accuracy of the PTVF metric and the effectiveness of the STAHL-based scan test and HP-STAHL-based scan test. As the first comprehensive study bridging the gap between hardened latch design s and IC testing, the findings of this research are expected to significantly improve the soft-error-related reliability of IC designs for mission-critical applications. Furthermore, the two proposed hardened latches and the scan test procedures can not only be use d to detect defects after production but also can be applied to detect aging related defects in the field through performing built-in self-test (BIST). In Chapter 1, an example is introduced to indicate Problem-1 and Problem-2. Chapter 2 shows the background information of soft-errors and defects. Chapter 3 shows some typical soft-error mitigation methods and details of a scan test. Chapter 4 describes the detailed information of PTVF Contribution-1). Chapter 5 shows the structure of STAHL (Contribution-2) and Chapter 6 shows the scan test procedure of testing the STAHL-based scan cell (Contribution-3). Chapter 7 shows the structure of HP-STAHL (Contribution-4) and Chapter 8 shows the scan test procedure of testing the HP-STAHL based scan cell (Contribution-5). Chapter 9 shows the experimental results of comparing STAHL and HP-STAHL with state-of-the-art hardened latch designs. Chapter 10 concludes this thesis.九州工業大学博士学位論文 学位記番号:情工博甲第371号 学位授与年月日:令和4年9月26日1. Introduction|2. Background|3. Related Works|4. Post-Test Vulnerability Factor (PTVF)|5. Scan-Test Aware Hardened Latch (STAHL)|6. Scan Test Based on STAHL|7. High Performance Scan-Test-Aware Hardened Latch (HP STAHL)|8. Scan Test Based on HP STAHL|9. Experimental Evaluation|10. Conclusions and Future Works九州工業大学令和4年

    AI/ML Algorithms and Applications in VLSI Design and Technology

    Full text link
    An evident challenge ahead for the integrated circuit (IC) industry in the nanometer regime is the investigation and development of methods that can reduce the design complexity ensuing from growing process variations and curtail the turnaround time of chip manufacturing. Conventional methodologies employed for such tasks are largely manual; thus, time-consuming and resource-intensive. In contrast, the unique learning strategies of artificial intelligence (AI) provide numerous exciting automated approaches for handling complex and data-intensive tasks in very-large-scale integration (VLSI) design and testing. Employing AI and machine learning (ML) algorithms in VLSI design and manufacturing reduces the time and effort for understanding and processing the data within and across different abstraction levels via automated learning algorithms. It, in turn, improves the IC yield and reduces the manufacturing turnaround time. This paper thoroughly reviews the AI/ML automated approaches introduced in the past towards VLSI design and manufacturing. Moreover, we discuss the scope of AI/ML applications in the future at various abstraction levels to revolutionize the field of VLSI design, aiming for high-speed, highly intelligent, and efficient implementations

    Cross-layer Soft Error Analysis and Mitigation at Nanoscale Technologies

    Get PDF
    This thesis addresses the challenge of soft error modeling and mitigation in nansoscale technology nodes and pushes the state-of-the-art forward by proposing novel modeling, analyze and mitigation techniques. The proposed soft error sensitivity analysis platform accurately models both error generation and propagation starting from a technology dependent device level simulations all the way to workload dependent application level analysis

    Predictive Reliability and Fault Management in Exascale Systems: State of the Art and Perspectives

    Get PDF
    © ACM, 2020. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Computing Surveys, Vol. 53, No. 5, Article 95. Publication date: September 2020. https://doi.org/10.1145/3403956[EN] Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.This work has received funding from the European Union's Horizon 2020 (H2020) research and innovation program under the FET-HPC Grant Agreement No. 801137 (RECIPE). Jaume Abella was also partially supported by the Ministry of Economy and Competitiveness of Spain under Contract No. TIN2015-65316-P and under Ramon y Cajal Postdoctoral Fellowship No. RYC-2013-14717, as well as by the HiPEAC Network of Excellence. Ramon Canal is partially supported by the Generalitat de Catalunya under Contract No. 2017SGR0962.Canal, R.; Hernández Luz, C.; Tornero-Gavilá, R.; Cilardo, A.; Massari, G.; Reghenzani, F.; Fornaciari, W.... (2020). Predictive Reliability and Fault Management in Exascale Systems: State of the Art and Perspectives. ACM Computing Surveys. 53(5):1-32. https://doi.org/10.1145/3403956S132535Abella, J., Hernandez, C., Quinones, E., Cazorla, F. J., Conmy, P. R., Azkarate-askasua, M., … Vardanega, T. (2015). WCET analysis methods: Pitfalls and challenges on their trustworthiness. 10th IEEE International Symposium on Industrial Embedded Systems (SIES). doi:10.1109/sies.2015.7185039E. Agullo L. Giraud A. Guermouche J. Roman and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: Recover-restart strategies. INRIA Research Report RR-8324. E. Agullo L. Giraud A. Guermouche J. Roman and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: Recover-restart strategies. INRIA Research Report RR-8324.Agullo, E., Giraud, L., Salas, P., & Zounon, M. (2016). Interpolation-Restart Strategies for Resilient Eigensolvers. SIAM Journal on Scientific Computing, 38(5), C560-C583. doi:10.1137/15m1042115Al-Qawasmeh, A. M., Pasricha, S., Maciejewski, A. A., & Siegel, H. J. (2015). Power and Thermal-Aware Workload Allocation in Heterogeneous Data Centers. IEEE Transactions on Computers, 64(2), 477-491. doi:10.1109/tc.2013.116ARM. 2017. ARM Reliability Availability and Serviceability (RAS) Specification—ARMv8 for the ARMv8-A Architecture Profile. White paper. Retrieved from https://developer.arm.com/docs/ddi0587/latest. ARM. 2017. ARM Reliability Availability and Serviceability (RAS) Specification—ARMv8 for the ARMv8-A Architecture Profile. White paper. Retrieved from https://developer.arm.com/docs/ddi0587/latest.Avizienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11-33. doi:10.1109/tdsc.2004.2Bautista-Gomez, L., Zyulkyarov, F., Unsal, O., & McIntosh-Smith, S. (2016). Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer. SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. doi:10.1109/sc.2016.54Berrocal, E., Bautista-Gomez, L., Di, S., Lan, Z., & Cappello, F. (2017). Toward General Software Level Silent Data Corruption Detection for Parallel Applications. IEEE Transactions on Parallel and Distributed Systems, 28(12), 3642-3655. doi:10.1109/tpds.2017.2735971M.-A. Breuer and A. D. Friedman. 1976. Diagnosis 8 Reliable Design of Digital Systems. Springer. M.-A. Breuer and A. D. Friedman. 1976. Diagnosis 8 Reliable Design of Digital Systems. Springer.P. Bridges K. Ferreira M. Heroux and M. Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints June 2012. arXiv:1206.1390 [math.NA]. P. Bridges K. Ferreira M. Heroux and M. Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints June 2012. arXiv:1206.1390 [math.NA].F. Cappello A. Geist W. Gropp S. Kale B. Kramer and M. Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1 1 (2014). http://superfri.org/superfri/article/view/14. F. Cappello A. Geist W. Gropp S. Kale B. Kramer and M. Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1 1 (2014). http://superfri.org/superfri/article/view/14.F. J. Cazorla L. Kosmidis E. Mezzetti C. Hernandez J. Abella and T. Vardanega. 2019. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv. 52 1 Article 14 (Feb. 2019) 35 pages. DOI:https://doi.org/10.1145/3301283 F. J. Cazorla L. Kosmidis E. Mezzetti C. Hernandez J. Abella and T. Vardanega. 2019. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv. 52 1 Article 14 (Feb. 2019) 35 pages. DOI:https://doi.org/10.1145/3301283Chan, C. S., Pan, B., Gross, K., Vaidyanathan, K., & Rosing, T. Š. (2014). Correcting vibration-induced performance degradation in enterprise servers. ACM SIGMETRICS Performance Evaluation Review, 41(3), 83-88. doi:10.1145/2567529.2567555Chantem, T., Hu, X. S., & Dick, R. P. (2011). Temperature-Aware Scheduling and Assignment for Hard Real-Time Applications on MPSoCs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 19(10), 1884-1897. doi:10.1109/tvlsi.2010.2058873Chen, M. Y., Kiciman, E., Fratkin, E., Fox, A., & Brewer, E. (s. f.). Pinpoint: problem determination in large, dynamic Internet services. Proceedings International Conference on Dependable Systems and Networks. doi:10.1109/dsn.2002.1029005Chen, Z. (2011). Algorithm-based recovery for iterative methods without checkpointing. Proceedings of the 20th international symposium on High performance distributed computing - HPDC ’11. doi:10.1145/1996130.1996142Chen, Z. (2013). Online-ABFT. Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP ’13. doi:10.1145/2442516.2442533Coskun, A. K., Rosing, T. S., Mihic, K., De Micheli, G., & Leblebici, Y. (2006). Analysis and Optimization of MPSoC Reliability. Journal of Low Power Electronics, 2(1), 56-69. doi:10.1166/jolpe.2006.007G. Da Costa A. Oleksiak W. Piatek J. Salom and L. Sisó. 2015. Minimization of costs and energy consumption in a data center by a workload-based capacity management. In Energy Efficient Data Centers S. Klingert M. Chinnici and M. Rey Porto (Eds.). Springer International Publishing Cham 102--119. G. Da Costa A. Oleksiak W. Piatek J. Salom and L. Sisó. 2015. Minimization of costs and energy consumption in a data center by a workload-based capacity management. In Energy Efficient Data Centers S. Klingert M. Chinnici and M. Rey Porto (Eds.). Springer International Publishing Cham 102--119.Cupertino, L., Da Costa, G., Oleksiak, A., Pia¸tek, W., Pierson, J.-M., Salom, J., … Zilio, T. (2015). Energy-efficient, thermal-aware modeling and simulation of data centers: The CoolEmAll approach and evaluation results. Ad Hoc Networks, 25, 535-553. doi:10.1016/j.adhoc.2014.11.002Dally, W. J. (1991). Express cubes: improving the performance of k-ary n-cube interconnection networks. IEEE Transactions on Computers, 40(9), 1016-1023. doi:10.1109/12.83652Dauwe, D., Pasricha, S., Maciejewski, A. A., & Siegel, H. J. (2018). Resilience-Aware Resource Management for Exascale Computing Systems. IEEE Transactions on Sustainable Computing, 3(4), 332-345. doi:10.1109/tsusc.2018.2797890R. I. Davis and A. Burns. 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43 4 Article 35 (Oct. 2011) 44 pages. DOI:https://doi.org/10.1145/1978802.1978814 R. I. Davis and A. Burns. 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43 4 Article 35 (Oct. 2011) 44 pages. DOI:https://doi.org/10.1145/1978802.1978814Di, S., & Cappello, F. (2016). Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications. IEEE Transactions on Parallel and Distributed Systems, 27(10), 2809-2823. doi:10.1109/tpds.2016.2517639Di, S., Guo, H., Gupta, R., Pershey, E. R., Snir, M., & Cappello, F. (2019). Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System. IEEE Transactions on Parallel and Distributed Systems, 30(2), 361-374. doi:10.1109/tpds.2018.2864184Di, S., Robert, Y., Vivien, F., & Cappello, F. (2017). Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model. IEEE Transactions on Parallel and Distributed Systems, 28(1), 244-259. doi:10.1109/tpds.2016.2546248J. Dongarra T. Herault and Y. Robert. 2015. Fault Tolerance Techniques for High-Performance Computing. Springer. J. Dongarra T. Herault and Y. Robert. 2015. Fault Tolerance Techniques for High-Performance Computing. Springer.DOWNING, S., & SOCIE, D. (1982). Simple rainflow counting algorithms. International Journal of Fatigue, 4(1), 31-40. doi:10.1016/0142-1123(82)90018-4Eghbalkhah, B., Kamal, M., Afzali-Kusha, H., Afzali-Kusha, A., Ghaznavi-Ghoushchi, M. B., & Pedram, M. (2015). Workload and temperature dependent evaluation of BTI-induced lifetime degradation in digital circuits. Microelectronics Reliability, 55(8), 1152-1162. doi:10.1016/j.microrel.2015.06.004Gottscho, M., Shoaib, M., Govindan, S., Sharma, B., Wang, D., & Gupta, P. (2017). Measuring the Impact of Memory Errors on Application  Performance. IEEE Computer Architecture Letters, 16(1), 51-55. doi:10.1109/lca.2016.2599513Greenberg, A., Hamilton, J. R., Jain, N., Kandula, S., Kim, C., Lahiri, P., … Sengupta, S. (2011). VL2. Communications of the ACM, 54(3), 95-104. doi:10.1145/1897852.1897877Heroux, M. A., Bartlett, R. A., Howle, V. E., Hoekstra, R. J., Hu, J. J., Kolda, T. G., … Stanley, K. S. (2005). An overview of the Trilinos project. ACM Transactions on Mathematical Software, 31(3), 397-423. doi:10.1145/1089014.1089021Hoffmann, G. A., Trivedi, K. S., & Malek, M. (2007). A Best Practice Guide to Resource Forecasting for Computing Systems. IEEE Transactions on Reliability, 56(4), 615-628. doi:10.1109/tr.2007.909764Hsiao, M. Y., Carter, W. C., Thomas, J. W., & Stringfellow, W. R. (1981). Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress. IBM Journal of Research and Development, 25(5), 453-468. doi:10.1147/rd.255.0453Hughes, G. F., Murray, J. F., Kreutz-Delgado, K., & Elkan, C. (2002). Improved disk-drive failure warnings. IEEE Transactions on Reliability, 51(3), 350-357. doi:10.1109/tr.2002.802886S. Hukerikar and C. Engelmann. 2017. Resilience design patterns: A structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4 3 (2017). DOI:https://doi.org/10.14529/jsfi170301 S. Hukerikar and C. Engelmann. 2017. Resilience design patterns: A structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4 3 (2017). DOI:https://doi.org/10.14529/jsfi170301Hussain, H., Malik, S. U. R., Hameed, A., Khan, S. U., Bickler, G., Min-Allah, N., … Rayes, A. (2013). A survey on resource allocation in high performance distributed computing systems. Parallel Computing, 39(11), 709-736. doi:10.1016/j.parco.2013.09.009Intel Corporation. [n.d.]. Intel Xeon Processor E7 Family: Reliability Availability and Serviceability. White paper. https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-family-ras-server-paper.html. Intel Corporation. [n.d.]. Intel Xeon Processor E7 Family: Reliability Availability and Serviceability. White paper. https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-family-ras-server-paper.html.Jha, S., Formicola, V., Martino, C. D., Dalton, M., Kramer, W. T., Kalbarczyk, Z., & Iyer, R. K. (2018). Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters. IEEE Transactions on Dependable and Secure Computing, 15(6), 915-930. doi:10.1109/tdsc.2017.2737537Kiciman, E., & Fox, A. (2005). Detecting Application-Level Failures in Component-Based Internet Services. IEEE Transactions on Neural Networks, 16(5), 1027-1041. doi:10.1109/tnn.2005.853411Kim, T., Sun, Z., Cook, C., Zhao, H., Li, R., Wong, D., & Tan, S. X.-D. (2016). Invited - Cross-layer modeling and optimization for electromigration induced reliability. Proceedings of the 53rd Annual Design Automation Conference. doi:10.1145/2897937.2905010Kurowski, K., Oleksiak, A., Piątek, W., Piontek, T., Przybyszewski, A., & Węglarz, J. (2013). DCworms – A tool for simulation of energy efficiency in distributed computing infrastructures. Simulation Modelling Practice and Theory, 39, 135-151. doi:10.1016/j.simpat.2013.08.007Langou, J., Chen, Z., Bosilca, G., & Dongarra, J. (2008). Recovery Patterns for Iterative Methods in a Parallel Unstable Environment. SIAM Journal on Scientific Computing, 30(1), 102-116. doi:10.1137/040620394J. C. Laprie (Ed.). 1995. Dependability—Its Attributes Impairments and Means. Springer-Verlag Berlin. J. C. Laprie (Ed.). 1995. Dependability—Its Attributes Impairments and Means. Springer-Verlag Berlin.Laprie, J.-C. (s. f.). DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY. Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ’ Highlights from Twenty-Five Years’. doi:10.1109/ftcsh.1995.532603Lasance, C. J. M. (2003). Thermally driven reliability issues in microelectronic systems: status-quo and challenges. Microelectronics Reliability, 43(12), 1969-1974. doi:10.1016/s0026-2714(03)00183-5Yinglung Liang, Yanyong Zhang, Sivasubramaniam, A., Jette, M., & Sahoo, R. (s. f.). BlueGene/L Failure Analysis and Prediction Models. International Conference on Dependable Systems and Networks (DSN’06). doi:10.1109/dsn.2006.18Lin, T.-T. Y., & Siewiorek, D. P. (1990). Error log analysis: statistical modeling and heuristic trend analysis. IEEE Transactions on Reliability, 39(4), 419-432. doi:10.1109/24.58720Losada, N., González, P., Martín, M. J., Bosilca, G., Bouteiller, A., & Teranishi, K. (2020). Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Generation Computer Systems, 106, 467-481. doi:10.1016/j.future.2020.01.026Lyons, R. E., & Vanderkulk, W. (1962). The Use of Triple-Modular Redundancy to Improve Computer Reliability. IBM Journal of Research and Development, 6(2), 200-209. doi:10.1147/rd.62.0200M. Médard and S. S. Lumetta. 2003. Network Reliability and Fault Tolerance. American Cancer Society. Retrieved from arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471219282.eot281. M. Médard and S. S. Lumetta. 2003. Network Reliability and Fault Tolerance. American Cancer Society. Retrieved from arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471219282.eot281.Moody, A., Bronevetsky, G., Mohror, K., & de Supinski, B. (2010). Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System. doi:10.2172/984082Moor Insights 8 Strategy. 2017. AMD EPYC Brings New RAS Capability. White paper. Retrieved from https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf. Moor Insights 8 Strategy. 2017. AMD EPYC Brings New RAS Capability. White paper. Retrieved from https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf.Mulas, F., Atienza, D., Acquaviva, A., Carta, S., Benini, L., & De Micheli, G. (2009). Thermal Balancing Policy for Multiprocessor Stream Computing Platforms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 28(12), 1870-1882. doi:10.1109/tcad.2009.2032372Oleksiak, A., Kierzynka, M., Piatek, W., Agosta, G., Barenghi, A., Brandolese, C., … Janssen, U. (2017). M2DC – Modular Microserver DataCentre with heterogeneous hardware. Microprocessors and Microsystems, 52, 117-130. doi:10.1016/j.micpro.2017.05.019Oxley, M. A., Jonardi, E., Pasricha, S., Maciejewski, A. A., Siegel, H. J., Burns, P. J., & Koenig, G. A. (2018). Rate-based thermal, power, and co-location aware resource management for heterogeneous data centers. Journal of Parallel and Distributed Computing, 112, 126-139. doi:10.1016/j.jpdc.2017.04.015K. O’brien I. Pietri R. Reddy A. Lastovetsky and R. Sakellariou. 2017. A survey of power and energy predictive models in HPC systems and applications. ACM Comput. Surv. 50 3 Article 37 (June 2017) 38 pages. DOI:https://doi.org/10.1145/3078811 K. O’brien I. Pietri R. Reddy A. Lastovetsky and R. Sakellariou. 2017. A survey of power and energy predictive models in HPC systems and applications. ACM Comput. Surv. 50 3 Article 37 (June 2017) 38 pages. DOI:https://doi.org/10.1145/3078811Park, S.-M., & Humphrey, M. (2011). Predictable High-Performance Computing Using Feedback Control and Admission Control. IEEE Transactions on Parallel and Distributed Systems, 22(3), 396-411. doi:10.1109/tpds.2010.100Pfefferman, J. D., & Cernuschi-Frias, B. (2002). A nonparametric nonstationary procedure for failure prediction. IEEE Transactions on Reliability, 51(4), 434-442. doi:10.1109/tr.2002.804733Rangan, K. K., Wei, G.-Y., & Brooks, D. (2009). Thread motion. ACM SIGARCH Computer Architecture News, 37(3), 302-313. doi:10.1145/1555815.1555793Paolo Rech. [n.d.]. Reliability Issues in Current and Future Supercomputers. Retrieved from http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf. Paolo Rech. [n.d.]. Reliability Issues in Current and Future Supercomputers. Retrieved from http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf.F. Reghenzani G. Massari and W. Fornaciari. 2019. The real-time Linux kernel: A survey on PREEMPT_RT. Comput. Surveys 52 1 Article 18 (Feb. 2019) 36 pages. DOI:https://doi.org/10.1145/3297714 F. Reghenzani G. Massari and W. Fornaciari. 2019. The real-time Linux kernel: A survey on PREEMPT_RT. Comput. Surveys 52 1 Article 18 (Feb. 2019) 36 pages. DOI:https://doi.org/10.1145/3297714F. Salfner M. Lenk and M. Malek. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42 3 Article 10 (March 2010) 42 pages. DOI:https://doi.org/10.1145/1670679.1670680 F. Salfner M. Lenk and M. Malek. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42 3 Article 10 (March 2010) 42 pages. DOI:https://doi.org/10.1145/1670679.1670680Salfner, F., Schieschke, M., & Malek, M. (2006). Predicting failures of computer systems: a case study for a telecommunication system. Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. doi:10.1109/ipdps.2006.1639672Shi, L., Chen, H., Sun, J., & Li, K. (2012). vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines. IEEE Transactions on Computers, 61(6), 804-816. doi:10.1109/tc.2011.112D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems 3rd ed. A. K. Peters Ltd. D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems 3rd ed. A. K. Peters Ltd.Singh, S., & Chana, I. (2016). A Survey on Resource Scheduling in Cloud Computing: Issues and Challenges. Journal of Grid Computing, 14(2), 217-264. doi:10.1007/s10723-015-9359-2Slegel, T. J., Averill, R. M., Check, M. A., Giamei, B. C., Krumm, B. W., Krygowski, C. A., … Webb, C. F. (1999). IBM’s S/390 G5 microprocessor design. IEEE Micro, 19(2), 12-23. doi:10.1109/40.755464Sridhar, A., Sabry, M. M., & Atienza, D. (2014). A Semi-Analytical Thermal Modeling Framework for Liquid-Cooled ICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 33(8), 1145-1158. doi:10.1109/tcad.2014.2323194Sridharan, V., DeBardeleben, N., Blanchard, S., Ferreira, K. B., Stearley, J., Shalf, J., & Gurumurthi, S. (2015). Memory Errors in Modern Systems. ACM SIGARCH Computer Architecture News, 43(1), 297-310. doi:10.1145/2786763.2694348Stathis, J. H. (2018). The physics of NBTI: What do we really know? 2018 IEEE International Reliability Physics Symposium (IRPS). doi:10.1109/irps.2018.8353539Stellner, G. (s. f.). CoCheck: checkpointing and process migration for MPI. Proceedings of International Conference on Parallel Processing. doi:10.1109/ipps.1996.508106Stone, J. E., Gohara, D., & Shi, G. (2010). OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science & Engineering, 12(3), 66-73. doi:10.1109/mcse.2010.69Subasi, O., Di, S., Bautista-Gomez, L., Balaprakash, P., Unsal, O., Labarta, J., … Cappello, F. (2018). Exploring the capabilities of support vector machines in detecting silent data corruptions. Sustainable Computing: Informatics and Systems, 19, 277-290. doi:10.1016/j.suscom.2018.01.004Tang, D., & Iyer, R. K. (1993). Dependability measurement and modeling of a multicomputer system. IEEE Transactions on Computers, 42(1), 62-75. doi:10.1109/12.192214D. Turnbull and N. Alldrin. 2003. Failure Prediction in Hardware Systems. Tech. rep. University of California San Diego CA. Retrieved from http://www.cs.ucsd.edu/ dturnbul/Papers/ServerPrediction.pdf. D. Turnbull and N. Alldrin. 2003. Failure Prediction in Hardware Systems. Tech. rep. University of California San Diego CA. Retrieved from http://www.cs.ucsd.edu/ dturnbul/Papers/ServerPrediction.pdf.Vilalta, R., Apte, C. V., Hellerstein, J. L., Ma, S., & Weiss, S. M. (2002). Predictive algorithms in the management of computer systems. IBM Systems Journal, 41(3), 461-474. doi:10.1147/sj.413.0461Vinoski, S. (2007). Reliability with Erlang. IEEE Internet Com
    corecore