485 research outputs found

    An Aging-Aware GPU Register File Design Based on Data Redundancy

    Get PDF
    "© 2019 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works."[EN] Nowadays, GPUs sit at the forefront of high-performance computing thanks to their massive computational capabilities. Internally, thousands of functional units, architected to be fed by large register files, fuel such a performance. At deep nanometer technologies, the SRAM memory cells that implement GPU register files are very sensitive to the Negative Bias Temperature Instability (NBTI) effect. NBTI ages cell transistors by degrading their threshold voltage Vth over the lifetime of the GPU. This degradation, which manifests when a cell keeps the same logic value for a relatively long period of time, compromises the cell read stability and increases the transistor switching delay, which can lead to wrong read values and eventually exceed the processor cycle time, respectively, so resulting in faulty operation. Thiswork proposes architectural mechanisms leveraging the redundancy of the data stored in GPU register files to attack NBTI aging. The proposed mechanisms are based on data compression, power gating, and register address rotation techniques. All these mechanismsworking together balance the distribution of logic values stored in the cells along the execution time, reducing both the overall Vth degradation and the increase in the transistor switching delays. Experimental results show that a conventional GPU register file suffers the worst case for NBTI, since a significant fraction of the cells maintain the same logic value during the entire application execution (i.e., a 100 percent '0' and '1' duty cycle distributions). On average, the proposal reduces these distributions by 58 and 68 percent, respectively, which translates into Vth degradation savings by 54 and 62 percent, respectively.This work was supported by the Gobierno de Aragon and the European ESF (gaZ: T58_17R research group), and by the Ministerio de Economia y Competitividad (MINECO) and AEI/FEDER (EU) funds under Grants TIN2016-76635-C2-1-R and TIN2015-66972-C5-1-R.Valero Bresó, A.; Candel-Margaix, F.; Suárez-Gracia, D.; Petit Martí, SV.; Sahuquillo Borrás, J. (2019). An Aging-Aware GPU Register File Design Based on Data Redundancy. IEEE Transactions on Computers. 68(1):4-20. https://doi.org/10.1109/TC.2018.2849376S42068

    DC-Patch: A Microarchitectural Fault Patching Technique for GPU Register Files

    Get PDF
    The ever-increasing parallelism demand of General-Purpose Graphics Processing Unit (GPGPU) applications pushes toward larger and more energy-hungry register files in successive GPU generations. Reducing the supply voltage beyond its safe limit is an effective way to improve the energy efficiency of register files. However, at these operating voltages, the reliability of the circuit is compromised. This work aims to tolerate permanent faults from process variations in large GPU register files operating below the safe supply voltage limit. To do so, this paper proposes a microarchitectural patching technique, DC-Patch, exploiting the inherent data redundancy of applications to compress registers at run-time with neither compiler assistance nor instruction set modifications. Instead of disabling an entire faulty register file entry, DC-Patch leverages the reliable cells within a faulty entry to store compressed register values. Experimental results show that, with more than a third of faulty register entries, DC-Patch ensures a reliable operation of the register file and reduces the energy consumption by 47% with respect to a conventional register file working at nominal supply voltage. The energy savings are 21% compared to a voltage noise smoothing scheme operating at the safe supply voltage limit. These benefits are obtained with less than 2 and 6% impact on the system performance and area, respectively

    Improving the Robustness of Redundant Execution with Register File Randomization

    Full text link
    [EN] Staggered Redundant execution (SRE) is a fault-tolerance mechanism that has been widely deployed in the context of safety-critical applications. SRE not only protects the system in the presence of faults but also helps relaxing safety requirements of individual elements. However, in this paper, we show that SRE does not effectively protect the system against a wide range of faults and thus, new mechanisms to increase the diversity of homogeneous cores are needed. In this paper, we propose Register File Randomization (RFR), a low-cost diversity mechanism that significantly increases the robustness of homogeneous multicores in front of common-cause faults (CCFs) and register file wearout. Our results show that RFR completely removes the failure rate for register file CCFs for certain workloads and reduces by a factor of 5X the impact of stress related register file aging for the workloads analysed. Our implementation requires less than 50 RTL lines of code and the area (FPGA logic) overhead of RFR is less than 0.2% of a 64-bit RISC-V core FPGA implementation.This work has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No 877056 and the Agencia Estatal de Investigacion from Spain under grant agreement no. PCI2020-112092, and from the the European Unions Horizon 2020 research and innovation programme under grant agreement no. 871467.Tuzov, I.; Andreu, P.; Medina, L.; Picornell-Sanjuan, T.; Robles MartĂ­nez, A.; LĂłpez RodrĂ­guez, PJ.; Flich Cardo, J.... (2021). Improving the Robustness of Redundant Execution with Register File Randomization. IEEE. 1-9. https://doi.org/10.1109/ICCAD51958.2021.96434661

    Cross layer reliability estimation for digital systems

    Get PDF
    Forthcoming manufacturing technologies hold the promise to increase multifuctional computing systems performance and functionality thanks to a remarkable growth of the device integration density. Despite the benefits introduced by this technology improvements, reliability is becoming a key challenge for the semiconductor industry. With transistor size reaching the atomic dimensions, vulnerability to unavoidable fluctuations in the manufacturing process and environmental stress rise dramatically. Failing to meet a reliability requirement may add excessive re-design cost to recover and may have severe consequences on the success of a product. %Worst-case design with large margins to guarantee reliable operation has been employed for long time. However, it is reaching a limit that makes it economically unsustainable due to its performance, area, and power cost. One of the open challenges for future technologies is building ``dependable'' systems on top of unreliable components, which will degrade and even fail during normal lifetime of the chip. Conventional design techniques are highly inefficient. They expend significant amount of energy to tolerate the device unpredictability by adding safety margins to a circuit's operating voltage, clock frequency or charge stored per bit. Unfortunately, the additional cost introduced to compensate unreliability are rapidly becoming unacceptable in today's environment where power consumption is often the limiting factor for integrated circuit performance, and energy efficiency is a top concern. Attention should be payed to tailor techniques to improve the reliability of a system on the basis of its requirements, ending up with cost-effective solutions favoring the success of the product on the market. Cross-layer reliability is one of the most promising approaches to achieve this goal. Cross-layer reliability techniques take into account the interactions between the layers composing a complex system (i.e., technology, hardware and software layers) to implement efficient cross-layer fault mitigation mechanisms. Fault tolerance mechanism are carefully implemented at different layers starting from the technology up to the software layer to carefully optimize the system by exploiting the inner capability of each layer to mask lower level faults. For this purpose, cross-layer reliability design techniques need to be complemented with cross-layer reliability evaluation tools, able to precisely assess the reliability level of a selected design early in the design cycle. Accurate and early reliability estimates would enable the exploration of the system design space and the optimization of multiple constraints such as performance, power consumption, cost and reliability. This Ph.D. thesis is devoted to the development of new methodologies and tools to evaluate and optimize the reliability of complex digital systems during the early design stages. More specifically, techniques addressing hardware accelerators (i.e., FPGAs and GPUs), microprocessors and full systems are discussed. All developed methodologies are presented in conjunction with their application to real-world use cases belonging to different computational domains

    Predictive Reliability and Fault Management in Exascale Systems: State of the Art and Perspectives

    Get PDF
    © ACM, 2020. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Computing Surveys, Vol. 53, No. 5, Article 95. Publication date: September 2020. https://doi.org/10.1145/3403956[EN] Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.This work has received funding from the European Union's Horizon 2020 (H2020) research and innovation program under the FET-HPC Grant Agreement No. 801137 (RECIPE). Jaume Abella was also partially supported by the Ministry of Economy and Competitiveness of Spain under Contract No. TIN2015-65316-P and under Ramon y Cajal Postdoctoral Fellowship No. RYC-2013-14717, as well as by the HiPEAC Network of Excellence. Ramon Canal is partially supported by the Generalitat de Catalunya under Contract No. 2017SGR0962.Canal, R.; Hernández Luz, C.; Tornero-Gavilá, R.; Cilardo, A.; Massari, G.; Reghenzani, F.; Fornaciari, W.... (2020). Predictive Reliability and Fault Management in Exascale Systems: State of the Art and Perspectives. ACM Computing Surveys. 53(5):1-32. https://doi.org/10.1145/3403956S132535Abella, J., Hernandez, C., Quinones, E., Cazorla, F. J., Conmy, P. R., Azkarate-askasua, M., … Vardanega, T. (2015). WCET analysis methods: Pitfalls and challenges on their trustworthiness. 10th IEEE International Symposium on Industrial Embedded Systems (SIES). doi:10.1109/sies.2015.7185039E. Agullo L. Giraud A. Guermouche J. Roman and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: Recover-restart strategies. INRIA Research Report RR-8324. E. Agullo L. Giraud A. Guermouche J. Roman and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: Recover-restart strategies. INRIA Research Report RR-8324.Agullo, E., Giraud, L., Salas, P., & Zounon, M. (2016). Interpolation-Restart Strategies for Resilient Eigensolvers. SIAM Journal on Scientific Computing, 38(5), C560-C583. doi:10.1137/15m1042115Al-Qawasmeh, A. M., Pasricha, S., Maciejewski, A. A., & Siegel, H. J. (2015). Power and Thermal-Aware Workload Allocation in Heterogeneous Data Centers. IEEE Transactions on Computers, 64(2), 477-491. doi:10.1109/tc.2013.116ARM. 2017. ARM Reliability Availability and Serviceability (RAS) Specification—ARMv8 for the ARMv8-A Architecture Profile. White paper. Retrieved from https://developer.arm.com/docs/ddi0587/latest. ARM. 2017. ARM Reliability Availability and Serviceability (RAS) Specification—ARMv8 for the ARMv8-A Architecture Profile. White paper. Retrieved from https://developer.arm.com/docs/ddi0587/latest.Avizienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11-33. doi:10.1109/tdsc.2004.2Bautista-Gomez, L., Zyulkyarov, F., Unsal, O., & McIntosh-Smith, S. (2016). Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer. SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. doi:10.1109/sc.2016.54Berrocal, E., Bautista-Gomez, L., Di, S., Lan, Z., & Cappello, F. (2017). Toward General Software Level Silent Data Corruption Detection for Parallel Applications. IEEE Transactions on Parallel and Distributed Systems, 28(12), 3642-3655. doi:10.1109/tpds.2017.2735971M.-A. Breuer and A. D. Friedman. 1976. Diagnosis 8 Reliable Design of Digital Systems. Springer. M.-A. Breuer and A. D. Friedman. 1976. Diagnosis 8 Reliable Design of Digital Systems. Springer.P. Bridges K. Ferreira M. Heroux and M. Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints June 2012. arXiv:1206.1390 [math.NA]. P. Bridges K. Ferreira M. Heroux and M. Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints June 2012. arXiv:1206.1390 [math.NA].F. Cappello A. Geist W. Gropp S. Kale B. Kramer and M. Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1 1 (2014). http://superfri.org/superfri/article/view/14. F. Cappello A. Geist W. Gropp S. Kale B. Kramer and M. Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1 1 (2014). http://superfri.org/superfri/article/view/14.F. J. Cazorla L. Kosmidis E. Mezzetti C. Hernandez J. Abella and T. Vardanega. 2019. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv. 52 1 Article 14 (Feb. 2019) 35 pages. DOI:https://doi.org/10.1145/3301283 F. J. Cazorla L. Kosmidis E. Mezzetti C. Hernandez J. Abella and T. Vardanega. 2019. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv. 52 1 Article 14 (Feb. 2019) 35 pages. DOI:https://doi.org/10.1145/3301283Chan, C. S., Pan, B., Gross, K., Vaidyanathan, K., & Rosing, T. Š. (2014). Correcting vibration-induced performance degradation in enterprise servers. ACM SIGMETRICS Performance Evaluation Review, 41(3), 83-88. doi:10.1145/2567529.2567555Chantem, T., Hu, X. S., & Dick, R. P. (2011). Temperature-Aware Scheduling and Assignment for Hard Real-Time Applications on MPSoCs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 19(10), 1884-1897. doi:10.1109/tvlsi.2010.2058873Chen, M. Y., Kiciman, E., Fratkin, E., Fox, A., & Brewer, E. (s. f.). Pinpoint: problem determination in large, dynamic Internet services. Proceedings International Conference on Dependable Systems and Networks. doi:10.1109/dsn.2002.1029005Chen, Z. (2011). Algorithm-based recovery for iterative methods without checkpointing. Proceedings of the 20th international symposium on High performance distributed computing - HPDC ’11. doi:10.1145/1996130.1996142Chen, Z. (2013). Online-ABFT. Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP ’13. doi:10.1145/2442516.2442533Coskun, A. K., Rosing, T. S., Mihic, K., De Micheli, G., & Leblebici, Y. (2006). Analysis and Optimization of MPSoC Reliability. Journal of Low Power Electronics, 2(1), 56-69. doi:10.1166/jolpe.2006.007G. Da Costa A. Oleksiak W. Piatek J. Salom and L. Sisó. 2015. Minimization of costs and energy consumption in a data center by a workload-based capacity management. In Energy Efficient Data Centers S. Klingert M. Chinnici and M. Rey Porto (Eds.). Springer International Publishing Cham 102--119. G. Da Costa A. Oleksiak W. Piatek J. Salom and L. Sisó. 2015. Minimization of costs and energy consumption in a data center by a workload-based capacity management. In Energy Efficient Data Centers S. Klingert M. Chinnici and M. Rey Porto (Eds.). Springer International Publishing Cham 102--119.Cupertino, L., Da Costa, G., Oleksiak, A., Pia¸tek, W., Pierson, J.-M., Salom, J., … Zilio, T. (2015). Energy-efficient, thermal-aware modeling and simulation of data centers: The CoolEmAll approach and evaluation results. Ad Hoc Networks, 25, 535-553. doi:10.1016/j.adhoc.2014.11.002Dally, W. J. (1991). Express cubes: improving the performance of k-ary n-cube interconnection networks. IEEE Transactions on Computers, 40(9), 1016-1023. doi:10.1109/12.83652Dauwe, D., Pasricha, S., Maciejewski, A. A., & Siegel, H. J. (2018). Resilience-Aware Resource Management for Exascale Computing Systems. IEEE Transactions on Sustainable Computing, 3(4), 332-345. doi:10.1109/tsusc.2018.2797890R. I. Davis and A. Burns. 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43 4 Article 35 (Oct. 2011) 44 pages. DOI:https://doi.org/10.1145/1978802.1978814 R. I. Davis and A. Burns. 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43 4 Article 35 (Oct. 2011) 44 pages. DOI:https://doi.org/10.1145/1978802.1978814Di, S., & Cappello, F. (2016). Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications. IEEE Transactions on Parallel and Distributed Systems, 27(10), 2809-2823. doi:10.1109/tpds.2016.2517639Di, S., Guo, H., Gupta, R., Pershey, E. R., Snir, M., & Cappello, F. (2019). Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System. IEEE Transactions on Parallel and Distributed Systems, 30(2), 361-374. doi:10.1109/tpds.2018.2864184Di, S., Robert, Y., Vivien, F., & Cappello, F. (2017). Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model. IEEE Transactions on Parallel and Distributed Systems, 28(1), 244-259. doi:10.1109/tpds.2016.2546248J. Dongarra T. Herault and Y. Robert. 2015. Fault Tolerance Techniques for High-Performance Computing. Springer. J. Dongarra T. Herault and Y. Robert. 2015. Fault Tolerance Techniques for High-Performance Computing. Springer.DOWNING, S., & SOCIE, D. (1982). Simple rainflow counting algorithms. International Journal of Fatigue, 4(1), 31-40. doi:10.1016/0142-1123(82)90018-4Eghbalkhah, B., Kamal, M., Afzali-Kusha, H., Afzali-Kusha, A., Ghaznavi-Ghoushchi, M. B., & Pedram, M. (2015). Workload and temperature dependent evaluation of BTI-induced lifetime degradation in digital circuits. Microelectronics Reliability, 55(8), 1152-1162. doi:10.1016/j.microrel.2015.06.004Gottscho, M., Shoaib, M., Govindan, S., Sharma, B., Wang, D., & Gupta, P. (2017). Measuring the Impact of Memory Errors on Application  Performance. IEEE Computer Architecture Letters, 16(1), 51-55. doi:10.1109/lca.2016.2599513Greenberg, A., Hamilton, J. R., Jain, N., Kandula, S., Kim, C., Lahiri, P., … Sengupta, S. (2011). VL2. Communications of the ACM, 54(3), 95-104. doi:10.1145/1897852.1897877Heroux, M. A., Bartlett, R. A., Howle, V. E., Hoekstra, R. J., Hu, J. J., Kolda, T. G., … Stanley, K. S. (2005). An overview of the Trilinos project. ACM Transactions on Mathematical Software, 31(3), 397-423. doi:10.1145/1089014.1089021Hoffmann, G. A., Trivedi, K. S., & Malek, M. (2007). A Best Practice Guide to Resource Forecasting for Computing Systems. IEEE Transactions on Reliability, 56(4), 615-628. doi:10.1109/tr.2007.909764Hsiao, M. Y., Carter, W. C., Thomas, J. W., & Stringfellow, W. R. (1981). Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress. IBM Journal of Research and Development, 25(5), 453-468. doi:10.1147/rd.255.0453Hughes, G. F., Murray, J. F., Kreutz-Delgado, K., & Elkan, C. (2002). Improved disk-drive failure warnings. IEEE Transactions on Reliability, 51(3), 350-357. doi:10.1109/tr.2002.802886S. Hukerikar and C. Engelmann. 2017. Resilience design patterns: A structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4 3 (2017). DOI:https://doi.org/10.14529/jsfi170301 S. Hukerikar and C. Engelmann. 2017. Resilience design patterns: A structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4 3 (2017). DOI:https://doi.org/10.14529/jsfi170301Hussain, H., Malik, S. U. R., Hameed, A., Khan, S. U., Bickler, G., Min-Allah, N., … Rayes, A. (2013). A survey on resource allocation in high performance distributed computing systems. Parallel Computing, 39(11), 709-736. doi:10.1016/j.parco.2013.09.009Intel Corporation. [n.d.]. Intel Xeon Processor E7 Family: Reliability Availability and Serviceability. White paper. https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-family-ras-server-paper.html. Intel Corporation. [n.d.]. Intel Xeon Processor E7 Family: Reliability Availability and Serviceability. White paper. https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-family-ras-server-paper.html.Jha, S., Formicola, V., Martino, C. D., Dalton, M., Kramer, W. T., Kalbarczyk, Z., & Iyer, R. K. (2018). Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters. IEEE Transactions on Dependable and Secure Computing, 15(6), 915-930. doi:10.1109/tdsc.2017.2737537Kiciman, E., & Fox, A. (2005). Detecting Application-Level Failures in Component-Based Internet Services. IEEE Transactions on Neural Networks, 16(5), 1027-1041. doi:10.1109/tnn.2005.853411Kim, T., Sun, Z., Cook, C., Zhao, H., Li, R., Wong, D., & Tan, S. X.-D. (2016). Invited - Cross-layer modeling and optimization for electromigration induced reliability. Proceedings of the 53rd Annual Design Automation Conference. doi:10.1145/2897937.2905010Kurowski, K., Oleksiak, A., Piątek, W., Piontek, T., Przybyszewski, A., & Węglarz, J. (2013). DCworms – A tool for simulation of energy efficiency in distributed computing infrastructures. Simulation Modelling Practice and Theory, 39, 135-151. doi:10.1016/j.simpat.2013.08.007Langou, J., Chen, Z., Bosilca, G., & Dongarra, J. (2008). Recovery Patterns for Iterative Methods in a Parallel Unstable Environment. SIAM Journal on Scientific Computing, 30(1), 102-116. doi:10.1137/040620394J. C. Laprie (Ed.). 1995. Dependability—Its Attributes Impairments and Means. Springer-Verlag Berlin. J. C. Laprie (Ed.). 1995. Dependability—Its Attributes Impairments and Means. Springer-Verlag Berlin.Laprie, J.-C. (s. f.). DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY. Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ’ Highlights from Twenty-Five Years’. doi:10.1109/ftcsh.1995.532603Lasance, C. J. M. (2003). Thermally driven reliability issues in microelectronic systems: status-quo and challenges. Microelectronics Reliability, 43(12), 1969-1974. doi:10.1016/s0026-2714(03)00183-5Yinglung Liang, Yanyong Zhang, Sivasubramaniam, A., Jette, M., & Sahoo, R. (s. f.). BlueGene/L Failure Analysis and Prediction Models. International Conference on Dependable Systems and Networks (DSN’06). doi:10.1109/dsn.2006.18Lin, T.-T. Y., & Siewiorek, D. P. (1990). Error log analysis: statistical modeling and heuristic trend analysis. IEEE Transactions on Reliability, 39(4), 419-432. doi:10.1109/24.58720Losada, N., González, P., Martín, M. J., Bosilca, G., Bouteiller, A., & Teranishi, K. (2020). Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Generation Computer Systems, 106, 467-481. doi:10.1016/j.future.2020.01.026Lyons, R. E., & Vanderkulk, W. (1962). The Use of Triple-Modular Redundancy to Improve Computer Reliability. IBM Journal of Research and Development, 6(2), 200-209. doi:10.1147/rd.62.0200M. Médard and S. S. Lumetta. 2003. Network Reliability and Fault Tolerance. American Cancer Society. Retrieved from arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471219282.eot281. M. Médard and S. S. Lumetta. 2003. Network Reliability and Fault Tolerance. American Cancer Society. Retrieved from arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471219282.eot281.Moody, A., Bronevetsky, G., Mohror, K., & de Supinski, B. (2010). Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System. doi:10.2172/984082Moor Insights 8 Strategy. 2017. AMD EPYC Brings New RAS Capability. White paper. Retrieved from https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf. Moor Insights 8 Strategy. 2017. AMD EPYC Brings New RAS Capability. White paper. Retrieved from https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf.Mulas, F., Atienza, D., Acquaviva, A., Carta, S., Benini, L., & De Micheli, G. (2009). Thermal Balancing Policy for Multiprocessor Stream Computing Platforms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 28(12), 1870-1882. doi:10.1109/tcad.2009.2032372Oleksiak, A., Kierzynka, M., Piatek, W., Agosta, G., Barenghi, A., Brandolese, C., … Janssen, U. (2017). M2DC – Modular Microserver DataCentre with heterogeneous hardware. Microprocessors and Microsystems, 52, 117-130. doi:10.1016/j.micpro.2017.05.019Oxley, M. A., Jonardi, E., Pasricha, S., Maciejewski, A. A., Siegel, H. J., Burns, P. J., & Koenig, G. A. (2018). Rate-based thermal, power, and co-location aware resource management for heterogeneous data centers. Journal of Parallel and Distributed Computing, 112, 126-139. doi:10.1016/j.jpdc.2017.04.015K. O’brien I. Pietri R. Reddy A. Lastovetsky and R. Sakellariou. 2017. A survey of power and energy predictive models in HPC systems and applications. ACM Comput. Surv. 50 3 Article 37 (June 2017) 38 pages. DOI:https://doi.org/10.1145/3078811 K. O’brien I. Pietri R. Reddy A. Lastovetsky and R. Sakellariou. 2017. A survey of power and energy predictive models in HPC systems and applications. ACM Comput. Surv. 50 3 Article 37 (June 2017) 38 pages. DOI:https://doi.org/10.1145/3078811Park, S.-M., & Humphrey, M. (2011). Predictable High-Performance Computing Using Feedback Control and Admission Control. IEEE Transactions on Parallel and Distributed Systems, 22(3), 396-411. doi:10.1109/tpds.2010.100Pfefferman, J. D., & Cernuschi-Frias, B. (2002). A nonparametric nonstationary procedure for failure prediction. IEEE Transactions on Reliability, 51(4), 434-442. doi:10.1109/tr.2002.804733Rangan, K. K., Wei, G.-Y., & Brooks, D. (2009). Thread motion. ACM SIGARCH Computer Architecture News, 37(3), 302-313. doi:10.1145/1555815.1555793Paolo Rech. [n.d.]. Reliability Issues in Current and Future Supercomputers. Retrieved from http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf. Paolo Rech. [n.d.]. Reliability Issues in Current and Future Supercomputers. Retrieved from http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf.F. Reghenzani G. Massari and W. Fornaciari. 2019. The real-time Linux kernel: A survey on PREEMPT_RT. Comput. Surveys 52 1 Article 18 (Feb. 2019) 36 pages. DOI:https://doi.org/10.1145/3297714 F. Reghenzani G. Massari and W. Fornaciari. 2019. The real-time Linux kernel: A survey on PREEMPT_RT. Comput. Surveys 52 1 Article 18 (Feb. 2019) 36 pages. DOI:https://doi.org/10.1145/3297714F. Salfner M. Lenk and M. Malek. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42 3 Article 10 (March 2010) 42 pages. DOI:https://doi.org/10.1145/1670679.1670680 F. Salfner M. Lenk and M. Malek. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42 3 Article 10 (March 2010) 42 pages. DOI:https://doi.org/10.1145/1670679.1670680Salfner, F., Schieschke, M., & Malek, M. (2006). Predicting failures of computer systems: a case study for a telecommunication system. Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. doi:10.1109/ipdps.2006.1639672Shi, L., Chen, H., Sun, J., & Li, K. (2012). vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines. IEEE Transactions on Computers, 61(6), 804-816. doi:10.1109/tc.2011.112D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems 3rd ed. A. K. Peters Ltd. D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems 3rd ed. A. K. Peters Ltd.Singh, S., & Chana, I. (2016). A Survey on Resource Scheduling in Cloud Computing: Issues and Challenges. Journal of Grid Computing, 14(2), 217-264. doi:10.1007/s10723-015-9359-2Slegel, T. J., Averill, R. M., Check, M. A., Giamei, B. C., Krumm, B. W., Krygowski, C. A., … Webb, C. F. (1999). IBM’s S/390 G5 microprocessor design. IEEE Micro, 19(2), 12-23. doi:10.1109/40.755464Sridhar, A., Sabry, M. M., & Atienza, D. (2014). A Semi-Analytical Thermal Modeling Framework for Liquid-Cooled ICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 33(8), 1145-1158. doi:10.1109/tcad.2014.2323194Sridharan, V., DeBardeleben, N., Blanchard, S., Ferreira, K. B., Stearley, J., Shalf, J., & Gurumurthi, S. (2015). Memory Errors in Modern Systems. ACM SIGARCH Computer Architecture News, 43(1), 297-310. doi:10.1145/2786763.2694348Stathis, J. H. (2018). The physics of NBTI: What do we really know? 2018 IEEE International Reliability Physics Symposium (IRPS). doi:10.1109/irps.2018.8353539Stellner, G. (s. f.). CoCheck: checkpointing and process migration for MPI. Proceedings of International Conference on Parallel Processing. doi:10.1109/ipps.1996.508106Stone, J. E., Gohara, D., & Shi, G. (2010). OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science & Engineering, 12(3), 66-73. doi:10.1109/mcse.2010.69Subasi, O., Di, S., Bautista-Gomez, L., Balaprakash, P., Unsal, O., Labarta, J., … Cappello, F. (2018). Exploring the capabilities of support vector machines in detecting silent data corruptions. Sustainable Computing: Informatics and Systems, 19, 277-290. doi:10.1016/j.suscom.2018.01.004Tang, D., & Iyer, R. K. (1993). Dependability measurement and modeling of a multicomputer system. IEEE Transactions on Computers, 42(1), 62-75. doi:10.1109/12.192214D. Turnbull and N. Alldrin. 2003. Failure Prediction in Hardware Systems. Tech. rep. University of California San Diego CA. Retrieved from http://www.cs.ucsd.edu/ dturnbul/Papers/ServerPrediction.pdf. D. Turnbull and N. Alldrin. 2003. Failure Prediction in Hardware Systems. Tech. rep. University of California San Diego CA. Retrieved from http://www.cs.ucsd.edu/ dturnbul/Papers/ServerPrediction.pdf.Vilalta, R., Apte, C. V., Hellerstein, J. L., Ma, S., & Weiss, S. M. (2002). Predictive algorithms in the management of computer systems. IBM Systems Journal, 41(3), 461-474. doi:10.1147/sj.413.0461Vinoski, S. (2007). Reliability with Erlang. IEEE Internet Com

    DYRE: a DYnamic REconfigurable solution to increase GPGPU's reliability

    Get PDF
    General-purpose graphics processing units (GPGPUs) are extensively used in high-performance computing. However, it is well known that these devices’ reliability may be limited by the rising of faults at the hardware level. This work introduces a flexible solution to detect and mitigate permanent faults affecting the execution units in these parallel devices. The proposed solution is based on adding some spare modules to perform two in-field operations: detecting and mitigating faults. The solution takes advantage of the regularity of the execution units in the device to avoid significant design changes and reduce the overhead. The proposed solution was evaluated in terms of reliability improvement and area, performance, and power overhead costs. For this purpose, we resorted to a micro-architectural open-source GPGPU model (FlexGripPlus). Experimental results show that the proposed solution can extend the reliability by up to 57%, with overhead costs lower than 2% and 8% in area and power, respectively

    GPU devices for safety-critical systems: a survey

    Get PDF
    Graphics Processing Unit (GPU) devices and their associated software programming languages and frameworks can deliver the computing performance required to facilitate the development of next-generation high-performance safety-critical systems such as autonomous driving systems. However, the integration of complex, parallel, and computationally demanding software functions with different safety-criticality levels on GPU devices with shared hardware resources contributes to several safety certification challenges. This survey categorizes and provides an overview of research contributions that address GPU devices’ random hardware failures, systematic failures, and independence of execution.This work has been partially supported by the European Research Council with Horizon 2020 (grant agreements No. 772773 and 871465), the Spanish Ministry of Science and Innovation under grant PID2019-107255GB, the HiPEAC Network of Excellence and the Basque Government under grant KK-2019-00035. The Spanish Ministry of Economy and Competitiveness has also partially supported Leonidas Kosmidis with a Juan de la Cierva Incorporación postdoctoral fellowship (FJCI-2020- 045931-I).Peer ReviewedPostprint (author's final draft
    • …
    corecore