373 research outputs found

    Improving Performance of Iterative Methods by Lossy Checkponting

    Get PDF
    Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. To this end, significantly reducing the checkpointing overhead is critical to improving the overall performance of iterative methods. Our contribution is fourfold. (1) We propose a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging lossy compressors. (2) We formulate a lossy checkpointing performance model and derive theoretically an upper bound for the extra number of iterations caused by the distortion of data in lossy checkpoints, in order to guarantee the performance improvement under the lossy checkpointing scheme. (3) We analyze the impact of lossy checkpointing (i.e., extra number of iterations caused by lossy checkpointing files) for multiple types of iterative methods. (4)We evaluate the lossy checkpointing scheme with optimal checkpointing intervals on a high-performance computing environment with 2,048 cores, using a well-known scientific computation package PETSc and a state-of-the-art checkpoint/restart toolkit. Experiments show that our optimized lossy checkpointing scheme can significantly reduce the fault tolerance overhead for iterative methods by 23%~70% compared with traditional checkpointing and 20%~58% compared with lossless-compressed checkpointing, in the presence of system failures.Comment: 14 pages, 10 figures, HPDC'1

    Resource management for extreme scale high performance computing systems in the presence of failures

    Get PDF
    2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems. To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system failures as well as application co-location on large-scale HPC computing systems. Our analysis of application and system behavior also investigates: the interrelated effects of network usage of applications and fault resilience protocols; checkpoint interval selection and its sensitivity to system parameters for various checkpoint-based fault resilience protocols; and performance comparisons of various promising strategies for fault resilience in exascale-sized systems

    Reliability -aware optimal checkpoint /restart model in high performance computing

    Get PDF
    Computational power demand for large challenging problems has increasingly driven the physical size of High Performance Computing (HPC) systems. As the system gets larger, it requires more and more components (processor, memory, disk, switch, power supply and so on). Thus, challenges arise in handling reliability of such large-scale systems. In order to minimize the performance loss due to unexpected failures, fault tolerant mechanisms are vital to sustain computational power in such environment. Checkpoint/restart is a common fault tolerant technique which has been widely applied in the single computer system. However, checkpointing in a large-scale HPC environment is much more challenging due to complexity, coordination, and timing issues. In this dissertation, we present a reliability-aware method for an optimal checkpoint/restart strategy. Our scheme aims to address the fault tolerance challenge, especially in a large-scale HPC system, by providing optimal checkpoint placement techniques derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can perform a varying checkpoint interval and deal with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques

    Reliability-oriented resource management for High-Performance Computing

    Get PDF
    Reliability is an increasingly pressing issue for High-Performance Computing systems, as failures are a threat to large-scale applications, for which an even single run may incur significant energy and billing costs. Currently, application developers need to address reliability explicitly, by integrating application-specific checkpoint/restore mechanisms. However, the application alone cannot exploit system knowledge, which is not the case for system-wide resource management systems. In this paper, we propose a reliability-oriented policy that can increase significantly component reliability by combining checkpoint/restore mechanisms exploitation and proactive resource management policies

    Predictive Reliability and Fault Management in Exascale Systems: State of the Art and Perspectives

    Get PDF
    © ACM, 2020. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Computing Surveys, Vol. 53, No. 5, Article 95. Publication date: September 2020. https://doi.org/10.1145/3403956[EN] Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.This work has received funding from the European Union's Horizon 2020 (H2020) research and innovation program under the FET-HPC Grant Agreement No. 801137 (RECIPE). Jaume Abella was also partially supported by the Ministry of Economy and Competitiveness of Spain under Contract No. TIN2015-65316-P and under Ramon y Cajal Postdoctoral Fellowship No. RYC-2013-14717, as well as by the HiPEAC Network of Excellence. Ramon Canal is partially supported by the Generalitat de Catalunya under Contract No. 2017SGR0962.Canal, R.; Hernández Luz, C.; Tornero-Gavilá, R.; Cilardo, A.; Massari, G.; Reghenzani, F.; Fornaciari, W.... (2020). Predictive Reliability and Fault Management in Exascale Systems: State of the Art and Perspectives. ACM Computing Surveys. 53(5):1-32. https://doi.org/10.1145/3403956S132535Abella, J., Hernandez, C., Quinones, E., Cazorla, F. J., Conmy, P. R., Azkarate-askasua, M., … Vardanega, T. (2015). WCET analysis methods: Pitfalls and challenges on their trustworthiness. 10th IEEE International Symposium on Industrial Embedded Systems (SIES). doi:10.1109/sies.2015.7185039E. Agullo L. Giraud A. Guermouche J. Roman and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: Recover-restart strategies. INRIA Research Report RR-8324. E. Agullo L. Giraud A. Guermouche J. Roman and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: Recover-restart strategies. INRIA Research Report RR-8324.Agullo, E., Giraud, L., Salas, P., & Zounon, M. (2016). Interpolation-Restart Strategies for Resilient Eigensolvers. SIAM Journal on Scientific Computing, 38(5), C560-C583. doi:10.1137/15m1042115Al-Qawasmeh, A. M., Pasricha, S., Maciejewski, A. A., & Siegel, H. J. (2015). Power and Thermal-Aware Workload Allocation in Heterogeneous Data Centers. IEEE Transactions on Computers, 64(2), 477-491. doi:10.1109/tc.2013.116ARM. 2017. ARM Reliability Availability and Serviceability (RAS) Specification—ARMv8 for the ARMv8-A Architecture Profile. White paper. Retrieved from https://developer.arm.com/docs/ddi0587/latest. ARM. 2017. ARM Reliability Availability and Serviceability (RAS) Specification—ARMv8 for the ARMv8-A Architecture Profile. White paper. Retrieved from https://developer.arm.com/docs/ddi0587/latest.Avizienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11-33. doi:10.1109/tdsc.2004.2Bautista-Gomez, L., Zyulkyarov, F., Unsal, O., & McIntosh-Smith, S. (2016). Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer. SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. doi:10.1109/sc.2016.54Berrocal, E., Bautista-Gomez, L., Di, S., Lan, Z., & Cappello, F. (2017). Toward General Software Level Silent Data Corruption Detection for Parallel Applications. IEEE Transactions on Parallel and Distributed Systems, 28(12), 3642-3655. doi:10.1109/tpds.2017.2735971M.-A. Breuer and A. D. Friedman. 1976. Diagnosis 8 Reliable Design of Digital Systems. Springer. M.-A. Breuer and A. D. Friedman. 1976. Diagnosis 8 Reliable Design of Digital Systems. Springer.P. Bridges K. Ferreira M. Heroux and M. Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints June 2012. arXiv:1206.1390 [math.NA]. P. Bridges K. Ferreira M. Heroux and M. Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints June 2012. arXiv:1206.1390 [math.NA].F. Cappello A. Geist W. Gropp S. Kale B. Kramer and M. Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1 1 (2014). http://superfri.org/superfri/article/view/14. F. Cappello A. Geist W. Gropp S. Kale B. Kramer and M. Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1 1 (2014). http://superfri.org/superfri/article/view/14.F. J. Cazorla L. Kosmidis E. Mezzetti C. Hernandez J. Abella and T. Vardanega. 2019. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv. 52 1 Article 14 (Feb. 2019) 35 pages. DOI:https://doi.org/10.1145/3301283 F. J. Cazorla L. Kosmidis E. Mezzetti C. Hernandez J. Abella and T. Vardanega. 2019. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv. 52 1 Article 14 (Feb. 2019) 35 pages. DOI:https://doi.org/10.1145/3301283Chan, C. S., Pan, B., Gross, K., Vaidyanathan, K., & Rosing, T. Š. (2014). Correcting vibration-induced performance degradation in enterprise servers. ACM SIGMETRICS Performance Evaluation Review, 41(3), 83-88. doi:10.1145/2567529.2567555Chantem, T., Hu, X. S., & Dick, R. P. (2011). Temperature-Aware Scheduling and Assignment for Hard Real-Time Applications on MPSoCs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 19(10), 1884-1897. doi:10.1109/tvlsi.2010.2058873Chen, M. Y., Kiciman, E., Fratkin, E., Fox, A., & Brewer, E. (s. f.). Pinpoint: problem determination in large, dynamic Internet services. Proceedings International Conference on Dependable Systems and Networks. doi:10.1109/dsn.2002.1029005Chen, Z. (2011). Algorithm-based recovery for iterative methods without checkpointing. Proceedings of the 20th international symposium on High performance distributed computing - HPDC ’11. doi:10.1145/1996130.1996142Chen, Z. (2013). Online-ABFT. Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP ’13. doi:10.1145/2442516.2442533Coskun, A. K., Rosing, T. S., Mihic, K., De Micheli, G., & Leblebici, Y. (2006). Analysis and Optimization of MPSoC Reliability. Journal of Low Power Electronics, 2(1), 56-69. doi:10.1166/jolpe.2006.007G. Da Costa A. Oleksiak W. Piatek J. Salom and L. Sisó. 2015. Minimization of costs and energy consumption in a data center by a workload-based capacity management. In Energy Efficient Data Centers S. Klingert M. Chinnici and M. Rey Porto (Eds.). Springer International Publishing Cham 102--119. G. Da Costa A. Oleksiak W. Piatek J. Salom and L. Sisó. 2015. Minimization of costs and energy consumption in a data center by a workload-based capacity management. In Energy Efficient Data Centers S. Klingert M. Chinnici and M. Rey Porto (Eds.). Springer International Publishing Cham 102--119.Cupertino, L., Da Costa, G., Oleksiak, A., Pia¸tek, W., Pierson, J.-M., Salom, J., … Zilio, T. (2015). Energy-efficient, thermal-aware modeling and simulation of data centers: The CoolEmAll approach and evaluation results. Ad Hoc Networks, 25, 535-553. doi:10.1016/j.adhoc.2014.11.002Dally, W. J. (1991). Express cubes: improving the performance of k-ary n-cube interconnection networks. IEEE Transactions on Computers, 40(9), 1016-1023. doi:10.1109/12.83652Dauwe, D., Pasricha, S., Maciejewski, A. A., & Siegel, H. J. (2018). Resilience-Aware Resource Management for Exascale Computing Systems. IEEE Transactions on Sustainable Computing, 3(4), 332-345. doi:10.1109/tsusc.2018.2797890R. I. Davis and A. Burns. 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43 4 Article 35 (Oct. 2011) 44 pages. DOI:https://doi.org/10.1145/1978802.1978814 R. I. Davis and A. Burns. 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43 4 Article 35 (Oct. 2011) 44 pages. DOI:https://doi.org/10.1145/1978802.1978814Di, S., & Cappello, F. (2016). Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications. IEEE Transactions on Parallel and Distributed Systems, 27(10), 2809-2823. doi:10.1109/tpds.2016.2517639Di, S., Guo, H., Gupta, R., Pershey, E. R., Snir, M., & Cappello, F. (2019). Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System. IEEE Transactions on Parallel and Distributed Systems, 30(2), 361-374. doi:10.1109/tpds.2018.2864184Di, S., Robert, Y., Vivien, F., & Cappello, F. (2017). Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model. IEEE Transactions on Parallel and Distributed Systems, 28(1), 244-259. doi:10.1109/tpds.2016.2546248J. Dongarra T. Herault and Y. Robert. 2015. Fault Tolerance Techniques for High-Performance Computing. Springer. J. Dongarra T. Herault and Y. Robert. 2015. Fault Tolerance Techniques for High-Performance Computing. Springer.DOWNING, S., & SOCIE, D. (1982). Simple rainflow counting algorithms. International Journal of Fatigue, 4(1), 31-40. doi:10.1016/0142-1123(82)90018-4Eghbalkhah, B., Kamal, M., Afzali-Kusha, H., Afzali-Kusha, A., Ghaznavi-Ghoushchi, M. B., & Pedram, M. (2015). Workload and temperature dependent evaluation of BTI-induced lifetime degradation in digital circuits. Microelectronics Reliability, 55(8), 1152-1162. doi:10.1016/j.microrel.2015.06.004Gottscho, M., Shoaib, M., Govindan, S., Sharma, B., Wang, D., & Gupta, P. (2017). Measuring the Impact of Memory Errors on Application  Performance. IEEE Computer Architecture Letters, 16(1), 51-55. doi:10.1109/lca.2016.2599513Greenberg, A., Hamilton, J. R., Jain, N., Kandula, S., Kim, C., Lahiri, P., … Sengupta, S. (2011). VL2. Communications of the ACM, 54(3), 95-104. doi:10.1145/1897852.1897877Heroux, M. A., Bartlett, R. A., Howle, V. E., Hoekstra, R. J., Hu, J. J., Kolda, T. G., … Stanley, K. S. (2005). An overview of the Trilinos project. ACM Transactions on Mathematical Software, 31(3), 397-423. doi:10.1145/1089014.1089021Hoffmann, G. A., Trivedi, K. S., & Malek, M. (2007). A Best Practice Guide to Resource Forecasting for Computing Systems. IEEE Transactions on Reliability, 56(4), 615-628. doi:10.1109/tr.2007.909764Hsiao, M. Y., Carter, W. C., Thomas, J. W., & Stringfellow, W. R. (1981). Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress. IBM Journal of Research and Development, 25(5), 453-468. doi:10.1147/rd.255.0453Hughes, G. F., Murray, J. F., Kreutz-Delgado, K., & Elkan, C. (2002). Improved disk-drive failure warnings. IEEE Transactions on Reliability, 51(3), 350-357. doi:10.1109/tr.2002.802886S. Hukerikar and C. Engelmann. 2017. Resilience design patterns: A structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4 3 (2017). DOI:https://doi.org/10.14529/jsfi170301 S. Hukerikar and C. Engelmann. 2017. Resilience design patterns: A structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4 3 (2017). DOI:https://doi.org/10.14529/jsfi170301Hussain, H., Malik, S. U. R., Hameed, A., Khan, S. U., Bickler, G., Min-Allah, N., … Rayes, A. (2013). A survey on resource allocation in high performance distributed computing systems. Parallel Computing, 39(11), 709-736. doi:10.1016/j.parco.2013.09.009Intel Corporation. [n.d.]. Intel Xeon Processor E7 Family: Reliability Availability and Serviceability. White paper. https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-family-ras-server-paper.html. Intel Corporation. [n.d.]. Intel Xeon Processor E7 Family: Reliability Availability and Serviceability. White paper. https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-family-ras-server-paper.html.Jha, S., Formicola, V., Martino, C. D., Dalton, M., Kramer, W. T., Kalbarczyk, Z., & Iyer, R. K. (2018). Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters. IEEE Transactions on Dependable and Secure Computing, 15(6), 915-930. doi:10.1109/tdsc.2017.2737537Kiciman, E., & Fox, A. (2005). Detecting Application-Level Failures in Component-Based Internet Services. IEEE Transactions on Neural Networks, 16(5), 1027-1041. doi:10.1109/tnn.2005.853411Kim, T., Sun, Z., Cook, C., Zhao, H., Li, R., Wong, D., & Tan, S. X.-D. (2016). Invited - Cross-layer modeling and optimization for electromigration induced reliability. Proceedings of the 53rd Annual Design Automation Conference. doi:10.1145/2897937.2905010Kurowski, K., Oleksiak, A., Piątek, W., Piontek, T., Przybyszewski, A., & Węglarz, J. (2013). DCworms – A tool for simulation of energy efficiency in distributed computing infrastructures. Simulation Modelling Practice and Theory, 39, 135-151. doi:10.1016/j.simpat.2013.08.007Langou, J., Chen, Z., Bosilca, G., & Dongarra, J. (2008). Recovery Patterns for Iterative Methods in a Parallel Unstable Environment. SIAM Journal on Scientific Computing, 30(1), 102-116. doi:10.1137/040620394J. C. Laprie (Ed.). 1995. Dependability—Its Attributes Impairments and Means. Springer-Verlag Berlin. J. C. Laprie (Ed.). 1995. Dependability—Its Attributes Impairments and Means. Springer-Verlag Berlin.Laprie, J.-C. (s. f.). DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY. Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ’ Highlights from Twenty-Five Years’. doi:10.1109/ftcsh.1995.532603Lasance, C. J. M. (2003). Thermally driven reliability issues in microelectronic systems: status-quo and challenges. Microelectronics Reliability, 43(12), 1969-1974. doi:10.1016/s0026-2714(03)00183-5Yinglung Liang, Yanyong Zhang, Sivasubramaniam, A., Jette, M., & Sahoo, R. (s. f.). BlueGene/L Failure Analysis and Prediction Models. International Conference on Dependable Systems and Networks (DSN’06). doi:10.1109/dsn.2006.18Lin, T.-T. Y., & Siewiorek, D. P. (1990). Error log analysis: statistical modeling and heuristic trend analysis. IEEE Transactions on Reliability, 39(4), 419-432. doi:10.1109/24.58720Losada, N., González, P., Martín, M. J., Bosilca, G., Bouteiller, A., & Teranishi, K. (2020). Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Generation Computer Systems, 106, 467-481. doi:10.1016/j.future.2020.01.026Lyons, R. E., & Vanderkulk, W. (1962). The Use of Triple-Modular Redundancy to Improve Computer Reliability. IBM Journal of Research and Development, 6(2), 200-209. doi:10.1147/rd.62.0200M. Médard and S. S. Lumetta. 2003. Network Reliability and Fault Tolerance. American Cancer Society. Retrieved from arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471219282.eot281. M. Médard and S. S. Lumetta. 2003. Network Reliability and Fault Tolerance. American Cancer Society. Retrieved from arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471219282.eot281.Moody, A., Bronevetsky, G., Mohror, K., & de Supinski, B. (2010). Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System. doi:10.2172/984082Moor Insights 8 Strategy. 2017. AMD EPYC Brings New RAS Capability. White paper. Retrieved from https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf. Moor Insights 8 Strategy. 2017. AMD EPYC Brings New RAS Capability. White paper. Retrieved from https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf.Mulas, F., Atienza, D., Acquaviva, A., Carta, S., Benini, L., & De Micheli, G. (2009). Thermal Balancing Policy for Multiprocessor Stream Computing Platforms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 28(12), 1870-1882. doi:10.1109/tcad.2009.2032372Oleksiak, A., Kierzynka, M., Piatek, W., Agosta, G., Barenghi, A., Brandolese, C., … Janssen, U. (2017). M2DC – Modular Microserver DataCentre with heterogeneous hardware. Microprocessors and Microsystems, 52, 117-130. doi:10.1016/j.micpro.2017.05.019Oxley, M. A., Jonardi, E., Pasricha, S., Maciejewski, A. A., Siegel, H. J., Burns, P. J., & Koenig, G. A. (2018). Rate-based thermal, power, and co-location aware resource management for heterogeneous data centers. Journal of Parallel and Distributed Computing, 112, 126-139. doi:10.1016/j.jpdc.2017.04.015K. O’brien I. Pietri R. Reddy A. Lastovetsky and R. Sakellariou. 2017. A survey of power and energy predictive models in HPC systems and applications. ACM Comput. Surv. 50 3 Article 37 (June 2017) 38 pages. DOI:https://doi.org/10.1145/3078811 K. O’brien I. Pietri R. Reddy A. Lastovetsky and R. Sakellariou. 2017. A survey of power and energy predictive models in HPC systems and applications. ACM Comput. Surv. 50 3 Article 37 (June 2017) 38 pages. DOI:https://doi.org/10.1145/3078811Park, S.-M., & Humphrey, M. (2011). Predictable High-Performance Computing Using Feedback Control and Admission Control. IEEE Transactions on Parallel and Distributed Systems, 22(3), 396-411. doi:10.1109/tpds.2010.100Pfefferman, J. D., & Cernuschi-Frias, B. (2002). A nonparametric nonstationary procedure for failure prediction. IEEE Transactions on Reliability, 51(4), 434-442. doi:10.1109/tr.2002.804733Rangan, K. K., Wei, G.-Y., & Brooks, D. (2009). Thread motion. ACM SIGARCH Computer Architecture News, 37(3), 302-313. doi:10.1145/1555815.1555793Paolo Rech. [n.d.]. Reliability Issues in Current and Future Supercomputers. Retrieved from http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf. Paolo Rech. [n.d.]. Reliability Issues in Current and Future Supercomputers. Retrieved from http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf.F. Reghenzani G. Massari and W. Fornaciari. 2019. The real-time Linux kernel: A survey on PREEMPT_RT. Comput. Surveys 52 1 Article 18 (Feb. 2019) 36 pages. DOI:https://doi.org/10.1145/3297714 F. Reghenzani G. Massari and W. Fornaciari. 2019. The real-time Linux kernel: A survey on PREEMPT_RT. Comput. Surveys 52 1 Article 18 (Feb. 2019) 36 pages. DOI:https://doi.org/10.1145/3297714F. Salfner M. Lenk and M. Malek. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42 3 Article 10 (March 2010) 42 pages. DOI:https://doi.org/10.1145/1670679.1670680 F. Salfner M. Lenk and M. Malek. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42 3 Article 10 (March 2010) 42 pages. DOI:https://doi.org/10.1145/1670679.1670680Salfner, F., Schieschke, M., & Malek, M. (2006). Predicting failures of computer systems: a case study for a telecommunication system. Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. doi:10.1109/ipdps.2006.1639672Shi, L., Chen, H., Sun, J., & Li, K. (2012). vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines. IEEE Transactions on Computers, 61(6), 804-816. doi:10.1109/tc.2011.112D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems 3rd ed. A. K. Peters Ltd. D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems 3rd ed. A. K. Peters Ltd.Singh, S., & Chana, I. (2016). A Survey on Resource Scheduling in Cloud Computing: Issues and Challenges. Journal of Grid Computing, 14(2), 217-264. doi:10.1007/s10723-015-9359-2Slegel, T. J., Averill, R. M., Check, M. A., Giamei, B. C., Krumm, B. W., Krygowski, C. A., … Webb, C. F. (1999). IBM’s S/390 G5 microprocessor design. IEEE Micro, 19(2), 12-23. doi:10.1109/40.755464Sridhar, A., Sabry, M. M., & Atienza, D. (2014). A Semi-Analytical Thermal Modeling Framework for Liquid-Cooled ICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 33(8), 1145-1158. doi:10.1109/tcad.2014.2323194Sridharan, V., DeBardeleben, N., Blanchard, S., Ferreira, K. B., Stearley, J., Shalf, J., & Gurumurthi, S. (2015). Memory Errors in Modern Systems. ACM SIGARCH Computer Architecture News, 43(1), 297-310. doi:10.1145/2786763.2694348Stathis, J. H. (2018). The physics of NBTI: What do we really know? 2018 IEEE International Reliability Physics Symposium (IRPS). doi:10.1109/irps.2018.8353539Stellner, G. (s. f.). CoCheck: checkpointing and process migration for MPI. Proceedings of International Conference on Parallel Processing. doi:10.1109/ipps.1996.508106Stone, J. E., Gohara, D., & Shi, G. (2010). OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science & Engineering, 12(3), 66-73. doi:10.1109/mcse.2010.69Subasi, O., Di, S., Bautista-Gomez, L., Balaprakash, P., Unsal, O., Labarta, J., … Cappello, F. (2018). Exploring the capabilities of support vector machines in detecting silent data corruptions. Sustainable Computing: Informatics and Systems, 19, 277-290. doi:10.1016/j.suscom.2018.01.004Tang, D., & Iyer, R. K. (1993). Dependability measurement and modeling of a multicomputer system. IEEE Transactions on Computers, 42(1), 62-75. doi:10.1109/12.192214D. Turnbull and N. Alldrin. 2003. Failure Prediction in Hardware Systems. Tech. rep. University of California San Diego CA. Retrieved from http://www.cs.ucsd.edu/ dturnbul/Papers/ServerPrediction.pdf. D. Turnbull and N. Alldrin. 2003. Failure Prediction in Hardware Systems. Tech. rep. University of California San Diego CA. Retrieved from http://www.cs.ucsd.edu/ dturnbul/Papers/ServerPrediction.pdf.Vilalta, R., Apte, C. V., Hellerstein, J. L., Ma, S., & Weiss, S. M. (2002). Predictive algorithms in the management of computer systems. IBM Systems Journal, 41(3), 461-474. doi:10.1147/sj.413.0461Vinoski, S. (2007). Reliability with Erlang. IEEE Internet Com

    Keeping checkpoint/restart viable for exascale systems

    Get PDF
    Next-generation exascale systems, those capable of performing a quintillion operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoints) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms

    Achieving High Reliability and Efficiency in Maintaining Large-Scale Storage Systems through Optimal Resource Provisioning and Data Placement

    Get PDF
    With the explosive increase in the amount of data being generated by various applications, large-scale distributed and parallel storage systems have become common data storage solutions and been widely deployed and utilized in both industry and academia. While these high performance storage systems significantly accelerate the data storage and retrieval, they also bring some critical issues in system maintenance and management. In this dissertation, I propose three methodologies to address three of these critical issues. First, I develop an optimal resource management and spare provisioning model to minimize the impact brought by component failures and ensure a highly operational experience in maintaining large-scale storage systems. Second, in order to cost-effectively integrate solid-state drives (SSD) into large-scale storage systems, I design a holistic algorithm which can adaptively predict the popularity of data objects by leveraging temporal locality in their access pattern and adjust their placement among solid-state drives and regular hard disk drives so that the data access throughput as well as the storage space efficiency of the large-scale heterogeneous storage systems can be improved. Finally, I propose a new checkpoint placement optimization model which can maximize the computation efficiency of large-scale scientific applications while guarantee the endurance requirements of the SSD-based burst buffer in high performance hierarchical storage systems. All these models and algorithms are validated through extensive evaluation using data collected from deployed large-scale storage systems and the evaluation results demonstrate our models and algorithms can significantly improve the reliability and efficiency of large-scale distributed and parallel storage systems

    Resiliency in numerical algorithm design for extreme scale simulations

    Get PDF
    This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G ̈oddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz, Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft

    Near-optimal scheduling and decision-making models for reactive and proactive fault tolerance mechanisms

    Get PDF
    As High Performance Computing (HPC) systems increase in size to fulfill computational power demand, the chance of failure occurrences dramatically increases, resulting in potentially large amounts of lost computing time. Fault Tolerance (FT) mechanisms aim to mitigate the impact of failure occurrences to the running applications. However, the overhead of FT mechanisms increases proportionally to the HPC systems\u27 size. Therefore, challenges arise in handling the expensive overhead of FT mechanisms while minimizing the large amount of lost computing time due to failure occurrences. In this dissertation, a near-optimal scheduling model is built to determine when to invoke a hybrid checkpoint mechanism, by means of stochastic processes and calculus of variations. The obtained schedule minimizes the waste time caused by checkpoint mechanism and failure occurrences. Generally, the checkpoint/restart mechanisms periodically save application states and load the saved state, upon failure occurrences. Furthermore, to handle various FT mechanisms, an adaptive decision-making model has been developed to determine the best FT strategy to invoke at each decision point. The best mechanism at each decision point is selected among considered FT mechanisms to globally minimize the total waste time for an application execution by means of a dynamic programming approach. In addition, the model is adaptive to deal with changes in failure rate over time
    • …
    corecore