14 research outputs found

    Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

    Get PDF
    © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Current trends and projections show that faults in computer systems become increasingly common. Such errors may be detected, and possibly corrected transparently, e.g. by Error Correcting Codes (ECC). For a program to be fault-tolerant, it needs to also handle the Errors that are Detected and Uncorrected (DUE), such as an ECC encountering too many bit flips in a codeword. While correcting an error has an overhead in itself, it can also affect the progress of a program. The most generic technique, rolling back the program state to a previously taken checkpoint, sets back any progress done since then. Alternately, application specific techniques exist, such as restarting an iterative program with its latest iteration's values as initial guess.This manuscript is the journal extension of a previously published conference paper [25]. This work has been partially supported by the European Research Council under the European Union’s 7th FP, ERC Advanced Grant 321253, and by the Spanish Ministry of Science and Innovation under grant TIN2015-65316-P. L. Jaulmes has been partially supported by the Spanish Ministry of Education, Culture and Sports under grant FPU2013/06982. M. Moretó has been partially supported by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship JCI-2012-15047. M. Casas has been partially supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Co-fund programme of the Marie Curie Actions of the European Union’s 7th FP (contract 2013 BP B 00243). We would like to thank Nicolas Vidal for his contribution on using huge pages natively.Peer ReviewedPostprint (author's final draft

    Exploiting task-based programming models for resilience

    Get PDF
    Hardware errors become more common as silicon technologies shrink and become more vulnerable, especially in memory cells, which are the most exposed to errors. Permanent and intermittent faults are caused by manufacturing variability and circuits ageing. While these can be mitigated once they are identified, their continuous rate of appearance throughout the lifetime of memory devices will always cause unexpected errors. In addition, transient faults are caused by effects such as radiation or small voltage/frequency margins, and there is no efficient way to shield against these events. Other constraints related to the diminishing sizes of transistors, such as power consumption and memory latency have caused the microprocessor industry to turn to increasingly complex processor architectures. To solve the difficulties arising from programming such architectures, programming models have emerged that rely on runtime systems. These systems form a new intermediate layer on the hardware-software abstraction stack, that performs tasks such as distributing work across computing resources: processor cores, accelerators, etc. These runtime systems dispose of a lot of information, both from the hardware and the applications, and offer thus many possibilities for optimisations. This thesis proposes solutions to the increasing fault rates in memory, across multiple resilience disciplines, from algorithm-based fault tolerance to hardware error correcting codes, through OS reliability strategies. These solutions rely for their efficiency on the opportunities presented by runtime systems. The first contribution of this thesis is an algorithmic-based resilience technique, allowing to tolerate detected errors in memory. This technique allows to recover data that is lost by performing computations that rely on simple redundancy relations identified in the program. The recovery is demonstrated for a family of iterative solvers, the Krylov subspace methods, and evaluated for the conjugate gradient solver. The runtime can transparently overlap the recovery with the computations of the algorithm, which allows to mask the already low overheads of this technique. The second part of this thesis proposes a metric to characterise the impact of faults in memory, which outperforms state-of-the-art metrics in precision and assurances on the error rate. This metric reveals a key insight into data that is not relevant to the program, and we propose an OS-level strategy to ignore errors in such data, by delaying the reporting of detected errors. This allows to reduce failure rates of running programs, by ignoring errors that have no impact. The architectural-level contribution of this thesis is a dynamically adaptable Error Correcting Code (ECC) scheme, that can increase protection of memory regions where the impact of errors is highest. A runtime methodology is presented to estimate the fault rate at runtime using our metric, through performance monitoring tools of current commodity processors. Guiding the dynamic ECC scheme online using the methodology's vulnerability estimates allows to decrease error rates of programs at a fraction of the redundancy cost required for a uniformly stronger ECC. This provides a useful and wide range of trade-offs between redundancy and error rates. The work presented in this thesis demonstrates that runtime systems allow to make the most of redundancy stored in memory, to help tackle increasing error rates in DRAM. This exploited redundancy can be an inherent part of algorithms that allows to tolerate higher fault rates, or in the form of dead data stored in memory. Redundancy can also be added to a program, in the form of ECC. In all cases, the runtime allows to decrease failure rates efficiently, by diminishing recovery costs, identifying redundant data, or targeting critical data. It is thus a very valuable tool for the future computing systems, as it can perform optimisations across different layers of abstractions.Los errores en memoria se vuelven más comunes a medida que las tecnologías de silicio reducen su tamaño. La variabilidad de fabricación y el envejecimiento de los circuitos causan fallos permanentes e intermitentes. Aunque se pueden mitigar una vez identificados, su continua tasa de aparición siempre causa errores inesperados. Además, la memoria también sufre de fallos transitorios contra los cuales no se puede proteger eficientemente. Estos fallos están causados por efectos como la radiación o los reducidos márgenes de voltaje y frecuencia. Otras restricciones coetáneas, como el consumo de energía y la latencia de la memoria, obligaron a las arquitecturas de computadores a volverse cada vez más complejas. Para programar tales procesadores, se desarrollaron modelos de programación basados en entornos de ejecución. Estos sistemas forman una nueva abstracción entre hardware y software, realizando tareas como la distribución del trabajo entre recursos informáticos: núcleos de procesadores, aceleradores, etc. Estos entornos de ejecución disponen de mucha información tanto sobre el hardware como sobre las aplicaciones, y ofrecen así muchas posibilidades de optimización. Esta tesis propone soluciones a los fallos en memoria entre múltiples disciplinas de resiliencia, desde la tolerancia a fallos basada en algoritmos, hasta los códigos de corrección de errores en hardware, incluyendo estrategias de resiliencia del sistema operativo. La eficiencia de estas soluciones depende de las oportunidades que presentan los entornos de ejecución. La primera contribución de esta tesis es una técnica a nivel algorítmico que permite corregir fallos encontrados mientras el programa su ejecuta. Para corregir fallos se han identificado redundancias simples en los datos del programa para toda una clase de algoritmos, los métodos del subespacio de Krylov (gradiente conjugado, GMRES, etc). La estrategia de recuperación de datos desarrollada permite corregir errores sin tener que reinicializar el algoritmo, y aprovecha el modelo de programación para superponer las computaciones del algoritmo y de la recuperación de datos. La segunda parte de esta tesis propone una métrica para caracterizar el impacto de los fallos en la memoria. Esta métrica supera en precisión a las métricas de vanguardia y permite identificar datos que son menos relevantes para el programa. Se propone una estrategia a nivel del sistema operativo retrasando la notificación de los errores detectados, que permite ignorar fallos en estos datos y reducir la tasa de fracaso del programa. Por último, la contribución a nivel arquitectónico de esta tesis es un esquema de Código de Corrección de Errores (ECC por sus siglas en inglés) adaptable dinámicamente. Este esquema puede aumentar la protección de las regiones de memoria donde el impacto de los errores es mayor. Se presenta una metodología para estimar el riesgo de fallo en tiempo de ejecución utilizando nuestra métrica, a través de las herramientas de monitorización del rendimiento disponibles en los procesadores actuales. El esquema de ECC guiado dinámicamente con estas estimaciones de vulnerabilidad permite disminuir la tasa de fracaso de los programas a una fracción del coste de redundancia requerido para un ECC uniformemente más fuerte. El trabajo presentado en esta tesis demuestra que los entornos de ejecución permiten aprovechar al máximo la redundancia contenida en la memoria, para contener el aumento de los errores en ella. Esta redundancia explotada puede ser una parte inherente de los algoritmos que permite tolerar más fallos, en forma de datos inutilizados almacenados en la memoria, o agregada a la memoria de un programa en forma de ECC. En todos los casos, el entorno de ejecución permite disminuir los efectos de los fallos de manera eficiente, disminuyendo los costes de recuperación, identificando datos redundantes, o focalizando esfuerzos de protección en los datos críticos

    Predictive Reliability and Fault Management in Exascale Systems: State of the Art and Perspectives

    Get PDF
    © ACM, 2020. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Computing Surveys, Vol. 53, No. 5, Article 95. Publication date: September 2020. https://doi.org/10.1145/3403956[EN] Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.This work has received funding from the European Union's Horizon 2020 (H2020) research and innovation program under the FET-HPC Grant Agreement No. 801137 (RECIPE). Jaume Abella was also partially supported by the Ministry of Economy and Competitiveness of Spain under Contract No. TIN2015-65316-P and under Ramon y Cajal Postdoctoral Fellowship No. RYC-2013-14717, as well as by the HiPEAC Network of Excellence. Ramon Canal is partially supported by the Generalitat de Catalunya under Contract No. 2017SGR0962.Canal, R.; Hernández Luz, C.; Tornero-Gavilá, R.; Cilardo, A.; Massari, G.; Reghenzani, F.; Fornaciari, W.... (2020). Predictive Reliability and Fault Management in Exascale Systems: State of the Art and Perspectives. ACM Computing Surveys. 53(5):1-32. https://doi.org/10.1145/3403956S132535Abella, J., Hernandez, C., Quinones, E., Cazorla, F. J., Conmy, P. R., Azkarate-askasua, M., … Vardanega, T. (2015). WCET analysis methods: Pitfalls and challenges on their trustworthiness. 10th IEEE International Symposium on Industrial Embedded Systems (SIES). doi:10.1109/sies.2015.7185039E. Agullo L. Giraud A. Guermouche J. Roman and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: Recover-restart strategies. INRIA Research Report RR-8324. E. Agullo L. Giraud A. Guermouche J. Roman and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: Recover-restart strategies. INRIA Research Report RR-8324.Agullo, E., Giraud, L., Salas, P., & Zounon, M. (2016). Interpolation-Restart Strategies for Resilient Eigensolvers. SIAM Journal on Scientific Computing, 38(5), C560-C583. doi:10.1137/15m1042115Al-Qawasmeh, A. M., Pasricha, S., Maciejewski, A. A., & Siegel, H. J. (2015). Power and Thermal-Aware Workload Allocation in Heterogeneous Data Centers. IEEE Transactions on Computers, 64(2), 477-491. doi:10.1109/tc.2013.116ARM. 2017. ARM Reliability Availability and Serviceability (RAS) Specification—ARMv8 for the ARMv8-A Architecture Profile. White paper. Retrieved from https://developer.arm.com/docs/ddi0587/latest. ARM. 2017. ARM Reliability Availability and Serviceability (RAS) Specification—ARMv8 for the ARMv8-A Architecture Profile. White paper. Retrieved from https://developer.arm.com/docs/ddi0587/latest.Avizienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11-33. doi:10.1109/tdsc.2004.2Bautista-Gomez, L., Zyulkyarov, F., Unsal, O., & McIntosh-Smith, S. (2016). Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer. SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. doi:10.1109/sc.2016.54Berrocal, E., Bautista-Gomez, L., Di, S., Lan, Z., & Cappello, F. (2017). Toward General Software Level Silent Data Corruption Detection for Parallel Applications. IEEE Transactions on Parallel and Distributed Systems, 28(12), 3642-3655. doi:10.1109/tpds.2017.2735971M.-A. Breuer and A. D. Friedman. 1976. Diagnosis 8 Reliable Design of Digital Systems. Springer. M.-A. Breuer and A. D. Friedman. 1976. Diagnosis 8 Reliable Design of Digital Systems. Springer.P. Bridges K. Ferreira M. Heroux and M. Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints June 2012. arXiv:1206.1390 [math.NA]. P. Bridges K. Ferreira M. Heroux and M. Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints June 2012. arXiv:1206.1390 [math.NA].F. Cappello A. Geist W. Gropp S. Kale B. Kramer and M. Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1 1 (2014). http://superfri.org/superfri/article/view/14. F. Cappello A. Geist W. Gropp S. Kale B. Kramer and M. Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1 1 (2014). http://superfri.org/superfri/article/view/14.F. J. Cazorla L. Kosmidis E. Mezzetti C. Hernandez J. Abella and T. Vardanega. 2019. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv. 52 1 Article 14 (Feb. 2019) 35 pages. DOI:https://doi.org/10.1145/3301283 F. J. Cazorla L. Kosmidis E. Mezzetti C. Hernandez J. Abella and T. Vardanega. 2019. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv. 52 1 Article 14 (Feb. 2019) 35 pages. DOI:https://doi.org/10.1145/3301283Chan, C. S., Pan, B., Gross, K., Vaidyanathan, K., & Rosing, T. Š. (2014). Correcting vibration-induced performance degradation in enterprise servers. ACM SIGMETRICS Performance Evaluation Review, 41(3), 83-88. doi:10.1145/2567529.2567555Chantem, T., Hu, X. S., & Dick, R. P. (2011). Temperature-Aware Scheduling and Assignment for Hard Real-Time Applications on MPSoCs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 19(10), 1884-1897. doi:10.1109/tvlsi.2010.2058873Chen, M. Y., Kiciman, E., Fratkin, E., Fox, A., & Brewer, E. (s. f.). Pinpoint: problem determination in large, dynamic Internet services. Proceedings International Conference on Dependable Systems and Networks. doi:10.1109/dsn.2002.1029005Chen, Z. (2011). Algorithm-based recovery for iterative methods without checkpointing. Proceedings of the 20th international symposium on High performance distributed computing - HPDC ’11. doi:10.1145/1996130.1996142Chen, Z. (2013). Online-ABFT. Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP ’13. doi:10.1145/2442516.2442533Coskun, A. K., Rosing, T. S., Mihic, K., De Micheli, G., & Leblebici, Y. (2006). Analysis and Optimization of MPSoC Reliability. Journal of Low Power Electronics, 2(1), 56-69. doi:10.1166/jolpe.2006.007G. Da Costa A. Oleksiak W. Piatek J. Salom and L. Sisó. 2015. Minimization of costs and energy consumption in a data center by a workload-based capacity management. In Energy Efficient Data Centers S. Klingert M. Chinnici and M. Rey Porto (Eds.). Springer International Publishing Cham 102--119. G. Da Costa A. Oleksiak W. Piatek J. Salom and L. Sisó. 2015. Minimization of costs and energy consumption in a data center by a workload-based capacity management. In Energy Efficient Data Centers S. Klingert M. Chinnici and M. Rey Porto (Eds.). Springer International Publishing Cham 102--119.Cupertino, L., Da Costa, G., Oleksiak, A., Pia¸tek, W., Pierson, J.-M., Salom, J., … Zilio, T. (2015). Energy-efficient, thermal-aware modeling and simulation of data centers: The CoolEmAll approach and evaluation results. Ad Hoc Networks, 25, 535-553. doi:10.1016/j.adhoc.2014.11.002Dally, W. J. (1991). Express cubes: improving the performance of k-ary n-cube interconnection networks. IEEE Transactions on Computers, 40(9), 1016-1023. doi:10.1109/12.83652Dauwe, D., Pasricha, S., Maciejewski, A. A., & Siegel, H. J. (2018). Resilience-Aware Resource Management for Exascale Computing Systems. IEEE Transactions on Sustainable Computing, 3(4), 332-345. doi:10.1109/tsusc.2018.2797890R. I. Davis and A. Burns. 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43 4 Article 35 (Oct. 2011) 44 pages. DOI:https://doi.org/10.1145/1978802.1978814 R. I. Davis and A. Burns. 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43 4 Article 35 (Oct. 2011) 44 pages. DOI:https://doi.org/10.1145/1978802.1978814Di, S., & Cappello, F. (2016). Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications. IEEE Transactions on Parallel and Distributed Systems, 27(10), 2809-2823. doi:10.1109/tpds.2016.2517639Di, S., Guo, H., Gupta, R., Pershey, E. R., Snir, M., & Cappello, F. (2019). Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System. IEEE Transactions on Parallel and Distributed Systems, 30(2), 361-374. doi:10.1109/tpds.2018.2864184Di, S., Robert, Y., Vivien, F., & Cappello, F. (2017). Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model. IEEE Transactions on Parallel and Distributed Systems, 28(1), 244-259. doi:10.1109/tpds.2016.2546248J. Dongarra T. Herault and Y. Robert. 2015. Fault Tolerance Techniques for High-Performance Computing. Springer. J. Dongarra T. Herault and Y. Robert. 2015. Fault Tolerance Techniques for High-Performance Computing. Springer.DOWNING, S., & SOCIE, D. (1982). Simple rainflow counting algorithms. International Journal of Fatigue, 4(1), 31-40. doi:10.1016/0142-1123(82)90018-4Eghbalkhah, B., Kamal, M., Afzali-Kusha, H., Afzali-Kusha, A., Ghaznavi-Ghoushchi, M. B., & Pedram, M. (2015). Workload and temperature dependent evaluation of BTI-induced lifetime degradation in digital circuits. Microelectronics Reliability, 55(8), 1152-1162. doi:10.1016/j.microrel.2015.06.004Gottscho, M., Shoaib, M., Govindan, S., Sharma, B., Wang, D., & Gupta, P. (2017). Measuring the Impact of Memory Errors on Application  Performance. IEEE Computer Architecture Letters, 16(1), 51-55. doi:10.1109/lca.2016.2599513Greenberg, A., Hamilton, J. R., Jain, N., Kandula, S., Kim, C., Lahiri, P., … Sengupta, S. (2011). VL2. Communications of the ACM, 54(3), 95-104. doi:10.1145/1897852.1897877Heroux, M. A., Bartlett, R. A., Howle, V. E., Hoekstra, R. J., Hu, J. J., Kolda, T. G., … Stanley, K. S. (2005). An overview of the Trilinos project. ACM Transactions on Mathematical Software, 31(3), 397-423. doi:10.1145/1089014.1089021Hoffmann, G. A., Trivedi, K. S., & Malek, M. (2007). A Best Practice Guide to Resource Forecasting for Computing Systems. IEEE Transactions on Reliability, 56(4), 615-628. doi:10.1109/tr.2007.909764Hsiao, M. Y., Carter, W. C., Thomas, J. W., & Stringfellow, W. R. (1981). Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress. IBM Journal of Research and Development, 25(5), 453-468. doi:10.1147/rd.255.0453Hughes, G. F., Murray, J. F., Kreutz-Delgado, K., & Elkan, C. (2002). Improved disk-drive failure warnings. IEEE Transactions on Reliability, 51(3), 350-357. doi:10.1109/tr.2002.802886S. Hukerikar and C. Engelmann. 2017. Resilience design patterns: A structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4 3 (2017). DOI:https://doi.org/10.14529/jsfi170301 S. Hukerikar and C. Engelmann. 2017. Resilience design patterns: A structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4 3 (2017). DOI:https://doi.org/10.14529/jsfi170301Hussain, H., Malik, S. U. R., Hameed, A., Khan, S. U., Bickler, G., Min-Allah, N., … Rayes, A. (2013). A survey on resource allocation in high performance distributed computing systems. Parallel Computing, 39(11), 709-736. doi:10.1016/j.parco.2013.09.009Intel Corporation. [n.d.]. Intel Xeon Processor E7 Family: Reliability Availability and Serviceability. White paper. https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-family-ras-server-paper.html. Intel Corporation. [n.d.]. Intel Xeon Processor E7 Family: Reliability Availability and Serviceability. White paper. https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-family-ras-server-paper.html.Jha, S., Formicola, V., Martino, C. D., Dalton, M., Kramer, W. T., Kalbarczyk, Z., & Iyer, R. K. (2018). Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters. IEEE Transactions on Dependable and Secure Computing, 15(6), 915-930. doi:10.1109/tdsc.2017.2737537Kiciman, E., & Fox, A. (2005). Detecting Application-Level Failures in Component-Based Internet Services. IEEE Transactions on Neural Networks, 16(5), 1027-1041. doi:10.1109/tnn.2005.853411Kim, T., Sun, Z., Cook, C., Zhao, H., Li, R., Wong, D., & Tan, S. X.-D. (2016). Invited - Cross-layer modeling and optimization for electromigration induced reliability. Proceedings of the 53rd Annual Design Automation Conference. doi:10.1145/2897937.2905010Kurowski, K., Oleksiak, A., Piątek, W., Piontek, T., Przybyszewski, A., & Węglarz, J. (2013). DCworms – A tool for simulation of energy efficiency in distributed computing infrastructures. Simulation Modelling Practice and Theory, 39, 135-151. doi:10.1016/j.simpat.2013.08.007Langou, J., Chen, Z., Bosilca, G., & Dongarra, J. (2008). Recovery Patterns for Iterative Methods in a Parallel Unstable Environment. SIAM Journal on Scientific Computing, 30(1), 102-116. doi:10.1137/040620394J. C. Laprie (Ed.). 1995. Dependability—Its Attributes Impairments and Means. Springer-Verlag Berlin. J. C. Laprie (Ed.). 1995. Dependability—Its Attributes Impairments and Means. Springer-Verlag Berlin.Laprie, J.-C. (s. f.). DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY. Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ’ Highlights from Twenty-Five Years’. doi:10.1109/ftcsh.1995.532603Lasance, C. J. M. (2003). Thermally driven reliability issues in microelectronic systems: status-quo and challenges. Microelectronics Reliability, 43(12), 1969-1974. doi:10.1016/s0026-2714(03)00183-5Yinglung Liang, Yanyong Zhang, Sivasubramaniam, A., Jette, M., & Sahoo, R. (s. f.). BlueGene/L Failure Analysis and Prediction Models. International Conference on Dependable Systems and Networks (DSN’06). doi:10.1109/dsn.2006.18Lin, T.-T. Y., & Siewiorek, D. P. (1990). Error log analysis: statistical modeling and heuristic trend analysis. IEEE Transactions on Reliability, 39(4), 419-432. doi:10.1109/24.58720Losada, N., González, P., Martín, M. J., Bosilca, G., Bouteiller, A., & Teranishi, K. (2020). Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Generation Computer Systems, 106, 467-481. doi:10.1016/j.future.2020.01.026Lyons, R. E., & Vanderkulk, W. (1962). The Use of Triple-Modular Redundancy to Improve Computer Reliability. IBM Journal of Research and Development, 6(2), 200-209. doi:10.1147/rd.62.0200M. Médard and S. S. Lumetta. 2003. Network Reliability and Fault Tolerance. American Cancer Society. Retrieved from arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471219282.eot281. M. Médard and S. S. Lumetta. 2003. Network Reliability and Fault Tolerance. American Cancer Society. Retrieved from arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471219282.eot281.Moody, A., Bronevetsky, G., Mohror, K., & de Supinski, B. (2010). Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System. doi:10.2172/984082Moor Insights 8 Strategy. 2017. AMD EPYC Brings New RAS Capability. White paper. Retrieved from https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf. Moor Insights 8 Strategy. 2017. AMD EPYC Brings New RAS Capability. White paper. Retrieved from https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf.Mulas, F., Atienza, D., Acquaviva, A., Carta, S., Benini, L., & De Micheli, G. (2009). Thermal Balancing Policy for Multiprocessor Stream Computing Platforms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 28(12), 1870-1882. doi:10.1109/tcad.2009.2032372Oleksiak, A., Kierzynka, M., Piatek, W., Agosta, G., Barenghi, A., Brandolese, C., … Janssen, U. (2017). M2DC – Modular Microserver DataCentre with heterogeneous hardware. Microprocessors and Microsystems, 52, 117-130. doi:10.1016/j.micpro.2017.05.019Oxley, M. A., Jonardi, E., Pasricha, S., Maciejewski, A. A., Siegel, H. J., Burns, P. J., & Koenig, G. A. (2018). Rate-based thermal, power, and co-location aware resource management for heterogeneous data centers. Journal of Parallel and Distributed Computing, 112, 126-139. doi:10.1016/j.jpdc.2017.04.015K. O’brien I. Pietri R. Reddy A. Lastovetsky and R. Sakellariou. 2017. A survey of power and energy predictive models in HPC systems and applications. ACM Comput. Surv. 50 3 Article 37 (June 2017) 38 pages. DOI:https://doi.org/10.1145/3078811 K. O’brien I. Pietri R. Reddy A. Lastovetsky and R. Sakellariou. 2017. A survey of power and energy predictive models in HPC systems and applications. ACM Comput. Surv. 50 3 Article 37 (June 2017) 38 pages. DOI:https://doi.org/10.1145/3078811Park, S.-M., & Humphrey, M. (2011). Predictable High-Performance Computing Using Feedback Control and Admission Control. IEEE Transactions on Parallel and Distributed Systems, 22(3), 396-411. doi:10.1109/tpds.2010.100Pfefferman, J. D., & Cernuschi-Frias, B. (2002). A nonparametric nonstationary procedure for failure prediction. IEEE Transactions on Reliability, 51(4), 434-442. doi:10.1109/tr.2002.804733Rangan, K. K., Wei, G.-Y., & Brooks, D. (2009). Thread motion. ACM SIGARCH Computer Architecture News, 37(3), 302-313. doi:10.1145/1555815.1555793Paolo Rech. [n.d.]. Reliability Issues in Current and Future Supercomputers. Retrieved from http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf. Paolo Rech. [n.d.]. Reliability Issues in Current and Future Supercomputers. Retrieved from http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf.F. Reghenzani G. Massari and W. Fornaciari. 2019. The real-time Linux kernel: A survey on PREEMPT_RT. Comput. Surveys 52 1 Article 18 (Feb. 2019) 36 pages. DOI:https://doi.org/10.1145/3297714 F. Reghenzani G. Massari and W. Fornaciari. 2019. The real-time Linux kernel: A survey on PREEMPT_RT. Comput. Surveys 52 1 Article 18 (Feb. 2019) 36 pages. DOI:https://doi.org/10.1145/3297714F. Salfner M. Lenk and M. Malek. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42 3 Article 10 (March 2010) 42 pages. DOI:https://doi.org/10.1145/1670679.1670680 F. Salfner M. Lenk and M. Malek. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42 3 Article 10 (March 2010) 42 pages. DOI:https://doi.org/10.1145/1670679.1670680Salfner, F., Schieschke, M., & Malek, M. (2006). Predicting failures of computer systems: a case study for a telecommunication system. Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. doi:10.1109/ipdps.2006.1639672Shi, L., Chen, H., Sun, J., & Li, K. (2012). vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines. IEEE Transactions on Computers, 61(6), 804-816. doi:10.1109/tc.2011.112D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems 3rd ed. A. K. Peters Ltd. D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems 3rd ed. A. K. Peters Ltd.Singh, S., & Chana, I. (2016). A Survey on Resource Scheduling in Cloud Computing: Issues and Challenges. Journal of Grid Computing, 14(2), 217-264. doi:10.1007/s10723-015-9359-2Slegel, T. J., Averill, R. M., Check, M. A., Giamei, B. C., Krumm, B. W., Krygowski, C. A., … Webb, C. F. (1999). IBM’s S/390 G5 microprocessor design. IEEE Micro, 19(2), 12-23. doi:10.1109/40.755464Sridhar, A., Sabry, M. M., & Atienza, D. (2014). A Semi-Analytical Thermal Modeling Framework for Liquid-Cooled ICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 33(8), 1145-1158. doi:10.1109/tcad.2014.2323194Sridharan, V., DeBardeleben, N., Blanchard, S., Ferreira, K. B., Stearley, J., Shalf, J., & Gurumurthi, S. (2015). Memory Errors in Modern Systems. ACM SIGARCH Computer Architecture News, 43(1), 297-310. doi:10.1145/2786763.2694348Stathis, J. H. (2018). The physics of NBTI: What do we really know? 2018 IEEE International Reliability Physics Symposium (IRPS). doi:10.1109/irps.2018.8353539Stellner, G. (s. f.). CoCheck: checkpointing and process migration for MPI. Proceedings of International Conference on Parallel Processing. doi:10.1109/ipps.1996.508106Stone, J. E., Gohara, D., & Shi, G. (2010). OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science & Engineering, 12(3), 66-73. doi:10.1109/mcse.2010.69Subasi, O., Di, S., Bautista-Gomez, L., Balaprakash, P., Unsal, O., Labarta, J., … Cappello, F. (2018). Exploring the capabilities of support vector machines in detecting silent data corruptions. Sustainable Computing: Informatics and Systems, 19, 277-290. doi:10.1016/j.suscom.2018.01.004Tang, D., & Iyer, R. K. (1993). Dependability measurement and modeling of a multicomputer system. IEEE Transactions on Computers, 42(1), 62-75. doi:10.1109/12.192214D. Turnbull and N. Alldrin. 2003. Failure Prediction in Hardware Systems. Tech. rep. University of California San Diego CA. Retrieved from http://www.cs.ucsd.edu/ dturnbul/Papers/ServerPrediction.pdf. D. Turnbull and N. Alldrin. 2003. Failure Prediction in Hardware Systems. Tech. rep. University of California San Diego CA. Retrieved from http://www.cs.ucsd.edu/ dturnbul/Papers/ServerPrediction.pdf.Vilalta, R., Apte, C. V., Hellerstein, J. L., Ma, S., & Weiss, S. M. (2002). Predictive algorithms in the management of computer systems. IBM Systems Journal, 41(3), 461-474. doi:10.1147/sj.413.0461Vinoski, S. (2007). Reliability with Erlang. IEEE Internet Com
    corecore