8 research outputs found

    Memory systems for high-performance computing: the capacity and reliability implications

    Get PDF
    Memory systems are signicant contributors to the overall power requirements, energy consumption, and the operational cost of large high-performance computing systems (HPC). Limitations of main memory systems in terms of latency, bandwidth and capacity, can signicantly affect the performance of HPC applications, and can have strong negative impact on system scalability. In addition, errors in the main memory system can have a strong impact on the reliability, accessibility and serviceability of large-scale clusters. This thesis studies capacity and reliability issues in modern memory systems for high-performance computing. The choice of main memory capacity is an important aspect of high-performance computing memory system design. This choice becomes in- creasingly important now that 3D-stacked memories are entering the market. Compared with conventional DIMMs, 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now. We analyze memory capacity requirements of important HPC benchmarks and applications. The study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step towards the adoption of this novel technology in the HPC domain. For HPC domains where large memory capacities are required, we propose scaling-in of HPC applications to reduce energy consumption and the running time of a batch of jobs. We also propose upgrading the per-node memory capacity, which enables greater degree of scaling-in and additional energy savings. Memory system is one of the main causes of hardware failures. In each generation, the DRAM chip density and the amount of the memory in systems increase, while the DRAM technology process is constantly shrinking. Therefore, we could expect that the DRAM failures could have a serious impact on the future-systems reliability. This thesis studies DRAM errors observed on a production HPC system during a period of two years. We clearly distinguish between two different approaches for the DRAM error analysis: categorical analysis and the analysis of error rates. The first approach compares the errors at the DIMM level and partitions the DIMMs into various categories, e.g. based on whether they did or did not experience an error. The second approach is to analyze the error rates, i.e., to present the total number of errors relative to other statistics, typically the number of MB-hours or the duration of the observation period. We show that although DRAM error analysis may be performed with both approaches, they are not interchangeable and can lead to completely different conclusions. We further demonstrate the importance of providing statistical significance and presenting results that have practical value and real-life use. We show that various widely-accepted approaches for DRAM error analysis may provide data that appear to support an interesting conclusion, but are not statistically signifcant, meaning that they could merely be the result of chance. We hope the study of methods for DRAM error analysis presented in this thesis will become a standard for any future analysis of DRAM errors in the field.Los sistemas de memoria son contribuyentes significativos al consumo de energía y al coste de operación de los sistemas de computación de altas prestaciones (HPC). Limitaciones de los sistemas de memoria en términos de latencia, ancho de banda y capacidad, pueden afectar significativamente el rendimiento de aplicaciones HPC, y pueden tener un fuerte impacto negativo en la escalabilidad del sistema. Además, los errores en el sistema de memoria principal pueden tener un fuerte impacto sobre la confiabilidad, disponibilidad y capacidad de servicio de los clusters a gran escala. Esta tesis estudia problemas de capacidad y confiabilidad de los sistemas modernos de computación de altas prestaciones. La elección de capacidad de la memoria principal es un aspecto importante del diseño de sistemas de computación de altas prestaciones. Esta elección empieza ser cada vez más importante con memorias 3D apareciendo en el mercado. Comparados con los DIMMs convencionales, los chips de memoria 3D proporcionan mejor rendimiento y eficiencia energética, pero menores capacidades de memoria. Por lo tanto, la adopción de memorias 3D en el dominio HPC depende de si es posible encontrar casos de uso que requieren mucha menos memoria de la que está disponible ahora. Analizamos los requisitos de capacidad de memoria de importantes benchmarks y aplicaciones de HPC. El estudio identifica las aplicaciones de HPC y los casos de uso con huellas de memoria que podrían ser proporcionadas por los chips de memoria 3D dando un primer paso hacia la adopción de esta nueva tecnología en el dominio HPC. Para dominios HPC donde se requieren grandes capacidades de memoria, proponemos scaling-in de las aplicaciones de HPC para reducir el consumo de energía y el tiempo de ejecución de un lote de tareas. También proponemos ampliar la capacidad de memoria que permite un mayor grado de scaling-in y ahorros de energía adicionales. El sistema de memoria es una de las principales causas de fallas de hardware. En cada generación, la densidad del chip DRAM y la cantidad de memoria en el sistema aumentan, mientras el proceso de tecnología DRAM se reduce constantemente. Por lo tanto, podríamos esperar que los fallos DRAM podrían tener un serio impacto en la confiabilidad de los sistemas en el futuro. Esta tesis estudia los errores de DRAM observados en un sistema de producción HPC durante un período de dos años. Nosotros distinguimos claramente dos enfoques diferentes de análisis de error DRAM: análisis categórico y análisis de tasas de error. El primer enfoque compara los errores en el nivel DIMM y divide los DIMMs en varias categorías, por ejemplo, dependiendo si tuvieron o no un error. El segundo enfoque es analizar las tasas de error, es decir, presentar el número total de errores relativos a otras estadísticas, generalmente el número de MB-horas o la duración del período de observación. Mostramos que aunque el análisis de error DRAM se puede realizar con ambos enfoques, estos no son intercambiables y pueden llevar a conclusiones completamente diferentes. Demostramos la importancia de proporcionar significación estadística y presentar resultados que tienen un valor práctico y uso en la vida real. Mostramos que varios enfoques de análisis de errores de DRAM pueden proporcionar datos que apoyan una conclusión interesante, pero no son estadísticamente significativos, lo que significa que simplemente podrían ser el resultado de casualidad. Esperamos que el estudio de los métodos para el análisis de errores DRAM presentados en esta tesis se convertirá en un estándar para cualquier análisis futuro de errores de DRAM en el campo.Postprint (published version

    Main memory in HPC: do we need more or could we live with less?

    Get PDF
    This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High Performance Conjugate Gradients benchmark could be an important success story for 3D-stacked memories in HPC, but High-performance Linpack is likely to be constrained by 3D memory capacity

    Main memory in HPC: do we need more, or could we live with less?

    Get PDF
    An important aspect of High-Performance Computing (HPC) system design is the choice of main memory capacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market. Compared with conventional Dual In-line Memory Modules (DIMMs), 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore, the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now. This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High-Performance Conjugate Gradients (HPCG) benchmark could be an important success story for 3D-stacked memories in HPC, but High-Performance Linpack (HPL) is likely to be constrained by 3D memory capacity. The study also emphasizes that the analysis of memory footprints of production HPC applications is complex and that it requires an understanding of application scalability and target category, i.e., whether the users target capability or capacity computing. The results show that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but we also detect applications and use cases that require gigabytes per core. Overall, the study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step toward adoption of this novel technology in the HPC domain.This work was supported by the Collaboration Agreement between Samsung Electronics Co., Ltd. and BSC, Spanish Government through Severo Ochoa programme (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). This work has also received funding from the European Union’s Horizon 2020 research and innovation programme under ExaNoDe project (grant agreement No 671578). Darko Zivanovic holds the Severo Ochoa grant (SVP-2014-068501) of the Ministry of Economy and Competitiveness of Spain. The authors thank Harald Servat from BSC and Vladimir Marjanovi´c from High Performance Computing Center Stuttgart for their technical support.Postprint (published version

    DRAM errors in the field: a statistical approach

    Get PDF
    This paper summarizes our two-year study of corrected and uncor-rected errors on the MareNostrum 3 supercomputer, covering 2000 billion MB-hours of DRAM in the field. The study analyzes 4.5 million corrected and 71 uncorrected DRAM errors and it compares the reliability of DIMMs from all three major memory manufacturers, built in three different technologies. Our work has two sets of contributions. First, we illustrate the complexity of in-field DRAM error analysis and demonstrate the limitations of various widely-used methods and metrics. For example, we show that average error rates, errors per MB-hour and mean time between failures can provide volatile and unreliable results even after long periods of error logging, leading to incorrect conclusions about DRAM reliability. Second, we present formal statistical methods that overcome many of the limitations of the current approaches. The methods that we present are simple to understand and implement, reliable and widely accepted in the statistical community. Overall, our study alerts the community about the need to, firstly, question the current practice in quantifying DRAM reliability and, secondly, to select a proper analysis approach for future studies. Our strong recommendations are to focus on metrics with a practical value that could be easily related to system reliability, and to select methods that provide stable results, ideally supported with statistical significance.This work was supported by the Collaboration Agreement between Samsung Electronics Co., Ltd. and BSC, Spanish Government through Severo Ochoa programme (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR1051 and 2014-SGR-1272). This work has also received funding from the European Union’s Horizon 2020 research and innovation programme under EuroEXA project (grant agreement No 754337). Darko Zivanovic holds the Severo Ochoa grant (SVP-2014-068501) of the Ministry of Economy and Competitiveness of Spain.Postprint (author's final draft

    Memory systems for high-performance computing: the capacity and reliability implications

    No full text
    Memory systems are signicant contributors to the overall power requirements, energy consumption, and the operational cost of large high-performance computing systems (HPC). Limitations of main memory systems in terms of latency, bandwidth and capacity, can signicantly affect the performance of HPC applications, and can have strong negative impact on system scalability. In addition, errors in the main memory system can have a strong impact on the reliability, accessibility and serviceability of large-scale clusters. This thesis studies capacity and reliability issues in modern memory systems for high-performance computing. The choice of main memory capacity is an important aspect of high-performance computing memory system design. This choice becomes in- creasingly important now that 3D-stacked memories are entering the market. Compared with conventional DIMMs, 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now. We analyze memory capacity requirements of important HPC benchmarks and applications. The study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step towards the adoption of this novel technology in the HPC domain. For HPC domains where large memory capacities are required, we propose scaling-in of HPC applications to reduce energy consumption and the running time of a batch of jobs. We also propose upgrading the per-node memory capacity, which enables greater degree of scaling-in and additional energy savings. Memory system is one of the main causes of hardware failures. In each generation, the DRAM chip density and the amount of the memory in systems increase, while the DRAM technology process is constantly shrinking. Therefore, we could expect that the DRAM failures could have a serious impact on the future-systems reliability. This thesis studies DRAM errors observed on a production HPC system during a period of two years. We clearly distinguish between two different approaches for the DRAM error analysis: categorical analysis and the analysis of error rates. The first approach compares the errors at the DIMM level and partitions the DIMMs into various categories, e.g. based on whether they did or did not experience an error. The second approach is to analyze the error rates, i.e., to present the total number of errors relative to other statistics, typically the number of MB-hours or the duration of the observation period. We show that although DRAM error analysis may be performed with both approaches, they are not interchangeable and can lead to completely different conclusions. We further demonstrate the importance of providing statistical significance and presenting results that have practical value and real-life use. We show that various widely-accepted approaches for DRAM error analysis may provide data that appear to support an interesting conclusion, but are not statistically signifcant, meaning that they could merely be the result of chance. We hope the study of methods for DRAM error analysis presented in this thesis will become a standard for any future analysis of DRAM errors in the field.Los sistemas de memoria son contribuyentes significativos al consumo de energía y al coste de operación de los sistemas de computación de altas prestaciones (HPC). Limitaciones de los sistemas de memoria en términos de latencia, ancho de banda y capacidad, pueden afectar significativamente el rendimiento de aplicaciones HPC, y pueden tener un fuerte impacto negativo en la escalabilidad del sistema. Además, los errores en el sistema de memoria principal pueden tener un fuerte impacto sobre la confiabilidad, disponibilidad y capacidad de servicio de los clusters a gran escala. Esta tesis estudia problemas de capacidad y confiabilidad de los sistemas modernos de computación de altas prestaciones. La elección de capacidad de la memoria principal es un aspecto importante del diseño de sistemas de computación de altas prestaciones. Esta elección empieza ser cada vez más importante con memorias 3D apareciendo en el mercado. Comparados con los DIMMs convencionales, los chips de memoria 3D proporcionan mejor rendimiento y eficiencia energética, pero menores capacidades de memoria. Por lo tanto, la adopción de memorias 3D en el dominio HPC depende de si es posible encontrar casos de uso que requieren mucha menos memoria de la que está disponible ahora. Analizamos los requisitos de capacidad de memoria de importantes benchmarks y aplicaciones de HPC. El estudio identifica las aplicaciones de HPC y los casos de uso con huellas de memoria que podrían ser proporcionadas por los chips de memoria 3D dando un primer paso hacia la adopción de esta nueva tecnología en el dominio HPC. Para dominios HPC donde se requieren grandes capacidades de memoria, proponemos scaling-in de las aplicaciones de HPC para reducir el consumo de energía y el tiempo de ejecución de un lote de tareas. También proponemos ampliar la capacidad de memoria que permite un mayor grado de scaling-in y ahorros de energía adicionales. El sistema de memoria es una de las principales causas de fallas de hardware. En cada generación, la densidad del chip DRAM y la cantidad de memoria en el sistema aumentan, mientras el proceso de tecnología DRAM se reduce constantemente. Por lo tanto, podríamos esperar que los fallos DRAM podrían tener un serio impacto en la confiabilidad de los sistemas en el futuro. Esta tesis estudia los errores de DRAM observados en un sistema de producción HPC durante un período de dos años. Nosotros distinguimos claramente dos enfoques diferentes de análisis de error DRAM: análisis categórico y análisis de tasas de error. El primer enfoque compara los errores en el nivel DIMM y divide los DIMMs en varias categorías, por ejemplo, dependiendo si tuvieron o no un error. El segundo enfoque es analizar las tasas de error, es decir, presentar el número total de errores relativos a otras estadísticas, generalmente el número de MB-horas o la duración del período de observación. Mostramos que aunque el análisis de error DRAM se puede realizar con ambos enfoques, estos no son intercambiables y pueden llevar a conclusiones completamente diferentes. Demostramos la importancia de proporcionar significación estadística y presentar resultados que tienen un valor práctico y uso en la vida real. Mostramos que varios enfoques de análisis de errores de DRAM pueden proporcionar datos que apoyan una conclusión interesante, pero no son estadísticamente significativos, lo que significa que simplemente podrían ser el resultado de casualidad. Esperamos que el estudio de los métodos para el análisis de errores DRAM presentados en esta tesis se convertirá en un estándar para cualquier análisis futuro de errores de DRAM en el campo

    Large-Memory Nodes for Energy Efficient High-Performance Computing

    No full text
    Energy consumption is by far the most important contributor to HPC cluster operational costs, and it accounts for a significant share of the total cost of ownership. Advanced energy-saving techniques in HPC components have received significant research and development effort, but a simple measure that can dramatically reduce energy consumption is often overlooked. We show that, in capacity computing, where many small to medium-sized jobs have to be solved at the lowest cost, a practical energy-saving approach is to scale-in the application on large-memory nodes. We evaluate scaling-in; i.e. decreasing the number of application processes and compute nodes (servers) to solve a fixed-sized problem, using a set of HPC applications running in a production system. Using standard-memory nodes, we obtain average energy savings of 36%, already a huge figure. We show that the main source of these energy savings is a decrease in the node-hours (node_hours = #nodes x exe_time), which is a consequence of the more efficient use of hardware resources. Scaling-in is limited by the per-node memory capacity. We therefore consider using large-memory nodes to enable a greater degree of scaling-in. We show that the additional energy savings, of up to 52%, mean that in many cases the investment in upgrading the hardware would be recovered in a typical system lifetime of less than five years.Peer ReviewedPostprint (published version

    Cost-aware prediction of uncorrected DRAM errors in the field

    Get PDF
    This paper presents and evaluates a method to predict DRAM uncorrected errors, a leading cause of hardware failures in large-scale HPC clusters. The method uses a random forest classifier, which was trained and evaluated using error logs from two years of production of the MareNostrum 3 supercomputer. By enabling the system to take measures to mitigate node failures, our method reduces lost compute time by up to 57%, a net saving of 21,000 node–hours per year. We release all source code as open source. We also discuss and clarify aspects of methodology that are essential for a DRAM prediction method to be useful in practice. We explain why standard evaluation metrics, such as precision and recall, are insufficient, and base the evaluation on a cost–benefit analysis. This methodology can help ensure that any DRAM error predictor is clear from training bias and has a clear cost–benefit calculation.This work was supported by the Spanish Ministry of Science and Technology (project PID2019-107255GB), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and the European Union’s Horizon 2020 research and innovation programme and EuroEXA project (grant agreement No 754337). Paul Carpenter and Marc Casas hold the Ramon y Cajal fellowship under contracts RYC2018-025628-I and RYC2017-23269, respectively, of the Ministry of Economy and Competitiveness of Spain.Peer ReviewedPostprint (author's final draft

    Mainstream vs. emerging HPC: metrics, trade-offs and lessons learned

    Get PDF
    ©2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Various servers with different characteristics and architectures are hitting the market, and their evaluation and comparison in terms of HPC features is complex and multidimensional. In this paper, we share our experience of evaluating a diverse set of HPC systems, consisting of three mainstream and five emerging architectures. We evaluate the performance and power efficiency using prominent HPC benchmarks, High-Performance Linpack (HPL) and High Performance Conjugate Gradients (HPCG), and expand our analysis using publicly available specialized kernel benchmarks, targeting specific system components. In addition to a large body of quantitative results, we emphasize six usually overlooked aspects of the HPC platforms evaluation, and share our conclusions and lessons learned. Overall, we believe that this paper will improve the evaluation and comparison of HPC platforms, making a first step towards a more reliable and uniform methodology.This work was supported by the Spanish Ministry of Science and Technology (project TIN2015-65316-P), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), Severo Ochoa Programme (SEV-2015-0493) of the Spanish Government; and the European Union’s Horizon 2020 research and innovation programme under ExaNoDe project (grant agreement No 671578).Peer ReviewedPostprint (author's final draft
    corecore