350 research outputs found
Achieving High Reliability and Efficiency in Maintaining Large-Scale Storage Systems through Optimal Resource Provisioning and Data Placement
With the explosive increase in the amount of data being generated by various applications, large-scale distributed and parallel storage systems have become common data storage solutions and been widely deployed and utilized in both industry and academia. While these high performance storage systems significantly accelerate the data storage and retrieval, they also bring some critical issues in system maintenance and management. In this dissertation, I propose three methodologies to address three of these critical issues.
First, I develop an optimal resource management and spare provisioning model to minimize the impact brought by component failures and ensure a highly operational experience in maintaining large-scale storage systems. Second, in order to cost-effectively integrate solid-state drives (SSD) into large-scale storage systems, I design a holistic algorithm which can adaptively predict the popularity of data objects by leveraging temporal locality in their access pattern and adjust their placement among solid-state drives and regular hard disk drives so that the data access throughput as well as the storage space efficiency of the large-scale heterogeneous storage systems can be improved. Finally, I propose a new checkpoint placement optimization model which can maximize the computation efficiency of large-scale scientific applications while guarantee the endurance requirements of the SSD-based burst buffer in high performance hierarchical storage systems. All these models and algorithms are validated through extensive evaluation using data collected from deployed large-scale storage systems and the evaluation results demonstrate our models and algorithms can significantly improve the reliability and efficiency of large-scale distributed and parallel storage systems
GPGPU Reliability Analysis: From Applications to Large Scale Systems
Over the past decade, GPUs have become an integral part of mainstream high-performance computing (HPC) facilities. Since applications running on HPC systems are usually long-running, any error or failure could result in significant loss in scientific productivity and system resources. Even worse, since HPC systems face severe resilience challenges as progressing towards exascale computing, it is imperative to develop a better understanding of the reliability of GPUs. This dissertation fills this gap by providing an understanding of the effects of soft errors on the entire system and on specific applications. To understand system-level reliability, a large-scale study on GPU soft errors in the field is conducted. The occurrences of GPU soft errors are linked to several temporal and spatial features, such as specific workloads, node location, temperature, and power consumption. Further, machine learning models are proposed to predict error occurrences on GPU nodes so as to proactively and dynamically turning on/off the costly error protection mechanisms based on prediction results. To understand the effects of soft errors at the application level, an effective fault-injection framework is designed aiming to understand the reliability and resilience characteristics of GPGPU applications. This framework is effective in terms of reducing the tremendous number of fault injection locations to a manageable size while still preserving remarkable accuracy. This framework is validated with both single-bit and multi-bit fault models for various GPGPU benchmarks. Lastly, taking advantage of the proposed fault-injection framework, this dissertation develops a hierarchical approach to understanding the error resilience characteristics of GPGPU applications at kernel, CTA, and warp levels. In addition, given that some corrupted application outputs due to soft errors may be acceptable, we present a use case to show how to enable low-overhead yet reliable GPU computing for GPGPU applications
High-Throughput Computing on High-Performance Platforms: A Case Study
The computing systems used by LHC experiments has historically consisted of
the federation of hundreds to thousands of distributed resources, ranging from
small to mid-size resource. In spite of the impressive scale of the existing
distributed computing solutions, the federation of small to mid-size resources
will be insufficient to meet projected future demands. This paper is a case
study of how the ATLAS experiment has embraced Titan---a DOE leadership
facility in conjunction with traditional distributed high- throughput computing
to reach sustained production scales of approximately 52M core-hours a years.
The three main contributions of this paper are: (i) a critical evaluation of
design and operational considerations to support the sustained, scalable and
production usage of Titan; (ii) a preliminary characterization of a next
generation executor for PanDA to support new workloads and advanced execution
modes; and (iii) early lessons for how current and future experimental and
observational systems can be integrated with production supercomputers and
other platforms in a general and extensible manner
A Spatially Correlated Competing Risks Time-to-Event Model for Supercomputer GPU Failure Data
Graphics processing units (GPUs) are widely used in many high-performance
computing (HPC) applications such as imaging/video processing and training
deep-learning models in artificial intelligence. GPUs installed in HPC systems
are often heavily used, and GPU failures occur during HPC system operations.
Thus, the reliability of GPUs is of interest for the overall reliability of HPC
systems. The Cray XK7 Titan supercomputer was one of the top ten supercomputers
in the world. The failure event times of more than 30,000 GPUs in Titan were
recorded and previous data analysis suggested that the failure time of a GPU
may be affected by the GPU's connectivity location inside the supercomputer
among other factors. In this paper, we conduct in-depth statistical modeling of
GPU failure times to study the effect of location on GPU failures under
competing risks with covariates and spatially correlated random effects. In
particular, two major failure types of GPUs in Titan are considered. The
connectivity locations of cabinets are modeled as spatially correlated random
effects, and the positions of GPUs inside each cabinet are treated as
covariates. A Bayesian framework is used for statistical inference. We also
compare different methods of estimation such as the maximum likelihood, which
is implemented via an expectation-maximization algorithm. Our results provide
interesting insights into GPU failures in HPC systems.Comment: 45 pages, 25 figure
Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication
This paper presents FT-GAIA, a software-based fault-tolerant parallel and
distributed simulation middleware. FT-GAIA has being designed to reliably
handle Parallel And Distributed Simulation (PADS) models, which are needed to
properly simulate and analyze complex systems arising in any kind of scientific
or engineering field. PADS takes advantage of multiple execution units run in
multicore processors, cluster of workstations or HPC systems. However, large
computing systems, such as HPC systems that include hundreds of thousands of
computing nodes, have to handle frequent failures of some components. To cope
with this issue, FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some
protection against Byzantine failures, since interaction messages among the
simulated entities are replicated as well, so that the receiving entity can
identify and discard corrupted messages. Results from an analytical model and
from an experimental evaluation show that FT-GAIA provides a high degree of
fault tolerance, at the cost of a moderate increase in the computational load
of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731
Analyzing Resource Utilization in an HPC System: A Case Study of NERSC Perlmutter
Resource demands of HPC applications vary significantly. However, it is
common for HPC systems to primarily assign resources on a per-node basis to
prevent interference from co-located workloads. This gap between the
coarse-grained resource allocation and the varying resource demands can lead to
HPC resources being not fully utilized. In this study, we analyze the resource
usage and application behavior of NERSC's Perlmutter, a state-of-the-art
open-science HPC system with both CPU-only and GPU-accelerated nodes. Our
one-month usage analysis reveals that CPUs are commonly not fully utilized,
especially for GPU-enabled jobs. Also, around 64% of both CPU and GPU-enabled
jobs used 50% or less of the available host memory capacity. Additionally,
about 50% of GPU-enabled jobs used up to 25% of the GPU memory, and the memory
capacity was not fully utilized in some ways for all jobs. While our study
comes early in Perlmutter's lifetime thus policies and application workload may
change, it provides valuable insights on performance characterization,
application behavior, and motivates systems with more fine-grain resource
allocation
Power Bounded Computing on Current & Emerging HPC Systems
Power has become a critical constraint for the evolution of large scale High Performance Computing (HPC) systems and commercial data centers. This constraint spans almost every level of computing technologies, from IC chips all the way up to data centers due to physical, technical, and economic reasons. To cope with this reality, it is necessary to understand how available or permissible power impacts the design and performance of emergent computer systems. For this reason, we propose power bounded computing and corresponding technologies to optimize performance on HPC systems with limited power budgets.
We have multiple research objectives in this dissertation. They center on the understanding of the interaction between performance, power bounds, and a hierarchical power management strategy. First, we develop heuristics and application aware power allocation methods to improve application performance on a single node. Second, we develop algorithms to coordinate power across nodes and components based on application characteristic and power budget on a cluster. Third, we investigate performance interference induced by hardware and power contentions, and propose a contention aware job scheduling to maximize system throughput under given power budgets for node sharing system. Fourth, we extend to GPU-accelerated systems and workloads and develop an online dynamic performance & power approach to meet both performance requirement and power efficiency.
Power bounded computing improves performance scalability and power efficiency and decreases operation costs of HPC systems and data centers. This dissertation opens up several new ways for research in power bounded computing to address the power challenges in HPC systems. The proposed power and resource management techniques provide new directions and guidelines to green exscale computing and other computing systems
A differentiated proposal of three dimension i/o performance characterization model focusing on storage environments
The I/O bottleneck remains a central issue in high-performance environments. Cloud
computing, high-performance computing (HPC) and big data environments share many underneath difficulties to deliver data at a desirable time rate requested by high-performance
applications. This increases the possibility of creating bottlenecks throughout the application feeding process by bottom hardware devices located in the storage system layer.
In the last years, many researchers have been proposed solutions to improve the I/O
architecture considering different approaches. Some of them take advantage of hardware
devices while others focus on a sophisticated software approach. However, due to the
complexity of dealing with high-performance environments, creating solutions to improve
I/O performance in both software and hardware is challenging and gives researchers many
opportunities. Classifying these improvements in different dimensions allows researchers
to understand how these improvements have been built over the years and how it progresses. In addition, it also allows future efforts to be directed to research topics that
have developed at a lower rate, balancing the general development process. This research
present a three-dimension characterization model for classifying research works on I/O
performance improvements for large scale storage computing facilities. This classification
model can also be used as a guideline framework to summarize researches providing an
overview of the actual scenario. We also used the proposed model to perform a systematic
literature mapping that covered ten years of research on I/O performance improvements
in storage environments. This study classified hundreds of distinct researches identifying
which were the hardware, software, and storage systems that received more attention over
the years, which were the most researches proposals elements and where these elements
were evaluated. In order to justify the importance of this model and the development
of solutions that targets I/O performance improvements, we evaluated a subset of these
improvements using a a real and complete experimentation environment, the Grid5000.
Analysis over different scenarios using a synthetic I/O benchmark demonstrates how the
throughput and latency parameters behaves when performing different I/O operations
using distinct storage technologies and approaches.O gargalo de E/S continua sendo um problema central em ambientes de alto desempenho. Os ambientes de computação em nuvem, computação de alto desempenho (HPC) e big data compartilham muitas dificuldades para fornecer dados em uma taxa de tempo desejável solicitada por aplicações de alto desempenho. Isso aumenta a possibilidade de criar gargalos em todo o processo de alimentação de aplicativos pelos dispositivos de hardware inferiores localizados na camada do sistema de armazenamento. Nos últimos anos, muitos pesquisadores propuseram soluções para melhorar a arquitetura de E/S considerando diferentes abordagens. Alguns deles aproveitam os dispositivos de hardware, enquanto outros se concentram em uma abordagem sofisticada de software. No entanto, devido à complexidade de lidar com ambientes de alto desempenho, criar soluções para melhorar o desempenho de E/S em software e hardware é um desafio e oferece aos pesquisadores muitas oportunidades. A classificação dessas melhorias em diferentes dimensões permite que os pesquisadores entendam como essas melhorias foram construídas ao longo dos anos e como elas progridem. Além disso, também permite que futuros esforços sejam direcionados para tópicos de pesquisa que se desenvolveram em menor proporção, equilibrando o processo geral de desenvolvimento. Esta pesquisa apresenta um modelo de caracterização tridimensional para classificar trabalhos de pesquisa sobre melhorias de desempenho de E/S para instalações de computação de armazenamento em larga escala. Esse modelo de classificação também pode ser usado como uma estrutura de diretrizes para resumir as pesquisas, fornecendo uma visão geral do cenário real. Também usamos o modelo proposto para realizar um mapeamento sistemático da literatura que abrangeu dez anos de pesquisa sobre melhorias no desempenho de E/S em ambientes de armazenamento. Este estudo classificou centenas de pesquisas distintas, identificando quais eram os dispositivos de hardware, software e sistemas de armazenamento que receberam mais atenção ao longo dos anos, quais foram os elementos de proposta mais pesquisados e onde esses elementos foram avaliados. Para justificar a importância desse modelo e o desenvolvimento de soluções que visam melhorias no desempenho de E/S, avaliamos um subconjunto dessas melhorias usando um ambiente de experimentação real e completo, o Grid5000. Análises em cenários diferentes usando um benchmark de E/S sintética demonstra como os parâmetros de vazão e latência se comportam ao executar diferentes operações de E/S usando tecnologias e abordagens distintas de armazenamento
Heterogeneity aware fault tolerance for extreme scale computing
Upcoming Extreme Scale, or Exascale, Computing Systems are expected to deliver a peak performance of at least 10^18 floating point operations per second (FLOPS), primarily through significant expansion in scale. A major concern for such large scale systems, however, is how to deal with failures in the system. This is because the impact of failures on system efficiency, while utilizing existing fault tolerance techniques, generally also increases with scale. Hence, current research effort in this area has been directed at optimizing various aspects of fault tolerance techniques to reduce their overhead at scale. One characteristic that has been overlooked so far, however, is heterogeneity, specifically in the rate at which individual components of the underlying system fail, and in the execution profile of a parallel application running on such a system. In this thesis, we investigate the implications of such types of heterogeneity for fault tolerance in large scale high performance computing (HPC) systems. To that end, we 1) study how knowledge of heterogeneity in system failure likelihoods can be utilized to make current fault tolerance schemes more efficient, 2) assess the feasibility of utilizing application imbalance for improved fault tolerance at scale, and 3) propose and evaluate changes to system level resource managers in order to achieve reliable job placement over resources with unequal failure likelihoods. The results in this thesis, taken together, demonstrate that heterogeneity in failure likelihoods significantly changes the landscape of fault tolerance for large scale HPC systems
DRAM errors in the field: a statistical approach
This paper summarizes our two-year study of corrected and uncor-rected errors on the MareNostrum 3 supercomputer, covering 2000 billion MB-hours of DRAM in the field. The study analyzes 4.5 million corrected and 71 uncorrected DRAM errors and it compares the reliability of DIMMs from all three major memory manufacturers, built in three different technologies.
Our work has two sets of contributions. First, we illustrate the complexity of in-field DRAM error analysis and demonstrate the limitations of various widely-used methods and metrics. For example, we show that average error rates, errors per MB-hour and mean time between failures can provide volatile and unreliable results even after long periods of error logging, leading to incorrect conclusions about DRAM reliability. Second, we present formal statistical methods that overcome many of the limitations of the current approaches. The methods that we present are simple to understand and implement, reliable and widely accepted in the statistical community.
Overall, our study alerts the community about the need to, firstly, question the current practice in quantifying DRAM reliability and, secondly, to select a proper analysis approach for future studies. Our strong recommendations are to focus on metrics with a practical value that could be easily related to system reliability, and to select methods that provide stable results, ideally supported with statistical significance.This work was supported by the Collaboration Agreement between Samsung Electronics Co., Ltd. and BSC, Spanish Government through Severo Ochoa programme (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR1051 and 2014-SGR-1272). This work has also received funding from the European Union’s Horizon 2020 research and innovation programme under EuroEXA project (grant agreement No 754337). Darko Zivanovic holds the Severo Ochoa grant (SVP-2014-068501) of the Ministry of Economy and Competitiveness of Spain.Postprint (author's final draft
- …