11 research outputs found
What does fault tolerant Deep Learning need from MPI?
Deep Learning (DL) algorithms have become the de facto Machine Learning (ML)
algorithm for large scale data analysis. DL algorithms are computationally
expensive - even distributed DL implementations which use MPI require days of
training (model learning) time on commonly studied datasets. Long running DL
applications become susceptible to faults - requiring development of a fault
tolerant system infrastructure, in addition to fault tolerant DL algorithms.
This raises an important question: What is needed from MPI for de- signing
fault tolerant DL implementations? In this paper, we address this problem for
permanent faults. We motivate the need for a fault tolerant MPI specification
by an in-depth consideration of recent innovations in DL algorithms and their
properties, which drive the need for specific fault tolerance features. We
present an in-depth discussion on the suitability of different parallelism
types (model, data and hybrid); a need (or lack thereof) for check-pointing of
any critical data structures; and most importantly, consideration for several
fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI
and their applicability to fault tolerant DL implementations. We leverage a
distributed memory implementation of Caffe, currently available under the
Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches
by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation
using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies
demonstrates the effectiveness of the proposed fault tolerant DL implementation
using OpenMPI based ULFM
Recommended from our members
HPC-Colony: Services and Interfaces to Aupport Systems With Very Large Numbers of Processors
The HPC-Colony Project, a collaboration with Lawrence Livermore National Laboratory, the University of Illinois at Urbana-Champaign and IBM, is focused on services and interfaces for very large numbers of processors. Advances in parallel systems in the last decade have delivered phenomenal progress in the overall capability available to a single parallel application. Several systems with peak capability of over 100TF are already available and systems are expected to exceed 1PF within a few years. Despite these impressive advances in peak performance capability, the sustained performance of these systems continues to fall as a percentage of the peak capability. Initial analysis suggests that key architectural bottlenecks (in hardware and software) are responsible for the lower sustained performance and some architectural change of direction may be necessary to address the declining sustained performance. In this proposal we focus on addressing software architectural bottlenecks, in the areas of operating system and runtime systems. While the trend towards larger processor counts benefits application developers through more processing power, it also challenges application developers to harness ever-increasing numbers of processors for productive work. Much of the burden falls to operating systems and runtime systems that were originally designed for much smaller processor counts. Under the Colony project, we are researching and developing system software to enable general purpose operating and runtime systems for tens of thousands of processors. Difficulties in achieving a balanced partitioning and dynamically scheduling workloads can limit scaling for complex problems on large machines. Scientific simulations that span components of large machines require common operating system services, such as process scheduling, event notification, and job management to scale to large machines. Today, application programmers must explicitly manage these resources. We address scaling issues and porting issues by delegating resource management tasks to a sophisticated parallel OS. Our definition of ''managing resources'' includes balancing CPU time, network utilization, and memory usage across the entire machine. We believe a consistent environment that provides newly necessary technology (such as fault tolerance) will also provide important efficiencies in system administration. The primary objective of the Colony Project is to develop technologies that enable application scientists to easily scale applications to computing platforms comprised of tens of thousands to hundreds of thousands of compute cores. This will be accomplished by addressing several problem areas that are known to be key factors when scaling applications to tens of thousands of processors. First, by providing a smart runtime system to quickly and dynamically make cpu and memory and interconnect resource management adjustments, we remove the burden of achieving applications that are highly tuned and load-balanced for a particular execution instance (i.e. a particular input datasets and machine platform combination). Second, by providing a full complement of system services including the entire Linux system call set, we ease the challenge of developing portable applications since lightweight kernels frequently incorporate only a small subset of the POSIX calls prevalent in typical large scientific applications. Third, by providing fundamental changes to the Linux kernel that reduce variability in context switch times and provide for parallel-aware scheduling across the entire machine, we remove the negative impact of synchronizing collectives on bulk-synchronous applications. Fourth, by providing fault tolerance mechanisms that utilize our unique migration abilities in conjunction with in-memory techniques for minimal overhead, we eliminate the necessity for costly frequent application-driven check-points. Our research utilizes full implementations of these technologies on systems consisting of tens of thousands of processors
A failure index for high performance computing applications
This dissertation introduces a new metric in the area of High Performance Computing (HPC) application reliability and performance modeling. Derived via the time-dependent implementation of an existing inequality measure, the Failure index (FI) generates a coefficient representing the level of volatility for the failures incurred by an application running on a given HPC system in a given time interval. This coefficient presents a normalized cross-system representation of the failure volatility of applications running on failure-rich HPC platforms. Further, the origin and ramifications of application failures are investigated, from which certain mathematical conclusions yield greater insight into the behavior of these applications in failure-rich system environments.
This work also includes background information on the problems facing HPC applications at the highest scale, the lack of standardized application-specific metrics within this arena, and a means of generating such metrics in a low latency manner. A case study containing detailed analysis showcasing the benefits of the FI is also included
Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems
Nanoscale technology nodes bring reliability concerns back to the center stage of digital system design. A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro) architecture, the mapping, and platform software (SW). The field is surveyed in a systematic way based on nonoverlapping categories, which add insight into the ongoing work by exposing similarities and differences. HW and SW solutions are discussed in a similar fashion so that interrelationships become apparent. The presented categories are illustrated by representative literature examples to illustrate their properties. Moreover, it is demonstrated how hybrid schemes can be decomposed into their primitive components
Proactive fault tolerance in mpi applications via task migration
Abstract. Failures are likely to be more frequent in systems with thousands of processors. Therefore, schemes for dealing with faults become increasingly important. In this paper, we present a fault tolerance solution for parallel applications that proactively migrates execution from processors where failure is imminent. Our approach assumes that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults. We use the concepts of processor virtualization and dynamic task migration, provided by Charm++ and Adaptive MPI (AMPI), to implement a mechanism that migrates tasks away from processors which are expected to fail. To demonstrate the feasibility of our approach, we present performance data from experiments with existing MPI applications. Our results show that proactive task migration is an effective technique to tolerate faults in MPI applications.
Keeping checkpoint/restart viable for exascale systems
Next-generation exascale systems, those capable of performing a quintillion operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoints) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms
Système d'Administration Autonome Adaptable : application au Cloud
Ces dernières années ont vu le développement du cloud computing. Le principe fondateur est de déporter la gestion des services informatique des entreprises dans des centres d'hébergement gérés par des entreprise tiers. Ce déport a pour principal avantage une réduction des coûts pour l'entreprise cliente, les moyens nécessaires à la gestion de ces services étant mutualisés entre clients et gérés par l'entreprise hébergeant ces services. Cette évolution implique la gestion de structures d'hébergement à grande échelle, que la dimension et la complexité rendent difficiles à administrer. Avec le développement des infrastructures de calcul de type cluster ou grille ont émergé des système fournissant un support pour l'administration automatisée de ces environnements. Ces systèmes sont désignés sous le terme Systèmes d'Administration Autonomes (SAA). Ils visent à fournir des services permettant d'automatiser les tâches d'administration comme le déploiement des logiciels, la réparation en cas de panne ou leur dimensionnement dynamique en fonction de la charge. Ainsi, il est naturel d'envisager l'utilisation des SAA pour l'administration d'une infrastructure d'hébergement de type clouds. Cependant, nous remarquons que les SAA disponibles à l'heure actuelle ont été pour la plupart conçus pour répondre aux besoins d'un domaine applicatif particulier. Un SAA doit pouvoir être adapté en fonction du domaine considéré, en particulier celui de l'administration d'un cloud. De plus, dans le domaine du cloud, différents besoins doivent être pris en compte : ceux de l'administrateur du centre d'hébergement et ceux de l'utilisateur du centre d'hébergement qui déploie ses applications dans le cloud. Ceci implique qu'un SAA doit pouvoir être adapté pour répondre à ces besoins divers. Dans cette thèse, nous étudions la conception et l'implantation d'un SAA adaptable. Un tel SAA doit permettre d'adapter les services qu'il offre aux besoins des domaines dans lesquels il est utilisé. Nous montrons ensuite comment ce SAA adaptable peut être utilisé pour l'administration autonome d'un environnement de cloud. ABSTRACT : Last years have seen the development of cloud computing. The main underlying principle of to externalize the management of companies' IT services in hosting centers which are managed by third party companies. This externalization allows saving costs for the client company, since the resources required to manage these services are mutualized between clients and managed by the hosting company. This orientation implies the management of large scale hosting centers, whose dimension and complexity make them difficult to manage. With the developement of computing infrastructures such as clusters or grids, researchers investigated the design of systems which provides support of an automatized management of these environments. We refer to these system as Autonomic Management Systems (AMS). They aim at providing services which automate administration tasks such as software deployment, fault repair or dynamic dimensioning according to a load. Therefore, in this context, it is natural to consider the use of AMS for the administration of a cloud infrastructure. However, we observe that currently available AMS have been designed to address the requirements of a particular application domain. It should be possible to adapt an AMS according to the considered domain, in particular that of the cloud. Moreover, in the cloud computing area, different requirements have to be accounted : those of the administrator of the hosting center and those of the user of the hosting center (who deploys his application in the cloud). Therefore, an AMS should be adaptable to fulfill such various needs. In this thesis, we investigate the design and implementation of an adaptable AMS. Such an AMS must allow adaptation of all the services it provides, according to the domains where it is used. We next describe the application of this adaptable AMS for the autonomic management of a cloud environment
Système d'administration autonome adaptable (application au cloud)
Ces dernières années ont vu le développement du cloud computing. Le principe fondateur est de déporter la gestion des services informatique des entreprises dans des centres d'hébergement gérés par des entreprise tiers. Ce déport a pour principal avantage une réduction des coûts pour l'entreprise cliente, les moyens nécessaires à la gestion de ces services étant mutualisés entre clients et gérés par l'entreprise hébergeant ces services. Cette évolution implique la gestion de structures d'hébergement à grande échelle, que la dimension et la complexité rendent difficiles à administrer. Avec le développement des infrastructures de calcul de type cluster ou grille ont émergé des système fournissant un support pour l'administration automatisée de ces environnements. Ces systèmes sont désignés sous le terme Systèmes d'Administration Autonomes (SAA). Ils visent à fournir des services permettant d'automatiser les tâches d'administration comme le déploiement des logiciels, la réparation en cas de panne ou leur dimensionnement dynamique en fonction de la charge. Ainsi, il est naturel d'envisager l'utilisation des SAA pour l'administration d'une infrastructure d'hébergement de type clouds. Cependant, nous remarquons que les SAA disponibles à l'heure actuelle ont été pour la plupart conçus pour répondre aux besoins d'un domaine applicatif particulier. Un SAA doit pouvoir être adapté en fonction du domaine considéré, en particulier celui de l'administration d'un cloud. De plus, dans le domaine du cloud, différents besoins doivent être pris en compte : ceux de l'administrateur du centre d'hébergement et ceux de l'utilisateur du centre d'hébergement qui déploie ses applications dans le cloud. Ceci implique qu'un SAA doit pouvoir être adapté pour répondre à ces besoins divers. Dans cette thèse, nous étudions la conception et l'implantation d'un SAA adaptable. Un tel SAA doit permettre d'adapter les services qu'il offre aux besoins des domaines dans lesquels il est utilisé. Nous montrons ensuite comment ce SAA adaptable peut être utilisé pour l'administration autonome d'un environnement de cloud.Last years have seen the development of cloud computing. The main underlying principle of to externalize the management of companies' IT services in hosting centers which are managed by third party companies. This externalization allows saving costs for the client company, since the resources required to manage these services are mutualized between clients and managed by the hosting company. This orientation implies the management of large scale hosting centers, whose dimension and complexity make them difficult to manage. With the developement of computing infrastructures such as clusters or grids, researchers investigated the design of systems which provides support of an automatized management of these environments. We refer to these system as Autonomic Management Systems (AMS). They aim at providing services which automate administration tasks such as software deployment, fault repair or dynamic dimensioning according to a load. Therefore, in this context, it is natural to consider the use of AMS for the administration of a cloud infrastructure. However, we observe that currently available AMS have been designed to address the requirements of a particular application domain. It should be possible to adapt an AMS according to the considered domain, in particular that of the cloud. Moreover, in the cloud computing area, different requirements have to be accounted : those of the administrator of the hosting center and those of the user of the hosting center (who deploys his application in the cloud). Therefore, an AMS should be adaptable to fulfill such various needs. In this thesis, we investigate the design and implementation of an adaptable AMS. Such an AMS must allow adaptation of all the services it provides, according to the domains where it is used. We next describe the application of this adaptable AMS for the autonomic management of a cloud environment.TOULOUSE-INP (315552154) / SudocSudocFranceF