Search CORE

2,322 research outputs found

What does fault tolerant Deep Learning need from MPI?

Author: Amatya Vinay
Daily Jeff
Siegel Charles
Vishnu Abhinav
Publication venue
Publication date: 01/01/2017
Field of study

Deep Learning (DL) algorithms have become the de facto Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive - even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults - requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: What is needed from MPI for de- signing fault tolerant DL implementations? In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM

arXiv.org e-Print Archive

Crossref

A RAID reconfiguration scheme for gracefully degraded operations

Author: Hwang K
Jin H
Zhang J
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1999
Field of study

One distinct advantage of Redundant Array of Independent Disks (RAID) is fault tolerance. But the performance of a disk array in degraded mode is so poor that no one uses the RAID after failure. Continuous operation of RAID in degraded mode is very important in many real time applications, which can not be interrupted in providing continuous services. In this paper, we propose an efficient architectural reconfiguration scheme to enhance the performance of RAID-5 in degraded mode, called reconfigurable RAID-5. It reconfigures RAID-5 to RPTD-0 in degraded mode. Using this scheme, the calculation of the failure data and the generation of parity in writing the new data to the failed disk can be reduced. It also alleviates the small write problem for RAID-5 in degraded mode. We use the phase parallel model to analyze the total execution time of the RAID-5 and of the reconfigurable RAID-5. Through theoretical analysis and benchmark test, we find the performance of the reconfigurable RAID-5 can be 200 times better than conventional RAID-5.published_or_final_versio

HKU Scholars Hub

Compressive Sensing Theory for Optical Systems Described by a Continuous Model

Author: Fannjiang Albert
Publication venue
Publication date: 02/07/2015
Field of study

A brief survey of the author and collaborators' work in compressive sensing applications to continuous imaging models.Comment: Chapter 3 of "Optical Compressive Imaging" edited by Adrian Stern published by Taylor & Francis 201

arXiv.org e-Print Archive

eScholarship - University of California

Algorithm-Based Fault Tolerance for Two-Sided Dense Matrix Factorizations

Author: Jia Yulu
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/2015
Field of study

The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale computers are expected to have a MTBF of around 30 minutes. Therefore, it is urgent to prepare important algorithms for future machines with such a short MTBF. Eigenvalue problems (EVP) and singular value problems (SVP) are common in engineering and scientific research. Solving EVP and SVP numerically involves two-sided matrix factorizations: the Hessenberg reduction, the tridiagonal reduction, and the bidiagonal reduction. These three factorizations are computation intensive, and have long running times. They are prone to suffer from computer failures. We designed algorithm-based fault tolerant (ABFT) algorithms for the parallel Hessenberg reduction and the parallel tridiagonal reduction. The ABFT algorithms target fail-stop errors. These two fault tolerant algorithms use a combination of ABFT and diskless checkpointing. ABFT is used to protect frequently modified data . We carefully design the ABFT algorithm so the checksums are valid at the end of each iterative cycle. Diskless checkpointing is used for rarely modified data. These checkpoints are in the form of checksums, which are small in size, so the time and storage cost to store them in main memory is small. Also, there are intermediate results which need to be protected for a short time window. We store a copy of this data on the neighboring process in the process grid. We also designed algorithm-based fault tolerant algorithms for the CPU-GPU hybrid Hessenberg reduction algorithm and the CPU-GPU hybrid bidiagonal reduction algorithm. These two fault tolerant algorithms target silent errors. Our design employs both ABFT and diskless checkpointing to provide data redundancy. The low cost error detection uses two dot products and an equality test. The recovery protocol uses reverse computation to roll back the state of the matrix to a point where it is easy to locate and correct errors. We provided theoretical analysis and experimental verification on the correctness and efficiency of our fault tolerant algorithm design. We also provided mathematical proof on the numerical stability of the factorization results after fault recovery. Experimental results corroborate with the mathematical proof that the impact is mild

University of Tennessee, Knoxville: Trace

A Tutorial on RAID Storage Systems

Author: Kritzinger Prof Pieter
Perumal Mr Sameshan
Publication venue
Publication date: 01/01/2004
Field of study

RAID storage systems have been in use since the early 1990's. Recently, however, as the demand for huge amounts of on-line storage has increased, RAID has once again come into focus. This report reviews the history of RAID, as well as where and how RAID systems fit in the storage hierarchy of an Enterprize Computing System (EIS). We describe the known RAID configurations and the advantages and disadvantages of each. Since the focus of our research is on the performance of RAID systems we devote a section to the various factors which affect RAID performance. Modelling RAID systems for their performance analysis is the topic of the next section and we report on the issues as well as briefly describe one simulator, RAIDframe, which has been developed. We conclude with section which describes the current open research questions in the area

UCT Computer Science Research Document Archive

Recommended from our members

Fault Tolerance Against Design Faults

Author: Strigini L.
Publication venue: 'Wiley'
Publication date: 01/01/2005
Field of study

City Research Online

Redundant disk arrays: Reliable, parallel secondary storage

Author: Gibson Garth Alan
Publication venue
Publication date
Field of study

During the past decade, advances in processor and memory technology have given rise to increases in computational performance that far outstrip increases in the performance of secondary storage technology. Coupled with emerging small-disk technology, disk arrays provide the cost, volume, and capacity of current disk subsystems, by leveraging parallelism, many times their performance. Unfortunately, arrays of small disks may have much higher failure rates than the single large disks they replace. Redundant arrays of inexpensive disks (RAID) use simple redundancy schemes to provide high data reliability. The data encoding, performance, and reliability of redundant disk arrays are investigated. Organizing redundant data into a disk array is treated as a coding problem. Among alternatives examined, codes as simple as parity are shown to effectively correct single, self-identifying disk failures

NASA Technical Reports Server

Multicluster interleaving on paths and cycles

Author: Bruck Jehoshua
Jiang Anxiao (Andrew)
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

Interleaving codewords is an important method not only for combatting burst errors, but also for distributed data retrieval. This paper introduces the concept of multicluster interleaving (MCI), a generalization of traditional interleaving problems. MCI problems for paths and cycles are studied. The following problem is solved: how to interleave integers on a path or cycle such that any m (m/spl ges/2) nonoverlapping clusters of order 2 in the path or cycle have at least three distinct integers. We then present a scheme using a "hierarchical-chain structure" to solve the following more general problem for paths: how to interleave integers on a path such that any m (m/spl ges/2) nonoverlapping clusters of order L (L/spl ges/2) in the path have at least L+1 distinct integers. It is shown that the scheme solves the second interleaving problem for paths that are asymptotically as long as the longest path on which an MCI exists, and clearly, for shorter paths as well

CiteSeerX

Caltech Authors