82,253 research outputs found
What does fault tolerant Deep Learning need from MPI?
Deep Learning (DL) algorithms have become the de facto Machine Learning (ML)
algorithm for large scale data analysis. DL algorithms are computationally
expensive - even distributed DL implementations which use MPI require days of
training (model learning) time on commonly studied datasets. Long running DL
applications become susceptible to faults - requiring development of a fault
tolerant system infrastructure, in addition to fault tolerant DL algorithms.
This raises an important question: What is needed from MPI for de- signing
fault tolerant DL implementations? In this paper, we address this problem for
permanent faults. We motivate the need for a fault tolerant MPI specification
by an in-depth consideration of recent innovations in DL algorithms and their
properties, which drive the need for specific fault tolerance features. We
present an in-depth discussion on the suitability of different parallelism
types (model, data and hybrid); a need (or lack thereof) for check-pointing of
any critical data structures; and most importantly, consideration for several
fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI
and their applicability to fault tolerant DL implementations. We leverage a
distributed memory implementation of Caffe, currently available under the
Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches
by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation
using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies
demonstrates the effectiveness of the proposed fault tolerant DL implementation
using OpenMPI based ULFM
A Unified Coded Deep Neural Network Training Strategy Based on Generalized PolyDot Codes for Matrix Multiplication
This paper has two contributions. First, we propose a novel coded matrix
multiplication technique called Generalized PolyDot codes that advances on
existing methods for coded matrix multiplication under storage and
communication constraints. This technique uses "garbage alignment," i.e.,
aligning computations in coded computing that are not a part of the desired
output. Generalized PolyDot codes bridge between Polynomial codes and MatDot
codes, trading off between recovery threshold and communication costs. Second,
we demonstrate that Generalized PolyDot can be used for training large Deep
Neural Networks (DNNs) on unreliable nodes prone to soft-errors. This requires
us to address three additional challenges: (i) prohibitively large overhead of
coding the weight matrices in each layer of the DNN at each iteration; (ii)
nonlinear operations during training, which are incompatible with linear
coding; and (iii) not assuming presence of an error-free master node, requiring
us to architect a fully decentralized implementation without any "single point
of failure." We allow all primary DNN training steps, namely, matrix
multiplication, nonlinear activation, Hadamard product, and update steps as
well as the encoding/decoding to be error-prone. We consider the case of
mini-batch size , as well as , leveraging coded matrix-vector
products, and matrix-matrix products respectively. The problem of DNN training
under soft-errors also motivates an interesting, probabilistic error model
under which a real number MDS code is shown to correct errors
with probability as compared to for the
more conventional, adversarial error model. We also demonstrate that our
proposed strategy can provide unbounded gains in error tolerance over a
competing replication strategy and a preliminary MDS-code-based strategy for
both these error models.Comment: Presented in part at the IEEE International Symposium on Information
Theory 2018 (Submission Date: Jan 12 2018); Currently under review at the
IEEE Transactions on Information Theor
Development of a neural network mathematical model for demand forecasting in fluctuating markets
Research has shown that Neural Networks (NNs) when trained appropriately are the best forecasting system compared to conventional techniques. Research has shown that there is no system to accurately forecast sudden changes in demand for a given product. This paper reports on the development of a recovery method when a sudden change in demand has taken place. This error in forecasting demand leads to either excessive inventories of the product or shortages of it and can lead to substantial financial losses for the company producing or marketing the product. Two recovery methods have been developed and described in this paper: RZ recovery and Exponential Smoothing (ES). In the RZ recovery once a sudden change has taken place, a ‘soft’ Poke-Yoke (PY) system is setup warning the company that the normal forecasting system can no longer be relied upon and a recovery system needs to be initiated, with re-forecasting initiated
- …