Search CORE

10,486 research outputs found

What does fault tolerant Deep Learning need from MPI?

Author: Amatya Vinay
Daily Jeff
Siegel Charles
Vishnu Abhinav
Publication venue
Publication date: 01/01/2017
Field of study

Deep Learning (DL) algorithms have become the de facto Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive - even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults - requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: What is needed from MPI for de- signing fault tolerant DL implementations? In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM

arXiv.org e-Print Archive

Crossref

Embedding Multi-Task Address-Event- Representation Computation

Author: Civit Balcells Antón
Jiménez Moreno Gabriel
Linares Barranco Alejandro
Luján Martínez Carlos
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Address-Event-Representation, AER, is a communication protocol that is intended to transfer neuronal spikes between bioinspired chips. There are several AER tools to help to develop and test AER based systems, which may consist of a hierarchical structure with several chips that transmit spikes among them in real-time, while performing some processing. Although these tools reach very high bandwidth at the AER communication level, they require the use of a personal computer to allow the higher level processing of the event information. We propose the use of an embedded platform based on a multi-task operating system to allow both, the AER communication and processing without the requirement of either a laptop or a computer. In this paper, we present and study the performance of an embedded multi-task AER tool, connecting and programming it for processing Address-Event information from a spiking generator.Ministerio de Ciencia e Innovación TEC2006-11730-C03-0

idUS. Depósito de Investigación Universidad de Sevilla