Search CORE

114,879 research outputs found

Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees

Author: Alistarh Dan
Chatterjee Bapi
Egan Malcolm
Kungurtsev Vyacheslav
Publication venue
Publication date: 11/07/2020
Field of study

Asynchronous distributed algorithms are a popular way to reduce synchronization costs in large-scale optimization, and in particular for neural network training. However, for nonsmooth and nonconvex objectives, few convergence guarantees exist beyond cases where closed-form proximal operator solutions are available. As most popular contemporary deep neural networks lead to nonsmooth and nonconvex objectives, there is now a pressing need for such convergence guarantees. In this paper, we analyze for the first time the convergence of stochastic asynchronous optimization for this general class of objectives. In particular, we focus on stochastic subgradient methods allowing for block variable partitioning, where the shared-memory-based model is asynchronously updated by concurrent processes. To this end, we first introduce a probabilistic model which captures key features of real asynchronous scheduling between concurrent processes; under this model, we establish convergence with probability one to an invariant set for stochastic subgradient methods with momentum. From the practical perspective, one issue with the family of methods we consider is that it is not efficiently supported by machine learning frameworks, as they mostly focus on distributed data-parallel strategies. To address this, we propose a new implementation strategy for shared-memory based training of deep neural networks, whereby concurrent parameter servers are utilized to train a partitioned but shared model in single- and multi-GPU settings. Based on this implementation, we achieve on average 1.2x speed-up in comparison to state-of-the-art training methods for popular image classification tasks without compromising accuracy

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

IST Austria: PubRep (Institute of Science and Technology)

Hal-Diderot

Association for the Advancement of Artificial Intelligence: AAAI Publications

Distributed learning of CNNs on heterogeneous CPU/GPU architectures

Author: Alexandre Luís A.
Falcao Gabriel
Marques Jose
Publication venue
Publication date: 07/12/2017
Field of study

Convolutional Neural Networks (CNNs) have shown to be powerful classification tools in tasks that range from check reading to medical diagnosis, reaching close to human perception, and in some cases surpassing it. However, the problems to solve are becoming larger and more complex, which translates to larger CNNs, leading to longer training times that not even the adoption of Graphics Processing Units (GPUs) could keep up to. This problem is partially solved by using more processing units and distributed training methods that are offered by several frameworks dedicated to neural network training. However, these techniques do not take full advantage of the possible parallelization offered by CNNs and the cooperative use of heterogeneous devices with different processing capabilities, clock speeds, memory size, among others. This paper presents a new method for the parallel training of CNNs that can be considered as a particular instantiation of model parallelism, where only the convolutional layer is distributed. In fact, the convolutions processed during training (forward and backward propagation included) represent from

60

90

\% of global processing time. The paper analyzes the influence of network size, bandwidth, batch size, number of devices, including their processing capabilities, and other parameters. Results show that this technique is capable of diminishing the training time without affecting the classification performance for both CPUs and GPUs. For the CIFAR-10 dataset, using a CNN with two convolutional layers, and

500

and

1500

kernels, respectively, best speedups achieve

3.28\times

using four CPUs and

2.45\times

with three GPUs. Modern imaging datasets, larger and more complex than CIFAR-10 will certainly require more than

60

90

\% of processing time calculating convolutions, and speedups will tend to increase accordingly

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

UBibliorum repositorio digital da ubi

Directory of Open Access Journals

Distributed computing methodology for training neural networks in an image-guided diagnostic application

Author: Magoulas George D.
Plagianakos V.P.
Vrahatis M.N.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2006
Field of study

Distributed computing is a process through which a set of computers connected by a network is used collectively to solve a single problem. In this paper, we propose a distributed computing methodology for training neural networks for the detection of lesions in colonoscopy. Our approach is based on partitioning the training set across multiple processors using a parallel virtual machine. In this way, interconnected computers of varied architectures can be used for the distributed evaluation of the error function and gradient values, and, thus, training neural networks utilizing various learning methods. The proposed methodology has large granularity and low synchronization, and has been implemented and tested. Our results indicate that the parallel virtual machine implementation of the training algorithms developed leads to considerable speedup, especially when large network architectures and training sets are used

Birkbeck Institutional Research Online

Adaptive Resonance Theory

Author: Carpenter Gail
Grossberg Stephen
Publication venue: Boston University Center for Adaptive Systems and Department of Cognitive and Neural Systems
Publication date: 01/09/1998
Field of study

Boston University Institutional Repository (OpenBU)