2,083 research outputs found
Distributed Training Large-Scale Deep Architectures
Scale of data and scale of computation infrastructures together enable the
current deep learning renaissance. However, training large-scale deep
architectures demands both algorithmic improvement and careful system
configuration. In this paper, we focus on employing the system approach to
speed up large-scale training. Via lessons learned from our routine
benchmarking effort, we first identify bottlenecks and overheads that hinter
data parallelism. We then devise guidelines that help practitioners to
configure an effective system and fine-tune parameters to achieve desired
speedup. Specifically, we develop a procedure for setting minibatch size and
choosing computation algorithms. We also derive lemmas for determining the
quantity of key components such as the number of GPUs and parameter servers.
Experiments and examples show that these guidelines help effectively speed up
large-scale deep learning training
Image recognition with Deep Learning techniques and TensorFlow
Deep neural networks have gained popularity in recent years, obtaining outstanding results in
a wide range of application, but most notoriously in computer vision and natural language
processing tasks. Despite the newly found interest, research in neural networks span many
decades back, and some of today’s most used network architectures where invented many years
ago. Nevertheless, the progress made during this period cannot be understood without taking
into account the technological advancements seen in key contiguous domains such as massive
data storage and computing systems, more specifically in the Graphic Processing Unit (GPU)
domain. These two components are responsible for the enormous performance gains in neural
networks, that have made what we call Deep Learning a common word among the Artificial
Intelligence and Machine Learning community.
These kind of networks need massive amounts of data to effectively train the millions of
parameters they contain, and this training can take up to days or weeks depending on the
computer architecture we are using. The size of new published datasets keeps growing, and the
tendency of creating deeper networks that outperforms shallower architectures means that on
the medium and long term the computer hardware to undertake these kind of training processes
can only be found in high performance computing facilities, where they have enormous clusters
of computers. However, using these machines is not straightforward, as both the framework and
the code need to be appropriately tuned for effectively taking advantage of these distributed
environments.
For this reason, we test TensorFlow, an open-sourced framework for Deep Learning from
Google that has built-in distributed support, on top of the GPU cluster, called MinoTauro, at
Barcelona Supercomputing Center (BSC). We aim to implement a defined workload using the
distributed features the framework offers, to speed up the training process, acquire knowledge
of the inner workings of the framework and understand the similarities and differences with
respect to a classic single node training
- …