1,942 research outputs found

    Distributed Training Large-Scale Deep Architectures

    Full text link
    Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we focus on employing the system approach to speed up large-scale training. Via lessons learned from our routine benchmarking effort, we first identify bottlenecks and overheads that hinter data parallelism. We then devise guidelines that help practitioners to configure an effective system and fine-tune parameters to achieve desired speedup. Specifically, we develop a procedure for setting minibatch size and choosing computation algorithms. We also derive lemmas for determining the quantity of key components such as the number of GPUs and parameter servers. Experiments and examples show that these guidelines help effectively speed up large-scale deep learning training

    Image recognition with Deep Learning techniques and TensorFlow

    Get PDF
    Deep neural networks have gained popularity in recent years, obtaining outstanding results in a wide range of application, but most notoriously in computer vision and natural language processing tasks. Despite the newly found interest, research in neural networks span many decades back, and some of today’s most used network architectures where invented many years ago. Nevertheless, the progress made during this period cannot be understood without taking into account the technological advancements seen in key contiguous domains such as massive data storage and computing systems, more specifically in the Graphic Processing Unit (GPU) domain. These two components are responsible for the enormous performance gains in neural networks, that have made what we call Deep Learning a common word among the Artificial Intelligence and Machine Learning community. These kind of networks need massive amounts of data to effectively train the millions of parameters they contain, and this training can take up to days or weeks depending on the computer architecture we are using. The size of new published datasets keeps growing, and the tendency of creating deeper networks that outperforms shallower architectures means that on the medium and long term the computer hardware to undertake these kind of training processes can only be found in high performance computing facilities, where they have enormous clusters of computers. However, using these machines is not straightforward, as both the framework and the code need to be appropriately tuned for effectively taking advantage of these distributed environments. For this reason, we test TensorFlow, an open-sourced framework for Deep Learning from Google that has built-in distributed support, on top of the GPU cluster, called MinoTauro, at Barcelona Supercomputing Center (BSC). We aim to implement a defined workload using the distributed features the framework offers, to speed up the training process, acquire knowledge of the inner workings of the framework and understand the similarities and differences with respect to a classic single node training
    • …
    corecore