2,677 research outputs found
TensorLayer: A Versatile Library for Efficient Deep Learning Development
Deep learning has enabled major advances in the fields of computer vision,
natural language processing, and multimedia among many others. Developing a
deep learning system is arduous and complex, as it involves constructing neural
network architectures, managing training/trained models, tuning optimization
process, preprocessing and organizing data, etc. TensorLayer is a versatile
Python library that aims at helping researchers and engineers efficiently
develop deep learning systems. It offers rich abstractions for neural networks,
model and data management, and parallel workflow mechanism. While boosting
efficiency, TensorLayer maintains both performance and scalability. TensorLayer
was released in September 2016 on GitHub, and has helped people from academia
and industry develop real-world applications of deep learning.Comment: ACM Multimedia 201
TensorFlow Estimators: Managing Simplicity vs. Flexibility in High-Level Machine Learning Frameworks
We present a framework for specifying, training, evaluating, and deploying
machine learning models. Our focus is on simplifying cutting edge machine
learning for practitioners in order to bring such technologies into production.
Recognizing the fast evolution of the field of deep learning, we make no
attempt to capture the design space of all possible model architectures in a
domain- specific language (DSL) or similar configuration language. We allow
users to write code to define their models, but provide abstractions that guide
develop- ers to write models in ways conducive to productionization. We also
provide a unifying Estimator interface, making it possible to write downstream
infrastructure (e.g. distributed training, hyperparameter tuning) independent
of the model implementation. We balance the competing demands for flexibility
and simplicity by offering APIs at different levels of abstraction, making
common model architectures available out of the box, while providing a library
of utilities designed to speed up experimentation with model architectures. To
make out of the box models flexible and usable across a wide range of problems,
these canned Estimators are parameterized not only over traditional
hyperparameters, but also using feature columns, a declarative specification
describing how to interpret input data. We discuss our experience in using this
framework in re- search and production environments, and show the impact on
code health, maintainability, and development speed.Comment: 8 pages, Appeared at KDD 2017, August 13--17, 2017, Halifax, NS,
Canad
TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments
Deep neural networks (DNNs) have become core computation components within
low latency Function as a Service (FaaS) prediction pipelines: including image
recognition, object detection, natural language processing, speech synthesis,
and personalized recommendation pipelines. Cloud computing, as the de-facto
backbone of modern computing infrastructure for both enterprise and consumer
applications, has to be able to handle user-defined pipelines of diverse DNN
inference workloads while maintaining isolation and latency guarantees, and
minimizing resource waste. The current solution for guaranteeing isolation
within FaaS is suboptimal -- suffering from "cold start" latency. A major cause
of such inefficiency is the need to move large amount of model data within and
across servers. We propose TrIMS as a novel solution to address these issues.
Our proposed solution consists of a persistent model store across the GPU, CPU,
local storage, and cloud storage hierarchy, an efficient resource management
layer that provides isolation, and a succinct set of application APIs and
container technologies for easy and transparent integration with FaaS, Deep
Learning (DL) frameworks, and user code. We demonstrate our solution by
interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x
speedup in latency for image classification models and up to 210x speedup for
large models. We achieve up to 8x system throughput improvement.Comment: In Proceedings CLOUD 201
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
TensorQuant - A Simulation Toolbox for Deep Neural Network Quantization
Recent research implies that training and inference of deep neural networks
(DNN) can be computed with low precision numerical representations of the
training/test data, weights and gradients without a general loss in accuracy.
The benefit of such compact representations is twofold: they allow a
significant reduction of the communication bottleneck in distributed DNN
training and faster neural network implementations on hardware accelerators
like FPGAs. Several quantization methods have been proposed to map the original
32-bit floating point problem to low-bit representations. While most related
publications validate the proposed approach on a single DNN topology, it
appears to be evident, that the optimal choice of the quantization method and
number of coding bits is topology dependent. To this end, there is no general
theory available, which would allow users to derive the optimal quantization
during the design of a DNN topology. In this paper, we present a quantization
tool box for the TensorFlow framework. TensorQuant allows a transparent
quantization simulation of existing DNN topologies during training and
inference. TensorQuant supports generic quantization methods and allows
experimental evaluation of the impact of the quantization on single layers as
well as on the full topology. In a first series of experiments with
TensorQuant, we show an analysis of fix-point quantizations of popular CNN
topologies
- …