17,829 research outputs found
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
TensorFlow Estimators: Managing Simplicity vs. Flexibility in High-Level Machine Learning Frameworks
We present a framework for specifying, training, evaluating, and deploying
machine learning models. Our focus is on simplifying cutting edge machine
learning for practitioners in order to bring such technologies into production.
Recognizing the fast evolution of the field of deep learning, we make no
attempt to capture the design space of all possible model architectures in a
domain- specific language (DSL) or similar configuration language. We allow
users to write code to define their models, but provide abstractions that guide
develop- ers to write models in ways conducive to productionization. We also
provide a unifying Estimator interface, making it possible to write downstream
infrastructure (e.g. distributed training, hyperparameter tuning) independent
of the model implementation. We balance the competing demands for flexibility
and simplicity by offering APIs at different levels of abstraction, making
common model architectures available out of the box, while providing a library
of utilities designed to speed up experimentation with model architectures. To
make out of the box models flexible and usable across a wide range of problems,
these canned Estimators are parameterized not only over traditional
hyperparameters, but also using feature columns, a declarative specification
describing how to interpret input data. We discuss our experience in using this
framework in re- search and production environments, and show the impact on
code health, maintainability, and development speed.Comment: 8 pages, Appeared at KDD 2017, August 13--17, 2017, Halifax, NS,
Canad
Towards quantum-chemical method development for arbitrary basis functions
We present the design of a flexible quantum-chemical method development
framework, which supports employing any type of basis function. This design has
been implemented in the light-weight program package molsturm, yielding a
basis-function-independent self-consistent field scheme. Versatile interfaces,
making use of open standards like python, mediate the integration of molsturm
with existing third-party packages. In this way both rapid extension of the
present set of methods for electronic structure calculations as well as adding
new basis function types can be readily achieved. This makes molsturm
well-suitable for testing novel approaches for discretising the electronic wave
function and allows comparing them to existing methods using the same software
stack. This is illustrated by two examples, an implementation of
coupled-cluster doubles as well as a gradient-free geometry optimisation, where
in both cases, an arbitrary basis functions could be used. molsturm is
open-source and can be obtained from https://molsturm.org.Comment: 15 pages and 7 figure
- …