148,872 research outputs found
Deep Learning Inference Frameworks Benchmark
Deep learning (DL) has been widely adopted those last years but they are
computing-intensive method. Therefore, scientists proposed diverse optimization
to accelerate their predictions for end-user applications. However, no single
inference framework currently dominates in terms of performance. This paper
takes a holistic approach to conduct an empirical comparison and analysis of
four representative DL inference frameworks. First, given a selection of
CPU-GPU configurations, we show that for a specific DL framework, different
configurations of its settings may have a significant impact on the prediction
speed, memory, and computing power. Second, to the best of our knowledge, this
study is the first to identify the opportunities for accelerating the ensemble
of co-localized models in the same GPU. This measurement study provides an
in-depth empirical comparison and analysis of four representative DL frameworks
and offers practical guidance for service providers to deploy and deliver DL
predictions
Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs
Deep learning frameworks have been widely deployed on GPU servers for deep
learning applications in both academia and industry. In training deep neural
networks (DNNs), there are many standard processes or algorithms, such as
convolution and stochastic gradient descent (SGD), but the running performance
of different frameworks might be different even running the same deep model on
the same GPU hardware. In this study, we evaluate the running performance of
four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI,
CNTK, MXNet, and TensorFlow) over single-GPU, multi-GPU, and multi-node
environments. We first build performance models of standard processes in
training DNNs with SGD, and then we benchmark the running performance of these
frameworks with three popular convolutional neural networks (i.e., AlexNet,
GoogleNet and ResNet-50), after that, we analyze what factors that result in
the performance gap among these four frameworks. Through both analytical and
experimental analysis, we identify bottlenecks and overheads which could be
further optimized. The main contribution is that the proposed performance
models and the analysis provide further optimization directions in both
algorithmic design and system configuration.Comment: Published at DataCom'201
Characterizing Deep-Learning I/O Workloads in TensorFlow
The performance of Deep-Learning (DL) computing frameworks rely on the
performance of data ingestion and checkpointing. In fact, during the training,
a considerable high number of relatively small files are first loaded and
pre-processed on CPUs and then moved to accelerator for computation. In
addition, checkpointing and restart operations are carried out to allow DL
computing frameworks to restart quickly from a checkpoint. Because of this, I/O
affects the performance of DL applications. In this work, we characterize the
I/O performance and scaling of TensorFlow, an open-source programming framework
developed by Google and specifically designed for solving DL problems. To
measure TensorFlow I/O performance, we first design a micro-benchmark to
measure TensorFlow reads, and then use a TensorFlow mini-application based on
AlexNet to measure the performance cost of I/O and checkpointing in TensorFlow.
To improve the checkpointing performance, we design and implement a burst
buffer. We find that increasing the number of threads increases TensorFlow
bandwidth by a maximum of 2.3x and 7.8x on our benchmark environments. The use
of the tensorFlow prefetcher results in a complete overlap of computation on
accelerator and input pipeline on CPU eliminating the effective cost of I/O on
the overall performance. The use of a burst buffer to checkpoint to a fast
small capacity storage and copy asynchronously the checkpoints to a slower
large capacity storage resulted in a performance improvement of 2.6x with
respect to checkpointing directly to slower storage on our benchmark
environment.Comment: Accepted for publication at pdsw-DISCS 201
Algorithm Portfolio for Individual-based Surrogate-Assisted Evolutionary Algorithms
Surrogate-assisted evolutionary algorithms (SAEAs) are powerful optimisation
tools for computationally expensive problems (CEPs). However, a randomly
selected algorithm may fail in solving unknown problems due to no free lunch
theorems, and it will cause more computational resource if we re-run the
algorithm or try other algorithms to get a much solution, which is more serious
in CEPs. In this paper, we consider an algorithm portfolio for SAEAs to reduce
the risk of choosing an inappropriate algorithm for CEPs. We propose two
portfolio frameworks for very expensive problems in which the maximal number of
fitness evaluations is only 5 times of the problem's dimension. One framework
named Par-IBSAEA runs all algorithm candidates in parallel and a more
sophisticated framework named UCB-IBSAEA employs the Upper Confidence Bound
(UCB) policy from reinforcement learning to help select the most appropriate
algorithm at each iteration. An effective reward definition is proposed for the
UCB policy. We consider three state-of-the-art individual-based SAEAs on
different problems and compare them to the portfolios built from their
instances on several benchmark problems given limited computation budgets. Our
experimental studies demonstrate that our proposed portfolio frameworks
significantly outperform any single algorithm on the set of benchmark problems
- …