40,867 research outputs found
Fast Single Shot Detection and Pose Estimation
For applications in navigation and robotics, estimating the 3D pose of
objects is as important as detection. Many approaches to pose estimation rely
on detecting or tracking parts or keypoints [11, 21]. In this paper we build on
a recent state-of-the-art convolutional network for slidingwindow detection
[10] to provide detection and rough pose estimation in a single shot, without
intermediate stages of detecting parts or initial bounding boxes. While not the
first system to treat pose estimation as a categorization problem, this is the
first attempt to combine detection and pose estimation at the same level using
a deep learning approach. The key to the architecture is a deep convolutional
network where scores for the presence of an object category, the offset for its
location, and the approximate pose are all estimated on a regular grid of
locations in the image. The resulting system is as accurate as recent work on
pose estimation (42.4% 8 View mAVP on Pascal 3D+ [21] ) and significantly
faster (46 frames per second (FPS) on a TITAN X GPU). This approach to
detection and rough pose estimation is fast and accurate enough to be widely
applied as a pre-processing step for tasks including high-accuracy pose
estimation, object tracking and localization, and vSLAM
Distributed Training Large-Scale Deep Architectures
Scale of data and scale of computation infrastructures together enable the
current deep learning renaissance. However, training large-scale deep
architectures demands both algorithmic improvement and careful system
configuration. In this paper, we focus on employing the system approach to
speed up large-scale training. Via lessons learned from our routine
benchmarking effort, we first identify bottlenecks and overheads that hinter
data parallelism. We then devise guidelines that help practitioners to
configure an effective system and fine-tune parameters to achieve desired
speedup. Specifically, we develop a procedure for setting minibatch size and
choosing computation algorithms. We also derive lemmas for determining the
quantity of key components such as the number of GPUs and parameter servers.
Experiments and examples show that these guidelines help effectively speed up
large-scale deep learning training
Comparative Analysis of Open Source Frameworks for Machine Learning with Use Case in Single-Threaded and Multi-Threaded Modes
The basic features of some of the most versatile and popular open source
frameworks for machine learning (TensorFlow, Deep Learning4j, and H2O) are
considered and compared. Their comparative analysis was performed and
conclusions were made as to the advantages and disadvantages of these
platforms. The performance tests for the de facto standard MNIST data set were
carried out on H2O framework for deep learning algorithms designed for CPU and
GPU platforms for single-threaded and multithreaded modes of operation.Comment: 4 pages, 6 figures, 4 tables; XIIth International Scientific and
Technical Conference on Computer Sciences and Information Technologies (CSIT
2017), Lviv, Ukrain
Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs
Deep learning frameworks have been widely deployed on GPU servers for deep
learning applications in both academia and industry. In training deep neural
networks (DNNs), there are many standard processes or algorithms, such as
convolution and stochastic gradient descent (SGD), but the running performance
of different frameworks might be different even running the same deep model on
the same GPU hardware. In this study, we evaluate the running performance of
four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI,
CNTK, MXNet, and TensorFlow) over single-GPU, multi-GPU, and multi-node
environments. We first build performance models of standard processes in
training DNNs with SGD, and then we benchmark the running performance of these
frameworks with three popular convolutional neural networks (i.e., AlexNet,
GoogleNet and ResNet-50), after that, we analyze what factors that result in
the performance gap among these four frameworks. Through both analytical and
experimental analysis, we identify bottlenecks and overheads which could be
further optimized. The main contribution is that the proposed performance
models and the analysis provide further optimization directions in both
algorithmic design and system configuration.Comment: Published at DataCom'201
- …