3,132 research outputs found
Distributed Training Large-Scale Deep Architectures
Scale of data and scale of computation infrastructures together enable the
current deep learning renaissance. However, training large-scale deep
architectures demands both algorithmic improvement and careful system
configuration. In this paper, we focus on employing the system approach to
speed up large-scale training. Via lessons learned from our routine
benchmarking effort, we first identify bottlenecks and overheads that hinter
data parallelism. We then devise guidelines that help practitioners to
configure an effective system and fine-tune parameters to achieve desired
speedup. Specifically, we develop a procedure for setting minibatch size and
choosing computation algorithms. We also derive lemmas for determining the
quantity of key components such as the number of GPUs and parameter servers.
Experiments and examples show that these guidelines help effectively speed up
large-scale deep learning training
GPU Acceleration of Image Convolution using Spatially-varying Kernel
Image subtraction in astronomy is a tool for transient object discovery such
as asteroids, extra-solar planets and supernovae. To match point spread
functions (PSFs) between images of the same field taken at different times a
convolution technique is used. Particularly suitable for large-scale images is
a computationally intensive spatially-varying kernel. The underlying algorithm
is inherently massively parallel due to unique kernel generation at every pixel
location. The spatially-varying kernel cannot be efficiently computed through
the Convolution Theorem, and thus does not lend itself to acceleration by Fast
Fourier Transform (FFT). This work presents results of accelerated
implementation of the spatially-varying kernel image convolution in multi-cores
with OpenMP and graphic processing units (GPUs). Typical speedups over ANSI-C
were a factor of 50 and a factor of 1000 over the initial IDL implementation,
demonstrating that the techniques are a practical and high impact path to
terabyte-per-night image pipelines and petascale processing.Comment: 4 pages. Accepted to IEEE-ICIP 201
BioEM: GPU-accelerated computing of Bayesian inference of electron microscopy images
In cryo-electron microscopy (EM), molecular structures are determined from
large numbers of projection images of individual particles. To harness the full
power of this single-molecule information, we use the Bayesian inference of EM
(BioEM) formalism. By ranking structural models using posterior probabilities
calculated for individual images, BioEM in principle addresses the challenge of
working with highly dynamic or heterogeneous systems not easily handled in
traditional EM reconstruction. However, the calculation of these posteriors for
large numbers of particles and models is computationally demanding. Here we
present highly parallelized, GPU-accelerated computer software that performs
this task efficiently. Our flexible formulation employs CUDA, OpenMP, and MPI
parallelization combined with both CPU and GPU computing. The resulting BioEM
software scales nearly ideally both on pure CPU and on CPU+GPU architectures,
thus enabling Bayesian analysis of tens of thousands of images in a reasonable
time. The general mathematical framework and robust algorithms are not limited
to cryo-electron microscopy but can be generalized for electron tomography and
other imaging experiments
- …