19,692 research outputs found
Optimized dynamical control of state transfer through noisy spin chains
We propose a method of optimally controlling the tradeoff of speed and
fidelity of state transfer through a noisy quantum channel (spin-chain). This
process is treated as qubit state-transfer through a fermionic bath. We show
that dynamical modulation of the boundary-qubits levels can ensure state
transfer with the best tradeoff of speed and fidelity. This is achievable by
dynamically optimizing the transmission spectrum of the channel. The resulting
optimal control is robust against both static and fluctuating noise in the
channel's spin-spin couplings. It may also facilitate transfer in the presence
of diagonal disorder (on site energy noise) in the channel.Comment: 20 pages, 5 figures. arXiv admin note: text overlap with
arXiv:1310.162
A context-based geoprocessing framework for optimizing meetup location of multiple moving objects along road networks
Given different types of constraints on human life, people must make
decisions that satisfy social activity needs. Minimizing costs (i.e., distance,
time, or money) associated with travel plays an important role in perceived and
realized social quality of life. Identifying optimal interaction locations on
road networks when there are multiple moving objects (MMO) with space-time
constraints remains a challenge. In this research, we formalize the problem of
finding dynamic ideal interaction locations for MMO as a spatial optimization
model and introduce a context-based geoprocessing heuristic framework to
address this problem. As a proof of concept, a case study involving
identification of a meetup location for multiple people under traffic
conditions is used to validate the proposed geoprocessing framework. Five
heuristic methods with regard to efficient shortest-path search space have been
tested. We find that the R* tree-based algorithm performs the best with high
quality solutions and low computation time. This framework is implemented in a
GIS environment to facilitate integration with external geographic contextual
information, e.g., temporary road barriers, points of interest (POI), and
real-time traffic information, when dynamically searching for ideal meetup
sites. The proposed method can be applied in trip planning, carpooling
services, collaborative interaction, and logistics management.Comment: 34 pages, 8 figure
PC Clusters for Lattice QCD
In the last several years, tightly coupled PC clusters have become widely
applied, cost effective resources for lattice gauge computations. This paper
discusses the practice of building such clusters, in particular balanced design
requirements. I review and quantify the improvements over time of key
performance parameters and overall price to performance ratio. Applying these
trends and technology forecasts given by computer equipment manufacturers, I
predict the range of price to performance for lattice codes expected in the
next several years.Comment: Talk presented at Lattice2004(plenary), Fermilab, June 21-26, 2004. 7
pages, 4 figures. v2 - clarified SIMD coding discusion and reference
Distributed Deep Learning Using Synchronous Stochastic Gradient Descent
We design and implement a distributed multinode synchronous SGD algorithm,
without altering hyper parameters, or compressing data, or altering algorithmic
behavior. We perform a detailed analysis of scaling, and identify optimal
design points for different networks. We demonstrate scaling of CNNs on 100s of
nodes, and present what we believe to be record training throughputs. A 512
minibatch VGG-A CNN training run is scaled 90X on 128 nodes. Also 256 minibatch
VGG-A and OverFeat-FAST networks are scaled 53X and 42X respectively on a 64
node cluster. We also demonstrate the generality of our approach via
best-in-class 6.5X scaling for a 7-layer DNN on 16 nodes. Thereafter we attempt
to democratize deep-learning by training on an Ethernet based AWS cluster and
show ~14X scaling on 16 nodes
Non-Orthogonal Multiple Access for Air-to-Ground Communication
This paper investigates ground-aerial uplink non-orthogonal multiple access
(NOMA) cellular networks. A rotary-wing unmanned aerial vehicle (UAV) user and
multiple ground users (GUEs) are served by ground base stations (GBSs) by
utilizing the uplink NOMA protocol. The UAV is dispatched to upload specific
information bits to each target GBSs. Specifically, our goal is to minimize the
UAV mission completion time by jointly optimizing the UAV trajectory and
UAV-GBS association order while taking into account the UAV's interference to
non-associated GBSs. The formulated problem is a mixed integer non-convex
problem and involves infinite variables. To tackle this problem, we efficiently
check the feasibility of the formulated problem by utilizing graph theory and
topology theory. Next, we prove that the optimal UAV trajectory needs to
satisfy the \emph{fly-hover-fly} structure. With this insight, we first design
an efficient solution with predefined hovering locations by leveraging graph
theory techniques. Furthermore, we propose an iterative UAV trajectory design
by applying successive convex approximation (SCA) technique, which is
guaranteed to coverage to a locally optimal solution. We demonstrate that the
two proposed designs exhibit polynomial time complexity. Finally, numerical
results show that: 1) the SCA based design outperforms the fly-hover-fly based
design; 2) the UAV mission completion time is significantly minimized with
proposed NOMA schemes compared with the orthogonal multiple access (OMA)
scheme; 3) the increase of GUEs' quality of service (QoS) requirements will
increase the UAV mission completion time
Optimizing Deep Learning Recommender Systems' Training On CPU Cluster Architectures
During the last two years, the goal of many researchers has been to squeeze
the last bit of performance out of HPC system for AI tasks. Often this
discussion is held in the context of how fast ResNet50 can be trained.
Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus,
we focus on Recommender Systems which account for most of the AI cycles in
cloud computing centers. More specifically, we focus on Facebook's DLRM
benchmark. By enabling it to run on latest CPU hardware and software tailored
for HPC, we are able to achieve more than two-orders of magnitude improvement
in performance (110x) on a single socket compared to the reference CPU
implementation, and high scaling efficiency up to 64 sockets, while fitting
ultra-large datasets. This paper discusses the optimization techniques for the
various operators in DLRM and which component of the systems are stressed by
these different operators. The presented techniques are applicable to a broader
set of DL workloads that pose the same scaling challenges/characteristics as
DLRM
OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed Training
The training of modern deep learning neural network calls for large amounts
of computation, which is often provided by GPUs or other specific accelerators.
To scale out to achieve faster training speed, two update algorithms are mainly
applied in the distributed training process, i.e. the Synchronous SGD algorithm
(SSGD) and Asynchronous SGD algorithm (ASGD). SSGD obtains good convergence
point while the training speed is slowed down by the synchronous barrier. ASGD
has faster training speed but the convergence point is lower when compared to
SSGD. To sufficiently utilize the advantages of SSGD and ASGD, we propose a
novel technology named One-step Delay SGD (OD-SGD) to combine their strengths
in the training process. Therefore, we can achieve similar convergence point
and training speed as SSGD and ASGD separately. To the best of our knowledge,
we make the first attempt to combine the features of SSGD and ASGD to improve
distributed training performance. Each iteration of OD-SGD contains a global
update in the parameter server node and local updates in the worker nodes, the
local update is introduced to update and compensate the delayed local weights.
We evaluate our proposed algorithm on MNIST, CIFAR-10 and ImageNet datasets.
Experimental results show that OD-SGD can obtain similar or even slightly
better accuracy than SSGD, while its training speed is much faster, which even
exceeds the training speed of ASGD
The Competition for Shortest Paths on Sparse Graphs
Optimal paths connecting randomly selected network nodes and fixed routers
are studied analytically in the presence of non-linear overlap cost that
penalizes congestion. Routing becomes increasingly more difficult as the number
of selected nodes increases and exhibits ergodicity breaking in the case of
multiple routers. A distributed linearly-scalable routing algorithm is devised.
The ground state of such systems reveals non-monotonic complex behaviors in
both average path-length and algorithmic convergence, depending on the network
topology, and densities of communicating nodes and routers.Comment: 4 pages, 4 figure
8-Bit Approximations for Parallelism in Deep Learning
The creation of practical deep learning data-products often requires
parallelization across processors and computers to make deep learning feasible
on large data sets, but bottlenecks in communication bandwidth make it
difficult to attain good speedups through parallelism. Here we develop and test
8-bit approximation algorithms which make better use of the available bandwidth
by compressing 32-bit gradients and nonlinear activations to 8-bit
approximations. We show that these approximations do not decrease predictive
performance on MNIST, CIFAR10, and ImageNet for both model and data parallelism
and provide a data transfer speedup of 2x relative to 32-bit parallelism. We
build a predictive model for speedups based on our experimental data, verify
its validity on known speedup data, and show that we can obtain a speedup of
50x and more on a system of 96 GPUs compared to a speedup of 23x for 32-bit. We
compare our data types with other methods and show that 8-bit approximations
achieve state-of-the-art speedups for model parallelism. Thus 8-bit
approximation is an efficient method to parallelize convolutional networks on
very large systems of GPUs
Quicksilver: Fast Predictive Image Registration - a Deep Learning Approach
This paper introduces Quicksilver, a fast deformable image registration
method. Quicksilver registration for image-pairs works by patch-wise prediction
of a deformation model based directly on image appearance. A deep
encoder-decoder network is used as the prediction model. While the prediction
strategy is general, we focus on predictions for the Large Deformation
Diffeomorphic Metric Mapping (LDDMM) model. Specifically, we predict the
momentum-parameterization of LDDMM, which facilitates a patch-wise prediction
strategy while maintaining the theoretical properties of LDDMM, such as
guaranteed diffeomorphic mappings for sufficiently strong regularization. We
also provide a probabilistic version of our prediction network which can be
sampled during the testing time to calculate uncertainties in the predicted
deformations. Finally, we introduce a new correction network which greatly
increases the prediction accuracy of an already existing prediction network. We
show experimental results for uni-modal atlas-to-image as well as uni- / multi-
modal image-to-image registrations. These experiments demonstrate that our
method accurately predicts registrations obtained by numerical optimization, is
very fast, achieves state-of-the-art registration results on four standard
validation datasets, and can jointly learn an image similarity measure.
Quicksilver is freely available as an open-source software.Comment: Add new discussion
- …