19,692 research outputs found

    Optimized dynamical control of state transfer through noisy spin chains

    Get PDF
    We propose a method of optimally controlling the tradeoff of speed and fidelity of state transfer through a noisy quantum channel (spin-chain). This process is treated as qubit state-transfer through a fermionic bath. We show that dynamical modulation of the boundary-qubits levels can ensure state transfer with the best tradeoff of speed and fidelity. This is achievable by dynamically optimizing the transmission spectrum of the channel. The resulting optimal control is robust against both static and fluctuating noise in the channel's spin-spin couplings. It may also facilitate transfer in the presence of diagonal disorder (on site energy noise) in the channel.Comment: 20 pages, 5 figures. arXiv admin note: text overlap with arXiv:1310.162

    A context-based geoprocessing framework for optimizing meetup location of multiple moving objects along road networks

    Full text link
    Given different types of constraints on human life, people must make decisions that satisfy social activity needs. Minimizing costs (i.e., distance, time, or money) associated with travel plays an important role in perceived and realized social quality of life. Identifying optimal interaction locations on road networks when there are multiple moving objects (MMO) with space-time constraints remains a challenge. In this research, we formalize the problem of finding dynamic ideal interaction locations for MMO as a spatial optimization model and introduce a context-based geoprocessing heuristic framework to address this problem. As a proof of concept, a case study involving identification of a meetup location for multiple people under traffic conditions is used to validate the proposed geoprocessing framework. Five heuristic methods with regard to efficient shortest-path search space have been tested. We find that the R* tree-based algorithm performs the best with high quality solutions and low computation time. This framework is implemented in a GIS environment to facilitate integration with external geographic contextual information, e.g., temporary road barriers, points of interest (POI), and real-time traffic information, when dynamically searching for ideal meetup sites. The proposed method can be applied in trip planning, carpooling services, collaborative interaction, and logistics management.Comment: 34 pages, 8 figure

    PC Clusters for Lattice QCD

    Full text link
    In the last several years, tightly coupled PC clusters have become widely applied, cost effective resources for lattice gauge computations. This paper discusses the practice of building such clusters, in particular balanced design requirements. I review and quantify the improvements over time of key performance parameters and overall price to performance ratio. Applying these trends and technology forecasts given by computer equipment manufacturers, I predict the range of price to performance for lattice codes expected in the next several years.Comment: Talk presented at Lattice2004(plenary), Fermilab, June 21-26, 2004. 7 pages, 4 figures. v2 - clarified SIMD coding discusion and reference

    Distributed Deep Learning Using Synchronous Stochastic Gradient Descent

    Full text link
    We design and implement a distributed multinode synchronous SGD algorithm, without altering hyper parameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify optimal design points for different networks. We demonstrate scaling of CNNs on 100s of nodes, and present what we believe to be record training throughputs. A 512 minibatch VGG-A CNN training run is scaled 90X on 128 nodes. Also 256 minibatch VGG-A and OverFeat-FAST networks are scaled 53X and 42X respectively on a 64 node cluster. We also demonstrate the generality of our approach via best-in-class 6.5X scaling for a 7-layer DNN on 16 nodes. Thereafter we attempt to democratize deep-learning by training on an Ethernet based AWS cluster and show ~14X scaling on 16 nodes

    Non-Orthogonal Multiple Access for Air-to-Ground Communication

    Full text link
    This paper investigates ground-aerial uplink non-orthogonal multiple access (NOMA) cellular networks. A rotary-wing unmanned aerial vehicle (UAV) user and multiple ground users (GUEs) are served by ground base stations (GBSs) by utilizing the uplink NOMA protocol. The UAV is dispatched to upload specific information bits to each target GBSs. Specifically, our goal is to minimize the UAV mission completion time by jointly optimizing the UAV trajectory and UAV-GBS association order while taking into account the UAV's interference to non-associated GBSs. The formulated problem is a mixed integer non-convex problem and involves infinite variables. To tackle this problem, we efficiently check the feasibility of the formulated problem by utilizing graph theory and topology theory. Next, we prove that the optimal UAV trajectory needs to satisfy the \emph{fly-hover-fly} structure. With this insight, we first design an efficient solution with predefined hovering locations by leveraging graph theory techniques. Furthermore, we propose an iterative UAV trajectory design by applying successive convex approximation (SCA) technique, which is guaranteed to coverage to a locally optimal solution. We demonstrate that the two proposed designs exhibit polynomial time complexity. Finally, numerical results show that: 1) the SCA based design outperforms the fly-hover-fly based design; 2) the UAV mission completion time is significantly minimized with proposed NOMA schemes compared with the orthogonal multiple access (OMA) scheme; 3) the increase of GUEs' quality of service (QoS) requirements will increase the UAV mission completion time

    Optimizing Deep Learning Recommender Systems' Training On CPU Cluster Architectures

    Full text link
    During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More specifically, we focus on Facebook's DLRM benchmark. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve more than two-orders of magnitude improvement in performance (110x) on a single socket compared to the reference CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets. This paper discusses the optimization techniques for the various operators in DLRM and which component of the systems are stressed by these different operators. The presented techniques are applicable to a broader set of DL workloads that pose the same scaling challenges/characteristics as DLRM

    OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed Training

    Full text link
    The training of modern deep learning neural network calls for large amounts of computation, which is often provided by GPUs or other specific accelerators. To scale out to achieve faster training speed, two update algorithms are mainly applied in the distributed training process, i.e. the Synchronous SGD algorithm (SSGD) and Asynchronous SGD algorithm (ASGD). SSGD obtains good convergence point while the training speed is slowed down by the synchronous barrier. ASGD has faster training speed but the convergence point is lower when compared to SSGD. To sufficiently utilize the advantages of SSGD and ASGD, we propose a novel technology named One-step Delay SGD (OD-SGD) to combine their strengths in the training process. Therefore, we can achieve similar convergence point and training speed as SSGD and ASGD separately. To the best of our knowledge, we make the first attempt to combine the features of SSGD and ASGD to improve distributed training performance. Each iteration of OD-SGD contains a global update in the parameter server node and local updates in the worker nodes, the local update is introduced to update and compensate the delayed local weights. We evaluate our proposed algorithm on MNIST, CIFAR-10 and ImageNet datasets. Experimental results show that OD-SGD can obtain similar or even slightly better accuracy than SSGD, while its training speed is much faster, which even exceeds the training speed of ASGD

    The Competition for Shortest Paths on Sparse Graphs

    Get PDF
    Optimal paths connecting randomly selected network nodes and fixed routers are studied analytically in the presence of non-linear overlap cost that penalizes congestion. Routing becomes increasingly more difficult as the number of selected nodes increases and exhibits ergodicity breaking in the case of multiple routers. A distributed linearly-scalable routing algorithm is devised. The ground state of such systems reveals non-monotonic complex behaviors in both average path-length and algorithmic convergence, depending on the network topology, and densities of communicating nodes and routers.Comment: 4 pages, 4 figure

    8-Bit Approximations for Parallelism in Deep Learning

    Full text link
    The creation of practical deep learning data-products often requires parallelization across processors and computers to make deep learning feasible on large data sets, but bottlenecks in communication bandwidth make it difficult to attain good speedups through parallelism. Here we develop and test 8-bit approximation algorithms which make better use of the available bandwidth by compressing 32-bit gradients and nonlinear activations to 8-bit approximations. We show that these approximations do not decrease predictive performance on MNIST, CIFAR10, and ImageNet for both model and data parallelism and provide a data transfer speedup of 2x relative to 32-bit parallelism. We build a predictive model for speedups based on our experimental data, verify its validity on known speedup data, and show that we can obtain a speedup of 50x and more on a system of 96 GPUs compared to a speedup of 23x for 32-bit. We compare our data types with other methods and show that 8-bit approximations achieve state-of-the-art speedups for model parallelism. Thus 8-bit approximation is an efficient method to parallelize convolutional networks on very large systems of GPUs

    Quicksilver: Fast Predictive Image Registration - a Deep Learning Approach

    Get PDF
    This paper introduces Quicksilver, a fast deformable image registration method. Quicksilver registration for image-pairs works by patch-wise prediction of a deformation model based directly on image appearance. A deep encoder-decoder network is used as the prediction model. While the prediction strategy is general, we focus on predictions for the Large Deformation Diffeomorphic Metric Mapping (LDDMM) model. Specifically, we predict the momentum-parameterization of LDDMM, which facilitates a patch-wise prediction strategy while maintaining the theoretical properties of LDDMM, such as guaranteed diffeomorphic mappings for sufficiently strong regularization. We also provide a probabilistic version of our prediction network which can be sampled during the testing time to calculate uncertainties in the predicted deformations. Finally, we introduce a new correction network which greatly increases the prediction accuracy of an already existing prediction network. We show experimental results for uni-modal atlas-to-image as well as uni- / multi- modal image-to-image registrations. These experiments demonstrate that our method accurately predicts registrations obtained by numerical optimization, is very fast, achieves state-of-the-art registration results on four standard validation datasets, and can jointly learn an image similarity measure. Quicksilver is freely available as an open-source software.Comment: Add new discussion
    • …
    corecore