3 research outputs found

    The origin of interstellar asteroidal objects like 1I/2017 U1 'Oumuamua

    Full text link
    We study the origin of the interstellar object 1I/2017 U1 'Oumuamua by juxtaposing estimates based on the observations with simulations. We speculate that objects like 'Oumuamua are formed in the debris disc as left over from the star and planet formation process, and subsequently liberated. The liberation process is mediated either by interaction with other stars in the parental star-cluster, by resonant interactions within the planetesimal disc or by the relatively sudden mass loss when the host star becomes a compact object. Integrating backward in time in the Galactic potential together with stars from the Gaia-TGAS catalogue we find that about 1.3Myr ago 'Oumuamua passed the nearby star HIP 17288 within a mean distance of 1.31.3pc. By comparing nearby observed L-dwarfs with simulations of the Galaxy we conclude that the kinematics of 'Oumuamua is consistent with relatively young objects of 1.11.1--1.71.7Gyr. We just met 'Oumuamua by chance, and with a derived mean Galactic density of ∼3×105\sim 3\times 10^{5} similarly sized objects within 100\,au from the Sun or ∼1014\sim 10^{14} per cubic parsec we expect about 2 to 12 such visitors per year within 1au from the Sun.Comment: MNRAS (in press

    Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

    Full text link
    TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits CUDA kernels and pointer caching to perform large reductions efficiently. Our proposed designs offer 5-17X better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages. The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie

    Creating the Virtual Universe

    No full text
    Computational astrophysic
    corecore