877 research outputs found

    When parallel speedups hit the memory wall

    Get PDF
    After Amdahl's trailblazing work, many other authors proposed analytical speedup models but none have considered the limiting effect of the memory wall. These models exploited aspects such as problem-size variation, memory size, communication overhead, and synchronization overhead, but data-access delays are assumed to be constant. Nevertheless, such delays can vary, for example, according to the number of cores used and the ratio between processor and memory frequencies. Given the large number of possible configurations of operating frequency and number of cores that current architectures can offer, suitable speedup models to describe such variations among these configurations are quite desirable for off-line or on-line scheduling decisions. This work proposes new parallel speedup models that account for variations of the average data-access delay to describe the limiting effect of the memory wall on parallel speedups. Analytical results indicate that the proposed modeling can capture the desired behavior while experimental hardware results validate the former. Additionally, we show that when accounting for parameters that reflect the intrinsic characteristics of the applications, such as degree of parallelism and susceptibility to the memory wall, our proposal has significant advantages over machine-learning-based modeling. Moreover, besides being black-box modeling, our experiments show that conventional machine-learning modeling needs about one order of magnitude more measurements to reach the same level of accuracy achieved in our modeling.Comment: 24 page

    Counting Triangles in Large Graphs on GPU

    Full text link
    The clustering coefficient and the transitivity ratio are concepts often used in network analysis, which creates a need for fast practical algorithms for counting triangles in large graphs. Previous research in this area focused on sequential algorithms, MapReduce parallelization, and fast approximations. In this paper we propose a parallel triangle counting algorithm for CUDA GPU. We describe the implementation details necessary to achieve high performance and present the experimental evaluation of our approach. Our algorithm achieves 8 to 15 times speedup over the CPU implementation and is capable of finding 3.8 billion triangles in an 89 million edges graph in less than 10 seconds on the Nvidia Tesla C2050 GPU.Comment: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW

    Computational Physics on Graphics Processing Units

    Full text link
    The use of graphics processing units for scientific computations is an emerging strategy that can significantly speed up various different algorithms. In this review, we discuss advances made in the field of computational physics, focusing on classical molecular dynamics, and on quantum simulations for electronic structure calculations using the density functional theory, wave function techniques, and quantum field theory.Comment: Proceedings of the 11th International Conference, PARA 2012, Helsinki, Finland, June 10-13, 201

    GPU-Accelerated Event Reconstruction for the COMET Phase-I Experiment

    Full text link
    This paper discusses a parallelized event reconstruction of the COMET Phase-I experiment. The experiment aims to discover charged lepton flavor violation by observing 104.97 MeV electrons from neutrinoless muon-to-electron conversion in muonic atoms. The event reconstruction of electrons with multiple helix turns is a challenging problem because hit-to-turn classification requires a high computation cost. The introduced algorithm finds an optimal seed of position and momentum for each turn partition by investigating the residual sum of squares based on distance-of-closest-approach (DCA) between hits and a track extrapolated from the seed. Hits with DCA less than a cutoff value are classified for the turn represented by the seed. The classification performance was optimized by tuning the cutoff value and refining the set of classified hits. The workload was parallelized over the seeds and the hits by defining two GPU kernels, which record track parameters extrapolated from the seeds and finds the DCAs of hits, respectively. A reasonable efficiency and momentum resolution was obtained for a wide momentum region which covers both signal and background electrons. The event reconstruction results from the CPU and GPU were identical to each other. The benchmarked GPUs had an order of magnitude of speedup over a CPU with 16 cores while the exact speed gains varied depending on their architectures

    Proximity Scheme for Instruction Caches in Tiled CMP Architectures

    Get PDF
    Recent research results show that there is a high degree of code sharing between cores in multi-core architectures. In this paper we propose a proximity scheme for the instruction caches, a scheme in which the shared code blocks among the neighbouring L2 caches in tiled multi-core architectures are exploited to reduce the average cache miss penalty and the on-chip network traffic. We evaluate the proposed proximity scheme for instruction caches using a full-system simulator running an n-core tiled CMP. The experimental results reveal a significant execution time improvement of up to 91.4% for microbenchmarks whose instruction footprint does not fit in the private L2 cache. For real applications from the PARSEC benchmarks suite, the proposed scheme results in speedups of up to 8%

    The application of GPU to molecular communication studies

    Get PDF
    This thesis applies the recent trends in parallel processing, via graphics processing unit (GPU), to the field of molecular communications (MC), an investigation into communication possibilities of futuristic in vivo nanomachines. Existing MC simulations have not fully accounted for structural boundaries and the associated simulation of a massive number of messenger molecule paths for stochastic evaluation. These molecules are influenced by a Brownian motion as well as the flow of the blood, which is modeled using numerical methods based on the Fokker-Planck stochastic differential equation. By using a GPU these paths can be calculated on a massive scale, both in the number of simulated paths, and the number of time steps. The use of a GPU also allows for other obstacles and complications to be added to the path of those molecules in future works. This study should enable and expedite existing as well as future study in the MC field
    corecore