877 research outputs found
When parallel speedups hit the memory wall
After Amdahl's trailblazing work, many other authors proposed analytical
speedup models but none have considered the limiting effect of the memory wall.
These models exploited aspects such as problem-size variation, memory size,
communication overhead, and synchronization overhead, but data-access delays
are assumed to be constant. Nevertheless, such delays can vary, for example,
according to the number of cores used and the ratio between processor and
memory frequencies. Given the large number of possible configurations of
operating frequency and number of cores that current architectures can offer,
suitable speedup models to describe such variations among these configurations
are quite desirable for off-line or on-line scheduling decisions. This work
proposes new parallel speedup models that account for variations of the average
data-access delay to describe the limiting effect of the memory wall on
parallel speedups. Analytical results indicate that the proposed modeling can
capture the desired behavior while experimental hardware results validate the
former. Additionally, we show that when accounting for parameters that reflect
the intrinsic characteristics of the applications, such as degree of
parallelism and susceptibility to the memory wall, our proposal has significant
advantages over machine-learning-based modeling. Moreover, besides being
black-box modeling, our experiments show that conventional machine-learning
modeling needs about one order of magnitude more measurements to reach the same
level of accuracy achieved in our modeling.Comment: 24 page
Counting Triangles in Large Graphs on GPU
The clustering coefficient and the transitivity ratio are concepts often used
in network analysis, which creates a need for fast practical algorithms for
counting triangles in large graphs. Previous research in this area focused on
sequential algorithms, MapReduce parallelization, and fast approximations.
In this paper we propose a parallel triangle counting algorithm for CUDA GPU.
We describe the implementation details necessary to achieve high performance
and present the experimental evaluation of our approach. Our algorithm achieves
8 to 15 times speedup over the CPU implementation and is capable of finding 3.8
billion triangles in an 89 million edges graph in less than 10 seconds on the
Nvidia Tesla C2050 GPU.Comment: 2016 IEEE International Parallel and Distributed Processing Symposium
Workshops (IPDPSW
Computational Physics on Graphics Processing Units
The use of graphics processing units for scientific computations is an
emerging strategy that can significantly speed up various different algorithms.
In this review, we discuss advances made in the field of computational physics,
focusing on classical molecular dynamics, and on quantum simulations for
electronic structure calculations using the density functional theory, wave
function techniques, and quantum field theory.Comment: Proceedings of the 11th International Conference, PARA 2012,
Helsinki, Finland, June 10-13, 201
GPU-Accelerated Event Reconstruction for the COMET Phase-I Experiment
This paper discusses a parallelized event reconstruction of the COMET Phase-I
experiment. The experiment aims to discover charged lepton flavor violation by
observing 104.97 MeV electrons from neutrinoless muon-to-electron conversion in
muonic atoms. The event reconstruction of electrons with multiple helix turns
is a challenging problem because hit-to-turn classification requires a high
computation cost. The introduced algorithm finds an optimal seed of position
and momentum for each turn partition by investigating the residual sum of
squares based on distance-of-closest-approach (DCA) between hits and a track
extrapolated from the seed. Hits with DCA less than a cutoff value are
classified for the turn represented by the seed. The classification performance
was optimized by tuning the cutoff value and refining the set of classified
hits. The workload was parallelized over the seeds and the hits by defining two
GPU kernels, which record track parameters extrapolated from the seeds and
finds the DCAs of hits, respectively. A reasonable efficiency and momentum
resolution was obtained for a wide momentum region which covers both signal and
background electrons. The event reconstruction results from the CPU and GPU
were identical to each other. The benchmarked GPUs had an order of magnitude of
speedup over a CPU with 16 cores while the exact speed gains varied depending
on their architectures
Proximity Scheme for Instruction Caches in Tiled CMP Architectures
Recent research results show that there is a high degree of code sharing between cores in multi-core architectures. In this paper we propose a proximity scheme for the instruction caches, a scheme in which the shared code blocks among the neighbouring L2 caches in tiled multi-core architectures are exploited to reduce the average cache miss penalty and the on-chip network traffic. We evaluate the proposed proximity scheme for instruction caches using a full-system simulator running an n-core tiled CMP. The experimental results reveal a significant execution time improvement of up to 91.4% for microbenchmarks whose instruction footprint does not fit in the private L2 cache. For real applications from the PARSEC benchmarks suite, the proposed scheme results in speedups of up to 8%
The application of GPU to molecular communication studies
This thesis applies the recent trends in parallel processing, via graphics processing unit (GPU), to the field of molecular communications (MC), an investigation into communication possibilities of futuristic in vivo nanomachines. Existing MC simulations have not fully accounted for structural boundaries and the associated simulation of a massive number of messenger molecule paths for stochastic evaluation. These molecules are influenced by a Brownian motion as well as the flow of the blood, which is modeled using numerical methods based on the Fokker-Planck stochastic differential equation. By using a GPU these paths can be calculated on a massive scale, both in the number of simulated paths, and the number of time steps. The use of a GPU also allows for other obstacles and complications to be added to the path of those molecules in future works. This study should enable and expedite existing as well as future study in the MC field
- …