1,128,391 research outputs found
Shared-memory Graph Truss Decomposition
We present PKT, a new shared-memory parallel algorithm and OpenMP
implementation for the truss decomposition of large sparse graphs. A k-truss is
a dense subgraph definition that can be considered a relaxation of a clique.
Truss decomposition refers to a partitioning of all the edges in the graph
based on their k-truss membership. The truss decomposition of a graph has many
applications. We show that our new approach PKT consistently outperforms other
truss decomposition approaches for a collection of large sparse graphs and on a
24-core shared-memory server. PKT is based on a recently proposed algorithm for
k-core decomposition.Comment: 10 pages, conference submissio
Open Transactions on Shared Memory
Transactional memory has arisen as a good way for solving many of the issues
of lock-based programming. However, most implementations admit isolated
transactions only, which are not adequate when we have to coordinate
communicating processes. To this end, in this paper we present OCTM, an
Haskell-like language with open transactions over shared transactional memory:
processes can join transactions at runtime just by accessing to shared
variables. Thus a transaction can co-operate with the environment through
shared variables, but if it is rolled-back, also all its effects on the
environment are retracted. For proving the expressive power of TCCS we give an
implementation of TCCS, a CCS-like calculus with open transactions
Implementing implicit OpenMP data sharing on GPUs
OpenMP is a shared memory programming model which supports the offloading of
target regions to accelerators such as NVIDIA GPUs. The implementation in
Clang/LLVM aims to deliver a generic GPU compilation toolchain that supports
both the native CUDA C/C++ and the OpenMP device offloading models. There are
situations where the semantics of OpenMP and those of CUDA diverge. One such
example is the policy for implicitly handling local variables. In CUDA, local
variables are implicitly mapped to thread local memory and thus become private
to a CUDA thread. In OpenMP, due to semantics that allow the nesting of regions
executed by different numbers of threads, variables need to be implicitly
\emph{shared} among the threads of a contention group. In this paper we
introduce a re-design of the OpenMP device data sharing infrastructure that is
responsible for the implicit sharing of local variables in the Clang/LLVM
toolchain. We introduce a new data sharing infrastructure that lowers
implicitly shared variables to the shared memory of the GPU. We measure the
amount of shared memory used by our scheme in cases that involve scalar
variables and statically allocated arrays. The evaluation is carried out by
offloading to K40 and P100 NVIDIA GPUs. For scalar variables the pressure on
shared memory is relatively low, under 26\% of shared memory utilization for
the K40, and does not negatively impact occupancy. The limiting occupancy
factor in that case is register pressure. The data sharing scheme offers the
users a simple memory model for controlling the implicit allocation of device
shared memory
Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory
New algorithms and optimization techniques are needed to balance the
accelerating trend towards bandwidth-starved multicore chips. It is well known
that the performance of stencil codes can be improved by temporal blocking,
lessening the pressure on the memory interface. We introduce a new pipelined
approach that makes explicit use of shared caches in multicore environments and
minimizes synchronization and boundary overhead. For clusters of shared-memory
nodes we demonstrate how temporal blocking can be employed successfully in a
hybrid shared/distributed-memory environment.Comment: 9 pages, 6 figure
- …
