17,604 research outputs found
An Efficient Multiway Mergesort for GPU Architectures
Sorting is a primitive operation that is a building block for countless
algorithms. As such, it is important to design sorting algorithms that approach
peak performance on a range of hardware architectures. Graphics Processing
Units (GPUs) are particularly attractive architectures as they provides massive
parallelism and computing power. However, the intricacies of their compute and
memory hierarchies make designing GPU-efficient algorithms challenging. In this
work we present GPU Multiway Mergesort (MMS), a new GPU-efficient multiway
mergesort algorithm. MMS employs a new partitioning technique that exposes the
parallelism needed by modern GPU architectures. To the best of our knowledge,
MMS is the first sorting algorithm for the GPU that is asymptotically optimal
in terms of global memory accesses and that is completely free of shared memory
bank conflicts.
We realize an initial implementation of MMS, evaluate its performance on
three modern GPU architectures, and compare it to competitive implementations
available in state-of-the-art GPU libraries. Despite these implementations
being highly optimized, MMS compares favorably, achieving performance
improvements for most random inputs. Furthermore, unlike MMS, state-of-the-art
algorithms are susceptible to bank conflicts. We find that for certain inputs
that cause these algorithms to incur large numbers of bank conflicts, MMS can
achieve up to a 37.6% speedup over its fastest competitor. Overall, even though
its current implementation is not fully optimized, due to its efficient use of
the memory hierarchy, MMS outperforms the fastest comparison-based sorting
implementations available to date
Distributed Graph Clustering using Modularity and Map Equation
We study large-scale, distributed graph clustering. Given an undirected
graph, our objective is to partition the nodes into disjoint sets called
clusters. A cluster should contain many internal edges while being sparsely
connected to other clusters. In the context of a social network, a cluster
could be a group of friends. Modularity and map equation are established
formalizations of this internally-dense-externally-sparse principle. We present
two versions of a simple distributed algorithm to optimize both measures. They
are based on Thrill, a distributed big data processing framework that
implements an extended MapReduce model. The algorithms for the two measures,
DSLM-Mod and DSLM-Map, differ only slightly. Adapting them for similar quality
measures is straight-forward. We conduct an extensive experimental study on
real-world graphs and on synthetic benchmark graphs with up to 68 billion
edges. Our algorithms are fast while detecting clusterings similar to those
detected by other sequential, parallel and distributed clustering algorithms.
Compared to the distributed GossipMap algorithm, DSLM-Map needs less memory, is
up to an order of magnitude faster and achieves better quality.Comment: 14 pages, 3 figures; v3: Camera ready for Euro-Par 2018, more
details, more results; v2: extended experiments to include comparison with
competing algorithms, shortened for submission to Euro-Par 201
Sparse Allreduce: Efficient Scalable Communication for Power-Law Data
Many large datasets exhibit power-law statistics: The web graph, social
networks, text data, click through data etc. Their adjacency graphs are termed
natural graphs, and are known to be difficult to partition. As a consequence
most distributed algorithms on these graphs are communication intensive. Many
algorithms on natural graphs involve an Allreduce: a sum or average of
partitioned data which is then shared back to the cluster nodes. Examples
include PageRank, spectral partitioning, and many machine learning algorithms
including regression, factor (topic) models, and clustering. In this paper we
describe an efficient and scalable Allreduce primitive for power-law data. We
point out scaling problems with existing butterfly and round-robin networks for
Sparse Allreduce, and show that a hybrid approach improves on both.
Furthermore, we show that Sparse Allreduce stages should be nested instead of
cascaded (as in the dense case). And that the optimum throughput Allreduce
network should be a butterfly of heterogeneous degree where degree decreases
with depth into the network. Finally, a simple replication scheme is introduced
to deal with node failures. We present experiments showing significant
improvements over existing systems such as PowerGraph and Hadoop
- …