2,379 research outputs found
An Efficient Multiway Mergesort for GPU Architectures
Sorting is a primitive operation that is a building block for countless
algorithms. As such, it is important to design sorting algorithms that approach
peak performance on a range of hardware architectures. Graphics Processing
Units (GPUs) are particularly attractive architectures as they provides massive
parallelism and computing power. However, the intricacies of their compute and
memory hierarchies make designing GPU-efficient algorithms challenging. In this
work we present GPU Multiway Mergesort (MMS), a new GPU-efficient multiway
mergesort algorithm. MMS employs a new partitioning technique that exposes the
parallelism needed by modern GPU architectures. To the best of our knowledge,
MMS is the first sorting algorithm for the GPU that is asymptotically optimal
in terms of global memory accesses and that is completely free of shared memory
bank conflicts.
We realize an initial implementation of MMS, evaluate its performance on
three modern GPU architectures, and compare it to competitive implementations
available in state-of-the-art GPU libraries. Despite these implementations
being highly optimized, MMS compares favorably, achieving performance
improvements for most random inputs. Furthermore, unlike MMS, state-of-the-art
algorithms are susceptible to bank conflicts. We find that for certain inputs
that cause these algorithms to incur large numbers of bank conflicts, MMS can
achieve up to a 37.6% speedup over its fastest competitor. Overall, even though
its current implementation is not fully optimized, due to its efficient use of
the memory hierarchy, MMS outperforms the fastest comparison-based sorting
implementations available to date
Accelerating Nearest Neighbor Search on Manycore Systems
We develop methods for accelerating metric similarity search that are
effective on modern hardware. Our algorithms factor into easily parallelizable
components, making them simple to deploy and efficient on multicore CPUs and
GPUs. Despite the simple structure of our algorithms, their search performance
is provably sublinear in the size of the database, with a factor dependent only
on its intrinsic dimensionality. We demonstrate that our methods provide
substantial speedups on a range of datasets and hardware platforms. In
particular, we present results on a 48-core server machine, on graphics
hardware, and on a multicore desktop
Combinatorial Continuous Maximal Flows
Maximum flow (and minimum cut) algorithms have had a strong impact on
computer vision. In particular, graph cuts algorithms provide a mechanism for
the discrete optimization of an energy functional which has been used in a
variety of applications such as image segmentation, stereo, image stitching and
texture synthesis. Algorithms based on the classical formulation of max-flow
defined on a graph are known to exhibit metrication artefacts in the solution.
Therefore, a recent trend has been to instead employ a spatially continuous
maximum flow (or the dual min-cut problem) in these same applications to
produce solutions with no metrication errors. However, known fast continuous
max-flow algorithms have no stopping criteria or have not been proved to
converge. In this work, we revisit the continuous max-flow problem and show
that the analogous discrete formulation is different from the classical
max-flow problem. We then apply an appropriate combinatorial optimization
technique to this combinatorial continuous max-flow CCMF problem to find a
null-divergence solution that exhibits no metrication artefacts and may be
solved exactly by a fast, efficient algorithm with provable convergence.
Finally, by exhibiting the dual problem of our CCMF formulation, we clarify the
fact, already proved by Nozawa in the continuous setting, that the max-flow and
the total variation problems are not always equivalent.Comment: 26 page
Fast and Provably Convergent Algorithms for Gromov-Wasserstein in Graph Data
In this paper, we study the design and analysis of a class of efficient
algorithms for computing the Gromov-Wasserstein (GW) distance tailored to
large-scale graph learning tasks. Armed with the Luo-Tseng error bound
condition~\citep{luo1992error}, two proposed algorithms, called Bregman
Alternating Projected Gradient (BAPG) and hybrid Bregman Proximal Gradient
(hBPG) enjoy the convergence guarantees. Upon task-specific properties, our
analysis further provides novel theoretical insights to guide how to select the
best-fit method. As a result, we are able to provide comprehensive experiments
to validate the effectiveness of our methods on a host of tasks, including
graph alignment, graph partition, and shape matching. In terms of both
wall-clock time and modeling performance, the proposed methods achieve
state-of-the-art results
- …