17,191 research outputs found
Tree Contraction, Connected Components, Minimum Spanning Trees: a GPU Path to Vertex Fitting
Standard parallel computing operations are considered in the context of algorithms for solving 3D graph problems which have applications, e.g., in vertex finding in HEP. Exploiting GPUs for tree-accumulation and graph algorithms is challenging: GPUs offer extreme computational power and high memory-access bandwidth, combined with a model of fine-grained parallelism perhaps not suiting the irregular distribution of linked representations of graph data structures. Achieving data-race free computations may demand serialization through atomic transactions, inevitably producing poor parallel performance. A Minimum Spanning Tree algorithm for GPUs is presented, its implementation discussed, and its efficiency evaluated on GPU and multicore architectures
A Faster Distributed Single-Source Shortest Paths Algorithm
We devise new algorithms for the single-source shortest paths (SSSP) problem
with non-negative edge weights in the CONGEST model of distributed computing.
While close-to-optimal solutions, in terms of the number of rounds spent by the
algorithm, have recently been developed for computing SSSP approximately, the
fastest known exact algorithms are still far away from matching the lower bound
of rounds by Peleg and Rubinovich [SIAM
Journal on Computing 2000], where is the number of nodes in the network
and is its diameter. The state of the art is Elkin's randomized algorithm
[STOC 2017] that performs rounds. We
significantly improve upon this upper bound with our two new randomized
algorithms for polynomially bounded integer edge weights, the first performing
rounds and the second performing rounds. Our bounds also compare favorably to the
independent result by Ghaffari and Li [STOC 2018]. As side results, we obtain a
-approximation -round algorithm for directed SSSP and a new work/depth trade-off for exact
SSSP on directed graphs in the PRAM model.Comment: Presented at the the 59th Annual IEEE Symposium on Foundations of
Computer Science (FOCS 2018
Near-Optimal Approximate Shortest Paths and Transshipment in Distributed and Streaming Models
We present a method for solving the transshipment problem - also known as
uncapacitated minimum cost flow - up to a multiplicative error of in undirected graphs with non-negative edge weights using a
tailored gradient descent algorithm. Using to hide
polylogarithmic factors in (the number of nodes in the graph), our gradient
descent algorithm takes iterations, and in each
iteration it solves an instance of the transshipment problem up to a
multiplicative error of . In particular, this allows
us to perform a single iteration by computing a solution on a sparse spanner of
logarithmic stretch. Using a randomized rounding scheme, we can further extend
the method to finding approximate solutions for the single-source shortest
paths (SSSP) problem. As a consequence, we improve upon prior work by obtaining
the following results: (1) Broadcast CONGEST model: -approximate SSSP using rounds, where is the (hop) diameter of the network.
(2) Broadcast congested clique model: -approximate
transshipment and SSSP using rounds. (3)
Multipass streaming model: -approximate transshipment and
SSSP using space and passes. The
previously fastest SSSP algorithms for these models leverage sparse hop sets.
We bypass the hop set construction; computing a spanner is sufficient with our
method. The above bounds assume non-negative edge weights that are polynomially
bounded in ; for general non-negative weights, running times scale with the
logarithm of the maximum ratio between non-zero weights.Comment: Accepted to SIAM Journal on Computing. Preliminary version in DISC
2017. Abstract shortened to fit arXiv's limitation to 1920 character
Matching Is as Easy as the Decision Problem, in the NC Model
Is matching in NC, i.e., is there a deterministic fast parallel algorithm for
it? This has been an outstanding open question in TCS for over three decades,
ever since the discovery of randomized NC matching algorithms [KUW85, MVV87].
Over the last five years, the theoretical computer science community has
launched a relentless attack on this question, leading to the discovery of
several powerful ideas. We give what appears to be the culmination of this line
of work: An NC algorithm for finding a minimum-weight perfect matching in a
general graph with polynomially bounded edge weights, provided it is given an
oracle for the decision problem. Consequently, for settling the main open
problem, it suffices to obtain an NC algorithm for the decision problem. We
believe this new fact has qualitatively changed the nature of this open
problem.
All known efficient matching algorithms for general graphs follow one of two
approaches: given by Edmonds [Edm65] and Lov\'asz [Lov79]. Our oracle-based
algorithm follows a new approach and uses many of the ideas discovered in the
last five years.
The difficulty of obtaining an NC perfect matching algorithm led researchers
to study matching vis-a-vis clever relaxations of the class NC. In this vein,
recently Goldwasser and Grossman [GG15] gave a pseudo-deterministic RNC
algorithm for finding a perfect matching in a bipartite graph, i.e., an RNC
algorithm with the additional requirement that on the same graph, it should
return the same (i.e., unique) perfect matching for almost all choices of
random bits. A corollary of our reduction is an analogous algorithm for general
graphs.Comment: Appeared in ITCS 202
Execution replay and debugging
As most parallel and distributed programs are internally non-deterministic --
consecutive runs with the same input might result in a different program flow
-- vanilla cyclic debugging techniques as such are useless. In order to use
cyclic debugging tools, we need a tool that records information about an
execution so that it can be replayed for debugging. Because recording
information interferes with the execution, we must limit the amount of
information and keep the processing of the information fast. This paper
contains a survey of existing execution replay techniques and tools.Comment: In M. Ducasse (ed), proceedings of the Fourth International Workshop
on Automated Debugging (AADebug 2000), August 2000, Munich. cs.SE/001003
Machine Learning at Microsoft with ML .NET
Machine Learning is transitioning from an art and science into a technology
available to every developer. In the near future, every application on every
platform will incorporate trained models to encode data-based decisions that
would be impossible for developers to author. This presents a significant
engineering challenge, since currently data science and modeling are largely
decoupled from standard software development processes. This separation makes
incorporating machine learning capabilities inside applications unnecessarily
costly and difficult, and furthermore discourage developers from embracing ML
in first place. In this paper we present ML .NET, a framework developed at
Microsoft over the last decade in response to the challenge of making it easy
to ship machine learning models in large software applications. We present its
architecture, and illuminate the application demands that shaped it.
Specifically, we introduce DataView, the core data abstraction of ML .NET which
allows it to capture full predictive pipelines efficiently and consistently
across training and inference lifecycles. We close the paper with a
surprisingly favorable performance study of ML .NET compared to more recent
entrants, and a discussion of some lessons learned
An Efficient Multiway Mergesort for GPU Architectures
Sorting is a primitive operation that is a building block for countless
algorithms. As such, it is important to design sorting algorithms that approach
peak performance on a range of hardware architectures. Graphics Processing
Units (GPUs) are particularly attractive architectures as they provides massive
parallelism and computing power. However, the intricacies of their compute and
memory hierarchies make designing GPU-efficient algorithms challenging. In this
work we present GPU Multiway Mergesort (MMS), a new GPU-efficient multiway
mergesort algorithm. MMS employs a new partitioning technique that exposes the
parallelism needed by modern GPU architectures. To the best of our knowledge,
MMS is the first sorting algorithm for the GPU that is asymptotically optimal
in terms of global memory accesses and that is completely free of shared memory
bank conflicts.
We realize an initial implementation of MMS, evaluate its performance on
three modern GPU architectures, and compare it to competitive implementations
available in state-of-the-art GPU libraries. Despite these implementations
being highly optimized, MMS compares favorably, achieving performance
improvements for most random inputs. Furthermore, unlike MMS, state-of-the-art
algorithms are susceptible to bank conflicts. We find that for certain inputs
that cause these algorithms to incur large numbers of bank conflicts, MMS can
achieve up to a 37.6% speedup over its fastest competitor. Overall, even though
its current implementation is not fully optimized, due to its efficient use of
the memory hierarchy, MMS outperforms the fastest comparison-based sorting
implementations available to date
- …