17,191 research outputs found

    Tree Contraction, Connected Components, Minimum Spanning Trees: a GPU Path to Vertex Fitting

    Get PDF
    Standard parallel computing operations are considered in the context of algorithms for solving 3D graph problems which have applications, e.g., in vertex finding in HEP. Exploiting GPUs for tree-accumulation and graph algorithms is challenging: GPUs offer extreme computational power and high memory-access bandwidth, combined with a model of fine-grained parallelism perhaps not suiting the irregular distribution of linked representations of graph data structures. Achieving data-race free computations may demand serialization through atomic transactions, inevitably producing poor parallel performance. A Minimum Spanning Tree algorithm for GPUs is presented, its implementation discussed, and its efficiency evaluated on GPU and multicore architectures

    A Faster Distributed Single-Source Shortest Paths Algorithm

    Full text link
    We devise new algorithms for the single-source shortest paths (SSSP) problem with non-negative edge weights in the CONGEST model of distributed computing. While close-to-optimal solutions, in terms of the number of rounds spent by the algorithm, have recently been developed for computing SSSP approximately, the fastest known exact algorithms are still far away from matching the lower bound of Ω~(n+D) \tilde \Omega (\sqrt{n} + D) rounds by Peleg and Rubinovich [SIAM Journal on Computing 2000], where n n is the number of nodes in the network and D D is its diameter. The state of the art is Elkin's randomized algorithm [STOC 2017] that performs O~(n2/3D1/3+n5/6) \tilde O(n^{2/3} D^{1/3} + n^{5/6}) rounds. We significantly improve upon this upper bound with our two new randomized algorithms for polynomially bounded integer edge weights, the first performing O~(nD) \tilde O (\sqrt{n D}) rounds and the second performing O~(nD1/4+n3/5+D) \tilde O (\sqrt{n} D^{1/4} + n^{3/5} + D) rounds. Our bounds also compare favorably to the independent result by Ghaffari and Li [STOC 2018]. As side results, we obtain a (1+ϵ) (1 + \epsilon) -approximation O~((nD1/4+D)/ϵ) \tilde O ((\sqrt{n} D^{1/4} + D) / \epsilon) -round algorithm for directed SSSP and a new work/depth trade-off for exact SSSP on directed graphs in the PRAM model.Comment: Presented at the the 59th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2018

    Near-Optimal Approximate Shortest Paths and Transshipment in Distributed and Streaming Models

    Full text link
    We present a method for solving the transshipment problem - also known as uncapacitated minimum cost flow - up to a multiplicative error of 1+ε1 + \varepsilon in undirected graphs with non-negative edge weights using a tailored gradient descent algorithm. Using O~()\tilde{O}(\cdot) to hide polylogarithmic factors in nn (the number of nodes in the graph), our gradient descent algorithm takes O~(ε2)\tilde O(\varepsilon^{-2}) iterations, and in each iteration it solves an instance of the transshipment problem up to a multiplicative error of polylogn\operatorname{polylog} n. In particular, this allows us to perform a single iteration by computing a solution on a sparse spanner of logarithmic stretch. Using a randomized rounding scheme, we can further extend the method to finding approximate solutions for the single-source shortest paths (SSSP) problem. As a consequence, we improve upon prior work by obtaining the following results: (1) Broadcast CONGEST model: (1+ε)(1 + \varepsilon)-approximate SSSP using O~((n+D)ε3)\tilde{O}((\sqrt{n} + D)\varepsilon^{-3}) rounds, where D D is the (hop) diameter of the network. (2) Broadcast congested clique model: (1+ε)(1 + \varepsilon)-approximate transshipment and SSSP using O~(ε2)\tilde{O}(\varepsilon^{-2}) rounds. (3) Multipass streaming model: (1+ε)(1 + \varepsilon)-approximate transshipment and SSSP using O~(n)\tilde{O}(n) space and O~(ε2)\tilde{O}(\varepsilon^{-2}) passes. The previously fastest SSSP algorithms for these models leverage sparse hop sets. We bypass the hop set construction; computing a spanner is sufficient with our method. The above bounds assume non-negative edge weights that are polynomially bounded in nn; for general non-negative weights, running times scale with the logarithm of the maximum ratio between non-zero weights.Comment: Accepted to SIAM Journal on Computing. Preliminary version in DISC 2017. Abstract shortened to fit arXiv's limitation to 1920 character

    Matching Is as Easy as the Decision Problem, in the NC Model

    Get PDF
    Is matching in NC, i.e., is there a deterministic fast parallel algorithm for it? This has been an outstanding open question in TCS for over three decades, ever since the discovery of randomized NC matching algorithms [KUW85, MVV87]. Over the last five years, the theoretical computer science community has launched a relentless attack on this question, leading to the discovery of several powerful ideas. We give what appears to be the culmination of this line of work: An NC algorithm for finding a minimum-weight perfect matching in a general graph with polynomially bounded edge weights, provided it is given an oracle for the decision problem. Consequently, for settling the main open problem, it suffices to obtain an NC algorithm for the decision problem. We believe this new fact has qualitatively changed the nature of this open problem. All known efficient matching algorithms for general graphs follow one of two approaches: given by Edmonds [Edm65] and Lov\'asz [Lov79]. Our oracle-based algorithm follows a new approach and uses many of the ideas discovered in the last five years. The difficulty of obtaining an NC perfect matching algorithm led researchers to study matching vis-a-vis clever relaxations of the class NC. In this vein, recently Goldwasser and Grossman [GG15] gave a pseudo-deterministic RNC algorithm for finding a perfect matching in a bipartite graph, i.e., an RNC algorithm with the additional requirement that on the same graph, it should return the same (i.e., unique) perfect matching for almost all choices of random bits. A corollary of our reduction is an analogous algorithm for general graphs.Comment: Appeared in ITCS 202

    Execution replay and debugging

    Full text link
    As most parallel and distributed programs are internally non-deterministic -- consecutive runs with the same input might result in a different program flow -- vanilla cyclic debugging techniques as such are useless. In order to use cyclic debugging tools, we need a tool that records information about an execution so that it can be replayed for debugging. Because recording information interferes with the execution, we must limit the amount of information and keep the processing of the information fast. This paper contains a survey of existing execution replay techniques and tools.Comment: In M. Ducasse (ed), proceedings of the Fourth International Workshop on Automated Debugging (AADebug 2000), August 2000, Munich. cs.SE/001003

    Machine Learning at Microsoft with ML .NET

    Full text link
    Machine Learning is transitioning from an art and science into a technology available to every developer. In the near future, every application on every platform will incorporate trained models to encode data-based decisions that would be impossible for developers to author. This presents a significant engineering challenge, since currently data science and modeling are largely decoupled from standard software development processes. This separation makes incorporating machine learning capabilities inside applications unnecessarily costly and difficult, and furthermore discourage developers from embracing ML in first place. In this paper we present ML .NET, a framework developed at Microsoft over the last decade in response to the challenge of making it easy to ship machine learning models in large software applications. We present its architecture, and illuminate the application demands that shaped it. Specifically, we introduce DataView, the core data abstraction of ML .NET which allows it to capture full predictive pipelines efficiently and consistently across training and inference lifecycles. We close the paper with a surprisingly favorable performance study of ML .NET compared to more recent entrants, and a discussion of some lessons learned

    An Efficient Multiway Mergesort for GPU Architectures

    Full text link
    Sorting is a primitive operation that is a building block for countless algorithms. As such, it is important to design sorting algorithms that approach peak performance on a range of hardware architectures. Graphics Processing Units (GPUs) are particularly attractive architectures as they provides massive parallelism and computing power. However, the intricacies of their compute and memory hierarchies make designing GPU-efficient algorithms challenging. In this work we present GPU Multiway Mergesort (MMS), a new GPU-efficient multiway mergesort algorithm. MMS employs a new partitioning technique that exposes the parallelism needed by modern GPU architectures. To the best of our knowledge, MMS is the first sorting algorithm for the GPU that is asymptotically optimal in terms of global memory accesses and that is completely free of shared memory bank conflicts. We realize an initial implementation of MMS, evaluate its performance on three modern GPU architectures, and compare it to competitive implementations available in state-of-the-art GPU libraries. Despite these implementations being highly optimized, MMS compares favorably, achieving performance improvements for most random inputs. Furthermore, unlike MMS, state-of-the-art algorithms are susceptible to bank conflicts. We find that for certain inputs that cause these algorithms to incur large numbers of bank conflicts, MMS can achieve up to a 37.6% speedup over its fastest competitor. Overall, even though its current implementation is not fully optimized, due to its efficient use of the memory hierarchy, MMS outperforms the fastest comparison-based sorting implementations available to date
    corecore