29,351 research outputs found

    An efficient implementation of the Bellman-Ford algorithm for Kepler GPU architectures

    Get PDF
    Finding the shortest paths from a single source to all other vertices is a common problem in graph analysis. The Bellman-Ford's algorithm is the solution that solves such a single-source shortest path (SSSP) problem and better applies to be parallelized for many-core architectures. Nevertheless, the high degree of parallelism is guaranteed at the cost of low work efficiency, which, compared to similar algorithms in literature (e.g., Dijkstra's) involves much more redundant work and a consequent waste of power consumption. This article presents a parallel implementation of the Bellman-Ford algorithm that exploits the architectural characteristics of recent GPU architectures (i.e., NVIDIA Kepler, Maxwell) to improve both performance and work efficiency. The article presents different optimizations to the implementation, which are oriented both to the algorithm and to the architecture. The experimental results show that the proposed implementation provides an average speedup of 5x higher than the existing most efficient parallel implementations for SSSP, that it works on graphs where those implementations cannot work or are inefficient (e.g., graphs with negative weight edges, sparse graphs), and that it sensibly reduces the redundant work caused by the parallelization process

    A Simple Boosting Framework for Transshipment

    Get PDF
    Transshipment, also known under the names of earth mover's distance, uncapacitated min-cost flow, or Wasserstein's metric, is an important and well-studied problem that asks to find a flow of minimum cost that routes a general demand vector. Adding to its importance, recent advancements in our understanding of algorithms for transshipment have led to breakthroughs for the fundamental problem of computing shortest paths. Specifically, the recent near-optimal (1+ε)(1+\varepsilon)-approximate single-source shortest path algorithms in the parallel and distributed settings crucially solve transshipment as a central step of their approach. The key property that differentiates transshipment from other similar problems like shortest path is the so-called \emph{boosting}: one can boost a (bad) approximate solution to a near-optimal (1+ε)(1 + \varepsilon)-approximate solution. This conceptually reduces the problem to finding an approximate solution. However, not all approximations can be boosted -- there have been several proposed approaches that were shown to be susceptible to boosting, and a few others where boosting was left as an open question. The main takeaway of our paper is that any black-box α\alpha-approximate transshipment solver that computes a \emph{dual} solution can be boosted to an (1+ε)(1 + \varepsilon)-approximate solver. Moreover, we significantly simplify and decouple previous approaches to transshipment (in sequential, parallel, and distributed settings) by showing all of them (implicitly) obtain approximate dual solutions. Our analysis is very simple and relies only on the well-known multiplicative weights framework. Furthermore, to keep the paper completely self-contained, we provide a new (and arguably much simpler) analysis of multiplicative weights that leverages well-known optimization tools to bypass the ad-hoc calculations used in the standard analyses

    A parallel priority queue with fast updates for GPU architectures

    Full text link
    The high computational throughput of modern graphics processing units (GPUs) make them the de-facto architecture for high-performance computing applications. However, to achieve peak performance, GPUs require highly parallel workloads, as well as memory access patterns that exhibit good locality of reference. As a result, many state-of-the-art algorithms and data structures designed for GPUs sacrifice work-optimality to achieve the necessary parallelism. Furthermore, some abstract data types are avoided completely due to there being no corresponding data structure that performs well on the GPU. One such abstract data type is the priority queue. Many well-known algorithms rely on priority queue operations as a building block. While various priority queue structures have been developed that are parallel, cache-aware, or cache-oblivious, none has been shown to be efficient on GPUs. In this paper, we present the parBucketHeap, a parallel, cache-efficient data structure designed for modern GPU architectures that supports standard priority queue operations, as well as bulk update. We analyze the structure in several well-known computational models and show that it provides both optimal parallelism and is cache-efficient. We implement the parBucketHeap and, using it, we solve the single-source shortest path (SSSP) problem. Experimental results indicate that, for sufficiently large, dense graphs with high diameter, we out-perform current state-of-the-art SSSP algorithms on the GPU by up to a factor of 5. Unlike existing GPU SSSP algorithms, our approach is work-optimal and places significantly less load on the GPU, reducing power consumption

    Parallel Processing of Large Graphs

    Full text link
    More and more large data collections are gathered worldwide in various IT systems. Many of them possess the networked nature and need to be processed and analysed as graph structures. Due to their size they require very often usage of parallel paradigm for efficient computation. Three parallel techniques have been compared in the paper: MapReduce, its map-side join extension and Bulk Synchronous Parallel (BSP). They are implemented for two different graph problems: calculation of single source shortest paths (SSSP) and collective classification of graph nodes by means of relational influence propagation (RIP). The methods and algorithms are applied to several network datasets differing in size and structural profile, originating from three domains: telecommunication, multimedia and microblog. The results revealed that iterative graph processing with the BSP implementation always and significantly, even up to 10 times outperforms MapReduce, especially for algorithms with many iterations and sparse communication. Also MapReduce extension based on map-side join usually noticeably presents better efficiency, although not as much as BSP. Nevertheless, MapReduce still remains the good alternative for enormous networks, whose data structures do not fit in local memories.Comment: Preprint submitted to Future Generation Computer System

    Distributed Approximation Algorithms for Weighted Shortest Paths

    Full text link
    A distributed network is modeled by a graph having nn nodes (processors) and diameter DD. We study the time complexity of approximating {\em weighted} (undirected) shortest paths on distributed networks with a O(logn)O(\log n) {\em bandwidth restriction} on edges (the standard synchronous \congest model). The question whether approximation algorithms help speed up the shortest paths (more precisely distance computation) was raised since at least 2004 by Elkin (SIGACT News 2004). The unweighted case of this problem is well-understood while its weighted counterpart is fundamental problem in the area of distributed approximation algorithms and remains widely open. We present new algorithms for computing both single-source shortest paths (\sssp) and all-pairs shortest paths (\apsp) in the weighted case. Our main result is an algorithm for \sssp. Previous results are the classic O(n)O(n)-time Bellman-Ford algorithm and an O~(n1/2+1/2k+D)\tilde O(n^{1/2+1/2k}+D)-time (8klog(k+1)1)(8k\lceil \log (k+1) \rceil -1)-approximation algorithm, for any integer k1k\geq 1, which follows from the result of Lenzen and Patt-Shamir (STOC 2013). (Note that Lenzen and Patt-Shamir in fact solve a harder problem, and we use O~()\tilde O(\cdot) to hide the O(\poly\log n) term.) We present an O~(n1/2D1/4+D)\tilde O(n^{1/2}D^{1/4}+D)-time (1+o(1))(1+o(1))-approximation algorithm for \sssp. This algorithm is {\em sublinear-time} as long as DD is sublinear, thus yielding a sublinear-time algorithm with almost optimal solution. When DD is small, our running time matches the lower bound of Ω~(n1/2+D)\tilde \Omega(n^{1/2}+D) by Das Sarma et al. (SICOMP 2012), which holds even when D=Θ(logn)D=\Theta(\log n), up to a \poly\log n factor.Comment: Full version of STOC 201
    corecore