129 research outputs found

    Cache-Oblivious Data Structures and Algorithms for Undirected Breadth-First Search and Shortest Paths

    Get PDF
    We present improved cache-oblivious data structures and algorithms for breadth-first search (BFS) on undirected graphs and the single-source shortest path (SSSP) problem on undirected graphs with non-negative edge weights. For the SSSP problem, our result closes the performance gap between the currently best cache-aware algorithm and the cache-oblivious counterpart. Our cache-oblivious SSSP-algorithm takes nearly full advantage of block transfers for dense graphs. The algorithm relies on a new data structure, called bucket heap, which is the first cache-oblivious priority queue to efficiently support a weak DECREASEKEY operation. For the BFS problem, we reduce the number of I/Os for sparse graphs by a factor of nearly sqrt{B}, where B is the cache-block size, nearly closing the performance gap between the currently best cache-aware and cache-oblivious algorithms

    A parallel priority queue with fast updates for GPU architectures

    Full text link
    The high computational throughput of modern graphics processing units (GPUs) make them the de-facto architecture for high-performance computing applications. However, to achieve peak performance, GPUs require highly parallel workloads, as well as memory access patterns that exhibit good locality of reference. As a result, many state-of-the-art algorithms and data structures designed for GPUs sacrifice work-optimality to achieve the necessary parallelism. Furthermore, some abstract data types are avoided completely due to there being no corresponding data structure that performs well on the GPU. One such abstract data type is the priority queue. Many well-known algorithms rely on priority queue operations as a building block. While various priority queue structures have been developed that are parallel, cache-aware, or cache-oblivious, none has been shown to be efficient on GPUs. In this paper, we present the parBucketHeap, a parallel, cache-efficient data structure designed for modern GPU architectures that supports standard priority queue operations, as well as bulk update. We analyze the structure in several well-known computational models and show that it provides both optimal parallelism and is cache-efficient. We implement the parBucketHeap and, using it, we solve the single-source shortest path (SSSP) problem. Experimental results indicate that, for sufficiently large, dense graphs with high diameter, we out-perform current state-of-the-art SSSP algorithms on the GPU by up to a factor of 5. Unlike existing GPU SSSP algorithms, our approach is work-optimal and places significantly less load on the GPU, reducing power consumption

    I/O-optimal algorithms on grid graphs

    Full text link
    Given a graph of which the n vertices form a regular two-dimensional grid, and in which each (possibly weighted and/or directed) edge connects a vertex to one of its eight neighbours, the following can be done in O(scan(n)) I/Os, provided M = Omega(B^2): computation of shortest paths with non-negative edge weights from a single source, breadth-first traversal, computation of a minimum spanning tree, topological sorting, time-forward processing (if the input is a plane graph), and an Euler tour (if the input graph is a tree). The minimum-spanning tree algorithm is cache-oblivious. The best previously published algorithms for these problems need Theta(sort(n)) I/Os. Estimates of the actual I/O volume show that the new algorithms may often be very efficient in practice.Comment: 12 pages' extended abstract plus 12 pages' appendix with details, proofs and calculations. Has not been published in and is currently not under review of any conference or journa

    External memory priority queues with decrease-key and applications to graph algorithms

    Get PDF
    We present priority queues in the external memory model with block size B and main memory size M that support on N elements, operation Update (a combination of operations Insert and DecreaseKey) in O(1/Blog_{M/B} N/B) amortized I/Os and operations ExtractMin and Delete in O(ceil[(M^epsilon)/B log_{M/B} N/B] log_{M/B} N/B) amortized I/Os, for any real epsilon in (0,1), using O(N/Blog_{M/B} N/B) blocks. Previous I/O-efficient priority queues either support these operations in O(1/Blog_2 N/B) amortized I/Os [Kumar and Schwabe, SPDP \u2796] or support only operations Insert, Delete and ExtractMin in optimal O(1/Blog_{M/B} N/B) amortized I/Os, however without supporting DecreaseKey [Fadel et al., TCS \u2799]. We also present buffered repository trees that support on a multi-set of N elements, operation Insert in O(1/Blog_M/B N/B) I/Os and operation Extract on K extracted elements in O(M^{epsilon} log_M/B N/B + K/B) amortized I/Os, using O(N/B) blocks. Previous results achieve O(1/Blog_2 N/B) I/Os and O(log_2 N/B + K/B) I/Os, respectively [Buchsbaum et al., SODA \u2700]. Our results imply improved O(E/Blog_{M/B} E/B) I/Os for single-source shortest paths, depth-first search and breadth-first search algorithms on massive directed dense graphs (V,E) with E = Omega (V^(1+epsilon)), epsilon > 0 and V = Omega (M), which is equal to the I/O-optimal bound for sorting E values in external memory

    Fine-grained I/O Complexity via Reductions: New Lower Bounds, Faster Algorithms, and a Time Hierarchy

    Get PDF
    This paper initiates the study of I/O algorithms (minimizing cache misses) from the perspective of fine-grained complexity (conditional polynomial lower bounds). Specifically, we aim to answer why sparse graph problems are so hard, and why the Longest Common Subsequence problem gets a savings of a factor of the size of cache times the length of a cache line, but no more. We take the reductions and techniques from complexity and fine-grained complexity and apply them to the I/O model to generate new (conditional) lower bounds as well as new faster algorithms. We also prove the existence of a time hierarchy for the I/O model, which motivates the fine-grained reductions. - Using fine-grained reductions, we give an algorithm for distinguishing 2 vs. 3 diameter and radius that runs in O(|E|^2/(MB)) cache misses, which for sparse graphs improves over the previous O(|V|^2/B) running time. - We give new reductions from radius and diameter to Wiener index and median. These reductions are new in both the RAM and I/O models. - We show meaningful reductions between problems that have linear-time solutions in the RAM model. The reductions use low I/O complexity (typically O(n/B)), and thus help to finely capture between "I/O linear time" O(n/B) and RAM linear time O(n). - We generate new I/O assumptions based on the difficulty of improving sparse graph problem running times in the I/O model. We create conjectures that the current best known algorithms for Single Source Shortest Paths (SSSP), diameter, and radius are optimal. - From these I/O-model assumptions, we show that many of the known reductions in the word-RAM model can naturally extend to hold in the I/O model as well (e.g., a lower bound on the I/O complexity of Longest Common Subsequence that matches the best known running time). - We prove an analog of the Time Hierarchy Theorem in the I/O model, further motivating the study of fine-grained algorithmic differences

    Parallel Cache-Efficient Algorithms on GPUs

    Get PDF