30 research outputs found
Parallel Algorithmic Techniques for Combinatorial Computation
Parallel computation offers the promise of great improvements in the solution of problems that, if we were restricted to sequential computation, would take so much time that solution would be impractical. There is a drawback to the use of parallel computers, however, and that is that they seem to be harder to program. For this reason, parallel algorithms in practice are often restricted to simple problems such as matrix multiplication. Certainly this is useful, and in fact we shall see later some non-obvious uses of matrix manipulation, but many of the large problems requiring solution are of a more complex nature. In particular, an instance of a problem may be structured as an arbitrary graph or tree, rather than in the regular order of a matrix. In this paper we describe a number of algorithmic techniques that have been developed for solving such combinatorial problems. The intent of the paper is to show how the algorithmic tools we present can be used as building blocks for higher level algorithms, and to present pointers to the literature for the reader to look up the specifics of these algorithms. We make no claim to completeness; a number of techniques have been omitted for brevity or because their chief application is not combinatorial in nature. In particular we give very little attention to parallel sorting, although sorting is used as a subroutine in a number of the algorithms we describe. We also only describe algorithms, and not lower bounds, for solving problems in parallel
Recommended from our members
Efficient Linked List Ranking Algorithms and Parentheses Matching as a New Strategy for Parallel Algorithm Design
The goal of a parallel algorithm is to solve a single problem using multiple processors working together and to do so in an efficient manner. In this regard, there is a need to categorize strategies in order to solve broad classes of problems with similar structures and requirements. In this dissertation, two parallel algorithm design strategies are considered: linked list ranking and parentheses matching
Data Oblivious Algorithms for Multicores
As secure processors such as Intel SGX (with hyperthreading) become widely adopted, there is a growing appetite for private analytics on big data. Most prior works on data-oblivious algorithms adopt the classical PRAM model to capture parallelism. However, it is widely understood that PRAM does not best capture realistic multicore processors, nor does it reflect
parallel programming models adopted in practice.
In this paper, we initiate the study of parallel data oblivious algorithms on realistic multicores, best captured by the binary fork-join model of computation. We first show that data-oblivious sorting can be accomplished by a binary fork-join algorithm with optimal total work and optimal (cache-oblivious) cache complexity, and in O(log n log log n) span (i.e., parallel time) that matches the best-known insecure algorithm. Using our sorting algorithm as a core primitive, we show how to data-obliviously simulate general PRAM algorithms in the binary fork-join model with non-trivial efficiency. We also present results for several applications including list ranking, Euler tour, tree contraction, connected components, and minimum spanning forest. For a subset of these applications, our data-oblivious algorithms asymptotically outperform the best known insecure algorithms. For other applications, we show data oblivious algorithms whose performance bounds match the best known insecure algorithms.
Complementing these asymptotically efficient results, we present a practical variant of our sorting algorithm that is self-contained and potentially implementable. It has optimal caching cost, and it is only a log log n factor off from optimal work and about a log n factor off in terms of span; moreover, it achieves small constant factors in its bounds
Optimal parallel string algorithms: sorting, merching and computing the minimum
We study fundamental comparison problems on strings of characters, equipped with the usual lexicographical ordering. For each problem studied, we give a parallel algorithm that is optimal with respect to at least one criterion for which no optimal algorithm was previously known. Specifically, our main results are: % \begin{itemize} \item Two sorted sequences of strings, containing altogether ~characters, can be merged in time using operations on an EREW PRAM. This is optimal as regards both the running time and the number of operations. \item A sequence of strings, containing altogether ~characters represented by integers of size polynomial in~, can be sorted in time using operations on a CRCW PRAM. The running time is optimal for any polynomial number of processors. \item The minimum string in a sequence of strings containing altogether characters can be found using (expected) operations in constant expected time on a randomized CRCW PRAM, in time on a deterministic CRCW PRAM with a program depending on~, in time on a deterministic CRCW PRAM with a program not depending on~, in expected time on a randomized EREW PRAM, and in time on a deterministic EREW PRAM. The number of operations is optimal, and the running time is optimal for the randomized algorithms and, if the number of processors is limited to~, for the nonuniform deterministic CRCW PRAM algorithm as we
Aspects of practical implementations of PRAM algorithms
The PRAM is a shared memory model of parallel computation which abstracts away from inessential engineering details. It provides a very simple architecture independent model and provides a good programming environment. Theoreticians of the computer science community have proved that it is possible to emulate the theoretical PRAM model using current technology. Solutions have been found for effectively interconnecting processing elements, for routing data on these networks and for distributing the data among memory modules without hotspots. This thesis reviews this emulation and the possibilities it provides for large scale general purpose parallel computation. The emulation employs a bridging model which acts as an interface between the actual hardware and the PRAM model. We review the evidence that such a scheme crn achieve scalable parallel performance and portable parallel software and that PRAM algorithms can be optimally implemented on such practical models. In the course of this review we presented the following new results:
1. Concerning parallel approximation algorithms, we describe an NC algorithm for finding an approximation to a minimum weight perfect matching in a complete weighted graph. The algorithm is conceptually very simple and it is also the first NC-approximation algorithm for the task with a sub-linear performance ratio.
2. Concerning graph embedding, we describe dense edge-disjoint embeddings of the complete binary tree with n leaves in the following n-node communication networks: the hypercube, the de Bruijn and shuffle-exchange networks and the 2-dimcnsional mesh. In the embeddings the maximum distance from a leaf to the root of the tree is asymptotically optimally short. The embeddings facilitate efficient implementation of many PRAM algorithms on networks employing these graphs as interconnection networks.
3. Concerning bulk synchronous algorithmics, we describe scalable transportable algorithms for the following three commonly required types of computation; balanced tree computations. Fast Fourier Transforms and matrix multiplications
Optimal (Randomized) Parallel Algorithms in the Binary-Forking Model
In this paper we develop optimal algorithms in the binary-forking model for a
variety of fundamental problems, including sorting, semisorting, list ranking,
tree contraction, range minima, and ordered set union, intersection and
difference. In the binary-forking model, tasks can only fork into two child
tasks, but can do so recursively and asynchronously. The tasks share memory,
supporting reads, writes and test-and-sets. Costs are measured in terms of work
(total number of instructions), and span (longest dependence chain).
The binary-forking model is meant to capture both algorithm performance and
algorithm-design considerations on many existing multithreaded languages, which
are also asynchronous and rely on binary forks either explicitly or under the
covers. In contrast to the widely studied PRAM model, it does not assume
arbitrary-way forks nor synchronous operations, both of which are hard to
implement in modern hardware. While optimal PRAM algorithms are known for the
problems studied herein, it turns out that arbitrary-way forking and strict
synchronization are powerful, if unrealistic, capabilities. Natural simulations
of these PRAM algorithms in the binary-forking model (i.e., implementations in
existing parallel languages) incur an overhead in span. This
paper explores techniques for designing optimal algorithms when limited to
binary forking and assuming asynchrony. All algorithms described in this paper
are the first algorithms with optimal work and span in the binary-forking
model. Most of the algorithms are simple. Many are randomized
Realizing degree sequences in parallel
A sequence of integers is a degree sequence if there exists a (simple) graph such that the components of are equal to the degrees of the vertices of . The graph is said to be a realization of . We provide an efficient parallel algorithm to realize . Before our result, it was not known if the problem of realizing is in
Recommended from our members
Analysis and design of algorithms : double hashing and parallel graph searching
The following is in two parts, corresponding to the two separate topics in the dissertation.Probabilistic Analysis of Double HashingIn [GS78], a deep and elegant analysis shows that double hashing is asymptotically equivalent to the ideal uniform hashing up to a load factor of about 0.319. In this paper we show how a resampling technique can be used to develop a surprisingly simple proof of the result that this equivalence holds for load factors arbitrarily close to 1.Parallel Depth First Search of Planar Directed Acyclic GraphsIn 1988, Kao [Kao88] presented the first NC algorithm for the depth first search of a directed planar graph. Recently, Kao and Klein [KK90] reduced the number of processors required from O(n^4) to linear, but the time bound is O(log^8 n).We present an algorithm for the depth first search of a planar directed acyclic graph with k sources using O(n) processors and O(log k log n) time on a CRCW PRAM model. For planar dags with a single source and a single sink, we present a simple optimal algorithm which gives the depth first search in O(log n) time with O(n/log n) processors on an EREW PRAM. For a single-source multiple-sink planar dag, we have an O(log n) time O(n) processor EREW algorithm. The EREW algorithms assume that the embedding is given. A simplified variant of the depth first search of a multisource planar dag can be used to solve the single source reachability problem for a planar directed acyclic graph in O(log^2 n) time and O(n) processors on an CRCW PRAM. Since an O(log^4 n) algorithm for this problem is used as a subroutine by Kao and Klein in their depth first search for the general planar directed graph, this will lower their time bound by a factor of log^2 n. Our work uses the concept of a planar Euler tour depth first search, a depth first search in which the Euler tour around the tree is planar and crosses no tree edge. This concept may prove to be of use in other parallel algorithms for planar graphs
Efficient Algorithms and Data Structures for Massive Data Sets
For many algorithmic problems, traditional algorithms that optimise on the
number of instructions executed prove expensive on I/Os. Novel and very
different design techniques, when applied to these problems, can produce
algorithms that are I/O efficient. This thesis adds to the growing chorus of
such results. The computational models we use are the external memory model and
the W-Stream model.
On the external memory model, we obtain the following results. (1) An I/O
efficient algorithm for computing minimum spanning trees of graphs that
improves on the performance of the best known algorithm. (2) The first external
memory version of soft heap, an approximate meldable priority queue. (3) Hard
heap, the first meldable external memory priority queue that matches the
amortised I/O performance of the known external memory priority queues, while
allowing a meld operation at the same amortised cost. (4) I/O efficient exact,
approximate and randomised algorithms for the minimum cut problem, which has
not been explored before on the external memory model. (5) Some lower and upper
bounds on I/Os for interval graphs.
On the W-Stream model, we obtain the following results. (1) Algorithms for
various tree problems and list ranking that match the performance of the best
known algorithms and are easier to implement than them. (2) Pass efficient
algorithms for sorting, and the maximal independent set problems, that improve
on the best known algorithms. (3) Pass efficient algorithms for the graphs
problems of finding vertex-colouring, approximate single source shortest paths,
maximal matching, and approximate weighted vertex cover. (4) Lower bounds on
passes for list ranking and maximal matching.
We propose two variants of the W-Stream model, and design algorithms for the
maximal independent set, vertex-colouring, and planar graph single source
shortest paths problems on those models.Comment: PhD Thesis (144 pages