32 research outputs found
Aspects of practical implementations of PRAM algorithms
The PRAM is a shared memory model of parallel computation which abstracts away from inessential engineering details. It provides a very simple architecture independent model and provides a good programming environment. Theoreticians of the computer science community have proved that it is possible to emulate the theoretical PRAM model using current technology. Solutions have been found for effectively interconnecting processing elements, for routing data on these networks and for distributing the data among memory modules without hotspots. This thesis reviews this emulation and the possibilities it provides for large scale general purpose parallel computation. The emulation employs a bridging model which acts as an interface between the actual hardware and the PRAM model. We review the evidence that such a scheme crn achieve scalable parallel performance and portable parallel software and that PRAM algorithms can be optimally implemented on such practical models. In the course of this review we presented the following new results:
1. Concerning parallel approximation algorithms, we describe an NC algorithm for finding an approximation to a minimum weight perfect matching in a complete weighted graph. The algorithm is conceptually very simple and it is also the first NC-approximation algorithm for the task with a sub-linear performance ratio.
2. Concerning graph embedding, we describe dense edge-disjoint embeddings of the complete binary tree with n leaves in the following n-node communication networks: the hypercube, the de Bruijn and shuffle-exchange networks and the 2-dimcnsional mesh. In the embeddings the maximum distance from a leaf to the root of the tree is asymptotically optimally short. The embeddings facilitate efficient implementation of many PRAM algorithms on networks employing these graphs as interconnection networks.
3. Concerning bulk synchronous algorithmics, we describe scalable transportable algorithms for the following three commonly required types of computation; balanced tree computations. Fast Fourier Transforms and matrix multiplications
Deterministic Computations on a PRAM with Static Processor and Memory Faults.
We consider Parallel Random Access Machine (PRAM) which has some processors
and memory cells faulty. The faults considered are static, i.e., once the
machine starts to operate, the operational/faulty status of PRAM components
does not change. We develop a deterministic simulation of a fully operational
PRAM on a similar faulty machine which has constant fractions of faults among
processors and memory cells. The simulating PRAM has processors and
memory cells, and simulates a PRAM with processors and a constant fraction
of memory cells. The simulation is in two phases: it starts with
preprocessing, which is followed by the simulation proper performed in a
step-by-step fashion. Preprocessing is performed in time . The slowdown of a step-by-step part of the simulation is
Data Oblivious Algorithms for Multicores
As secure processors such as Intel SGX (with hyperthreading) become widely adopted, there is a growing appetite for private analytics on big data. Most prior works on data-oblivious algorithms adopt the classical PRAM model to capture parallelism. However, it is widely understood that PRAM does not best capture realistic multicore processors, nor does it reflect
parallel programming models adopted in practice.
In this paper, we initiate the study of parallel data oblivious algorithms on realistic multicores, best captured by the binary fork-join model of computation. We first show that data-oblivious sorting can be accomplished by a binary fork-join algorithm with optimal total work and optimal (cache-oblivious) cache complexity, and in O(log n log log n) span (i.e., parallel time) that matches the best-known insecure algorithm. Using our sorting algorithm as a core primitive, we show how to data-obliviously simulate general PRAM algorithms in the binary fork-join model with non-trivial efficiency. We also present results for several applications including list ranking, Euler tour, tree contraction, connected components, and minimum spanning forest. For a subset of these applications, our data-oblivious algorithms asymptotically outperform the best known insecure algorithms. For other applications, we show data oblivious algorithms whose performance bounds match the best known insecure algorithms.
Complementing these asymptotically efficient results, we present a practical variant of our sorting algorithm that is self-contained and potentially implementable. It has optimal caching cost, and it is only a log log n factor off from optimal work and about a log n factor off in terms of span; moreover, it achieves small constant factors in its bounds
Parallel Weighted Random Sampling
Data structures for efficient sampling from a set of weighted items are an important building block of many applications. However, few parallel solutions are known. We close many of these gaps both for shared-memory and distributed-memory machines. We give efficient, fast, and practicable algorithms for sampling single items, k items with/without replacement, permutations, subsets, and reservoirs. We also give improved sequential algorithms for alias table construction and for sampling with replacement. Experiments on shared-memory parallel machines with up to 158 threads show near linear speedups both for construction and queries
Design and analysis of sequential and parallel single-source shortest-paths algorithms
We study the performance of algorithms for the Single-Source Shortest-Paths (SSSP) problem on graphs with n nodes and m edges with nonnegative random weights. All previously known SSSP algorithms for directed graphs required superlinear time. Wie give the first SSSP algorithms that provably achieve linear O(n-m)average-case execution time on arbitrary directed graphs with random edge weights. For independent edge weights, the linear-time bound holds with high probability, too. Additionally, our result implies improved average-case bounds for the All-Pairs Shortest-Paths (APSP) problem on sparse graphs, and it yields the first theoretical average-case analysis for the "Approximate Bucket Implementation" of Dijkstra\u27s SSSP algorithm (ABI-Dijkstra). Futhermore, we give constructive proofs for the existence of graph classes with random edge weights on which ABI-Dijkstra and several other well-known SSSP algorithms require superlinear average-case time. Besides the classical sequential (single processor) model of computation we also consider parallel computing: we give the currently fastest average-case linear-work parallel SSSP algorithms for large graph classes with random edge weights, e.g., sparse rondom graphs and graphs modeling the WWW, telephone calls or social networks.In dieser Arbeit untersuchen wir die Laufzeiten von Algorithmen fĂŒr das KĂŒrzeste-Wege Problem (Single-Source Shortest-Paths, SSSP) auf Graphen mit n Knoten, M Kanten und nichtnegativen zufĂ€lligen Kantengewichten. Alle bisherigen SSSP Algorithmen benötigen auf gerichteten Graphen superlineare Zeit. Wir stellen den ersten SSSP Algorithmus vor, der auf beliebigen gerichteten Graphen mit zufĂ€lligen Kantengewichten eine beweisbar lineare average-case-KomplexitĂ€t
O(n+m)aufweist. Sind die Kantengewichte unabhĂ€ngig, so wird die lineare Zeitschranke auch mit hoher Wahrscheinlichkeit eingehalten. AuĂerdem impliziert unser Ergebnis verbesserte average-case-Schranken fĂŒr das All-Pairs Shortest-Paths (APSP) Problem auf dĂŒnnen Graphen und liefert die erste theoretische average-case-Analyse fĂŒr die "Approximate Bucket Implementierung" von Dijkstras SSSP Algorithmus (ABI-Dijkstra). Weiterhin fĂŒhren wir konstruktive Existenzbeweise fĂŒr Graphklassen mit zufĂ€lligen Kantengewichten, auf denen ABI-Dijkstra und mehrere andere bekannte SSSP Algorithmen durchschnittlich superlineare Zeit benötigen. Neben dem klassischen seriellen (Ein-Prozessor) Berechnungsmodell betrachten wir auch Parallelverarbeitung; fĂŒr umfangreiche Graphklassen mit zufĂ€lligen Kantengewichten wie z.B. dĂŒnne Zufallsgraphen oder Modelle fĂŒr das WWW, Telefonanrufe oder soziale Netzwerke stellen wir die derzeit schnellsten parallelen SSSP Algorithmen mit durchschnittlich linearer Arbeit vor
Oblivious Network RAM and Leveraging Parallelism to Achieve Obliviousness
Oblivious RAM (ORAM) is a cryptographic primitive that allows a trusted CPU to securely access untrusted memory, such that the access patterns reveal nothing about sensitive data. ORAM is known to have broad applications in secure processor design and secure multi-party computation for big data. Unfortunately, due to a logarithmic lower bound by Goldreich and Ostrovsky (Journal of the ACM, \u2796), ORAM is bound to incur a moderate cost in practice. In
particular, with the latest developments in ORAM constructions, we are quickly approaching this limit, and the room for performance improvement is small.
In this paper, we consider new models of computation in which the cost of obliviousness can be fundamentally reduced in comparison with the standard ORAM model. We propose the Oblivious Network RAM model of computation, where a CPU communicates with multiple
memory banks, such that the adversary observes only which bank the CPU is communicating with, but not the address oset within each memory bank. In other words, obliviousness within each bank comes for free either because the architecture prevents a malicious party from observing the address accessed within a bank, or because another solution is used to obfuscate memory accesses within each bank and hence we only need to obfuscate communication patterns between the CPU and the memory banks. We present new constructions for obliviously simulating general or parallel programs in the Network RAM model. We describe applications of our new model in secure processor design and in distributed storage applications with a network adversary
Parallel Weighted Random Sampling
Data structures for efficient sampling from a set of weighted items are an important building block of many applications. However, few parallel solutions are known. We close many of these gaps both for shared-memory and distributed-memory machines. We give efficient, fast, and practicable algorithms for sampling single items, k items with/without replacement, permutations, subsets, and reservoirs. We also give improved sequential algorithms for alias table construction and for sampling with replacement. Experiments on shared-memory parallel machines with up to 158 threads show near linear speedups both for construction and queries