130 research outputs found

    Deterministic Computations on a PRAM with Static Processor and Memory Faults.

    Get PDF
    We consider Parallel Random Access Machine (PRAM) which has some processors and memory cells faulty. The faults considered are static, i.e., once the machine starts to operate, the operational/faulty status of PRAM components does not change. We develop a deterministic simulation of a fully operational PRAM on a similar faulty machine which has constant fractions of faults among processors and memory cells. The simulating PRAM has nn processors and mm memory cells, and simulates a PRAM with nn processors and a constant fraction of mm memory cells. The simulation is in two phases: it starts with preprocessing, which is followed by the simulation proper performed in a step-by-step fashion. Preprocessing is performed in time O((mn+log⁥n)log⁥n)O((\frac{m}{n}+ \log n)\log n). The slowdown of a step-by-step part of the simulation is O(log⁥m)O(\log m)

    Fast Computation of Small Cuts via Cycle Space Sampling

    Full text link
    We describe a new sampling-based method to determine cuts in an undirected graph. For a graph (V, E), its cycle space is the family of all subsets of E that have even degree at each vertex. We prove that with high probability, sampling the cycle space identifies the cuts of a graph. This leads to simple new linear-time sequential algorithms for finding all cut edges and cut pairs (a set of 2 edges that form a cut) of a graph. In the model of distributed computing in a graph G=(V, E) with O(log V)-bit messages, our approach yields faster algorithms for several problems. The diameter of G is denoted by Diam, and the maximum degree by Delta. We obtain simple O(Diam)-time distributed algorithms to find all cut edges, 2-edge-connected components, and cut pairs, matching or improving upon previous time bounds. Under natural conditions these new algorithms are universally optimal --- i.e. a Omega(Diam)-time lower bound holds on every graph. We obtain a O(Diam+Delta/log V)-time distributed algorithm for finding cut vertices; this is faster than the best previous algorithm when Delta, Diam = O(sqrt(V)). A simple extension of our work yields the first distributed algorithm with sub-linear time for 3-edge-connected components. The basic distributed algorithms are Monte Carlo, but they can be made Las Vegas without increasing the asymptotic complexity. In the model of parallel computing on the EREW PRAM our approach yields a simple algorithm with optimal time complexity O(log V) for finding cut pairs and 3-edge-connected components.Comment: Previous version appeared in Proc. 35th ICALP, pages 145--160, 200

    Enhancing Productivity and Performance Portability of General-Purpose Parallel Programming

    Get PDF
    This work focuses on compiler and run-time techniques for improving the productivity and the performance portability of general-purpose parallel programming. More specifically, we focus on shared-memory task-parallel languages, where the programmer explicitly exposes parallelism in the form of short tasks that may outnumber the cores by orders of magnitude. The compiler, the run-time, and the platform (henceforth the system) are responsible for harnessing this unpredictable amount of parallelism, which can vary from none to excessive, towards efficient execution. The challenge arises from the aspiration to support fine-grained irregular computations and nested parallelism. This work is even more ambitious by also aspiring to lay the foundations to efficiently support declarative code, where the programmer exposes all available parallelism, using high-level language constructs such as parallel loops, reducers or futures. The appeal of declarative code is twofold for general-purpose programming: it is often easier for the programmer who does not have to worry about the granularity of the exposed parallelism, and it achieves better performance portability by avoiding overfitting to a small range of platforms and inputs for which the programmer is coarsening. Furthermore, PRAM algorithms, an important class of parallel algorithms, naturally lend themselves to declarative programming, so supporting it is a necessary condition for capitalizing on the wealth of the PRAM theory. Unfortunately, declarative codes often expose such an overwhelming number of fine-grained tasks that existing systems fail to deliver performance. Our contributions can be partitioned into three components. First, we tackle the issue of coarsening, which declarative code leaves to the system. We identify two goals of coarsening and advocate tackling them separately, using static compiler transformations for one and dynamic run-time approaches for the other. Additionally, we present evidence that the current practice of burdening the programmer with coarsening either leads to codes with poor performance-portability, or to a significantly increased programming effort. This is a ``show-stopper'' for general-purpose programming. To compare the performance portability among approaches, we define an experimental framework and two metrics, and we demonstrate that our approaches are preferable. We close the chapter on coarsening by presenting compiler transformations that automatically coarsen some types of very fine-grained codes. Second, we propose Lazy Scheduling, an innovative run-time scheduling technique that infers the platform load at run-time, using information already maintained. Based on the inferred load, Lazy Scheduling adapts the amount of available parallelism it exposes for parallel execution and, thus, saves parallelism overheads that existing approaches pay. We implement Lazy Scheduling and present experimental results on four different platforms. The results show that Lazy Scheduling is vastly superior for declarative codes and competitive, if not better, for coarsened codes. Moreover, Lazy Scheduling is also superior in terms of performance-portability, supporting our thesis that it is possible to achieve reasonable efficiency and performance portability with declarative codes. Finally, we also implement Lazy Scheduling on XMT, an experimental manycore platform developed at the University of Maryland, which was designed to support codes derived from PRAM algorithms. On XMT, we manage to harness the existing hardware support for scheduling flat parallelism to compose it with Lazy Scheduling, which supports nested parallelism. In the resulting hybrid scheduler, the hardware and software work in synergy to overcome each other's weaknesses. We show the performance composability of the hardware and software schedulers, both in an abstract cost model and experimentally, as the hybrid always performs better than the software scheduler alone. Furthermore, the cost model is validated by using it to predict if it is preferable to execute a code sequentially, with outer parallelism, or with nested parallelism, depending on the input, the available hardware parallelism and the calling context of the parallel code

    Anonymous and fault-tolerant shared-memory computing

    Get PDF
    The vast majority of papers on distributed computing assume that processes are assigned unique identifiers before computation begins. But is this assumption necessary? What if processes do not have unique identifiers or do not wish to divulge them for reasons of privacy? We consider asynchronous shared-memory systems that are anonymous. The shared memory contains only the most common type of shared objects, read/write registers. We investigate, for the first time, what can be implemented deterministically in this model when processes can fail. We give anonymous algorithms for some fundamental problems: time-stamping, snapshots and consensus. Our solutions to the first two are wait-free and the third is obstruction-free. We also show that a shared object has an obstruction-free implementation if and only if it satisfies a simple property called idempotence. To prove the sufficiency of this condition, we give a universal construction that implements any idempotent objec

    D.1.2 – Modular quasi-causal data structures

    Get PDF
    GDD_HCERES2020In large scale systems such as the Internet, replicating data is an essential feature in order to provide availability and fault-tolerance. Attiya and Welch proved that using strong consistency criteria such as atomicity is costly as each operation may need an execution time linear with the latency of the communication network. Weaker consistency criteria like causal consistency and PRAM consistency do not ensure convergence. The different replicas are not guaranteed to converge towards a unique state. Eventual consistency guarantees that all replicas eventually converge when the participants stop updating. However, it fails to fully specify the semantics of the operations on shared objects and requires additional non-intuitive and error-prone distributed specification techniques. In addition existing consistency conditions are usually defined independently from the computing entities (nodes) that manipulate the replicated data; i.e., they do not take into account how computing entities might be linked to one another, or geographically distributed. In this deliverable, we address these issues with two novel contributions. The first contribution proposes a notion of proximity graph between computing nodes. If two nodes are connected in this graph, their operations must satisfy a strong consistency condition, while the operations invoked by other nodes are allowed to satisfy a weaker condition. We use this graph to provide a generic approach to the hybridization of data consistency conditions into the same system. Based on this, we design a distributed algorithm based on this proximity graph, which combines sequential consistency and causal consistency (the resulting condition is called fisheye consistency). The second contribution of this deliverable focuses on improving the limitations of eventual consistency. To this end, we formalize a new consistency criterion, called update consistency, that requires the state of a replicated object to be consistent with a linearization of all the updates. In other words, whereas atomicity imposes a linearization of all of the operations, this criterion imposes this only on updates. Consequently some read operations may return outdated values. Update consistency is stronger than eventual consistency , so we can replace eventually consistent objects with update consistent ones in any program. Finally, we prove that update consistency is universal, in the sense that any object can be implemented under this criterion in a distributed system where any number of nodes may crash

    Efficient Algorithms with Asymmetric Read and Write Costs

    Get PDF
    In several emerging technologies for computer memory (main memory), the cost of reading is significantly cheaper than the cost of writing. Such asymmetry in memory costs poses a fundamentally different model from the RAM for algorithm design. In this paper we study lower and upper bounds for various problems under such asymmetric read and write costs. We consider both the case in which all but O(1) memory has asymmetric cost, and the case of a small cache of symmetric memory. We model both cases using the (M,omega)-ARAM, in which there is a small (symmetric) memory of size M and a large unbounded (asymmetric) memory, both random access, and where reading from the large memory has unit cost, but writing has cost omega >> 1. For FFT and sorting networks we show a lower bound cost of Omega(omega*n*log_{omega*M}(n)), which indicates that it is not possible to achieve asymptotic improvements with cheaper reads when omega is bounded by a polynomial in M. Moreover, there is an asymptotic gap (of min(omega,log(n)/log(omega*M)) between the cost of sorting networks and comparison sorting in the model. This contrasts with the RAM, and most other models, in which the asymptotic costs are the same. We also show a lower bound for computations on an n*n diamond DAG of Omega(omega*n^2/M) cost, which indicates no asymptotic improvement is achievable with fast reads. However, we show that for the minimum edit distance problem (and related problems), which would seem to be a diamond DAG, we can beat this lower bound with an algorithm with only O(omega*n^2/(M*min(omega^{1/3},M^{1/2}))) cost. To achieve this we make use of a "path sketch" technique that is forbidden in a strict DAG computation. Finally, we show several interesting upper bounds for shortest path problems, minimum spanning trees, and other problems. A common theme in many of the upper bounds is that they require redundant computation and a tradeoff between reads and writes

    What Can Be Implemented Anonymously?

    Get PDF
    Abstract. The vast majority of papers on distributed computing assume that processes are assigned unique identifiers before computation begins. But is this assumption necessary? What if processes do not have unique identifiers or do not wish to divulge them for reasons of privacy? We consider asynchronous shared-memory systems that are anonymous. The shared memory contains only the most common type of shared objects, read/write registers. We investigate, for the first time, what can be implemented deterministically in this model when processes can fail. We give anonymous algorithms for some fundamental problems: timestamping, snapshots and consensus. Our solutions to the first two are wait-free and the third is obstruction-free. We also show that a shared object has an obstruction-free implementation if and only if it satisfies a simple property called idempotence. To prove the sufficiency of this condition, we give a universal construction that implements any idempotent object

    Efficient Data-Oblivious Computation

    Get PDF
    The rapid increase in the amount of data stored by cloud servers has resulted in growing privacy concerns for users. First, although keeping data encrypted at all times is an attractive approach to privacy, encryption may preclude mining and learning useful patterns from data. Second, companies are unable to distribute proprietary programs to other parties without risking the loss of their private code when those programs are reverse engineered. A challenge underlying both those problems is that how data is accessed — even when that data is encrypted — can leak secret information. Oblivious RAM is a well studied cryptographic primitive that can be used to solve the underlying challenge of hiding data-access patterns. In this dissertation, we improve Oblivious RAMs and oblivious algorithms asymptotically. We then show how to apply our novel oblivious algorithms to build systems that enable privacy-preserving computation on encrypted data and program obfuscation. Specifically, the first part of this dissertation shows two efficient Oblivious RAM algorithms: 1) The first algorithm achieves sub-logarithmic bandwidth blowup while only incurring an inexpensive XOR computation for performing Private Information Retrieval operations, and 2) The second algorithm is the first perfectly-secure Oblivious Parallel RAM with O(log⁡3N)O(\log^3 N ) bandwidth blowup, O((log⁡m+log⁡log⁡N)log⁡N)O((\log m + \log \log N)\log N) depth blowup, and O(1)O(1) space blowup when the PRAM has mm CPUs and stores NN blocks of data. The second part of this dissertation describes two systems — HOP and GraphSC — that address the problem of computing on private data and the distribution of proprietary programs. HOP is a system that achieves simulation-secure obfuscation of RAM programs assuming secure hardware. It is the first prototype implementation of a provably secure virtual black-box (VBB) obfuscation scheme in any model under any assumptions. GraphSC is a system that allows cloud servers to run a class of data-mining and machine-learning algorithms over users’ data without learning anything about that data. GraphSC brings efficient, parallel secure computation to programmers by allowing them to express computation tasks using the GraphLab abstraction. It is backed by the first non-trivial parallel oblivious algorithms that outperform generic Oblivious RAMs

    Computer science I like proceedings of miniconference on 4.11.2011

    Get PDF
    • 

    corecore