17 research outputs found

    Long-Lived Counters with Polylogarithmic Amortized Step Complexity

    Get PDF
    A shared-memory counter is a well-studied and widely-used concurrent object. It supports two operations: An Inc operation that increases its value by 1 and a Read operation that returns its current value. Jayanti, Tan and Toueg [Jayanti et al., 2000] proved a linear lower bound on the worst-case step complexity of obstruction-free implementations, from read and write operations, of a large class of shared objects that includes counters. The lower bound leaves open the question of finding counter implementations with sub-linear amortized step complexity. In this paper, we address this gap. We present the first wait-free n-process counter, implemented using only read and write operations, whose amortized operation step complexity is O(log^2 n) in all executions. This is the first non-blocking read/write counter algorithm that provides sub-linear amortized step complexity in executions of arbitrary length. Since a logarithmic lower bound on the amortized step complexity of obstruction-free counter implementations exists, our upper bound is optimal up to a logarithmic factor

    Tight Bounds for Asynchronous Renaming

    Full text link

    Randomized versus Deterministic Implementations of Concurrent Data Structures

    Get PDF
    One of the key trends in computing over the past two decades has been increased distribution, both at the processor level, where multi-core architectures are now the norm, and at the system level, where many key services are currently distributed overmultiple machines. Thus, understanding the power and limitations of computing in a concurrent, distributed setting is one of the major challenges in Computer Science. In this thesis, we analyze the complexity of implementing concurrent data structures in asynchronous shared memory systems. We focus on the complexity of a classic distributed coordination task called renaming, in which a set of processes need to pick distinct names from a small set of identifiers. We present the first tight bounds for the time complexity of this problem, both for deterministic and randomized implementations, solving a long-standing open problem in the field. For deterministic algorithms, we prove a tight linear lower bound; for randomized solutions, we provide logarithmic upper and lower bounds on time complexity. Together, these results show an exponential separation between deterministic and randomized renaming solutions. Importantly, the lower bounds extend to implementations of practical shared-memory data structures, such as queues, stacks, and counters. From a technical perspective, this thesis highlights new connections between the distributed renaming problem and other fundamental objects, such as sorting networks, mutual exclusion, and counters. In particular, we show that sorting networks can be used to obtain optimal randomized solutions to renaming, and that, in turn, the existence of these solutions implies a linear lower bound on the complexity of the problem. In sum, the results in this thesis suggest that deterministic implementations of shared-memory data structures do not scale well in terms of worst-case time complexity. On the positive side, we emphasize randomization as a natural alternative, which can circumvent the deterministic lower bounds with high probability. Thus, a promising direction for future work is to extend our randomized renaming techniques to obtain efficient implementations of concurrent data structures

    Oblivious Message Retrieval

    Get PDF
    Anonymous message delivery systems, such as private messaging services and privacy-preserving payment systems, need a mechanism for recipients to retrieve the messages addressed to them, without leaking metadata or letting their messages be linked. Recipients could download all posted messages and scan for those addressed to them, but communication and computation costs are excessive at scale. We show how untrusted servers can detect messages on behalf of recipients, and summarize these into a compact encrypted digest that recipients can easily decrypt. These servers operate obliviously and do not learn anything about which messages are addressed to which recipients. Privacy, soundness, and completeness hold even if everyone but the recipient is adversarial and colluding (unlike in prior schemes). Our starting point is an asymptotically-efficient approach, using Fully Homomorphic Encryption and homomorphically-encoded Sparse Random Linear Codes. We then address the concrete performance using bespoke tailoring of lattice-based cryptographic components, alongside various algebraic and algorithmic optimizations. This reduces the digest size to a few bits per message scanned. Concretely, the servers\u27 cost is ~$1 per million messages scanned, and the resulting digests can be decoded by recipients in ~20ms. Our schemes can thus practically attain the strongest form of receiver privacy for current applications such as privacy-preserving cryptocurrencies

    Case Studies on Optimizing Algorithms for GPU Architectures

    Get PDF
    Modern GPUs are complex, massively multi-threaded, and high-performance. Programmers naturally gravitate towards taking advantage of this high performance for achieving faster results. However, in order to do so successfully, programmers must first understand and then master a new set of skills – writing parallel code, using different types of parallelism, adapting to GPU architectural features, and understanding issues that limit performance. In order to ease this learning process and help GPU programmers become productive more quickly, this dissertation introduces three data access skeletons (DASks) – Block, Column, and Row -- and two block access skeletons (BASks) – Block-By-Block and Warp-by-Warp. Each “skeleton” provides a high-performance implementation framework that partitions data arrays into data blocks and then iterates over those blocks. The programmer must still write “body” methods on individual data blocks to solve their specific problem. These skeletons provide efficient machine dependent data access patterns for use on GPUs. DASks group n data elements into m fixed size data blocks. These m data block are then partitioned across p thread blocks using a 1D or 2D layout pattern. The fixed-size data blocks are parameterized using three C++ template parameters – nWork, WarpSize, and nWarps. Generic programming techniques use these three parameters to enable performance experiments on three different types of parallelism – instruction-level parallelism (ILP), data-level parallelism (DLP), and thread-level parallelism (TLP). These different DASks and BASks are introduced using a simple memory I/O (Copy) case study. A nearest neighbor search case study resulted in the development of DASKs and BASks but does not use these skeletons itself. Three additional case studies – Reduce/Scan, Histogram, and Radix Sort -- demonstrate DASks and BASks in action on parallel primitives and also provides more valuable performance lessons.Doctor of Philosoph

    Long-lived counters with polylogarithmic amortized step complexity

    No full text
    A shared-memory counter is a widely-used and well-studied concurrent object. It supports two operations: An Inc operation that increases its value by 1 and a Read operation that returns its current value. In Jayanti et al (SIAM J Comput, 30(2), 2000), Jayanti, Tan and Toueg proved a linear lower bound on the worst-case step complexity of obstruction-free implementations, from read-write registers, of a large class of shared objects that includes counters. The lower bound leaves open the question of finding counter implementations with sub-linear amortized step complexity. In this work, we address this gap. We show that n-process, wait-free and linearizable counters can be implemented from read-write registers with O(log2n) amortized step complexity. This is the first counter algorithm from read-write registers that provides sub-linear amortized step complexity in executions of arbitrary length. Since a logarithmic lower bound on the amortized step complexity of obstruction-free counter implementations exists, our upper bound is within a logarithmic factor of the optimal. The worst-case step complexity of the construction remains linear, which is optimal. This is obtained thanks to a new max register construction with O(logn) amortized step complexity in executions of arbitrary length in which the value stored in the register does not grow too quickly. We then leverage an existing counter algorithm by Aspnes, Attiya and Censor-Hillel [1] in which we “plug” our max register implementation to show that it remains linearizable while achieving O(log2n) amortized step complexity

    Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures

    Get PDF
    Multi-core processors have become the dominant processor architecture with 2, 4, and 8 cores on a chip being widely available and an increasing number of cores predicted for the future. In addition, the decreasing costs and increasing programmability of Graphic Processing Units (GPUs) have made these an accessible source of parallel processing power in general purpose computing. Among the many research challenges that this scenario has raised are the fundamental problems related to theoretical modeling of computation in these architectures. In this thesis we study several aspects of computation in modern parallel architectures, from modeling of computation in multi-cores and heterogeneous platforms, to multi-core cache management strategies, through the proposal of an architecture that exploits bit-parallelism on thousands of bits. Observing that in practice multi-cores have a small number of cores, we propose a model for low-degree parallelism for these architectures. We argue that assuming a small number of processors (logarithmic in a problem's input size) simplifies the design of parallel algorithms. We show that in this model a large class of divide-and-conquer and dynamic programming algorithms can be parallelized with simple modifications to sequential programs, while achieving optimal parallel speedups. We further explore low-degree-parallelism in computation, providing evidence of fundamental differences in practice and theory between systems with a sublinear and linear number of processors, and suggesting a sharp theoretical gap between the classes of problems that are efficiently parallelizable in each case. Efficient strategies to manage shared caches play a crucial role in multi-core performance. We propose a model for paging in multi-core shared caches, which extends classical paging to a setting in which several threads share the cache. We show that in this setting traditional cache management policies perform poorly, and that any effective strategy must partition the cache among threads, with a partition that adapts dynamically to the demands of each thread. Inspired by the shared cache setting, we introduce the minimum cache usage problem, an extension to classical sequential paging in which algorithms must account for the amount of cache they use. This cache-aware model seeks algorithms with good performance in terms of faults and the amount of cache used, and has applications in energy efficient caching and in shared cache scenarios. The wide availability of GPUs has added to the parallel power of multi-cores, however, most applications underutilize the available resources. We propose a model for hybrid computation in heterogeneous systems with multi-cores and GPU, and describe strategies for generic parallelization and efficient scheduling of a large class of divide-and-conquer algorithms. Lastly, we introduce the Ultra-Wide Word architecture and model, an extension of the word-RAM model, that allows for constant time operations on thousands of bits in parallel. We show that a large class of existing algorithms can be implemented in the Ultra-Wide Word model, achieving speedups comparable to those of multi-threaded computations, while avoiding the more difficult aspects of parallel programming

    A PROCRUSTEAN APPROACH TO STREAM PROCESSING

    Get PDF
    The increasing demand for real-time data processing and the constantly growing data volume have contributed to the rapid evolution of Stream Processing Engines (SPEs), which are designed to continuously process data as it arrives. Low operational cost and timely delivery of results are both objectives of paramount importance for SPEs. Given the volatile and uncharted nature of data streams, achieving the aforementioned goals under fixed resources is a challenge. This calls for adaptable SPEs, which can react to fluctuations in processing demands. In the past, three techniques have been developed for improving an SPE’s ability to adapt. Those techniques are classified based on applications’ requirements on exact or approximate results: stream partitioning, and re-partitioning target exact, and load shedding targets approximate processing. Stream partitioning strives to balance load among processors, and previous techniques neglected hidden costs of distributed execution. Load Shedding lowers the accuracy of results by dropping part of the input, and previous techniques did not cope with evolving streams. Stream re-partitioning is used to reconfigure execution while processing takes place, and previous techniques did not fully utilize window semantics. In this dissertation, we put stream processing in a procrustean bed, in terms of the manner and the degree that processing takes place. To this end, we present new approaches, for window-based aggregate operators, which are applicable to both exact and approximate stream processing in modern SPEs. Our stream partitioning, re-partitioning, and load shedding solutions offer improvements in performance and accuracy on real-world data by exploiting the semantics of both data and operations. In addition, we present SPEAr, the design of an SPE that accelerates processing by delivering approximate results with accuracy guarantees and avoiding unnecessary load. Finally, we contribute a hybrid technique, ShedPart, which can further improve load balance and performance of an SPE

    Hierarchical Routing in Low-Power Wireless Networks

    Get PDF
    Steen, M.R. van [Promotor
    corecore