283 research outputs found

    Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems

    Get PDF
    We present Coup, a technique to lower the cost of updates to shared data in cache-coherent systems. Coup exploits the insight that many update operations, such as additions and bitwise logical operations, are commutative: they produce the same final result regardless of the order they are performed in. Coup allows multiple private caches to simultaneously hold update-only permission to the same cache line. Caches with update-only permission can locally buffer and coalesce updates to the line, but cannot satisfy read requests. Upon a read request, Coup reduces the partial updates buffered in private caches to produce the final value. Coup integrates seamlessly into existing coherence protocols, requires inexpensive hardware, and does not affect the memory consistency model. We apply Coup to speed up single-word updates to shared data. On a simulated 128-core, 8-socket system, Coup accelerates state-of-the-art implementations of update-heavy algorithms by up to 2.4Ă—.Center for Future Architectures ResearchNational Science Foundation (U.S.) (CAREER-1452994)Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science (Grier Presidential Fellowship)Microelectronics Advanced Research CorporationUnited States. Defense Advanced Research Projects Agenc

    Plackett and Burman analysis to select effective compiler optimizations

    Get PDF
    Compiler optimizations help to make code run faster at runtime. When the compilation is done before the program is run, compilation time is less of an issue, but how do on-the-fly compilation and optimization impact the overall runtime? If the compiler must compete with the running application for resources, the running application will take more time to complete. This paper investigates the impact of specific compiler optimizations on the overall runtime of an application. A foldover Plackett and Burman design is used to choose compiler optimizations that appear to contribute to shorter overall runtimes. These selected optimizations are compared with the default optimization levels in the Jikes RVM. This method selects optimizations that result in a shorter overall runtime than the default O0, O1, and O2 levels. This shows that careful selection of compiler optimizations can have a significant, positive impact on overall runtime

    Architecting Persistent Memory Systems

    Full text link
    The imminent release of 3D XPoint memory by Intel and Micron looks set to end the long wait for affordable persistent memory. Persistent memories combine the persistence of disk with DRAM-like performance, blurring the traditional divide between a byte-addressable, volatile main memory and a block-addressable, persistent storage (e.g., SSDs). One of the most disruptive potential use cases for persistent memories is to host in-memory recoverable data structures. These recoverable data structures may be directly modified by programmers using user-level processor load and store instructions, rather than relying on performance sapping software intermediaries like the operating and file systems. Ensuring the recoverability of these data structures requires programmers to have the ability to control the order of updates to persistent memory. Current systems do not provide efficient mechanisms (if any) to enforce the order in which store instructions update the physical main memory. Recently proposed memory persistency models allow programmers to specify constraints on the order in which stores can be written-back to main memory. While ordering constraints are necessary for recoverability, they are expensive to enforce due to the high write-latencies exhibited by popular persistent memory technologies. Moreover, reasoning about recovery correctness using memory persistency models in addition to ensuring necessary concurrency control in multi-threaded programs drastically increases programming burden. This thesis aims at increasing the adoption of persistent memories through a) improving the performance of recoverable data structures and b) simplifying persistent memory programming. Software transaction abstractions developed using recently proposed memory persistency models are expected to be widely used by regular programmers to exploit the advantages of persistent memory. This thesis shows that a straightforward implementation of transactions imposes many unnecessary constraints on stores to persistent memory. This thesis also shows how to reduce these constraints through a variety of techniques, notably, deferring transaction commit until after locks are released, resulting in substantial performance improvements. Next, this thesis shows the high cost of enforcing ordering constraints using recent x86 ISA extensions to enable persistent memory programming, an ordering model referred to as synchronous ordering. Synchronous ordering tightly couples enforcing order with writing back stores to main memory, but this tight coupling is often unnecessary to ensure recoverablity. Instead, this thesis proposes delegated persist ordering, wherein ordering requirements are communicated explicitly to the persistent memory controller via novel enhancements to the cache hierarchy. Delegated persist ordering decouples store ordering from processor execution and cache management, significantly reducing processor stalls, and hence, the cost of enforcing constraints. Finally, existing memory persistency models have all been specified to be used in conjunction with ISA-level memory models. That is, programmers must reason about recovery correctness at the abstraction of assembly instructions, an approach which is error prone and places an unreasonable burden on the programmer. This thesis argues for a language-level persistency model that provides mechanisms to specify the semantics of accesses to persistent memory as an integral part of the programming language and proposes a concrete model, acquire-release persistency, that extends C++11s memory model to provide persistency semantics.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/136953/1/akolli_1.pd

    Compiler architecture using a portable intermediate language

    Get PDF
    The back end of a compiler performs machine-dependent tasks and low-level optimisations that are laborious to implement and difficult to debug. In addition, in languages that require run-time services such as garbage collection, the back end must interface with the run-time system to provide those services. The net result is that building a compiler back end entails a high implementation cost. In this dissertation I describe reusable code generation infrastructure that enables the construction of a complete programming language implementation (compiler and run-time system) with reduced effort. The infrastructure consists of a portable intermediate language, a compiler for this language and a low-level run-time system. I provide an implementation of this system and I show that it can support a variety of source programming languages, it reduces the overall eort required to implement a programming language, it can capture and retain information necessary to support run-time services and optimisations, and it produces efficient code

    Crafting Concurrent Data Structures

    Get PDF
    Concurrent data structures lie at the heart of modern parallel programs. The design and implementation of concurrent data structures can be challenging due to the demand for good performance (low latency and high scalability) and strong progress guarantees. In this dissertation, we enrich the knowledge of concurrent data structure design by proposing new implementations, as well as general techniques to improve the performance of existing ones.The first part of the dissertation present an unordered linked list implementation that supports nonblocking insert, remove, and lookup operations. The algorithm is based on a novel ``enlist\u27\u27 technique that greatly simplifies the task of achieving wait-freedom. The value of our technique is also demonstrated in the creation of other wait-free data structures such as stacks and hash tables.The second data structure presented is a nonblocking hash table implementation which solves a long-standing design challenge by permitting the hash table to dynamically adjust its size in a nonblocking manner. Additionally, our hash table offers strong theoretical properties such as supporting unbounded memory. In our algorithm, we introduce a new ``freezable set\u27\u27 abstraction which allows us to achieve atomic migration of keys during a resize. The freezable set abstraction also enables highly efficient implementations which maximally exploit the processor cache locality. In experiments, we found our lock-free hash table performs consistently better than state-of-the-art implementations, such as the split-ordered list.The third data structure we present is a concurrent priority queue called the ``mound\u27\u27. Our implementations include nonblocking and lock-based variants. The mound employs randomization to reduce contention on concurrent insert operations, and decomposes a remove operation into smaller atomic operations so that multiple remove operations can execute in parallel within a pipeline. In experiments, we show that the mound can provide excellent latency at low thread counts.Lastly, we discuss how hardware transactional memory (HTM) can be used to accelerate existing nonblocking concurrent data structure implementations. We propose optimization techniques that can significantly improve the performance (1.5x to 3x speedups) of a variety of important concurrent data structures, such as binary search trees and hash tables. The optimizations also preserve the strong progress guarantees of the original implementations

    Runtime Systems for Persistent Memories

    Full text link
    Emerging persistent memory (PM) technologies promise the performance of DRAM with the durability of disk. However, several challenges remain in existing hardware, programming, and software systems that inhibit wide-scale PM adoption. This thesis focuses on building efficient mechanisms that span hardware and operating systems, and programming languages for integrating PMs in future systems. First, this thesis proposes a mechanism to solve low-endurance problem in PMs. PMs suffer from limited write endurance---PM cells can be written only 10^7-10^9 times before they wear out. Without any wear management, PM lifetime might be as low as 1.1 months. This thesis presents Kevlar, an OS-based wear-management technique for PM, that requires no new hardware. Kevlar uses existing virtual memory mechanisms to remap pages, enabling it to perform both wear leveling---shuffling pages in PM to even wear; and wear reduction---transparently migrating heavily written pages to DRAM. Crucially, Kevlar avoids the need for hardware support to track wear at fine grain. It relies on a novel wear-estimation technique that builds upon Intel's Precise Event Based Sampling to approximately track processor cache contents via a software-maintained Bloom filter and estimate write-back rates at fine grain. Second, this thesis proposes a persistency model for high-level languages to enable integration of PMs in to future programming systems. Prior works extend language memory models with a persistency model prescribing semantics for updates to PM. These approaches require high-overhead mechanisms, are restricted to certain synchronization constructs, provide incomplete semantics, and/or may recover to state that cannot arise in fault-free program execution. This thesis argues for persistency semantics that guarantee failure atomicity of synchronization-free regions (SFRs) --- program regions delimited by synchronization operations. The proposed approach provides clear semantics for the PM state that recovery code may observe and extends C++11's "sequential consistency for data-race-free" guarantee to post-failure recovery code. To this end, this thesis investigates two designs for failure-atomic SFRs that vary in performance and the degree to which commit of persistent state may lag execution. Finally, this thesis proposes StrandWeaver, a hardware persistency model that minimally constrains ordering on PM operations. Several language-level persistency models have emerged recently to aid programming recoverable data structures in PM. The language-level persistency models are built upon hardware primitives that impose stricter ordering constraints on PM operations than the persistency models require. StrandWeaver manages PM order within a strand, a logically independent sequence of PM operations within a thread. PM operations that lie on separate strands are unordered and may drain concurrently to PM. StrandWeaver implements primitives under strand persistency to allow programmers to improve concurrency and relax ordering constraints on updates as they drain to PM. Furthermore, StrandWeaver proposes mechanisms that map persistency semantics in high-level language persistency models to the primitives implemented by StrandWeaver.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155100/1/vgogte_1.pd

    Cornucopia: Temporal safety for CHERI heaps

    Get PDF
    Use-after-free violations of temporal memory safety continue to plague software systems, underpinning many high-impact exploits. The CHERI capability system shows great promise in achieving C and C++ language spatial memory safety, preventing out-of-bounds accesses. Enforcing language-level temporal safety on CHERI requires capability revocation, traditionally achieved either via table lookups (avoided for performance in the CHERI design) or by identifying capabilities in memory to revoke them (similar to a garbage-collector sweep). CHERIvoke, a prior feasibility study, suggested that CHERI’s tagged capabilities could make this latter strategy viable, but modeled only architectural limits and did not consider the full implementation or evaluation of the approach. Cornucopia is a lightweight capability revocation system for CHERI that implements non-probabilistic C/C++ temporal memory safety for standard heap allocations. It extends the CheriBSD virtual-memory subsystem to track capability flow through memory and provides a concurrent kernel-resident revocation service that is amenable to multi-processor and hardware acceleration. We demonstrate an average overhead of less than 2% and a worst-case of 8.9% for concurrent revocation on compatible SPEC CPU2006 benchmarks on a multi-core CHERI CPU on FPGA, and we validate Cornucopia against the Juliet test suite’s corpus of temporally unsafe programs. We test its compatibility with a large corpus of C programs by using a revoking allocator as the system allocator while booting multi-user CheriBSD. Cornucopia is a viable strategy for always-on temporal heap memory safety, suitable for production environments.This work was supported by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL), under contracts FA8750-10-C-0237 (“CTSRD”) and HR0011-18-C-0016 (“ECATS”). We also acknowledge the EPSRC REMS Programme Grant (EP/K008528/1), the ABP Grant (EP/P020011/1), the ERC ELVER Advanced Grant (789108), the Gates Cambridge Trust, Arm Limited, HP Enterprise, and Google, Inc
    • …
    corecore