6 research outputs found

    The Java Memory Model is Type Safe

    Get PDF

    Specialization without complexity in heterogeneous memory systems

    Get PDF
    The end of Dennard scaling and Moore's law has motivated a rise in the use of parallelism and hardware specialization in computer system design. Across all compute domains, applications have increasingly relied on specialized devices such as GPUs, DSPs, FPGAs, etc., to execute tasks faster and more efficiently, but interfacing these diverse devices within a heterogeneous system remains an important challenge. Early heterogeneous systems were loosely coupled and lacked a shared coherent memory interface, so specialization was reserved for highly regular code patterns with coarse-grained synchronization requirements. More recently, the need to accelerate applications with more irregular and fine-grained sharing patterns has led to significant research into closer integration of specialized devices. A single global address space enables improved programmability, communication efficiency, data reuse, and load balancing for emerging heterogeneous applications. Consequently, there have been many attempts to integrate specialized devices and their caches into a single coherent memory hierarchy to improve performance in future systems-on-chip (SoCs). However, coherence is particularly difficult to implement in heterogeneous systems. Differences in parallelism, locality, and synchronization in high-throughput accelerators such as GPUs means that coherence and consistency strategies designed for CPUs are ineffective, and evaluating the performance of alternative strategies is difficult. Recent efforts to implement coherence for such devices involve a simple software-driven coherence strategy combined with complex extensions to a conventional memory consistency model, which guarantees sequential consistency (SC) for programs that are data race-free (DRF). The first extension, scoped synchronization, avoids coherence costs when synchronization is guaranteed to be local, but it requires the use of the heterogeneous race-free (HRF) consistency model, which limits sharing patterns and increases the burden on the programmer. The second extension, relaxed atomics, allows the programmer to avoid costly ordering constraints when they are unnecessary for functionality, but existing consistency models offer complex and often poorly specified semantics when relaxed atomics are used. Once an appropriate coherence and consistency strategy is determined for a device, interfacing it with devices using different strategies poses another critical challenge. Existing integration strategies are incremental, either sacrificing system flexibility or incurring significant added complexity to achieve this goal. A rethinking of heterogeneous coherence and protocol integration from the ground up is needed. This work lays out a path to implementing flexible and efficient heterogeneous coherence without adding complexity to the consistency model or the system design. To help understand the memory demands of emerging specialized hardware, we first describe a performance analysis tool we developed for highly parallel workloads. Insights from this tool helped guide the development of a collection of coherence and consistency innovations for high-throughput accelerators. On the coherence side, we describe two innovations, DeNovo for GPUs and heterogeneous lazy release consistency (hLRC), which demonstrate that scoped synchronization is not necessary for cache efficiency in high-throughput devices. On the consistency side, this work describes the DRFrlx consistency model, which formalizes safe use cases of atomic relaxation. Again, we offer these benefits while retaining a simple SC-centric DRF consistency model. Finally, to address the challenge of integrating diverse coherence strategies, we present the Spandex coherence interface. Spandex can flexibly and simply integrate devices with a broad range of memory demands in an SoC, and we show how this flexibility enables new performance optimizations that can take advantage of hints about the expected memory demands of an application. Together, these innovations establish a framework for integrating future SoCs that can dynamically adapt to serve the diverse memory demands of future accelerators without incurring complexity for hardware or software designers

    Efficient coherence and consistency for specialized memory hierarchies

    Get PDF
    As the benefits from transistor scaling slow down, specialization is becoming increasingly important for a wide range of applications. Although traditional heterogeneous systems work well for streaming, data parallel applications, they are inefficient for emerging applications, like graph analytics workloads, with fine-grained synchronization, relaxed atomics, and more general sharing patterns. Heterogeneous systems are also difficult to program, which makes it harder for programmers to take advantage of the potential benefits of specialization. This thesis redesigns the memory hierarchy of heterogeneous systems to make heterogeneous systems more efficient and easier to use. In particular, we focus on three key sources of inefficiency in the memory hierarchy of modern heterogeneous systems: (1) a unified global address space, (2) the cache coherence protocol, and (3) the memory consistency model. A unified global address space makes it easier to write programs for heterogeneous systems. Although industry has recently begun to provide a unified global address space across CPUs and accelerators (primarily GPUs), there are many inefficiencies. For example, emerging applications with fine-grained synchronization need better support for coherence and consistency. We find that simple coherence and complex consistency are key sources of inefficiency. To resolve this problem, we adjust the division of complexity between the cache coherence protocol and memory consistency model: we introduce DeNovo for accelerators (DeNovoA), which extends DeNovo’s hybrid, software-driven hardware coherence protocol to heterogeneous systems. Unlike current coherence protocols for heterogeneous systems, DeNovoA obtains ownership for written data, enables heterogeneous systems to use the simpler sequentially consistent for data-race-free (SC-for-DRF, or DRF) memory consistency model, and provides both efficiency and programmability. Across a wide variety of applications, DeNovoA with a DRF memory consistency model either outperforms or provides comparable efficiency to a the state-of-the-art approach. Although DRF is easier to use and works well for most applications, there are some corner cases where its overheads are unnecessary and hurt performance. This led to the introduction of relaxed atomics in the memory consistency models for multi-core CPUs and heterogeneous systems. Although relaxed atomics can significantly improve performance, they are very difficult to use correctly. We address the impact of relaxed atomics on memory consistency models for heterogeneous systems by creating a new memory consistency model, Data-Race-Free-Relaxed or DRFrlx. DRFrlx extends the existing DRF memory consistency models to provide SC-centric semantics for all common uses of relaxed atomics in heterogeneous systems while retaining their efficiency benefits. Thus, DRFrlx makes it easier for programmers to safely use relaxed atomics. Although current heterogeneous systems are adopting unified global address spaces, specialized memories such as scratchpads still exist in disjoint, private address spaces. This increases programming complexity and causes inefficiencies that negate some of the benefits of specialization. We introduce a new memory organization, stash, that mitigates the inefficiencies of specialized memories by integrating them into the coherent, globally visible address space. Stash makes it easier for programmers to use specialized memories and retains their efficiency benefits. Finally, to better understand the tradeoffs and scalability of different coherence protocols and consistency models, we created a suite of synchronization microbenchmarks, HeteroSync. HeteroSync contains various fine-grained synchronization and relaxed atomics algorithms. Moreover, HeteroSync is highly configurable and provides a standard set of fine-grained synchronization microbenchmarks to compare the efficiency of different approaches. In summary, this thesis questions the state-of-the-art approaches for designing memory hierarchies of heterogeneous systems, and shows that the current techniques provide neither efficiency nor programmability for emerging workloads. We demonstrate how DeNovoA with a DRFrlx memory consistency model improves efficiency and programmability for many heterogeneous applications and makes it easier for programmers to use heterogeneous systems

    Programming Languages and Systems

    Get PDF
    This open access book constitutes the proceedings of the 31st European Symposium on Programming, ESOP 2022, which was held during April 5-7, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 21 regular papers presented in this volume were carefully reviewed and selected from 64 submissions. They deal with fundamental issues in the specification, design, analysis, and implementation of programming languages and systems

    Programming Languages and Systems

    Get PDF
    This open access book constitutes the proceedings of the 31st European Symposium on Programming, ESOP 2022, which was held during April 5-7, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 21 regular papers presented in this volume were carefully reviewed and selected from 64 submissions. They deal with fundamental issues in the specification, design, analysis, and implementation of programming languages and systems
    corecore