17 research outputs found

    Irregular accesses reorder unit: improving GPGPU memory coalescing for graph-based workloads

    Get PDF
    GPGPU architectures have become the dominant platform for massively parallel workloads, delivering high performance and energy efficiency for popular applications such as machine learning, computer vision or self-driving cars. However, irregular applications, such as graph processing, fail to fully exploit GPGPU resources due to their divergent memory accesses that saturate the memory hierarchy. To reduce the pressure on the memory subsystem for divergent memory-intensive applications, programmers must take into account SIMT execution model and memory coalescing in GPGPUs, devoting significant efforts in complex optimization techniques. Despite these efforts, we show that irregular graph processing still suffers from low GPGPU performance. We observe that in many irregular applications the mapping of data to threads can be safely changed. In other words, it is possible to relax the strict relationship between thread and data processed to reduce memory divergence. Based on this observation, we propose the Irregular accesses Reorder Unit (IRU), a novel hardware extension tightly integrated in the GPGPU pipeline. The IRU reorders data processed by the threads on irregular accesses to improve memory coalescing, i.e., it tries to assign data elements to threads as to produce coalesced accesses in SIMT groups. Furthermore, the IRU is capable of filtering and merging duplicated accesses, significantly reducing the workload. Programmers can easily utilize the IRU with a simple API, or let the compiler issue instructions from our extended ISA. We evaluate our proposal for state-of-the-art graph-based algorithms and a wide selection of applications. Results show that the IRU achieves a memory coalescing improvement of 1.32x and a 46% reduction in the overall traffic in the memory hierarchy, which results in 1.33x speedup and 13% energy savings on average, while incurring in a small 5.6% area overhead.This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00 and the ICREA Academia program.Peer ReviewedPostprint (published version

    A GPU Register File using Static Data Compression

    Full text link
    GPUs rely on large register files to unlock thread-level parallelism for high throughput. Unfortunately, large register files are power hungry, making it important to seek for new approaches to improve their utilization. This paper introduces a new register file organization for efficient register-packing of narrow integer and floating-point operands designed to leverage on advances in static analysis. We show that the hardware/software co-designed register file organization yields a performance improvement of up to 79%, and 18.6%, on average, at a modest output-quality degradation.Comment: Accepted to ICPP'2

    Efficient coherence and consistency for specialized memory hierarchies

    Get PDF
    As the benefits from transistor scaling slow down, specialization is becoming increasingly important for a wide range of applications. Although traditional heterogeneous systems work well for streaming, data parallel applications, they are inefficient for emerging applications, like graph analytics workloads, with fine-grained synchronization, relaxed atomics, and more general sharing patterns. Heterogeneous systems are also difficult to program, which makes it harder for programmers to take advantage of the potential benefits of specialization. This thesis redesigns the memory hierarchy of heterogeneous systems to make heterogeneous systems more efficient and easier to use. In particular, we focus on three key sources of inefficiency in the memory hierarchy of modern heterogeneous systems: (1) a unified global address space, (2) the cache coherence protocol, and (3) the memory consistency model. A unified global address space makes it easier to write programs for heterogeneous systems. Although industry has recently begun to provide a unified global address space across CPUs and accelerators (primarily GPUs), there are many inefficiencies. For example, emerging applications with fine-grained synchronization need better support for coherence and consistency. We find that simple coherence and complex consistency are key sources of inefficiency. To resolve this problem, we adjust the division of complexity between the cache coherence protocol and memory consistency model: we introduce DeNovo for accelerators (DeNovoA), which extends DeNovo’s hybrid, software-driven hardware coherence protocol to heterogeneous systems. Unlike current coherence protocols for heterogeneous systems, DeNovoA obtains ownership for written data, enables heterogeneous systems to use the simpler sequentially consistent for data-race-free (SC-for-DRF, or DRF) memory consistency model, and provides both efficiency and programmability. Across a wide variety of applications, DeNovoA with a DRF memory consistency model either outperforms or provides comparable efficiency to a the state-of-the-art approach. Although DRF is easier to use and works well for most applications, there are some corner cases where its overheads are unnecessary and hurt performance. This led to the introduction of relaxed atomics in the memory consistency models for multi-core CPUs and heterogeneous systems. Although relaxed atomics can significantly improve performance, they are very difficult to use correctly. We address the impact of relaxed atomics on memory consistency models for heterogeneous systems by creating a new memory consistency model, Data-Race-Free-Relaxed or DRFrlx. DRFrlx extends the existing DRF memory consistency models to provide SC-centric semantics for all common uses of relaxed atomics in heterogeneous systems while retaining their efficiency benefits. Thus, DRFrlx makes it easier for programmers to safely use relaxed atomics. Although current heterogeneous systems are adopting unified global address spaces, specialized memories such as scratchpads still exist in disjoint, private address spaces. This increases programming complexity and causes inefficiencies that negate some of the benefits of specialization. We introduce a new memory organization, stash, that mitigates the inefficiencies of specialized memories by integrating them into the coherent, globally visible address space. Stash makes it easier for programmers to use specialized memories and retains their efficiency benefits. Finally, to better understand the tradeoffs and scalability of different coherence protocols and consistency models, we created a suite of synchronization microbenchmarks, HeteroSync. HeteroSync contains various fine-grained synchronization and relaxed atomics algorithms. Moreover, HeteroSync is highly configurable and provides a standard set of fine-grained synchronization microbenchmarks to compare the efficiency of different approaches. In summary, this thesis questions the state-of-the-art approaches for designing memory hierarchies of heterogeneous systems, and shows that the current techniques provide neither efficiency nor programmability for emerging workloads. We demonstrate how DeNovoA with a DRFrlx memory consistency model improves efficiency and programmability for many heterogeneous applications and makes it easier for programmers to use heterogeneous systems

    Convolutional Neural Network Acceleration on GPU by Exploiting Data Reuse

    Get PDF
    Graphical processing units (GPUs) achieve high throughput with hundreds of cores for concurrent execution and a large register file for storing the context of thousands of threads. Deep learning algorithms have recently gained popularity for their capability for solving complex problems without programmer intervention. Deep learning algorithms operate with a massive amount of input data that causes high memory access overhead. In the convolutional layer of the deep learning network, there exists a unique pattern of data access and reuse, which is not effectively utilized by the GPU architecture. These abundant redundant memory accesses lead to a significant power and performance overhead. In this thesis, I maintained redundant data in a faster on-chip memory, register file, so that the data that are used by multiple neurons can be directly fetched from the register file without cumbersome system memory accesses. In this method, a neuron’s load instruction is replaced by a shuffle instruction if the data are found from the register file. To enable data sharing in the register file, a new register type was used as a destination register of load instructions. By using the unique ID of the new load destination registers, neurons can easily find their data in the register file. By exploiting the underutilized register file space, this method does not impose any area or power overhead on the register file design. The effectiveness of the new idea was evaluated through exhaustive experiments. According to the results, the new idea significantly improved performance and energy efficiency compared to baseline architecture and shared memory version solution

    Specialization without complexity in heterogeneous memory systems

    Get PDF
    The end of Dennard scaling and Moore's law has motivated a rise in the use of parallelism and hardware specialization in computer system design. Across all compute domains, applications have increasingly relied on specialized devices such as GPUs, DSPs, FPGAs, etc., to execute tasks faster and more efficiently, but interfacing these diverse devices within a heterogeneous system remains an important challenge. Early heterogeneous systems were loosely coupled and lacked a shared coherent memory interface, so specialization was reserved for highly regular code patterns with coarse-grained synchronization requirements. More recently, the need to accelerate applications with more irregular and fine-grained sharing patterns has led to significant research into closer integration of specialized devices. A single global address space enables improved programmability, communication efficiency, data reuse, and load balancing for emerging heterogeneous applications. Consequently, there have been many attempts to integrate specialized devices and their caches into a single coherent memory hierarchy to improve performance in future systems-on-chip (SoCs). However, coherence is particularly difficult to implement in heterogeneous systems. Differences in parallelism, locality, and synchronization in high-throughput accelerators such as GPUs means that coherence and consistency strategies designed for CPUs are ineffective, and evaluating the performance of alternative strategies is difficult. Recent efforts to implement coherence for such devices involve a simple software-driven coherence strategy combined with complex extensions to a conventional memory consistency model, which guarantees sequential consistency (SC) for programs that are data race-free (DRF). The first extension, scoped synchronization, avoids coherence costs when synchronization is guaranteed to be local, but it requires the use of the heterogeneous race-free (HRF) consistency model, which limits sharing patterns and increases the burden on the programmer. The second extension, relaxed atomics, allows the programmer to avoid costly ordering constraints when they are unnecessary for functionality, but existing consistency models offer complex and often poorly specified semantics when relaxed atomics are used. Once an appropriate coherence and consistency strategy is determined for a device, interfacing it with devices using different strategies poses another critical challenge. Existing integration strategies are incremental, either sacrificing system flexibility or incurring significant added complexity to achieve this goal. A rethinking of heterogeneous coherence and protocol integration from the ground up is needed. This work lays out a path to implementing flexible and efficient heterogeneous coherence without adding complexity to the consistency model or the system design. To help understand the memory demands of emerging specialized hardware, we first describe a performance analysis tool we developed for highly parallel workloads. Insights from this tool helped guide the development of a collection of coherence and consistency innovations for high-throughput accelerators. On the coherence side, we describe two innovations, DeNovo for GPUs and heterogeneous lazy release consistency (hLRC), which demonstrate that scoped synchronization is not necessary for cache efficiency in high-throughput devices. On the consistency side, this work describes the DRFrlx consistency model, which formalizes safe use cases of atomic relaxation. Again, we offer these benefits while retaining a simple SC-centric DRF consistency model. Finally, to address the challenge of integrating diverse coherence strategies, we present the Spandex coherence interface. Spandex can flexibly and simply integrate devices with a broad range of memory demands in an SoC, and we show how this flexibility enables new performance optimizations that can take advantage of hints about the expected memory demands of an application. Together, these innovations establish a framework for integrating future SoCs that can dynamically adapt to serve the diverse memory demands of future accelerators without incurring complexity for hardware or software designers

    A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS

    Get PDF
    Accelerator-based -or heterogeneous- computing has become increasingly important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes custom-made components, most of today’s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can play here a key role, as they enable unprecedented levels of power-efficiency compared to CPUs/GPUs. However, such paradigms are still immature and deeper exploration is indispensable. This dissertation investigates customizability and proposes novel solutions for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent scratchpad memory with a configurable bank remapping system to reduce bank conflicts. The experimental results show the benefits of both using a customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed synchronization master better suits many-cores than standard centralized solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based on the sparse directory approach, with a selective coherence maintenance system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and non-coherent architectural mechanism along with an extended coherence protocol can enhance performance. The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration of power-efficient high-performance computing architectures. The system is based on a NoC and a customizable GPU-like accelerator core, as well as a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological results as part of the contribution in this dissertation. In fact, as a key benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms do not always support a comprehensive heterogeneous architecture exploration
    corecore