28 research outputs found

    Architectural support for task dependence management with flexible software scheduling

    Get PDF
    The growing complexity of multi-core architectures has motivated a wide range of software mechanisms to improve the orchestration of parallel executions. Task parallelism has become a very attractive approach thanks to its programmability, portability and potential for optimizations. However, with the expected increase in core counts, finer-grained tasking will be required to exploit the available parallelism, which will increase the overheads introduced by the runtime system. This work presents Task Dependence Manager (TDM), a hardware/software co-designed mechanism to mitigate runtime system overheads. TDM introduces a hardware unit, denoted Dependence Management Unit (DMU), and minimal ISA extensions that allow the runtime system to offload costly dependence tracking operations to the DMU and to still perform task scheduling in software. With lower hardware cost, TDM outperforms hardware-based solutions and enhances the flexibility, adaptability and composability of the system. Results show that TDM improves performance by 12.3% and reduces EDP by 20.4% on average with respect to a software runtime system. Compared to a runtime system fully implemented in hardware, TDM achieves an average speedup of 4.2% with 7.3x less area requirements and significant EDP reductions. In addition, five different software schedulers are evaluated with TDM, illustrating its flexibility and performance gains.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P, TIN2016-76635-C2-2-R and TIN2016-81840-REDT), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671697 and No. 671610. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047.Peer ReviewedPostprint (author's final draft

    Towards fair, scalable, locking

    Get PDF
    Without care, Hardware Transactional Memory presents several performance pathologies that can degrade its performance. Among them, writers of commonly read variables can suffer from starvation. Though different solutions have been proposed for HTM systems, hybrid systems can still suffer from this performance problem, given that software transactions don’t interact with the mechanisms used by hardware to avoid starvation. In this paper we introduce a new per-directory-line hardware contention management mechanism that allows fairer access between both software and hardware threads without the need to abort any transaction. Our mechanism is based on “reserving” directory lines, implementing a limited fair queue for the requests on that line. We adapt the mechanism to the LogTM conflict detection mechanism and show that the resulting proposal is deadlock free. Finally, we sketch how the idea could be applied more generally to reader-writer locks.Postprint (published version

    Active memory controller

    Full text link
    Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs\u27 performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50x faster barriers, 12x faster spinlocks, 8.5x-15x faster stream/array operations, and 3x faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation

    UVSIM user manual: version 0.1

    Get PDF
    technical reportThis document describes UVSIM?the UltraViolet Simulator. UVSIM is an execution-driven simulator modeling the Ultraviolet system, SGI?s entry in DARPA?s HPCS Initiative

    Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients

    Full text link
    Applying differentiable programming techniques and machine learning algorithms to foreign programs requires developers to either rewrite their code in a machine learning framework, or otherwise provide derivatives of the foreign code. This paper presents Enzyme, a high-performance automatic differentiation (AD) compiler plugin for the LLVM compiler framework capable of synthesizing gradients of statically analyzable programs expressed in the LLVM intermediate representation (IR). Enzyme synthesizes gradients for programs written in any language whose compiler targets LLVM IR including C, C++, Fortran, Julia, Rust, Swift, MLIR, etc., thereby providing native AD capabilities in these languages. Unlike traditional source-to-source and operator-overloading tools, Enzyme performs AD on optimized IR. On a machine-learning focused benchmark suite including Microsoft's ADBench, AD on optimized IR achieves a geometric mean speedup of 4.5x over AD on IR before optimization allowing Enzyme to achieve state-of-the-art performance. Packaging Enzyme for PyTorch and TensorFlow provides convenient access to gradients of foreign code with state-of-the art performance, enabling foreign code to be directly incorporated into existing machine learning workflows.Comment: To be published in NeurIPS 202

    Characterization and modeling of atomic memory operations in arm based architectures

    Get PDF
    Efficient fine-grain synchronization is a classic computer architecture challenge that has been profusely addressed in the past. Load Link and Store Conditional (LL/SC) became one of the few solutions to this problem and today it is still part of the State-of-the-art. However, as the core count keeps growing many Instruction Set Architectures (ISA) start to support other synchronization instructions that scale better like Atomic Memory Operations (AMO). In this work we present a characterization of LL/SC and AMO instructions in two current Arm-based server machines. Furthermore, Arm has released its Network-on-Chip (NoC) specification enabling different hardware implementations of how AMO are executed in a multicore. Since the adoption of this new standard is still in its first stages, we have modeled six different AMO policies to explore the hardware design trade offs. We find out that there is no single implementation that outperforms the rest. Therefore, we have designed a hardware solution to dynamically select the best configuration obtaining up to 1.15x speed-ups on relevant benchmarks from the Splash-3 benchmark suite

    Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems

    Get PDF
    We present Coup, a technique to lower the cost of updates to shared data in cache-coherent systems. Coup exploits the insight that many update operations, such as additions and bitwise logical operations, are commutative: they produce the same final result regardless of the order they are performed in. Coup allows multiple private caches to simultaneously hold update-only permission to the same cache line. Caches with update-only permission can locally buffer and coalesce updates to the line, but cannot satisfy read requests. Upon a read request, Coup reduces the partial updates buffered in private caches to produce the final value. Coup integrates seamlessly into existing coherence protocols, requires inexpensive hardware, and does not affect the memory consistency model. We apply Coup to speed up single-word updates to shared data. On a simulated 128-core, 8-socket system, Coup accelerates state-of-the-art implementations of update-heavy algorithms by up to 2.4×.Center for Future Architectures ResearchNational Science Foundation (U.S.) (CAREER-1452994)Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science (Grier Presidential Fellowship)Microelectronics Advanced Research CorporationUnited States. Defense Advanced Research Projects Agenc

    DynAMO: Improving parallelism through dynamic placement of atomic memory operations

    Get PDF
    With increasing core counts in modern multi-core designs, the overhead of synchronization jeopardizes the scalability and efficiency of parallel applications. To mitigate these overheads, modern cache-coherent protocols offer support for Atomic Memory Operations (AMOs) that can be executed near-core (near) or remotely in the on-chip memory hierarchy (far). This paper evaluates current available static AMO execution policies implemented in multi-core Systems-on-Chip (SoC) designs, which select AMOs' execution placement (near or far) based on the cache block coherence state. We propose three static policies and show that the performance of static policies is application dependent. Moreover, we show that one of our proposed static policies outperforms currently available implementations. Furthermore, we propose DynAMO, a predictor that selects the best location to execute the AMOs. DynAMO identifies the different locality patterns to make informed decisions, improving AMO latency and increasing overall throughput. DynAMO outperforms the best-performing static policy and provides geometric mean speed-ups of 1.09× across all workloads and 1.31× on AMO-intensive applications with respect to executing all AMOs near.This research was supported by the Spanish Ministry of Science and Innovation (MCIN) through contracts [PID2019-107255GB-C21], [TED2021-132634A-I00], and [PID2019-105660RB-C21]; the Generalitat of Catalunya through contract [2021-SGR-00763]; the Government of Aragon [T5820R]; the Arm-BSC Center of Excellence, and the European Processor Initiative (EPI) which is part of the European Union’s Horizon 2020 research and innovation program under grant agreement No. 826647. V. Soria-Pardos has been supported through an FPU fellowship [FPU20-02132]; A. Armejach is a Serra Hunter Fellow and has been partially supported by the Grant [IJCI-2017-33945] funded by MCIN/AEI/10.13039/501100011033; M. Moreto through a Ramón y Cajal fellowship [RYC-2016-21104].Peer ReviewedPostprint (author's final draft
    corecore