796 research outputs found

    Deterministic Consistency: A Programming Model for Shared Memory Parallelism

    Full text link
    The difficulty of developing reliable parallel software is generating interest in deterministic environments, where a given program and input can yield only one possible result. Languages or type systems can enforce determinism in new code, and runtime systems can impose synthetic schedules on legacy parallel code. To parallelize existing serial code, however, we would like a programming model that is naturally deterministic without language restrictions or artificial scheduling. We propose "deterministic consistency", a parallel programming model as easy to understand as the "parallel assignment" construct in sequential languages such as Perl and JavaScript, where concurrent threads always read their inputs before writing shared outputs. DC supports common data- and task-parallel synchronization abstractions such as fork/join and barriers, as well as non-hierarchical structures such as producer/consumer pipelines and futures. A preliminary prototype suggests that software-only implementations of DC can run applications written for popular parallel environments such as OpenMP with low (<10%) overhead for some applications.Comment: 7 pages, 3 figure

    Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors

    Get PDF
    This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs

    A Comparative Analysis of STM Approaches to Reduction Operations in Irregular Applications

    Get PDF
    As a recently consolidated paradigm for optimistic concurrency in modern multicore architectures, Transactional Memory (TM) can help to the exploitation of parallelism in irregular applications when data dependence information is not available up to run- time. This paper presents and discusses how to leverage TM to exploit parallelism in an important class of irregular applications, the class that exhibits irregular reduction patterns. In order to test and compare our techniques with other solutions, they were implemented in a software TM system called ReduxSTM, that acts as a proof of concept. Basically, ReduxSTM combines two major ideas: a sequential-equivalent ordering of transaction commits that assures the correct result, and an extension of the underlying TM privatization mechanism to reduce unnecessary overhead due to reduction memory updates as well as unnecesary aborts and rollbacks. A comparative study of STM solutions, including ReduxSTM, and other more classical approaches to the parallelization of reduction operations is presented in terms of time, memory and overhead.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

    Investigation of LSTM Based Prediction for Dynamic Energy Management in Chip Multiprocessors

    Get PDF
    In this paper, we investigate the effectiveness of using long short-term memory (LSTM) instead of Kalman filtering to do prediction for the purpose of constructing dynamic energy management (DEM) algorithms in chip multi-processors (CMPs). Either of the two prediction methods is employed to estimate the workload in the next control period for each of the processor cores. These estimates are then used to select voltage-frequency (VF) pairs for each core of the CMP during the next control period as part of a dynamic voltage and frequency scaling (DVFS) technique. The objective of the DVFS technique is to reduce energy consumption under performance constraints that are set by the user. We conduct our investigation using a custom Sniper system simulation framework. Simulation results for 16 and 64 core network-on-chip based CMP architectures and using several benchmarks demonstrate that the LSTM is slightly better than Kalman filtering

    Investigation of LSTM Based Prediction for Dynamic Energy Management in Chip Multiprocessors

    Get PDF
    In this paper, we investigate the effectiveness of using long short-term memory (LSTM) instead of Kalman filtering to do prediction for the purpose of constructing dynamic energy management (DEM) algorithms in chip multi-processors (CMPs). Either of the two prediction methods is employed to estimate the workload in the next control period for each of the processor cores. These estimates are then used to select voltage-frequency (VF) pairs for each core of the CMP during the next control period as part of a dynamic voltage and frequency scaling (DVFS) technique. The objective of the DVFS technique is to reduce energy consumption under performance constraints that are set by the user. We conduct our investigation using a custom Sniper system simulation framework. Simulation results for 16 and 64 core network-on-chip based CMP architectures and using several benchmarks demonstrate that the LSTM is slightly better than Kalman filtering

    Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters

    Get PDF
    The steeply growing performance demands for highly power- and energy-constrained processing systems such as end-nodes of the Internet-of-Things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency benefits of low-voltage operation with the performance typical of parallel systems. Shared-L1-memory multiprocessor clusters are a promising architecture, delivering performance in the order of GOPS and over 100 GOPS/W of energy-efficiency. However, this level of computational efficiency can only be reached by maximizing the effective utilization of the processing elements (PEs) available in the clusters. Along with this effort, the optimization of PE-to-PE synchronization and communication is a critical factor for performance. In this article, we describe a light-weight hardware-accelerated synchronization and communication unit (SCU) for tightly-coupled clusters of processors. We detail the architecture, which enables fine-grain per-PE power management, and its integration into an eight-core cluster of RISC-V processors. To validate the effectiveness of the proposed solution, we implemented the eight-core cluster in advanced 22 nm FDX technology and evaluated performance and energy-efficiency with tunable microbenchmarks and a set of rea-life applications and kernels. The proposed solution allows synchronization-free regions as small as 42 cycles, over 41 smaller than the baseline implementation based on fast test-and-set access to L1 memory when constraining the microbenchmarks to 10 percent synchronization overhead. When evaluated on the real-life DSP-applications, the proposed SCU improves performance by up to 92 and 23 percent on average and energy efficiency by up to 98 and 39 percent on average

    Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems

    Get PDF
    We present Coup, a technique to lower the cost of updates to shared data in cache-coherent systems. Coup exploits the insight that many update operations, such as additions and bitwise logical operations, are commutative: they produce the same final result regardless of the order they are performed in. Coup allows multiple private caches to simultaneously hold update-only permission to the same cache line. Caches with update-only permission can locally buffer and coalesce updates to the line, but cannot satisfy read requests. Upon a read request, Coup reduces the partial updates buffered in private caches to produce the final value. Coup integrates seamlessly into existing coherence protocols, requires inexpensive hardware, and does not affect the memory consistency model. We apply Coup to speed up single-word updates to shared data. On a simulated 128-core, 8-socket system, Coup accelerates state-of-the-art implementations of update-heavy algorithms by up to 2.4×.Center for Future Architectures ResearchNational Science Foundation (U.S.) (CAREER-1452994)Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science (Grier Presidential Fellowship)Microelectronics Advanced Research CorporationUnited States. Defense Advanced Research Projects Agenc

    Efficient Machine-Independent Programming of High-Performance Multiprocessors

    Get PDF
    Parallel computing is regarded by most computer scientists as the most likely approach for significantly improving computing power for scientists and engineers. Advances in programming languages and parallelizing compilers are making parallel computers easier to use by providing a high-level portable programming model that protects software investment. However, experience has shown that simply finding parallelism is not always sufficient for obtaining good performance from today's multiprocessors. The goal of this project is to develop advanced compiler analysis of data and computation decompositions, thread placement, communication, synchronization, and memory system effects needed in order to take advantage of performance-critical elements in modern parallel architectures
    corecore