Search CORE

4,399 research outputs found

BarrierPoint: sampled simulation of multi-threaded applications

Author: Carlson Trevor
Eeckhout Lieven
Heirman Wim
Van Craeynest Kenzo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Sampling is a well-known technique to speed up architectural simulation of long-running workloads while maintaining accurate performance predictions. A number of sampling techniques have recently been developed that extend well- known single-threaded techniques to allow sampled simulation of multi-threaded applications. Unfortunately, prior work is limited to non-synchronizing applications (e.g., server throughput workloads); requires the functional simulation of the entire application using a detailed cache hierarchy which limits the overall simulation speedup potential; leads to different units of work across different processor architectures which complicates performance analysis; or, requires massive machine resources to achieve reasonable simulation speedups. In this work, we propose BarrierPoint, a sampling methodology to accelerate simulation by leveraging globally synchronizing barriers in multi-threaded applications. BarrierPoint collects microarchitecture-independent code and data signatures to determine the most representative inter-barrier regions, called barrierpoints. BarrierPoint estimates total application execution time (and other performance metrics of interest) through detailed simulation of these barrierpoints only, leading to substantial simulation speedups. Barrierpoints can be simulated in parallel, use fewer simulation resources, and define fixed units of work to be used in performance comparisons across processor architectures. Our evaluation of BarrierPoint using NPB and Parsec benchmarks reports average simulation speedups of 24.7x (and up to 866.6x) with an average simulation error of 0.9% and 2.9% at most. On average, BarrierPoint reduces the number of simulation machine resources needed by 78x

CiteSeerX

Crossref

Ghent University Academic Bibliography

Archivsystem Ask23

SSP: Eliminating Redundant Writes in Failure-Atomic NVRAMs via Shadow Sub-Paging

Author: Bittman Daniel
Coburn Joel
Hitz Dave
Kolli Aasheesh
Kwon Youngjin
Lee Changman
Lee Se Kwon
Minh Chi Cao
Ni Yuanjiang
Pelley Steven
Talluri Madhusudhan
Venkataraman Shivaram
Volos Haris
Xu Jian
Yang Jun
Zhao Jishen
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Crossref

eScholarship - University of California

Message passing support in the Avalanche widget

Author: Stoller Leigh B.
Swanson Mark R.
Publication venue: University of Utah
Publication date: 01/01/1996
Field of study

Journal ArticleMinimizing communication latency in message passing multiprocessing systems is critical. An emerging problem in these systems is the latency contribution costs caused by the need to percolate the message through the memory hierarchy (at both sending and receiving nodes) and the additional cost of managing consistency within the hierarchy. This paper, considers three important aspects of these costs: cache coherence, message copying, and cache miss rates. The paper then shows via a simulation study how a design called the Widget can be used with existing commercial workstation technology to significantly reduce these costs to support efficient message passing in the Avalanche multiprocessing system

The University of Utah: J. Willard Marriott Digital Library

Modeling Cache Coherence to Expose Interference

Author: Brunel Julien
Pagetti Claire
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Euromicro Conference on Real-Time Systems (ECRTS 2019)
Publication date: 01/01/2019
Field of study

To facilitate programming, most multi-core processors feature automated mechanisms maintaining coherence between each core\u27s cache. These mechanisms introduce interference, that is, delays caused by concurrent access to a shared resource. This type of interference is hard to predict, leading to the mechanisms being shunned by real-time system designers, at the cost of potential benefits in both running time and system complexity. We believe that formal methods can provide the means to ensure that the effects of this interference are properly exposed and mitigated. Consequently, this paper proposes a nascent framework relying on timed automata to model and analyze the interference caused by cache coherence

Dagstuhl Research Online Publication Server

DSM64: A Distributed Shared Memory System in User-Space

Author: Holsapple Stephen Alan
Publication venue: DigitalCommons@CalPoly
Publication date: 01/05/2012
Field of study

This paper presents DSM64: a lazy release consistent software distributed shared memory (SDSM) system built entirely in user-space. The DSM64 system is capable of executing threaded applications implemented with pthreads on a cluster of networked machines without any modifications to the target application. The DSM64 system features a centralized memory manager [1] built atop Hoard [2, 3]: a fast, scalable, and memory-efficient allocator for shared-memory multiprocessors. In my presentation, I present a SDSM system written in C++ for Linux operating systems. I discuss a straight-forward approach to implement SDSM systems in a Linux environment using system-provided tools and concepts avail- able entirely in user-space. I show that the SDSM system presented in this paper is capable of resolving page faults over a local area network in as little as 2 milliseconds. In my analysis, I present the following. I compare the performance characteristics of a matrix multiplication benchmark using various memory coherency models. I demonstrate that matrix multiplication benchmark using a LRC model performs orders of magnitude quicker than the same application using a stricter coherency model. I show the effect of coherency model on memory access patterns and memory contention. I compare the effects of different locking strategies on execution speed and memory access patterns. Lastly, I provide a comparison of the DSM64 system to a non-networked version using a system-provided allocator

DigitalCommons@CalPoly

Modeling Cache Coherence to Expose

Author: Brunel Julien
Pagetti Claire
Sensfelder Nathanaël
Publication venue: HAL CCSD
Publication date: 09/07/2019
Field of study

International audienceTo facilitate programming, most multi-core processors feature automated mechanisms maintaining coherence between each core's cache. These mechanisms introduce interference, that is, delays caused by concurrent access to a shared resource. This type of interference is hard to predict, leading to the mechanisms being shunned by real-time system designers, at the cost of potential benefits in both running time and system complexity. We believe that formal methods can provide the means to ensure that the effects of this interference are properly exposed and mitigated. Consequently, this paper proposes a nascent framework relying on timed automata to model and analyze the interference caused by cache coherence

Exploring the value of supporting multiple DSM protocols in Hardware DSM Controllers

Author: Kuramkote Ravindra
Publication venue: University of Utah
Publication date: 01/01/1999
Field of study

Journal ArticleThe performance of a hardware distributed shared memory (DSM) system is largely dependent on its architect's ability to reduce the number of remote memory misses that occur. Previous attempts to solve this problem have included measures such as supporting both the CC-NUMA and S-COMA architectures is the same machine and providing a programmable DSM controller that can emulate any DSM mechanism. In this paper we first present the design of a DSM controller that supports multiple DSM protocols in custom hardware, and allows the programmer or compiler to specify on a per-variable basis what protocol to use to keep that variable coherent. This simulated performance of this DSM controller compares favorably with that of conventional single-protocol custom hardware designs, often outperforming the conventional systems by a factor of two. To achieve these promising results, that multi-protocol DSM controller needed to support only two DSM architectures (CC-NUMA and S-COMA) and three coherency protocols (both release and sequentially consistent write invalidate and release consistent write update). This work demonstrates the value of supporting a degree of flexibility in one's DSM controller design and suggests what operations such a flexible DSM controller should support

The University of Utah: J. Willard Marriott Digital Library

ISIM: The simulator for the impulse adaptable memory system

Author: Zhang Lixin
Publication venue: University of Utah
Publication date: 18/09/1998
Field of study

technical reportThis document describes ISIM, the simulator for the Impulse Adaptable Memory System. Impulse adds two new features to a conventional memory system. First, it supports a configurable, extra level of address remapping at the memory controller. Second, it supports prefetching at the memory controller. consequently, two new units, a remapping controller and a memory controller cache, are added to a traditional memory system to support the new Impulse features. ISIM is based on Paint, a PA-RISC instruction set interpreter. ISIM extends Paint with a detailed Impulse memory system model which includes a primary data cache, a secondary data cache, a system bus, an Impulse memory controller, and a renovated DRAM backend. Note that this document focuses on the Impulse extensions only. The reader should consult the Paint technical report [2] for an overview of the Paint simulation environment and terminology

Crossref

The University of Utah: J. Willard Marriott Digital Library