Search CORE

10 research outputs found

Recommended from our members

Using Pin as a Memory Reference Generator for Multiprocessor Simulation

Author: McCurdy C
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 22/10/2005
Field of study

In this paper we describe how we have used Pin to generate a multithreaded reference stream for simulation of a multiprocessor on a uniprocessor. We have taken special care to model as accurately as possible the effects of cache coherence protocol state, and lock and barrier synchronization on the performance of multithreaded applications running on multiprocessor hardware. We first describe a simplified version of the algorithm, which uses semaphores to synchronize instrumented application threads and the simulator on every memory reference. We then describe modifications to that algorithm to model the microarchitectural features of the Itanium2 that affect the timing of memory reference issue. An experimental evaluation determines that while cycle-accurate multithreaded simulation is possible using our approach, the use of semaphores has a negative impact on the performance of the simulator

UNT Digital Library

Using Pin as a memory reference generator for multiprocessor simulation

Author: Charles Fischer
Collin McCurdy
Dijkstra E.
Franke H.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Using Pin as a Memory Reference Generator for Multiprocessor Simulation

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref

Memory hierarchy characterization of SPEC CPU2006 and SPEC CPU2017 on the Intel Xeon Skylake-SP

Author: Alastruey-Benedé Jesús
Ibáñez-Marín Pablo
Navarro-Torres Agustín
Viñals-Yúfera Víctor
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2019
Field of study

SPEC CPU is one of the most common benchmark suites used in computer architecture research. CPU2017 has recently been released to replace CPU2006. In this paper we present a detailed evaluation of the memory hierarchy performance for both the CPU2006 and single-threaded CPU2017 benchmarks. The experiments were executed on an Intel Xeon Skylake-SP, which is the first Intel processor to implement a mostly non-inclusive last-level cache (LLC). We present a classification of the benchmarks according to their memory pressure and analyze the performance impact of different LLC sizes. We also test all the hardware prefetchers showing they improve performance in most of the benchmarks. After comprehensive experimentation, we can highlight the following conclusions: i) almost half of SPEC CPU benchmarks have very low miss ratios in the second and third level caches, even with small LLC sizes and without hardware prefetching, ii) overall, the SPEC CPU2017 benchmarks demand even less memory hierarchy resources than the SPEC CPU2006 ones, iii) hardware prefetching is very effective in reducing LLC misses for most benchmarks, even with the smallest LLC size, and iv) from the memory hierarchy standpoint the methodologies commonly used to select benchmarks or simulation points do not guarantee representative workloads

Repositorio Universidad de Zaragoza

Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis

Author: Donald Yeung
Meng-ju Wu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2012
Field of study

Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies em-ployed in modern CPUs. In today’s hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communi-cation in private caches. Researchers normally perform ex-tensive simulations to study these interactions, but this can be costly and not very insightful. An alternative is multicore reuse distance (RD) analysis, which can provide extremely rich information about multicore memory behavior. In this paper, we apply multicore RD analysis to better understand cache system design. We focus on loop-based parallel pro-grams, an important class of programs for which RD anal-ysis provides high accuracy. We propose a novel framework to identify optimal multicore cache hierarchies, and extract several new insights. We also characterize how the optimal cache hierarchies vary with core count and problem size

CiteSeerX

Crossref

A Simple Multi-Core Functional Cache Design Simulator

Author: Mal Rano
Publication venue: ScholarWorks @ UTRGV
Publication date: 01/07/2017
Field of study

This paper presents a flexible multi-core cache memory simulator to design and evaluate memory hierarchies for general-purpose or embedded processors. The proposed simulator needs to work with Pin, which is an open-source dynamic instrumentation tool provided by Intel. The Pin intercepts the execution of instructions and generates a sequence code (traces) to feed into the simulator for any selected benchmark programs, such as SPEC2006, SPLASH2, or PARSEC. We have a plan to release this simulator as an open-source (like Pin) to support research and/or academic community for their simulation works. In addition, we expect more functions can be updated on top of this simulator to share by the research community

Scholarworks@UTRGV Univ. of Texas RioGrande Valley

Studying the impact of multicore processor scaling on directory techniques via reuse distance analysis

Author: Donald Yeung
Minshu Zhao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 04/12/2015
Field of study

Abstract—Researchers have proposed numerous directory techniques to address multicore scalability whose behavior de-pends on the CPU’s particular configuration, e.g. core count and cache size. As CPUs continue to scale, it is essential to explore the directory’s architecture dependences. However, this is challenging using detailed simulation given the large number of CPU configurations that are possible. This paper proposes to use multicore reuse distance analysis to study coherence directories. We develop a framework to extract the directory access stream from parallel LRU stacks, enabling rapid analysis of the directory’s accesses and contents across both core count and cache size scaling. We also implement our framework in a profiler, and apply it to gain insights into multicore scaling’s impact on the directory. Our profiling results show that directory accesses reduce by 3.5x across data cache size scaling, suggesting techniques that tradeoff access latency for reduced capacity or conflicts become increasingly effective as cache size scales. We also show the portion of on-chip memory devoted to the directory cache can be reduced by 53.3 % across data cache size scaling, thus lowering the over-provisioning needed at large cache sizes. Finally, we validate our RD-based directory analyses, and find they are within 13% of cache simulations in terms of access count, on average. I

CiteSeerX

Crossref

Using pin as a memory reference generator for multiprocessor simulation

Author: Charles Fischer
Collin Mccurdy
Publication venue
Publication date: 24/04/2020
Field of study

Abstrac

CiteSeerX

Software-Oriented Distributed Shared Cache Management for Chip Multiprocessors

Author: Jin Lei
Publication venue
Publication date: 30/09/2010
Field of study

This thesis proposes a software-oriented distributed shared cache management approach for chip multiprocessors (CMPs). Unlike hardware-based schemes, our approach offloads the cache management task to trace analysis phase, allowing flexible management strategies. For single-threaded programs, a static 2D page coloring scheme is proposed to utilize oracle trace information to derive an optimal data placement schema for a program. In addition, a dynamic 2D page coloring scheme is proposed as a practical solution, which tries to ap- proach the performance of the static scheme. The evaluation results show that the static scheme achieves 44.7% performance improvement over the conventional shared cache scheme on average while the dynamic scheme performs 32.3% better than the shared cache scheme. For latency-oriented multithreaded programs, a pattern recognition algorithm based on the K-means clustering method is introduced. The algorithm tries to identify data access pat- terns that can be utilized to guide the placement of private data and the replication of shared data. The experimental results show that data placement and replication based on these access patterns lead to 19% performance improvement over the shared cache scheme. The reduced remote cache accesses and aggregated cache miss rate result in much lower bandwidth requirements for the on-chip network and the off-chip main memory bus. Lastly, for throughput-oriented multithreaded programs, we propose a hint-guided data replication scheme to identify memory instructions of a target program that access data with a high reuse property. The derived hints are then used to guide data replication at run time. By balancing the amount of data replication and local cache pressure, the proposed scheme has the potential to help achieve comparable performance to best existing hardware-based schemes.Our proposed software-oriented shared cache management approach is an effective way to manage program performance on CMPs. This approach provides an alternative direction to the research of the distributed cache management problem. Given the known difficulties (e.g., scalability and design complexity) we face with hardware-based schemes, this software- oriented approach may receive a serious consideration from researchers in the future. In this perspective, the thesis provides valuable contributions to the computer architecture research society

D-Scholarship@Pitt