Search CORE

13 research outputs found

Interval simulation: raising the level of abstraction in architectural simulation

Author: Eeckhout Lieven
Eyerman Stijn
Genbrugge Davy
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multi-core processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by mapping the simulation models on FPGAs; these approaches achieve substantial simulation speedups while simulating performance in a cycle-accurate manner This paper proposes interval simulation which rakes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model estimates core-level performance by analyzing intervals, or the timing between two miss events (branch mispredictions and TLB/cache misses); the miss events are determined through simulation of the memory hierarchy, cache coherence protocol, interconnection network and branch predictor By raising the level of abstraction, interval simulation reduces both development time and evaluation time. Our experimental results using the SPEC CPU2000 and PARSEC benchmark suites and the MS multi-core simulator show good accuracy up to eight cores (average error of 4.6% and max error of 11% for the multi-threaded full-system workloads), while achieving a one order of magnitude simulation speedup compared to cycle-accurate simulation. Moreover interval simulation is easy to implement: our implementation of the mechanistic analytical model incurs only one thousand lines of code. Its high accuracy, fast simulation speed and ease-of-use make interval simulation a useful complement to the architect's toolbox for exploring system-level and high-level micro-architecture trade-offs

CiteSeerX

Crossref

Ghent University Academic Bibliography

Archivsystem Ask23

Recommended from our members

Instruction-level performance modeling and characterization of multimedia applications

Author: Cameron K. W.
Luo Y.
Publication venue: Los Alamos National Laboratory
Publication date: 01/06/1999
Field of study

One of the challenges for characterizing and modeling realistic multimedia applications is the lack of access to source codes. On-chip performance counters effectively resolve this problem by monitoring run-time behaviors at the instruction-level. This paper presents a novel technique of characterizing and modeling workloads at the instruction level for realistic multimedia applications using hardware performance counters. A variety of instruction counts are collected from some multimedia applications, such as RealPlayer, GSM Vocoder, MPEG encoder/decoder, and speech synthesizer. These instruction counts can be used to form a set of abstract characteristic parameters directly related to a processor`s architectural features. Based on microprocessor architectural constraints and these calculated abstract parameters, the architectural performance bottleneck for a specific application can be estimated. Meanwhile, the bottleneck estimation can provide suggestions about viable architectural/functional improvement for certain workloads. The biggest advantage of this new characterization technique is a better understanding of processor utilization efficiency and architectural bottleneck for each application. This technique also provides predictive insight of future architectural enhancements and their affect on current codes. In this paper the authors also attempt to model architectural effect on processor utilization without memory influence. They derive formulas for calculating CPI{sub 0}, CPI without memory effect, and they quantify utilization of architectural parameters. These equations are architecturally diagnostic and predictive in nature. Results provide promise in code characterization, and empirical/analytical modeling

UNT Digital Library

Recommended from our members

Cache-Fair Thread Scheduling for Multicore Processors

Author: Fedorova Alexandra
Seltzer Margo I.
Smith Michael D.
Publication venue
Publication date: 04/03/2016
Field of study

We present a new operating system scheduling algorithm for multicore processors. Our algorithm reduces the effects of unequal CPU cache sharing that occur on these processors and cause unfair CPU sharing, priority inversion, and inadequate CPU accounting. We describe the implementation of our algorithm in the Solaris operating system and demonstrate that it produces fairer schedules enabling better priority enforcement and improved performance stability for applications. With conventional scheduling algorithms, application performance on multicore processors varies by up to 36% depending on the runtime characteristics of concurrent processes. We reduce this variability by up to a factor of seven.Engineering and Applied Science

Harvard University - DASH

BADCO: Behavioral Application-Dependent Superscalar Core Models

Author: Michaud Pierre
Seznec André
Velasquez Ricardo A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 19/10/2013
Field of study

International audienceMicroarchitecture research and development rely heavily on simulators. The ideal simulator should be simple and easy to develop, it should be precise, accurate and very fast. But the ideal simulator does not exist, and microarchitects use different sorts of simulators at different stages of the development of a processor, depending on which is most important, accuracy or simulation speed. Approximate microarchitecture models, which trade accuracy for simulation speed, are very useful for research and design space exploration, provided the loss of accuracy remains acceptable. Behavioral superscalar core modeling is a possible way to trade accuracy for simulation speed in situations where the focus of the study is not the core itself. In this approach, a superscalar core is viewed as a black box emitting requests to the uncore at certain times. A behavioral core model can be connected to a detailed uncore model. Behavioral core models are built from detailed simulations. Once the time to build the model is amortized, important simulation speedups can be obtained. We describe and study a new method for defining behavioral models for modern superscalar cores. The proposed Behavioral Application-Dependent Superscalar Core model, BADCO, predicts the execution time of a thread running on a superscalar core with an error less than 10% in most cases. We show that BADCO is qualitatively accurate, being able to predict how performance changes when we change the uncore. The simulation speedups we obtained are typically between one and two orders of magnitude

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

The home-forwarding mechanism to reduce the cache coherence overhead in next-generation CMPs

Author: LAMETTI SILVIA
MENCAGLI GABRIELE
VANNESCHI MARCO
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

On the road to computer systems able to support the requirements of exascale applications, Chip Multi-Processors (CMPs) are equipped with an ever increasing number of cores interconnected through fast on-chip networks. To exploit such new architectures, the parallel software must be able to scale almost linearly with the number of cores available. To this end, the overhead introduced by the run-time system of parallel programming frameworks and by the architecture itself must be small enough in order to enable high scalability also for very fine-grained parallel programs. An approach to reduce this overhead is to use non-conventional architectural mechanisms revealing useful when certain concurrency patterns in the running application are statically or dynamically recognized. Following this idea, this paper proposes a run-time support able to reduce the effective latency of inter-thread cooperation primitives by lowering the contention on individual caches. To achieve this goal, the new home-forwarding hardware mechanism is proposed and used by our runtime in order to reduce the amount of cache-to-cache interactions generated by the cache coherence protocol. Our ideas have been emulated on the Tilera TILEPro64 CMP, showing a significant speedup improvement in some first benchmarks

Crossref

Archivio della Ricerca - Università di Pisa

Embedded Processor Selection/Performance Estimation using FPGA-based Profiling

Author: Obeidat Fadi
Publication venue: VCU Scholars Compass
Publication date: 26/07/2010
Field of study

In embedded systems, modeling the performance of the candidate processor architectures is very important to enable the designer to estimate the capability of each architecture against the target application. Considering the large number of available embedded processors, the need has increased for building an infrastructure by which it is possible to estimate the performance of a given application on a given processor with a minimum of time and resources. This dissertation presents a framework that employs the softcore MicroBlaze processor as a reference architecture where FPGA-based profiling is implemented to extract the functional statistics that characterize the target application. Linear regression analysis is implemented for mapping the functional statistics of the target application to the performance of the candidate processor architecture. Hence, this approach does not require running the target application on each candidate processor; instead, it is run only on the reference processor which allows testing many processor architectures in very short time

VCU Scholars Compass

Empirical and Statistical Application Modeling Using on -Chip Performance Monitors.

Author: Cameron Kirk William
Publication venue: LSU Digital Commons
Publication date: 01/01/2000
Field of study

To analyze the performance of applications and architectures, both programmers and architects desire formal methods to explain anomalous behavior. To this end, we present various methods that utilize non-intrusive, performance-monitoring hardware only recently available on microprocessors to provide further explanations of observed behavior. All the methods attempt to characterize and explain the instruction-level parallelism achieved by codes on different architectures. We also present a prototype tool automating the analysis process to exploit the advantages of the empirical and statistical methods proposed. The empirical, statistical and hybrid methods are discussed and explained with case study results provided. The given methods further the wealth of tools available to programmer\u27s and architects for generally understanding the performance of scientific applications. Specifically, the models and tools presented provide new methods for evaluating and categorizing application performance. The empirical memory model serves to quantify the hierarchical memory performance of applications by inferring the incurred latencies of codes after the effect of latency hiding techniques are realized. The instruction-level model and its extensions model on-chip performance analytically giving insight into inherent performance bottlenecks in superscalar architectures. The statistical model and its hybrid extension provide other methods of categorizing codes via their statistical variations. The PTERA performance tool automates the use of performance counters for use by these methods across platforms making the modeling process easier still. These unique methods provide alternatives to performance modeling and categorizing not available previously in an attempt to utilize the inherent modeling capabilities of performance monitors on commodity processors for scientific applications

Louisiana State University

SIMULATION OF MANYCORE ARCHITECTURES ON MULTICORE HOSTS

Author: Moeng Michael
Publication venue
Publication date: 22/06/2015
Field of study

Computer architects heavily rely on software simulation to evaluate new and existing processor designs. As target designs become more complex, a growing gap has emerged between single-threaded simulator performance and simulation requirements. Even though modern machines feature multiple cores, most host cores are typically unused or underutilized by state-of-the-art simulators. Parallel simulators are inherently limited by their need to synchronize threads for correctness. In my thesis, I study accurate and efficient parallelization techniques for architecture simulation. This thesis contains several contributions. First, I study synchronization between simulator threads simulating homogeneous hardware structures such as cores or network tiles. Based on this study, I introduce a new synchronization policy, weighted-tuple synchronization, and show that it provides a better performance-accuracy trade-off compared to synchronization currently used by state-of-the-art parallel simulators. Next, I study synchronization between separate simulators responsible for modeling heterogeneous components and introduce reciprocal abstraction. Reciprocal abstraction allows asynchronous simulators to exchange information at runtime for more accurate event timing. Lastly, the reciprocal abstraction model relaxes communication latency restrictions and synchronization requirements; I show how relaxed synchronization requirements allows for coprocessor acceleration

D-Scholarship@Pitt

Analytic evaluation of shared-memory systems with ILP processors

Author: Adve Sarita V.
Pai Vijay S.
Sorin Daniel J.
Vernon Mary K.
Wood David A.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder

CiteSeerX

Crossref

Minds@University of Wisconsin

Analytic Evaluation of Shared-Memory Systems with ILP Processors

Author: Adve Sarita V.
Adve Sarita V.
Pai Vijay S.
Pai Vijay S.
Sorin Daniel J.
Sorin Daniel J.
Vernon Mary K.
Vernon Mary K.
Wood David A.
Wood David A.
Publication venue
Publication date: 20/03/2002
Field of study

Conference paperNon

DSpace at Rice University