1,812 research outputs found
The multi-program performance model: debunking current practice in multi-core simulation
Composing a representative multi-program multi-core workload is non-trivial. A multi-core processor can execute multiple independent programs concurrently, and hence, any program mix can form a potential multi-program workload. Given the very large number of possible multiprogram workloads and the limited speed of current simulation methods, it is impossible to evaluate all possible multi-program workloads. This paper presents the Multi-Program Performance Model (MPPM), a method for quickly estimating multiprogram multi-core performance based on single-core simulation runs. MPPM employs an iterative method to model the tight performance entanglement between co-executing programs on a multi-core processor with shared caches. Because MPPM involves analytical modeling, it is very fast, and it estimates multi-core performance for a very large number of multi-program workloads in a reasonable amount of time. In addition, it provides confidence bounds on its performance estimates. Using SPEC CPU2006 and up to 16 cores, we report an average performance prediction error of 2.3% and 2.9% for system throughput (STP) and average normalized turnaround time (ANTT), respectively, while being up to five orders of magnitude faster than detailed simulation. Subsequently, we demonstrate that randomly picking a limited number of multi-program workloads, as done in current pactice, can lead to incorrect design decisions in practical design and research studies, which is alleviated using MPPM. In addition, MPPM can be used to quickly identify multi-program workloads that stress multi-core performance through excessive conflict behavior in shared caches; these stress workloads can then be used for driving the design process further
Mechanistic modeling of architectural vulnerability factor
Reliability to soft errors is a significant design challenge in modern microprocessors owing to an exponential increase in the number of transistors on chip and the reduction in operating voltages with each process generation. Architectural Vulnerability Factor (AVF) modeling using microarchitectural simulators enables architects to make informed performance, power, and reliability tradeoffs. However, such simulators are time-consuming and do not reveal the microarchitectural mechanisms that influence AVF. In this article, we present an accurate first-order mechanistic analytical model to compute AVF, developed using the first principles of an out-of-order superscalar execution. This model provides insight into the fundamental interactions between the workload and microarchitecture that together influence AVF. We use the model to perform design space exploration, parametric sweeps, and workload characterization for AVF
RPPM : Rapid Performance Prediction of Multithreaded workloads on multicore processors
Analytical performance modeling is a useful complement to detailed cycle-level simulation to quickly explore the design space in an early design stage. Mechanistic analytical modeling is particularly interesting as it provides deep insight and does not require expensive offline profiling as empirical modeling. Previous work in mechanistic analytical modeling, unfortunately, is limited to single-threaded applications running on single-core processors.
This work proposes RPPM, a mechanistic analytical performance model for multi-threaded applications on multicore hardware. RPPM collects microarchitecture-independent characteristics of a multi-threaded workload to predict performance on a previously unseen multicore architecture. The profile needs to be collected only once to predict a range of processor architectures. We evaluate RPPM's accuracy against simulation and report a performance prediction error of 11.2% on average (23% max). We demonstrate RPPM's usefulness for conducting design space exploration experiments as well as for analyzing parallel application performance
A Cache Model for Modern Processors
Modern processors use high-performance cache replacement policies that outperform traditional alternatives like least-recently used (LRU). Unfortunately, current cache models use stack distances to predict LRU or its variants, and cannot capture these high-performance policies. Accurate predictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of service, or achieve other system objectives. Without an accurate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. This model uses absolute reuse distances instead of stack distances, which makes it applicable to arbitrary age-based replacement policies. We thoroughly validate our model on several high-performance policies on synthetic and real benchmarks, where its median error is less than 1%. Finally, we present two case studies showing how to use the model to improve shared and single-stream cache performance
Accelerating Non-volatile/Hybrid Processor Cache Design Space Exploration for Application Specific Embedded Systems
In this article, we propose a technique to accelerate nonvolatile or hybrid
of volatile and nonvolatile processor cache design space exploration for
application specific embedded systems. Utilizing a novel cache behavior
modeling equation and a new accurate cache miss prediction mechanism, our
proposed technique can accelerate NVM or hybrid FIFO processor cache design
space exploration for SPEC CPU 2000 applications up to 249 times compared to
the conventional approach
Analytical Modeling is Enough for High Performance BLIS
We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarchically layered memories interact with high-performance software. This allows the community to move on from valuable engineering solutions (empirically autotuning) to scientific understanding (analytical insight).This research was sponsored in part by NSF grants ACI-1148125/1340293 and CCF-0917167.
Enrique S. Quintana-Ortà was supported by project TIN2011-23283 of the Ministerio de Ciencia e Innovacióon and FEDER. Francisco D. Igual was supported by project TIN2012-32180 of the Ministerio de Ciencia e Innovación
Warping Cache Simulation of Polyhedral Programs
Techniques to evaluate a program’s cache performance fall
into two camps: 1. Traditional trace-based cache simulators
precisely account for sophisticated real-world cache models
and support arbitrary workloads, but their runtime is proportional to the number of memory accesses performed by
the program under analysis. 2. Relying on implicit workload
characterizations such as the polyhedral model, analytical approaches often achieve problem-size-independent runtimes,
but so far have been limited to idealized cache models.
We introduce a hybrid approach, warping cache simulation, that aims to achieve applicability to real-world cache
models and problem-size-independent runtimes. As prior
analytical approaches, we focus on programs in the polyhedral model, which allows to reason about the sequence
of memory accesses analytically. Combining this analytical
reasoning with information about the cache behavior obtained from explicit cache simulation allows us to soundly
fast-forward the simulation. By this process of warping, we
accelerate the simulation so that its cost is often independent
of the number of memory accesses
- …