1,365 research outputs found
Interval simulation: raising the level of abstraction in architectural simulation
Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multi-core processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by mapping the simulation models on FPGAs; these approaches achieve substantial simulation speedups while simulating performance in a cycle-accurate manner This paper proposes interval simulation which rakes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model estimates core-level performance by analyzing intervals, or the timing between two miss events (branch mispredictions and TLB/cache misses); the miss events are determined through simulation of the memory hierarchy, cache coherence protocol, interconnection network and branch predictor By raising the level of abstraction, interval simulation reduces both development time and evaluation time. Our experimental results using the SPEC CPU2000 and PARSEC benchmark suites and the MS multi-core simulator show good accuracy up to eight cores (average error of 4.6% and max error of 11% for the multi-threaded full-system workloads), while achieving a one order of magnitude simulation speedup compared to cycle-accurate simulation. Moreover interval simulation is easy to implement: our implementation of the mechanistic analytical model incurs only one thousand lines of code. Its high accuracy, fast simulation speed and ease-of-use make interval simulation a useful complement to the architect's toolbox for exploring system-level and high-level micro-architecture trade-offs
Accurate and Scalable Many-Node Simulation
Accurate performance estimation of future many-node machines is challenging
because it requires detailed simulation models of both node and network.
However, simulating the full system in detail is unfeasible in terms of compute
and memory resources. State-of-the-art techniques use a two-phase approach that
combines detailed simulation of a single node with network-only simulation of
the full system. We show that these techniques, where the detailed node
simulation is done in isolation, are inaccurate because they ignore two
important node-level effects: compute time variability, and inter-node
communication.
We propose a novel three-stage simulation method to allow scalable and
accurate many-node simulation, combining native profiling, detailed node
simulation and high-level network simulation. By including timing variability
and the impact of external nodes, our method leads to more accurate estimates.
We validate our technique against measurements on a multi-node cluster, and
report an average 6.7% error on 64 nodes (maximum error of 12%), compared to on
average 27% error and up to 54% when timing variability and the scaling
overhead are ignored. At higher node counts, the prediction error of ignoring
variable timings and scaling overhead continues to increase compared to our
technique, and may lead to selecting the wrong optimal cluster configuration.
Using our technique, we are able to accurately project performance to
thousands of nodes within a day of simulation time, using only a single or a
few simulation hosts. Our method can be used to quickly explore large many-node
design spaces, including node micro-architecture, node count and network
configuration
Recommended from our members
Accurate modeling of core and memory locality for proxy generation targeting emerging applications and architectures
Designing optimal computer systems for improved performance and energy efficiency requires architects and designers to have a deep understanding of the end-user workloads. However, many end-users (e.g., large corporations, banks, defense organizations, etc.) are apprehensive to share their applications with designers due to the confidential nature of software code and data. In addition, emerging applications pose significant challenges to early design space exploration due to their long-running nature and the highly complex nature of their software stack that cannot be supported on many early performance models.
The above challenges can be overcome by using a proxy benchmark. A miniaturized proxy benchmark can be used as a substitute of the original workload to perform early computer performance evaluation. The process of generating a proxy benchmark consists of extracting a set of key statistics to summarize the behavior of end-user applications through profiling and using the collected statistics to synthesize a representative proxy benchmark. Using such proxy benchmarks can help designers to understand the behavior of end-user’s workloads in a reasonable time without the users having to disclose sensitive information about their workloads.
Prior proxy benchmarking schemes leverage micro-architecture independent metrics, derived from detailed simulation tools, to generate proxy benchmarks. However, many emerging workloads do not work reliably with many profiling or simulation tools, in which case it becomes impossible to apply prior proxy generation techniques to generate proxy benchmarks for such complex applications. Furthermore, these techniques model instruction pipeline-level locality in great detail, but abstract out memory locality modeling using simple stride-based models. This results in poor cloning accuracy especially for emerging applications, which have larger memory footprints and complex access patterns. A few detailed cache and memory locality modeling techniques have also been proposed in literature. However, these techniques either model limited locality metrics and suffer from poor cloning accuracy or are fairly accurate, but at the expense of significant metadata overhead. Finally, none of the prior proxy benchmarking techniques model both core and memory locality with high accuracy. As a result, they are not useful for studying system-level performance behavior. Keeping the above key limitations and shortcomings of prior work in mind, this dissertation presents several techniques that expand the frontiers of workload proxy benchmarking, thereby enabling computer designers to gain a better and faster understanding of end-user application behavior without compromising the privileged nature of software or data.
This dissertation first presents a core-level proxy benchmark generation methodology that leverages performance metrics derived from hardware performance counter measurements to create miniature proxy benchmarks targeting emerging big-data applications. The presented performance counter based characterization and associated extrapolation into generic parameters for proxy generation enables faster analysis (runs almost at native hardware speeds, unlike prior workload cloning proposals) and proxy generation for emerging applications that do not work with simulators or profiling tools. The generated proxy benchmarks are representative of the performance of the real-world big-data applications, including operating system and run-time effects, and yet converge to results quickly without needing any complex software stack support.
Next, to improve upon the accuracy and efficiency of prior memory proxy benchmarking techniques, this dissertation presents a novel memory locality modeling technique that leverages localized pattern detection to create miniature memory proxy benchmarks. The presented technique models memory reference locality by decomposing an application’s memory accesses into a set of independent streams (localized by using address region based localization property), tracking fine-grained patterns within the localized streams and, finally, chaining or interleaving accesses from different localized memory streams to create an ordered proxy memory access sequence. This dissertation further extends the workload cloning approach to Graphics Processing Units (GPUs) and presents a novel proxy generation methodology to model the inherent memory access locality of GPU applications, while also accounting for the GPU’s parallel execution model. The generated memory proxy benchmarks help to enable fast and efficient design space exploration of futuristic memory hierarchies.
Finally, this dissertation presents a novel technique to integrate accurate core and memory locality models to create system-level proxy benchmarks targeting emerging applications. This is a new capability that can facilitate efficient overall system (core, cache and memory subsystem) design-space exploration. This dissertation further presents a novel methodology that exploits the synthetic benchmark generation framework to create hypothetical workloads with performance behavior that does not currently exist. Such proxies can be generated to cover anticipated code trends and can represent futuristic workloads before the workloads even exist.Electrical and Computer Engineerin
Fast simulation techniques for microprocessor design space exploration
Designing a microprocessor is extremely time-consuming. Computer architects heavily rely on architectural simulators, e.g., to drive high-level design decisions during early stage design space exploration. The benefit of architectural simulators is that they yield relatively accurate performance results, are highly parameterizable and are very flexible to use. The downside, however, is that they are at least three or four orders of magnitude slower than real hardware execution. The current trend towards multicore processors exacerbates the problem; as the number of cores on a multicore processor increases, simulation speed has become a major concern in computer architecture research and development.
In this dissertation, we propose and evaluate two simulation techniques that reduce the simulation time significantly: statistical simulation and interval simulation. Statistical simulation speeds up the simulation by reducing the number of dynamically executed instructions. First, we collect a number of program execution characteristics into a statistical profile. From this profile we can generate a synthetic trace that exhibits the same execution behavior but which has a much shorter trace length as compared to the original trace. Simulating this synthetic trace then yields a performance estimate. Interval simulation raises the level of abstraction in architectural simulation; it replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model builds on insights from interval analysis: miss events divide the smooth streaming of instructions into so called intervals. The model drives the timing by analyzing the type of the miss events and their latencies, instead of tracking the individual instructions as they propagate through the pipeline stages
Dynamic cache reconfiguration based techniques for improving cache energy efficiency
Modern multicore processors are employing large last-level caches, for
example Intel's E7-8800 processor uses 24MB L3 cache. Further, with each CMOS
technology generation, leakage energy has been dramatically increasing and
hence, leakage energy is expected to become a major source of energy
dissipation, especially in last-level caches (LLCs). The conventional schemes
of cache energy saving either aim at saving dynamic energy or are based on
properties specific to first-level caches, and thus these schemes have limited
utility for last-level caches. Further, several other techniques require
offline profiling or per-application tuning and hence are not suitable for
product systems. In this research, we propose novel cache leakage energy saving
schemes for single-core and multicore systems; desktop, QoS, real-time and
server systems. We propose software-controlled, hardware-assisted techniques
which use dynamic cache reconfiguration to configure the cache to the most
energy efficient configuration while keeping the performance loss bounded. To
profile and test a large number of potential configurations, we utilize
low-overhead, micro-architecture components, which can be easily integrated
into modern processor chips. We adopt a system-wide approach to save energy to
ensure that cache reconfiguration does not increase energy consumption of other
components of the processor. We have compared our techniques with the
state-of-art techniques and have found that our techniques outperform them in
their energy efficiency. This research has important applications in improving
energy-efficiency of higher-end embedded, desktop, server processors and
multitasking systems. We have also proposed performance estimation approach for
efficient design space exploration and have implemented time-sampling based
simulation acceleration approach for full-system architectural simulators.Comment: PhD thesis, dynamic cache reconfiguratio
Software Performance Engineering using Virtual Time Program Execution
In this thesis we introduce a novel approach to software performance engineering that is based
on the execution of code in virtual time. Virtual time execution models the timing-behaviour
of unmodified applications by scaling observed method times or replacing them with results
acquired from performance model simulation. This facilitates the investigation of "what-if" performance predictions of applications comprising an arbitrary combination of real code and
performance models. The ability to analyse code and models in a single framework enables
performance testing throughout the software lifecycle, without the need to to extract performance
models from code. This is accomplished by forcing thread scheduling decisions to take
into account the hypothetical time-scaling or model-based performance specifications of each
method. The virtual time execution of I/O operations or multicore targets is also investigated.
We explore these ideas using a Virtual EXecution (VEX) framework, which provides performance
predictions for multi-threaded applications. The language-independent VEX core is
driven by an instrumentation layer that notifies it of thread state changes and method profiling events; it is then up to VEX to control the progress of application threads in virtual time on top of the operating system scheduler. We also describe a Java Instrumentation Environment
(JINE), demonstrating the challenges involved in virtual time execution at the JVM level.
We evaluate the VEX/JINE tools by executing client-side Java benchmarks in virtual time
and identifying the causes of deviations from observed real times. Our results show that VEX
and JINE transparently provide predictions for the response time of unmodified applications
with typically good accuracy (within 5-10%) and low simulation overheads (25-50% additional
time). We conclude this thesis with a case study that shows how models and code can be
integrated, thus illustrating our vision on how virtual time execution can support performance
testing throughout the software lifecycle
TaskPoint: sampled simulation of task-based programs
Sampled simulation is a mature technique for reducing simulation time of single-threaded programs, but it is not directly applicable to simulation of multi-threaded architectures. Recent multi-threaded sampling techniques assume that the workload assigned to each thread does not change across multiple executions of a program. This assumption does not hold for dynamically scheduled task-based programming models. Task-based programming models allow the programmer to specify program segments as tasks which are instantiated many times and scheduled dynamically to available threads. Due to system noise and variation in scheduling decisions, two consecutive executions on the same machine typically result in different instruction streams processed by each thread. In this paper, we propose TaskPoint, a sampled simulation technique for dynamically scheduled task-based programs. We leverage task instances as sampling units and simulate only a fraction of all task instances in detail. Between detailed simulation intervals we employ a novel fast-forward mechanism for dynamically scheduled programs. We evaluate the proposed technique on a set of 19 task-based parallel benchmarks and two different architectures. Compared to detailed simulation, TaskPoint accelerates architectural simulation with 64 simulated threads by an average factor of 19.1 at an average error of 1.8% and a maximum error of 15.0%.This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493, SEV-2011-00067), the Spanish Ministry of Science and Innovation
(contract TIN2015-65316-P), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), the RoMoL ERC Advanced Grant (GA 321253), the European HiPEAC Network of Excellence and the Mont-Blanc project (EU-FP7-610402 and EU-H2020-671697). M. Moreto has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship JCI-2012-15047. M. Casas is supported by the Ministry of Economy
and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the EUFP7 (contract 2013BP B 00243). T.Grass has been partially
supported by the AGAUR of the Generalitat de Catalunya (grant 2013FI B 0058).Peer ReviewedPostprint (author's final draft
- …