Search CORE

179 research outputs found

RPPM : Rapid Performance Prediction of Multithreaded workloads on multicore processors

Author: Akram Shoaib
De Pestel Sander
Eeckhout Lieven
Van den Steen Sam
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Analytical performance modeling is a useful complement to detailed cycle-level simulation to quickly explore the design space in an early design stage. Mechanistic analytical modeling is particularly interesting as it provides deep insight and does not require expensive offline profiling as empirical modeling. Previous work in mechanistic analytical modeling, unfortunately, is limited to single-threaded applications running on single-core processors. This work proposes RPPM, a mechanistic analytical performance model for multi-threaded applications on multicore hardware. RPPM collects microarchitecture-independent characteristics of a multi-threaded workload to predict performance on a previously unseen multicore architecture. The profile needs to be collected only once to predict a range of processor architectures. We evaluate RPPM's accuracy against simulation and report a performance prediction error of 11.2% on average (23% max). We demonstrate RPPM's usefulness for conducting design space exploration experiments as well as for analyzing parallel application performance

Crossref

Ghent University Academic Bibliography

Mechanistic analytical modeling of superscalar in-order processor performance

Author: Breughe Maximilien
Eeckhout Lieven
Eyerman Stijn
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

Superscalar in-order processors form an interesting alternative to out-of-order processors because of their energy efficiency and lower design complexity. However, despite the reduced design complexity, it is nontrivial to get performance estimates or insight in the application--microarchitecture interaction without running slow, detailed cycle-level simulations, because performance highly depends on the order of instructions within the application’s dynamic instruction stream, as in-order processors stall on interinstruction dependences and functional unit contention. To limit the number of detailed cycle-level simulations needed during design space exploration, we propose a mechanistic analytical performance model that is built from understanding the internal mechanisms of the processor. The mechanistic performance model for superscalar in-order processors is shown to be accurate with an average performance prediction error of 3.2% compared to detailed cycle-accurate simulation using gem5. We also validate the model against hardware, using the ARM Cortex-A8 processor and show that it is accurate within 10% on average. We further demonstrate the usefulness of the model through three case studies: (1) design space exploration, identifying the optimum number of functional units for achieving a given performance target; (2) program--machine interactions, providing insight into microarchitecture bottlenecks; and (3) compiler--architecture interactions, visualizing the impact of compiler optimizations on performance

CiteSeerX

Ghent University Academic Bibliography

Integrating a floating point unit into the AT&T Hobbit(tm) microprocessor

Author: Holler Paul T.
Publication venue: Lehigh Preserve
Publication date
Field of study

Lehigh University: Lehigh Preserve

Investigation of a simultaneous multithreaded architecture

Author: Torrant Marc
Publication venue: RIT Scholar Works
Publication date: 01/08/1999
Field of study

Many enhancements have been made to the traditional general purpose load-store computer architectures. Among the enhancements are memory hierarchy improvements, branch prediction, and multiple issue processors. A major problem that exists with current microprocessor design is the disparity in the much larger increase in speed of the CPU versus the moderate increase in speed accessing main memory. The simultaneous multithreaded architecture is an extension of the single-threaded architecture that helps hide the performance penalty created by long-latency instructions, branch mispredictions, and memory accesses. Simultaneous multithreaded architectures use a more flexible parallelism, which takes advantage of both instruction-level, and thread-level parallelism. The goal of this project was to design, simulate, and analyze a model of a simultaneous multithreaded architecture in order to evaluate design alternatives. The simulator was created by modifying a version of the Simple Scalar toolset, developed at the University of Wisconsin. The simulations provide documentation for an overall system performance improvement of a simulta neous multithreaded architecture. In early simulation results, performed with the same number of functional units, an improvement in the number of instructions per cycle (IPC) of between 43% and 58% was found using four threads versus a single thread. The horizontal waste rate, which measures the number of unused issue slots, was reduced between 35% and 46%. The vertical waste rate, which measures the percentage- of unused issue cycles (no issue slots used in a cycle), was reduced between 46% and 61%. These results are derived from a set of four sample programs. It was also found that increasing the number of certain functional units did not improve performance, whereas increasing the number of other types of functional units did have a significant positive impact on performance

RIT Scholar Works

Performance Counter Measurements of Data Structures: Implementations for Multi-Objective Optimisation

Author: Candia S
Publication venue: 'Division of Chemical Information and Computer Sciences'
Publication date: 30/09/2020
Field of study

Solving multi-objective optimisation problems using evolutionary computation methods involve the implementation of algorithms and data structures for the storage of tempo- rary solutions. Computational efficiency of these systems becomes important as problems increase in complexity and the number of solutions maintained becomes large. Many data structures and algorithms have been proposed looking to decrease computa- tional times. The effectiveness of a data structure/algorithm can be characterised using wall-clock time. This is a widely used parameter in the literature, however it is strongly dependent on the underlying computer architecture and hence not a reliable measure of absolute performance. A commonly used approach to avoid architectural dependencies is to compare the performance of the data structure being evaluated to the equivalent implementation using a linked list. Modern processors offer built-in hardware performance counters, giving access to a wide set of parameters that can be used to explore performance. In this dissertation we study the efficiency of a non-dominated quad-tree data structure in combination with different evolutionary algorithms using hardware performance counters. We also compare the re- sults for the quad-tree data structure to a linked list as it is the standard practice, however we find non-scalable hardware dependencies might appear

Open Research Exeter

Efficient design space exploration of embedded microprocessors

Author: Breughe Maximilien
Publication venue: Ghent University. Faculty of Engineering and Architecture
Publication date: 01/01/2014
Field of study

Ghent University Academic Bibliography

Empirical and Statistical Application Modeling Using on -Chip Performance Monitors.

Author: Cameron Kirk William
Publication venue: LSU Digital Commons
Publication date: 01/01/2000
Field of study

To analyze the performance of applications and architectures, both programmers and architects desire formal methods to explain anomalous behavior. To this end, we present various methods that utilize non-intrusive, performance-monitoring hardware only recently available on microprocessors to provide further explanations of observed behavior. All the methods attempt to characterize and explain the instruction-level parallelism achieved by codes on different architectures. We also present a prototype tool automating the analysis process to exploit the advantages of the empirical and statistical methods proposed. The empirical, statistical and hybrid methods are discussed and explained with case study results provided. The given methods further the wealth of tools available to programmer\u27s and architects for generally understanding the performance of scientific applications. Specifically, the models and tools presented provide new methods for evaluating and categorizing application performance. The empirical memory model serves to quantify the hierarchical memory performance of applications by inferring the incurred latencies of codes after the effect of latency hiding techniques are realized. The instruction-level model and its extensions model on-chip performance analytically giving insight into inherent performance bottlenecks in superscalar architectures. The statistical model and its hybrid extension provide other methods of categorizing codes via their statistical variations. The PTERA performance tool automates the use of performance counters for use by these methods across platforms making the modeling process easier still. These unique methods provide alternatives to performance modeling and categorizing not available previously in an attempt to utilize the inherent modeling capabilities of performance monitors on commodity processors for scientific applications

Louisiana State University

A general framework to realize an abstract machine as an ILP processor with application to java

Author: WANG HAICHEN
Publication venue
Publication date: 05/05/2007
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Dynamically tunable memory hierarchy

Author: Albonesi David H.
Balasubramonian Rajeev
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/2003
Field of study

Journal ArticleThe widespread use of repeaters in long wires creates the possibility of dynamically sizing regular on-chip structures. We present a tunable cache and translation lookaside buffer (TLB) hierarchy that leverages repeater insertion to dynamically trade off size for speed and power consumption on a per-application phase basis using a novel configuration management algorithm. In comparison to a conventional design that is fixed at a single design point targeted to the average application, the dynamically tunable cache and TLB hierarchy can be tailored to the needs of each application phase. The configuration algorithm dynamically detects phase changes and selects a configuration based on the application's ability to tolerate different hit and miss latencies in order to improve the memory energy-delay product. We evaluate the performance and energy consumption of our approach and project the effects of technology scaling trends on our design

The University of Utah: J. Willard Marriott Digital Library