Search CORE

1,108 research outputs found

RPPM : Rapid Performance Prediction of Multithreaded workloads on multicore processors

Author: Akram Shoaib
De Pestel Sander
Eeckhout Lieven
Van den Steen Sam
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Analytical performance modeling is a useful complement to detailed cycle-level simulation to quickly explore the design space in an early design stage. Mechanistic analytical modeling is particularly interesting as it provides deep insight and does not require expensive offline profiling as empirical modeling. Previous work in mechanistic analytical modeling, unfortunately, is limited to single-threaded applications running on single-core processors. This work proposes RPPM, a mechanistic analytical performance model for multi-threaded applications on multicore hardware. RPPM collects microarchitecture-independent characteristics of a multi-threaded workload to predict performance on a previously unseen multicore architecture. The profile needs to be collected only once to predict a range of processor architectures. We evaluate RPPM's accuracy against simulation and report a performance prediction error of 11.2% on average (23% max). We demonstrate RPPM's usefulness for conducting design space exploration experiments as well as for analyzing parallel application performance

Crossref

Ghent University Academic Bibliography

COLAB:A Collaborative Multi-factor Scheduler for Asymmetric Multicore Processors

Author: Janjic Vladimir
Leather Hugh
Petoumenos Pavlos
Thomson John Donald
Yu Teng
Zhu Mingcan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Funding: Partially funded by the UK EPSRC grants Discovery: Pattern Discovery and Program Shaping for Many-core Systems (EP/P020631/1) and ABC: Adaptive Brokerage for Cloud (EP/R010528/1); Royal Academy of Engineering under the Research Fellowship scheme.Increasingly prevalent asymmetric multicore processors (AMP) are necessary for delivering performance in the era of limited power budget and dark silicon. However, the software fails to use them efficiently. OS schedulers, in particular, handle asymmetry only under restricted scenarios. We have efficient symmetric schedulers, efficient asymmetric schedulers for single-threaded workloads, and efficient asymmetric schedulers for single program workloads. What we do not have is a scheduler that can handle all runtime factors affecting AMP for multi-threaded multi-programmed workloads. This paper introduces the first general purpose asymmetry-aware scheduler for multi-threaded multi-programmed workloads. It estimates the performance of each thread on each type of core and identifies communication patterns and bottleneck threads. The scheduler then makes coordinated core assignment and thread selection decisions that still provide each application its fair share of the processor's time. We evaluate our approach using the GEM5 simulator on four distinct big.LITTLE configurations and 26 mixed workloads composed of PARSEC and SPLASH2 benchmarks. Compared to the state-of-the art Linux CFS and AMP-aware schedulers, we demonstrate performance gains of up to 25% and 5% to 15% on average depending on the hardware setup.Postprin

Crossref

The University of Manchester - Institutional Repository

University of Dundee Online Publications

University of St. Andrews - Pure

St Andrews Research Repository

DReAM: An approach to estimate per-Task DRAM energy in multicore systems

Author: Abella Ferrer Jaume
Cazorla Almeida Francisco Javier
Liu Qixiao
Moreto Planas Miquel
Valero Cortés Mateo
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

Accurate per-task energy estimation in multicore systems would allow performing per-task energy-aware task scheduling and energy-aware billing in data centers, among other applications. Per-task energy estimation is challenged by the interaction between tasks in shared resources, which impacts tasks’ energy consumption in uncontrolled ways. Some accurate mechanisms have been devised recently to estimate per-task energy consumed on-chip in multicores, but there is a lack of such mechanisms for DRAM memories. This article makes the case for accurate per-task DRAM energy metering in multicores, which opens new paths to energy/performance optimizations. In particular, the contributions of this article are (i) an ideal per-task energy metering model for DRAM memories; (ii) DReAM, an accurate yet low cost implementation of the ideal model (less than 5% accuracy error when 16 tasks share memory); and (iii) a comparison with standard methods (even distribution and access-count based) proving that DReAM is much more accurate than these other methods.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

A Survey of Phase Classification Techniques for Characterizing Variable Application Behavior

Author: Adegbija Tosiron
Criswell Keeley
Publication venue
Publication date: 16/07/2019
Field of study

Adaptable computing is an increasingly important paradigm that specializes system resources to variable application requirements, environmental conditions, or user requirements. Adapting computing resources to variable application requirements (or application phases) is otherwise known as phase-based optimization. Phase-based optimization takes advantage of application phases, or execution intervals of an application, that behave similarly, to enable effective and beneficial adaptability. In order for phase-based optimization to be effective, the phases must first be classified to determine when application phases begin and end, and ensure that system resources are accurately specialized. In this paper, we present a survey of phase classification techniques that have been proposed to exploit the advantages of adaptable computing through phase-based optimization. We focus on recent techniques and classify these techniques with respect to several factors in order to highlight their similarities and differences. We divide the techniques by their major defining characteristics---online/offline and serial/parallel. In addition, we discuss other characteristics such as prediction and detection techniques, the characteristics used for prediction, interval type, etc. We also identify gaps in the state-of-the-art and discuss future research directions to enable and fully exploit the benefits of adaptable computing.Comment: To appear in IEEE Transactions on Parallel and Distributed Systems (TPDS

arXiv.org e-Print Archive

CP<sub>pf</sub>: A prefetch aware LLC partitioning approach

Author: Liu X.
Pimentel A.D.
Xiao J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

International Migration, Integration and Social Cohesion online publications

UvA-DARE

RPPM : rapid performance prediction of multithreaded applications on multicore hardware

Author: Akram Shoaib
De Pestel Sander
Eeckhout Lieven
Van den Steen Sam
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

This paper proposes RPPM which, based on a microarchitecture-independent profile of a multithreaded application, predicts its performance on a previously unseen multicore platform. RPPM breaks up multithreaded program execution into epochs based on synchronization primitives, and then predicts per-epoch active execution times for each thread and synchronization overhead to arrive at a prediction for overall application performance. RPPM predicts performance within 12 percent on average (27 percent max error) compared to cycle-level simulation. We present a case study to illustrate that RPPM can be used for making accurate multicore design trade-offs early in the design cycle

Ghent University Academic Bibliography

A dynamic execution time estimation model to save energy in heterogeneous multicores running periodic tasks

Author: AlEnawy
Bartolini
Burkimsher
Buttazzo
Bux
Camarasu-Pop
Diaz
Donald
Eranian
Foglia
Gerards
Halfhill
Hardy
Houcine Hassan
Ipek
Ishihara
José Duato
José Luis March
Julio Sahuquillo
Kim
Kwon
March
March
Marwedel
Miftakhutdinov
Salvador Petit
Seth
Shah
Ubal
Ungerer
Wilhelm
Publication venue: 'Elsevier BV'
Publication date: 01/03/2016
Field of study

this is the author’s version of a work that was accepted for publication in Future Generation Computer Systems. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Future Generation Computer Systems, vol. 56 (2016). DOI 10.1016/j.future.2015.06.011.Nowadays, real-time embedded applications have to cope with an increasing demand of functionalities, which require increasing processing capabilities. With this aim real-time systems are being implemented on top of high-performance multicore processors that run multithreaded periodic workloads by allocating threads to individual cores. In addition, to improve both performance and energy savings, the industry is introducing new multicore designs such as ARM’s big.LITTLE that include heterogeneous cores in the same package. A key issue to improve energy savings in multicore embedded real-time systems and reduce the number of deadline misses is to accurately estimate the execution time of the tasks considering the supported processor frequencies. Two main aspects make this estimation difficult. First, the running threads compete among them for shared resources. Second, almost all current microprocessors implement Dynamic Voltage and Frequency Scaling (DVFS) regulators to dynamically adjust the voltage/frequency at run-time according to the workload behavior. Existing execution time estimation models rely on off-line analysis or on the assumption that the task execution time scales linearly with the processor frequency, which can bring important deviations since the memory system uses a different power supply. In contrast, this paper proposes the Processor–Memory (Proc–Mem) model, which dynamically predicts the distinct task execution times depending on the implemented processor frequencies. A power-aware EDF (Earliest Deadline First)-based scheduler using the Proc–Mem approach has been evaluated and compared against the same scheduler using a typical Constant Memory Access Time model, namely CMAT. Results on a heterogeneous multicore processor show that the average deviation of Proc–Mem is only by 5.55% with respect to the actual measured execution time, while the average deviation of the CMAT model is 36.42%. These results turn in important energy savings, by 18% on average and up to 31% in some mixes, in comparison to CMAT for a similar number of deadline misses. © 2015 Elsevier B.V. All rights reserved.This work was supported by the Spanish Ministerio de Economia y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-004-01, and by the Intel Early Career Faculty Honor Program Award.Sahuquillo Borrás, J.; Hassan Mohamed, H.; Petit Martí, SV.; March Cabrelles, JL.; Duato Marín, JF. (2016). A dynamic execution time estimation model to save energy in heterogeneous multicores running periodic tasks. Future Generation Computer Systems. 56:211-219. https://doi.org/10.1016/j.future.2015.06.011S2112195

Crossref

RiuNet

PMCTrack: Delivering performance monitoring counter support to the OS scheduler

Author: Castro F.
Pousa Adrián
Prieto Matías M.
Rodríguez Rodríguez R.
Saez J. C.
Publication venue
Publication date: 18/12/2019
Field of study

Hardware performance monitoring counters (PMCs) have proven effective in characterizing application performance. Because PMCs can only be accessed directly at the OS privilege level, kernellevel tools must be developed to enable the end-user and userspace programs to access PMCs. A large body of work has demonstrated that the OS can perform effective runtime optimizations in multicore systems by leveraging performance-counter data. Special attention has been paid to optimizations in the OS scheduler. While existing performance monitoring tools greatly simplify the collection of PMC application data from userspace, they do not provide an architecture-agnostic kernel-level mechanism that is capable of exposing high-level PMC metrics to OS components, such as the scheduler. As a result, the implementation of PMC-based OS scheduling schemes is typically tied to specific processor models. To address this shortcoming we present PMCTrack, a novel tool for the Linux kernel that provides a simple architecture-independent mechanism that makes it possible for the OS scheduler to access per-thread PMC data. Despite being an OSoriented tool, PMCTrack still allows the gathering of monitoring data from userspace, enabling kernel developers to carry out the necessary offline analysis and debugging to assist them during the scheduler design process. In addition, the tool provides both the OS and the user-space PMCTrack components with other insightful metrics available in modern processors and which are not directly exposed as PMCs, such as cache occupancy or energy consumption. This information is also of great value when it comes to analyzing the potential benefits of novel scheduling policies on real systems. In this paper, we analyze different case studies that demonstrate the flexibility, simplicity and powerful features of PMCTrack.Facultad de InformáticaInstituto de Investigación en Informátic