Search CORE

1,420 research outputs found

Predictable Performance and Fairness Through Accurate Slowdown Estimation in Shared Main Memory Systems

Author: Jaiyen Ben
Kim Yoongu
Mutlu Onur
Seshadri Vivek
Subramanian Lavanya
Publication venue
Publication date: 15/05/2018
Field of study

This paper summarizes the ideas and key concepts in MISE (Memory Interference-induced Slowdown Estimation), which was published in HPCA 2013 [97], and examines the work's significance and future potential. Applications running concurrently on a multicore system interfere with each other at the main memory. This interference can slow down different applications differently. Accurately estimating the slowdown of each application in such a system can enable mechanisms that can enforce quality-of-service. While much prior work has focused on mitigating the performance degradation due to inter-application interference, there is little work on accurately estimating slowdown of individual applications in a multi-programmed environment. Our goal is to accurately estimate application slowdowns, towards providing predictable performance. To this end, we first build a simple Memory Interference-induced Slowdown Estimation (MISE) model, which accurately estimates slowdowns caused by memory interference. We then leverage our MISE model to develop two new memory scheduling schemes: 1) one that provides soft quality-of-service guarantees, and 2) another that explicitly attempts to minimize maximum slowdown (i.e., unfairness) in the system. Evaluations show that our techniques perform significantly better than state-of-the-art memory scheduling approaches to address the same problems. Our proposed model and techniques have enabled significant research in the development of accurate performance models [35, 59, 98, 110] and interference management mechanisms [66, 99, 100, 108, 119, 120]

arXiv.org e-Print Archive

Thread Progress Equalization: Dynamically Adaptive Power and Performance Optimization of Multi-threaded Applications

Author: Garg Siddharth
Liu Guangshuo
Marculescu Diana
Turakhia Yatish
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/03/2016
Field of study

Dynamically adaptive multi-core architectures have been proposed as an effective solution to optimize performance for peak power constrained processors. In processors, the micro-architectural parameters or voltage/frequency of each core to be changed at run-time, thus providing a range of power/performance operating points for each core. In this paper, we propose Thread Progress Equalization (TPEq), a run-time mechanism for power constrained performance maximization of multithreaded applications running on dynamically adaptive multicore processors. Compared to existing approaches, TPEq (i) identifies and addresses two primary sources of inter-thread heterogeneity in multithreaded applications, (ii) determines the optimal core configurations in polynomial time with respect to the number of cores and configurations, and (iii) requires no modifications in the user-level source code. Our experimental evaluations demonstrate that TPEq outperforms state-of-the-art run-time power/performance optimization techniques proposed in literature for dynamically adaptive multicores by up to 23%

arXiv.org e-Print Archive

Revisiting Conventional Task Schedulers to Exploit Asymmetry in ARM big.LITTLE Architectures for Dense Linear Algebra

Author: Costero Luis
Igual Francisco D.
Olcoz Katzalin
Quintana-Ortí Enrique S.
Publication venue
Publication date: 07/09/2015
Field of study

Dealing with asymmetry in the architecture opens a plethora of questions from the perspective of scheduling task-parallel applications, and there exist early attempts to address this problem via ad-hoc strategies embedded into a runtime framework. In this paper we take a different path, which consists in addressing the complexity of the problem at the library level, via a few asymmetry-aware fundamental kernels, hiding the architecture heterogeneity from the task scheduler. For the specific domain of dense linear algebra, we show that this is not only possible but delivers much higher performance than a naive approach based on an asymmetry-oblivious scheduler. Furthermore, this solution also outperforms an ad-hoc asymmetry-aware scheduler furnished with sophisticated scheduling techniques

arXiv.org e-Print Archive

The Proceedings of First Work-in-Progress Session of The CSI International Symposium on Real-Time and Embedded Systems and Technologies

Author: Behnoudfar Ali
Faghih Fathieh
Hatami Mojtaba
Kargahi Mehdi
Naghibzadeh Mahmoud
Taheri Boshra
Zahani Seyyed Hossein Hosseini
Publication venue
Publication date: 09/04/2019
Field of study

The present volume contains the proceedings of RTEST WiP 2018, chaired by Marco Caccamo, University of Illinois at Urbana-Champaign. This event has been organized by the School of Electrical and Computer Engineering at the University of Tehran, in conjunction with the Department of Computer Engineering at Sharif University of Technology, Tehran, Iran. The topics of interest in RTEST WiP span over all theoretical and application-oriented aspects, reporting design, analysis, implementation, evaluation, and empirical results, of real-time and embedded systems, internet-of-things, and cyber-physical systems. The program committee of RTEST 2018 consists of 54 top researchers in the mentioned fields from top universities, industries, and research centers around the world. RTEST 2018 has received a total of 41 submissions, out of which we have accepted 14 regular papers and 4 work-in-progress papers. Each submission has been reviewed by 3 to 5 independent referees, for its quality, originality, contribution, clarity of presentation, and relevance to the symposium topics

arXiv.org e-Print Archive

Novel Model-based Methods for Performance Optimization of Multithreaded 2D Discrete Fourier Transform on Multicore Processors

Author: Khokhriakov Semyon
Lastovetsky Alexey
Reddy Ravi
Publication venue
Publication date: 16/08/2018
Field of study

In this paper, we use multithreaded fast Fourier transforms provided in three highly optimized packages, FFTW-2.1.5, FFTW-3.3.7, and Intel MKL FFT, to present a novel model-based parallel computing technique as a very effective and portable method for optimization of scientific multithreaded routines for performance, especially in the current multicore era where the processors have abundant number of cores. We propose two optimization methods, PFFT-FPM and PFFT-FPM-PAD, based on this technique. They compute 2D-DFT of a complex signal matrix of size NxN using p abstract processors. Both algorithms take as inputs, discrete 3D functions of performance against problem size of the processors and output the transformed signal matrix. Based on our experiments on a modern Intel Haswell multicore server consisting of 36 physical cores, the average and maximum speedups observed for PFFT-FPM using FFTW-3.3.7 are 1.9x and 6.8x respectively and the average and maximum speedups observed using Intel MKL FFT are 1.3x and 2x respectively. The average and maximum speedups observed for PFFT-FPM-PAD using FFTW-3.3.7 are 2x and 9.4x respectively and the average and maximum speedups observed using Intel MKL FFT are 1.4x and 5.9x respectively

arXiv.org e-Print Archive

Building real-time embedded applications on QduinoMC: a web-connected 3D printer case study

Author: Cheng Zhuoqun
West Richard
Ye Ying
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Single Board Computers (SBCs) are now emerging with multiple cores, ADCs, GPIOs, PWM channels, integrated graphics, and several serial bus interfaces. The low power consumption, small form factor and I/O interface capabilities of SBCs with sensors and actuators makes them ideal in embedded and real-time applications. However, most SBCs run non-realtime operating systems based on Linux and Windows, and do not provide a user-friendly API for application development. This paper presents QduinoMC, a multicore extension to the popular Arduino programming environment, which runs on the Quest real-time operating system. QduinoMC is an extension of our earlier single-core, real-time, multithreaded Qduino API. We show the utility of QduinoMC by applying it to a specific application: a web-connected 3D printer. This differs from existing 3D printers, which run relatively simple firmware and lack operating system support to spool multiple jobs, or interoperate with other devices (e.g., in a print farm). We show how QduinoMC empowers devices with the capabilities to run new services without impacting their timing guarantees. While it is possible to modify existing operating systems to provide suitable timing guarantees, the effort to do so is cumbersome and does not provide the ease of programming afforded by QduinoMC.http://www.cs.bu.edu/fac/richwest/papers/rtas_2017.pdfAccepted manuscrip

Boston University Institutional Repository (OpenBU)

The Blacklisting Memory Scheduler: Balancing Performance, Fairness and Complexity

Author: Lee Donghyuk
Mutlu Onur
Rastogi Harsha
Seshadri Vivek
Subramanian Lavanya
Publication venue
Publication date: 01/04/2015
Field of study

In a multicore system, applications running on different cores interfere at main memory. This inter-application interference degrades overall system performance and unfairly slows down applications. Prior works have developed application-aware memory schedulers to tackle this problem. State-of-the-art application-aware memory schedulers prioritize requests of applications that are vulnerable to interference, by ranking individual applications based on their memory access characteristics and enforcing a total rank order. In this paper, we observe that state-of-the-art application-aware memory schedulers have two major shortcomings. First, such schedulers trade off hardware complexity in order to achieve high performance or fairness, since ranking applications with a total order leads to high hardware complexity. Second, ranking can unfairly slow down applications that are at the bottom of the ranking stack. To overcome these shortcomings, we propose the Blacklisting Memory Scheduler (BLISS), which achieves high system performance and fairness while incurring low hardware complexity, based on two observations. First, we find that, to mitigate interference, it is sufficient to separate applications into only two groups. Second, we show that this grouping can be efficiently performed by simply counting the number of consecutive requests served from each application. We evaluate BLISS across a wide variety of workloads/system configurations and compare its performance and hardware complexity, with five state-of-the-art memory schedulers. Our evaluations show that BLISS achieves 5% better system performance and 25% better fairness than the best-performing previous scheduler while greatly reducing critical path latency and hardware area cost of the memory scheduler (by 79% and 43%, respectively), thereby achieving a good trade-off between performance, fairness and hardware complexity

arXiv.org e-Print Archive

CiteSeerX

SysScale: Exploiting Multi-domain Dynamic Voltage and Frequency Scaling for Energy Efficient Mobile Processors

Author: Alser Mohammed
Haj-Yahya Jawad
Kim Jeremie
Mutlu Onur
Rotem Efraim
Vijaykumar Nandita
Yaglıkçı A. Giray
Publication venue
Publication date: 18/05/2020
Field of study

There are three domains in a modern thermally-constrained mobile system-on-chip (SoC): compute, IO, and memory. We observe that a modern SoC typically allocates a fixed power budget, corresponding to worst-case performance demands, to the IO and memory domains even if they are underutilized. The resulting unfair allocation of the power budget across domains can cause two major issues: 1) the IO and memory domains can operate at a higher frequency and voltage than necessary, increasing power consumption and 2) the unused power budget of the IO and memory domains cannot be used to increase the throughput of the compute domain, hampering performance. To avoid these issues, it is crucial to dynamically orchestrate the distribution of the SoC power budget across the three domains based on their actual performance demands. We propose SysScale, a new multi-domain power management technique to improve the energy efficiency of mobile SoCs. SysScale is based on three key ideas. First, SysScale introduces an accurate algorithm to predict the performance (e.g., bandwidth and latency) demands of the three SoC domains. Second, SysScale uses a new DVFS (dynamic voltage and frequency scaling) mechanism to distribute the SoC power to each domain according to the predicted performance demands. Third, in addition to using a global DVFS mechanism, SysScale uses domain-specialized techniques to optimize the energy efficiency of each domain at different operating points. We implement SysScale on an Intel Skylake microprocessor for mobile devices and evaluate it using a wide variety of SPEC CPU2006, graphics (3DMark), and battery life workloads (e.g., video playback). On a 2-core Skylake, SysScale improves the performance of SPEC CPU2006 and 3DMark workloads by up to 16% and 8.9% (9.2% and 7.9% on average), respectively.Comment: To appear at ISCA 202

arXiv.org e-Print Archive

FlashAbacus: A Self-Governing Flash-Based Accelerator for Low-Power Systems

Author: Jung Myoungsoo
Zhang Jie
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/05/2018
Field of study

Energy efficiency and computing flexibility are some of the primary design constraints of heterogeneous computing. In this paper, we present FlashAbacus, a data-processing accelerator that self-governs heterogeneous kernel executions and data storage accesses by integrating many flash modules in lightweight multiprocessors. The proposed accelerator can simultaneously process data from different applications with diverse types of operational functions, and it allows multiple kernels to directly access flash without the assistance of a host-level file system or an I/O runtime library. We prototype FlashAbacus on a multicore-based PCIe platform that connects to FPGA-based flash controllers with a 20 nm node process. The evaluation results show that FlashAbacus can improve the bandwidth of data processing by 127%, while reducing energy consumption by 78.4%, as compared to a conventional method of heterogeneous computing. \blfootnote{This paper is accepted by and will be published at 2018 EuroSys. This document is presented to ensure timely dissemination of scholarly and technical work.Comment: This paper is published at the 13th edition of EuroSy

arXiv.org e-Print Archive

REPP-H: runtime estimation of power and performance on heterogeneous data centers

Author: Martorell Bofill Xavier
Mossé Daniel
Nishtala Rajiv
Petrucci Vinicius
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Modern data centers increasingly demand improved performance with minimal power consumption. Managing the power and performance requirements of the applications is challenging because these data centers, incidentally or intentionally, have to deal with server architecture heterogeneity [19], [22]. One critical challenge that data centers have to face is how to manage system power and performance given the different application behavior across multiple different architectures.This work has been supported by the EU FP7 program (Mont-Blanc 2, ICT-610402), by the Ministerio de Economia (CAP-VII, TIN2015-65316-P), and the Generalitat de Catalunya (MPEXPAR, 2014-SGR-1051). The material herein is based in part upon work supported by the US NSF, grant numbers ACI-1535232 and CNS-1305220.Peer ReviewedPostprint (author's final draft