1,420 research outputs found
Predictable Performance and Fairness Through Accurate Slowdown Estimation in Shared Main Memory Systems
This paper summarizes the ideas and key concepts in MISE (Memory
Interference-induced Slowdown Estimation), which was published in HPCA 2013
[97], and examines the work's significance and future potential. Applications
running concurrently on a multicore system interfere with each other at the
main memory. This interference can slow down different applications
differently. Accurately estimating the slowdown of each application in such a
system can enable mechanisms that can enforce quality-of-service. While much
prior work has focused on mitigating the performance degradation due to
inter-application interference, there is little work on accurately estimating
slowdown of individual applications in a multi-programmed environment. Our goal
is to accurately estimate application slowdowns, towards providing predictable
performance.
To this end, we first build a simple Memory Interference-induced Slowdown
Estimation (MISE) model, which accurately estimates slowdowns caused by memory
interference. We then leverage our MISE model to develop two new memory
scheduling schemes: 1) one that provides soft quality-of-service guarantees,
and 2) another that explicitly attempts to minimize maximum slowdown (i.e.,
unfairness) in the system. Evaluations show that our techniques perform
significantly better than state-of-the-art memory scheduling approaches to
address the same problems.
Our proposed model and techniques have enabled significant research in the
development of accurate performance models [35, 59, 98, 110] and interference
management mechanisms [66, 99, 100, 108, 119, 120]
Thread Progress Equalization: Dynamically Adaptive Power and Performance Optimization of Multi-threaded Applications
Dynamically adaptive multi-core architectures have been proposed as an
effective solution to optimize performance for peak power constrained
processors. In processors, the micro-architectural parameters or
voltage/frequency of each core to be changed at run-time, thus providing a
range of power/performance operating points for each core. In this paper, we
propose Thread Progress Equalization (TPEq), a run-time mechanism for power
constrained performance maximization of multithreaded applications running on
dynamically adaptive multicore processors. Compared to existing approaches,
TPEq (i) identifies and addresses two primary sources of inter-thread
heterogeneity in multithreaded applications, (ii) determines the optimal core
configurations in polynomial time with respect to the number of cores and
configurations, and (iii) requires no modifications in the user-level source
code. Our experimental evaluations demonstrate that TPEq outperforms
state-of-the-art run-time power/performance optimization techniques proposed in
literature for dynamically adaptive multicores by up to 23%
Revisiting Conventional Task Schedulers to Exploit Asymmetry in ARM big.LITTLE Architectures for Dense Linear Algebra
Dealing with asymmetry in the architecture opens a plethora of questions from
the perspective of scheduling task-parallel applications, and there exist early
attempts to address this problem via ad-hoc strategies embedded into a runtime
framework. In this paper we take a different path, which consists in addressing
the complexity of the problem at the library level, via a few asymmetry-aware
fundamental kernels, hiding the architecture heterogeneity from the task
scheduler. For the specific domain of dense linear algebra, we show that this
is not only possible but delivers much higher performance than a naive approach
based on an asymmetry-oblivious scheduler. Furthermore, this solution also
outperforms an ad-hoc asymmetry-aware scheduler furnished with sophisticated
scheduling techniques
The Proceedings of First Work-in-Progress Session of The CSI International Symposium on Real-Time and Embedded Systems and Technologies
The present volume contains the proceedings of RTEST WiP 2018, chaired by
Marco Caccamo, University of Illinois at Urbana-Champaign. This event has been
organized by the School of Electrical and Computer Engineering at the
University of Tehran, in conjunction with the Department of Computer
Engineering at Sharif University of Technology, Tehran, Iran. The topics of
interest in RTEST WiP span over all theoretical and application-oriented
aspects, reporting design, analysis, implementation, evaluation, and empirical
results, of real-time and embedded systems, internet-of-things, and
cyber-physical systems. The program committee of RTEST 2018 consists of 54 top
researchers in the mentioned fields from top universities, industries, and
research centers around the world. RTEST 2018 has received a total of 41
submissions, out of which we have accepted 14 regular papers and 4
work-in-progress papers. Each submission has been reviewed by 3 to 5
independent referees, for its quality, originality, contribution, clarity of
presentation, and relevance to the symposium topics
Novel Model-based Methods for Performance Optimization of Multithreaded 2D Discrete Fourier Transform on Multicore Processors
In this paper, we use multithreaded fast Fourier transforms provided in three
highly optimized packages, FFTW-2.1.5, FFTW-3.3.7, and Intel MKL FFT, to
present a novel model-based parallel computing technique as a very effective
and portable method for optimization of scientific multithreaded routines for
performance, especially in the current multicore era where the processors have
abundant number of cores. We propose two optimization methods, PFFT-FPM and
PFFT-FPM-PAD, based on this technique. They compute 2D-DFT of a complex signal
matrix of size NxN using p abstract processors. Both algorithms take as inputs,
discrete 3D functions of performance against problem size of the processors and
output the transformed signal matrix. Based on our experiments on a modern
Intel Haswell multicore server consisting of 36 physical cores, the average and
maximum speedups observed for PFFT-FPM using FFTW-3.3.7 are 1.9x and 6.8x
respectively and the average and maximum speedups observed using Intel MKL FFT
are 1.3x and 2x respectively. The average and maximum speedups observed for
PFFT-FPM-PAD using FFTW-3.3.7 are 2x and 9.4x respectively and the average and
maximum speedups observed using Intel MKL FFT are 1.4x and 5.9x respectively
Building real-time embedded applications on QduinoMC: a web-connected 3D printer case study
Single Board Computers (SBCs) are now emerging
with multiple cores, ADCs, GPIOs, PWM channels, integrated
graphics, and several serial bus interfaces. The low power
consumption, small form factor and I/O interface capabilities of
SBCs with sensors and actuators makes them ideal in embedded
and real-time applications. However, most SBCs run non-realtime
operating systems based on Linux and Windows, and do
not provide a user-friendly API for application development. This
paper presents QduinoMC, a multicore extension to the popular
Arduino programming environment, which runs on the Quest
real-time operating system. QduinoMC is an extension of our earlier
single-core, real-time, multithreaded Qduino API. We show
the utility of QduinoMC by applying it to a specific application: a
web-connected 3D printer. This differs from existing 3D printers,
which run relatively simple firmware and lack operating system
support to spool multiple jobs, or interoperate with other devices
(e.g., in a print farm). We show how QduinoMC empowers devices with the capabilities to run new services without impacting their timing guarantees. While it is possible to modify existing operating systems to provide suitable timing guarantees, the effort to do so is cumbersome and does not provide the ease of programming afforded by QduinoMC.http://www.cs.bu.edu/fac/richwest/papers/rtas_2017.pdfAccepted manuscrip
The Blacklisting Memory Scheduler: Balancing Performance, Fairness and Complexity
In a multicore system, applications running on different cores interfere at
main memory. This inter-application interference degrades overall system
performance and unfairly slows down applications. Prior works have developed
application-aware memory schedulers to tackle this problem. State-of-the-art
application-aware memory schedulers prioritize requests of applications that
are vulnerable to interference, by ranking individual applications based on
their memory access characteristics and enforcing a total rank order.
In this paper, we observe that state-of-the-art application-aware memory
schedulers have two major shortcomings. First, such schedulers trade off
hardware complexity in order to achieve high performance or fairness, since
ranking applications with a total order leads to high hardware complexity.
Second, ranking can unfairly slow down applications that are at the bottom of
the ranking stack. To overcome these shortcomings, we propose the Blacklisting
Memory Scheduler (BLISS), which achieves high system performance and fairness
while incurring low hardware complexity, based on two observations. First, we
find that, to mitigate interference, it is sufficient to separate applications
into only two groups. Second, we show that this grouping can be efficiently
performed by simply counting the number of consecutive requests served from
each application.
We evaluate BLISS across a wide variety of workloads/system configurations
and compare its performance and hardware complexity, with five state-of-the-art
memory schedulers. Our evaluations show that BLISS achieves 5% better system
performance and 25% better fairness than the best-performing previous scheduler
while greatly reducing critical path latency and hardware area cost of the
memory scheduler (by 79% and 43%, respectively), thereby achieving a good
trade-off between performance, fairness and hardware complexity
SysScale: Exploiting Multi-domain Dynamic Voltage and Frequency Scaling for Energy Efficient Mobile Processors
There are three domains in a modern thermally-constrained mobile
system-on-chip (SoC): compute, IO, and memory. We observe that a modern SoC
typically allocates a fixed power budget, corresponding to worst-case
performance demands, to the IO and memory domains even if they are
underutilized. The resulting unfair allocation of the power budget across
domains can cause two major issues: 1) the IO and memory domains can operate at
a higher frequency and voltage than necessary, increasing power consumption and
2) the unused power budget of the IO and memory domains cannot be used to
increase the throughput of the compute domain, hampering performance. To avoid
these issues, it is crucial to dynamically orchestrate the distribution of the
SoC power budget across the three domains based on their actual performance
demands.
We propose SysScale, a new multi-domain power management technique to improve
the energy efficiency of mobile SoCs. SysScale is based on three key ideas.
First, SysScale introduces an accurate algorithm to predict the performance
(e.g., bandwidth and latency) demands of the three SoC domains. Second,
SysScale uses a new DVFS (dynamic voltage and frequency scaling) mechanism to
distribute the SoC power to each domain according to the predicted performance
demands. Third, in addition to using a global DVFS mechanism, SysScale uses
domain-specialized techniques to optimize the energy efficiency of each domain
at different operating points.
We implement SysScale on an Intel Skylake microprocessor for mobile devices
and evaluate it using a wide variety of SPEC CPU2006, graphics (3DMark), and
battery life workloads (e.g., video playback). On a 2-core Skylake, SysScale
improves the performance of SPEC CPU2006 and 3DMark workloads by up to 16% and
8.9% (9.2% and 7.9% on average), respectively.Comment: To appear at ISCA 202
FlashAbacus: A Self-Governing Flash-Based Accelerator for Low-Power Systems
Energy efficiency and computing flexibility are some of the primary design
constraints of heterogeneous computing. In this paper, we present FlashAbacus,
a data-processing accelerator that self-governs heterogeneous kernel executions
and data storage accesses by integrating many flash modules in lightweight
multiprocessors. The proposed accelerator can simultaneously process data from
different applications with diverse types of operational functions, and it
allows multiple kernels to directly access flash without the assistance of a
host-level file system or an I/O runtime library. We prototype FlashAbacus on a
multicore-based PCIe platform that connects to FPGA-based flash controllers
with a 20 nm node process. The evaluation results show that FlashAbacus can
improve the bandwidth of data processing by 127%, while reducing energy
consumption by 78.4%, as compared to a conventional method of heterogeneous
computing. \blfootnote{This paper is accepted by and will be published at 2018
EuroSys. This document is presented to ensure timely dissemination of scholarly
and technical work.Comment: This paper is published at the 13th edition of EuroSy
REPP-H: runtime estimation of power and performance on heterogeneous data centers
Modern data centers increasingly demand improved performance with minimal power consumption. Managing the power and performance requirements of the applications is challenging because these data centers, incidentally or intentionally, have to deal with server architecture heterogeneity [19], [22]. One critical challenge that data centers have to face is how to manage system power and performance given the different application behavior across multiple different architectures.This work has been supported by the EU FP7 program (Mont-Blanc 2, ICT-610402), by the
Ministerio de Economia (CAP-VII, TIN2015-65316-P), and the Generalitat de Catalunya (MPEXPAR, 2014-SGR-1051).
The material herein is based in part upon work supported by the US NSF, grant numbers ACI-1535232 and CNS-1305220.Peer ReviewedPostprint (author's final draft
- …