844 research outputs found
Fairness-aware scheduling on single-ISA heterogeneous multi-cores
Single-ISA heterogeneous multi-cores consisting of small (e.g., in-order) and big (e.g., out-of-order) cores dramatically improve energy- and power-efficiency by scheduling workloads on the most appropriate core type. A significant body of recent work has focused on improving system throughput through scheduling. However, none of the prior work has looked into fairness. Yet, guaranteeing that all threads make equal progress on heterogeneous multi-cores is of utmost importance for both multi-threaded and multi-program workloads to improve performance and quality-of-service. Furthermore, modern operating systems affinitize workloads to cores (pinned scheduling) which dramatically affects fairness on heterogeneous multi-cores. In this paper, we propose fairness-aware scheduling for single-ISA heterogeneous multi-cores, and explore two flavors for doing so. Equal-time scheduling runs each thread or workload on each core type for an equal fraction of the time, whereas equal-progress scheduling strives at getting equal amounts of work done on each core type. Our experimental results demonstrate an average 14% (and up to 25%) performance improvement over pinned scheduling through fairness-aware scheduling for homogeneous multi-threaded workloads; equal-progress scheduling improves performance by 32% on average for heterogeneous multi-threaded workloads. Further, we report dramatic improvements in fairness over prior scheduling proposals for multi-program workloads, while achieving system throughput comparable to throughput-optimized scheduling, and an average 21% improvement in throughput over pinned scheduling
Smartlocks: Self-Aware Synchronization through Lock Acquisition Scheduling
As multicore processors become increasingly prevalent, system complexity is skyrocketing. The advent of the asymmetric multicore compounds this -- it is no longer practical for an average programmer to balance the system constraints associated with today's multicores and worry about new problems like asymmetric partitioning and thread interference. Adaptive, or self-aware, computing has been proposed as one method to help application and system programmers confront this complexity. These systems take some of the burden off of programmers by monitoring themselves and optimizing or adapting to meet their goals. This paper introduces an open-source self-aware synchronization library for multicores and asymmetric multicores called Smartlocks. Smartlocks is a spin-lock library that adapts its internal implementation during execution using heuristics and machine learning to optimize toward a user-defined goal, which may relate to performance, power, or other problem-specific criteria. Smartlocks builds upon adaptation techniques from prior work like reactive locks, but introduces a novel form of adaptation designed for asymmetric multicores that we term lock acquisition scheduling. Lock acquisition scheduling is optimizing which waiter will get the lock next for the best long-term effect when multiple threads (or processes) are spinning for a lock. Our results demonstrate empirically that lock scheduling is important for asymmetric multicores and that Smartlocks significantly outperform conventional and reactive locks for asymmetries like dynamic variations in processor clock frequencies caused by thermal throttling events
Multicore Performance Optimization Using Partner Cores
As the push for parallelism continues to increase the number of cores on a chip, and add to the complexity of system design, the task of optimizing performance at the application level becomes nearly impossible for the programmer. Much effort has been spent on developing techniques for optimizing performance at runtime, but many techniques for modern processors employ the use of speculative threads or performance counters. These approaches result in stolen cycles, or the use of an extra core, and such expensive penalties put demanding constraints on the gains provided by such methods. While processors have grown in power and complexity, the technology for small, efficient cores has emerged. We introduce the concept of Partner Cores for maximizing hardware power efficiency; these are low-area, low-power cores situated on-die, tightly coupled to each main processor core. We demonstrate that such cores enable performance improvement without incurring expensive penalties, and carry out potential applications that are impossible on a traditional chip multiprocessor
Graphs, Matrices, and the GraphBLAS: Seven Good Reasons
The analysis of graphs has become increasingly important to a wide range of
applications. Graph analysis presents a number of unique challenges in the
areas of (1) software complexity, (2) data complexity, (3) security, (4)
mathematical complexity, (5) theoretical analysis, (6) serial performance, and
(7) parallel performance. Implementing graph algorithms using matrix-based
approaches provides a number of promising solutions to these challenges. The
GraphBLAS standard (istc- bigdata.org/GraphBlas) is being developed to bring
the potential of matrix based graph algorithms to the broadest possible
audience. The GraphBLAS mathematically defines a core set of matrix-based graph
operations that can be used to implement a wide class of graph algorithms in a
wide range of programming environments. This paper provides an introduction to
the GraphBLAS and describes how the GraphBLAS can be used to address many of
the challenges associated with analysis of graphs.Comment: 10 pages; International Conference on Computational Science workshop
on the Applications of Matrix Computational Methods in the Analysis of Modern
Dat
Acceleration and energy consumption optimization in cascading classifiers for face detection on low-cost ARM big.LITTLE asymmetric architectures
This paper proposes a mechanism to accelerate and optimize the energy
consumption of a face detection software based on Haar-like cascading
classifiers, taking advantage of the features of low-cost Asymmetric Multicore
Processors (AMPs) with limited power budget. A modelling and task
scheduling/allocation is proposed in order to efficiently make use of the
existing features on big.LITTLE ARM processors, including: (I) source-code
adaptation for parallel computing, which enables code acceleration by applying
the OmpSs programming model, a task-based programming model that handles
data-dependencies between tasks in a transparent fashion; (II) different OmpSs
task allocation policies which take into account the processor asymmetry and
can dynamically set processing resources in a more efficient way based on their
particular features. The proposed mechanism can be efficiently applied to take
advantage of the processing elements existing on low-cost and low-energy
multi-core embedded devices executing object detection algorithms based on
cascading classifiers. Although these classifiers yield the best results for
detection algorithms in the field of computer vision, their high computational
requirements prevent them from being used on these devices under real-time
requirements. Finally, we compare the energy efficiency of a heterogeneous
architecture based on asymmetric multicore processors with a suitable task
scheduling, with that of a homogeneous symmetric architecture
Phase-based tuning: better utilized performance asymmetric multicores
The latest trend towards performance asymmetry among cores on a single chip of a multicore processor is posing new challenges. For effective utilization of these performance-asymmetric multicore processors, code sections of a program must be assigned to cores such that the resource needs of code sections closely matches resource availability at the assigned core. Determining this assignment manually is tedious, error prone, and significantly complicates software development. To solve this problem, this thesis describes a transparent and fully-automatic process called phase-based tuning which adapts an application to effectively utilize performance-asymmetric multicores. The basic idea behind this technique is to statically compute groups of program segments which are expected to behave similarly at runtime. Then, at runtime, the behavior of a few code segments is used to infer the behavior and preferred core assignment of all similar code segments with low overhead. Compared to the stock Linux scheduler, for systems asymmetric with respect to clock frequency, a 36% average process speedup is observed, while maintaining fairness and with negligible overheads.
A key component to phase-based tuning is grouping program segments with similar behavior. The importance of various similarity metrics are likely to differ for each target asymmetric multicore processor. Determining groups using too many metrics may result in a grouping that differentiates between program segments based on irrelevant properties for a target machine. Using too few metrics may cause relevant metrics to be ignored thereby considering segments with different behavior similar. Therefore, to solve this problem and enable phase-based tuning for a wide range of a performance-asymmetric multicores, this thesis also describes a new technique called lazy grouping. Lazy grouping statically (at compile and install times) groups program segments that are expected to have similar behavior. The basic idea is to use extensive compile time analysis with intelligent install time (when the target system is known) group assignment. The accuracy of lazy grouping for a wide range of machines is shown to be more than 90% for nearly all target machines and asymmetric multicores
Recommended from our members
The Advantage of Custom Microprocessors for Stochastic Gradient Descent in Graph-Based Robot Localization and Mapping
Simultaneous Localization and Mapping (SLAM) describes a class of problems facing a large and growing field of autonomous systems -- from self-driving cars, to interplanetary rovers, to home automation products. Unfortunately this is a complex task where sophisticated algorithms and data structures are required to navigate a wide range of uncharted environments. Furthermore, most mobile robots need to run these tasks near real-time onboard an embedded controller with limited power and compute resources. To address this problem we explore the stochastic gradient descent (SGD) variant of graph solvers for SLAM and observe a tradeoff between various execution architectures and overall execution speed. Based on these observations, we propose a custom multiprocessor design that relaxes memory-coherency constraints between parallel cores while avoiding divergent behavior. We introduce a specialized streaming-tree interconnect that provides increased performance while using fewer resources compared to state-of-art GPU/CPU implementations of SGD. Finally, we discuss applications of unconventional architectural paradigms like over-provisioned “dark processors” and specialized data partitioning that provided a unique performance advantage for our particular design
Heterogeneity-aware scheduling and data partitioning for system performance acceleration
Over the past decade, heterogeneous processors and accelerators have become increasingly prevalent in modern computing systems. Compared with previous homogeneous parallel machines, the hardware heterogeneity in modern systems provides new opportunities and challenges for performance acceleration. Classic operating systems optimisation problems such as task scheduling, and application-specific optimisation techniques such as the adaptive data partitioning of parallel algorithms, are both required to work together to address hardware heterogeneity.
Significant effort has been invested in this problem, but either focuses on a specific type of heterogeneous systems or algorithm, or a high-level framework without insight into the difference in heterogeneity between different types of system. A general software framework is required, which can not only be adapted to multiple types of systems and workloads, but is also equipped with the techniques to address a variety of hardware heterogeneity.
This thesis presents approaches to design general heterogeneity-aware software frameworks for system performance acceleration. It covers a wide variety of systems, including an OS scheduler targeting on-chip asymmetric multi-core processors (AMPs) on mobile devices, a hierarchical many-core supercomputer and multi-FPGA systems for high performance computing (HPC) centers. Considering heterogeneity from on-chip AMPs, such as thread criticality, core sensitivity, and relative fairness, it suggests a collaborative based approach to co-design the task selector and core allocator on OS scheduler. Considering the typical sources of heterogeneity in HPC systems, such as the memory hierarchy, bandwidth limitations and asymmetric physical connection, it proposes an application-specific automatic data partitioning method for a modern supercomputer, and a topological-ranking heuristic based schedule for a multi-FPGA based reconfigurable cluster.
Experiments on both a full system simulator (GEM5) and real systems (Sunway Taihulight Supercomputer and Xilinx Multi-FPGA based clusters) demonstrate the significant advantages of the suggested approaches compared against the state-of-the-art on variety of workloads."This work is supported by St Leonards 7th Century Scholarship and
Computer Science PhD funding from University of St Andrews; by UK
EPSRC grant Discovery: Pattern Discovery and Program Shaping for Manycore
Systems (EP/P020631/1)." -- Acknowledgement
- …