296 research outputs found

    Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

    Get PDF
    Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency

    POSTER: Exploiting asymmetric multi-core processors with flexible system sofware

    Get PDF
    Energy efficiency has become the main challenge for high performance computing (HPC). The use of mobile asymmetric multi-core architectures to build future multi-core systems is an approach towards energy savings while keeping high performance. However, it is not known yet whether such systems are ready to handle parallel applications. This paper fills this gap by evaluating emerging parallel applications on an asymmetric multi-core. We make use of the PARSEC benchmark suite and a processor that implements the ARM big.LITTLE architecture. We conclude that these applications are not mature enough to run on such systems, as they suffer from load imbalance. Furthermore, we explore the behaviour of dynamic scheduling solutions on either the Operating System (OS) or the runtime level. Comparing these approaches shows us that the most efficient scheduling takes place in the runtime level, influencing the future research towards such solutions.This work has been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project receives funding from the EU's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 610402 and from the EU's H2020 Framework Programme (H2020/2014-2020) under grant agreement number 671697. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243).Peer ReviewedPostprint (author's final draft

    Simulating Execution Time and Power Consumption of Real-Time Tasks on Embedded Platforms

    Get PDF
    In this paper, we present PARTSim, an open-source power/thermal-aware simulator for embedded real-time systems. This tool is a fork of the well-known RTSim simulator, which can simulate the timing behavior of a set of real-time tasks with various characteristics when running on a multi-processor platform in presence of a number of real-time scheduling policies. PARTSim extends the functionality of RTSim by introducing support for power-aware embedded platforms exhibiting frequency scaling and specific architectural patterns like the ARM big.LITTLE and DynamIQ ones. Experimental results that compare simulated data against execution profiles collected on real platforms show a simulation error under 10 % for both execution time and power consumption at 90th percentile when simulating the effects of DVFS

    Fairness-aware scheduling on single-ISA heterogeneous multi-cores

    Get PDF
    Single-ISA heterogeneous multi-cores consisting of small (e.g., in-order) and big (e.g., out-of-order) cores dramatically improve energy- and power-efficiency by scheduling workloads on the most appropriate core type. A significant body of recent work has focused on improving system throughput through scheduling. However, none of the prior work has looked into fairness. Yet, guaranteeing that all threads make equal progress on heterogeneous multi-cores is of utmost importance for both multi-threaded and multi-program workloads to improve performance and quality-of-service. Furthermore, modern operating systems affinitize workloads to cores (pinned scheduling) which dramatically affects fairness on heterogeneous multi-cores. In this paper, we propose fairness-aware scheduling for single-ISA heterogeneous multi-cores, and explore two flavors for doing so. Equal-time scheduling runs each thread or workload on each core type for an equal fraction of the time, whereas equal-progress scheduling strives at getting equal amounts of work done on each core type. Our experimental results demonstrate an average 14% (and up to 25%) performance improvement over pinned scheduling through fairness-aware scheduling for homogeneous multi-threaded workloads; equal-progress scheduling improves performance by 32% on average for heterogeneous multi-threaded workloads. Further, we report dramatic improvements in fairness over prior scheduling proposals for multi-program workloads, while achieving system throughput comparable to throughput-optimized scheduling, and an average 21% improvement in throughput over pinned scheduling

    An adaptive, utilization-based approach to schedule real-time tasks for ARM big. LITTLE architectures

    Get PDF
    ARM big.LITTLE architectures are spreading more and more in the mobile world thanks to their power-saving capabilities due to the use of two ISA-compatible islands, one focusing on energy efficiency and the other one on computational power. This architecture makes the problem of energy-aware task scheduling particularly challenging, due to the number of variables to take into account and the need for having lightweight mechanisms that can be readily computed in an operating system kernel scheduler. This paper presents a novel task scheduler for big.LITTLE platforms, combining the well-known Constant Bandwidth Server algorithm with a power-aware per-job migration policy. This achieves real-time adaptation of the CPU islands' frequencies based on the individual cores' overall utilization, as available in the scheduler thanks to the use of the resource reservation paradigm. Preliminary results obtained by simulations based on modifications to the open-source RTSim tool show that the proposed technique is able to achieve interesting performance/energy trade-offs
    • …
    corecore