1,753 research outputs found

    COLAB:A Collaborative Multi-factor Scheduler for Asymmetric Multicore Processors

    Get PDF
    Funding: Partially funded by the UK EPSRC grants Discovery: Pattern Discovery and Program Shaping for Many-core Systems (EP/P020631/1) and ABC: Adaptive Brokerage for Cloud (EP/R010528/1); Royal Academy of Engineering under the Research Fellowship scheme.Increasingly prevalent asymmetric multicore processors (AMP) are necessary for delivering performance in the era of limited power budget and dark silicon. However, the software fails to use them efficiently. OS schedulers, in particular, handle asymmetry only under restricted scenarios. We have efficient symmetric schedulers, efficient asymmetric schedulers for single-threaded workloads, and efficient asymmetric schedulers for single program workloads. What we do not have is a scheduler that can handle all runtime factors affecting AMP for multi-threaded multi-programmed workloads. This paper introduces the first general purpose asymmetry-aware scheduler for multi-threaded multi-programmed workloads. It estimates the performance of each thread on each type of core and identifies communication patterns and bottleneck threads. The scheduler then makes coordinated core assignment and thread selection decisions that still provide each application its fair share of the processor's time. We evaluate our approach using the GEM5 simulator on four distinct big.LITTLE configurations and 26 mixed workloads composed of PARSEC and SPLASH2 benchmarks. Compared to the state-of-the art Linux CFS and AMP-aware schedulers, we demonstrate performance gains of up to 25% and 5% to 15% on average depending on the hardware setup.Postprin

    Improving OpenStack Swift interaction with the I/O stack to enable software defined storage

    Get PDF
    This paper analyses how OpenStack Swift, a distributed object storage service for a globally used middleware, interacts with the I/O subsystem through the Operating System. This interaction, which seems organised and clean on the middleware side, becomes disordered on the device side when using mechanical disk drives, due to the way threads are used internally to request data. We will show that only modifying the Swift threading model we achieve an 18% mean improvement in performance with objects larger than 512 KiB and obtain a similar performance with smaller objects. Compared to the original scenario, the performance obtained on both scenarios is obtained in a fair way: the bandwidth is shared equally between concurrently accessed objects. Moreover, this threading model allows us to apply techniques for Software Defined Storage (SDS). We show an implementation of a Bandwidth Differentiation technique that can control each data stream and that guarantees a high utilization of the device.The research leading to these results has received funding from the European Community under the IOStack (H2020-ICT-2014-7-1) project, by the Spanish Ministry of Economy and Competitiveness under the TIN2015-65316-P grant and by the Catalan Government under the 2014-SGR-1051 grant. To learn more about the IOStack H2020 project, please visit http:nnwww.iostack.eu.Peer ReviewedPostprint (author's final draft

    I/O Schedulers for Proportionality and Stability on Flash-Based SSDs in Multi-Tenant Environments

    Get PDF
    The use of flash based Solid State Drives (SSDs) has expanded rapidly into the cloud computing environment. In cloud computing, ensuring the service level objective (SLO) of each server is the major criterion in designing a system. In particular, eliminating performance interference among virtual machines (VMs) on shared storage is a key challenge. However, studies on SSD performance to guarantee SLO in such environments are limited. In this paper, we present analysis of I/O behavior for a shared SSD as storage in terms of proportionality and stability. We show that performance SLOs of SSD based storage systems being shared by VMs or tasks are not satisfactory. We present and analyze the reasons behind the unexpected behavior through examining the components of SSDs such as channels, DRAM buffer, and Native Command Queuing (NCQ). We introduce two novel SSD-aware host level I/O schedulers on Linux, called A & x002B;CFQ and H & x002B;BFQ, based on our analysis and findings. Through experiments on Linux, we analyze I/O proportionality and stability in multi-tenant environments. In addition, through experiments using real workloads, we analyze the performance interference between workloads on a shared SSD. We then show that the proposed I/O schedulers almost eliminate the interference effect seen in CFQ and BFQ, while still providing I/O proportionality and stability for various I/O weighted scenarios

    Fairness-aware scheduling on single-ISA heterogeneous multi-cores

    Get PDF
    Single-ISA heterogeneous multi-cores consisting of small (e.g., in-order) and big (e.g., out-of-order) cores dramatically improve energy- and power-efficiency by scheduling workloads on the most appropriate core type. A significant body of recent work has focused on improving system throughput through scheduling. However, none of the prior work has looked into fairness. Yet, guaranteeing that all threads make equal progress on heterogeneous multi-cores is of utmost importance for both multi-threaded and multi-program workloads to improve performance and quality-of-service. Furthermore, modern operating systems affinitize workloads to cores (pinned scheduling) which dramatically affects fairness on heterogeneous multi-cores. In this paper, we propose fairness-aware scheduling for single-ISA heterogeneous multi-cores, and explore two flavors for doing so. Equal-time scheduling runs each thread or workload on each core type for an equal fraction of the time, whereas equal-progress scheduling strives at getting equal amounts of work done on each core type. Our experimental results demonstrate an average 14% (and up to 25%) performance improvement over pinned scheduling through fairness-aware scheduling for homogeneous multi-threaded workloads; equal-progress scheduling improves performance by 32% on average for heterogeneous multi-threaded workloads. Further, we report dramatic improvements in fairness over prior scheduling proposals for multi-program workloads, while achieving system throughput comparable to throughput-optimized scheduling, and an average 21% improvement in throughput over pinned scheduling

    Collaborative Heterogeneity-Aware OS Scheduler for Asymmetric Multicore Processors

    Get PDF
    Funding: This work is supported in part by the China Postdoctoral Science Foundation (Grant No. 2020TQ0169), the ShuiMu Tsinghua Scholar fellowship (2019SM131), National Key R&D Program of China (2020AAA0105200), National Natural Science Foundation of China (U20A20226), Beijing Natural Science Foundation (4202031), Beijing Academy of Artificial Intelligence BAAI), the UK EPSRC grants Discovery: Pattern Discovery and Program Shaping for Manycore Systems (EP/P020631/1). This work is also supported by the Royal Academy of Engineering under the Research Fellowship scheme.Asymmetric multicore processors (AMP) offer multiple types of cores under the same programming interface. Extracting the full potential of AMPs requires intelligent scheduling decisions, matching each thread with the right kind of core, the core that will maximize performance or minimize wasted energy for this thread. Existing OS schedulers are not up to this task. While they may handle certain aspects of asymmetry in the system, none can handle all runtime factors affecting AMPs for the general case of multi-threaded multi-programmed workloads. We address this problem by introducing COLAB, a general purpose asymmetry-aware scheduler targeting multi-threaded multi-programmed workloads. It estimates the performance and power of each thread on each type of core and identifies communication patterns and bottleneck threads. With this information, the scheduler makes coordinated core assignment and thread selection decisions that still provide each application its fair share of the processor’s time. We evaluate our approach using both the GEM5 simulator on four distinct big.LITTLE configurations and a development board with ARM Cortex-A73/A53 processors and mixed workloads composed of PARSEC and SPLASH2 benchmarks. Compared to the state-of-the art Linux CFS and AMP-aware schedulers, we demonstrate performance gains of up to 25% and 5% to 15% on average,together with an average 5% energy saving depending on the hardware setup.PostprintPeer reviewe

    Planificación consciente de la contención y gestión de recursos en arquitecturas multicore emergentes

    Get PDF
    Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Arquitectura de Computadores y Automática, leída el 14-12-2021Chip multicore processors (CMPs) currently constitute the architecture of choice for mosto general-pùrpose computing systems, and they will likely continue to be dominant in the near future. Advances in technology have enabled to pack an increasing number of cores and bigger caches on the same chip. Nevertheless, contention on shared resources on CMPs -present since the advent of these architectures- still poses a big challenge. Cores in a CMP typically share a last-level cache (LLC) and other memory-related resources with the remaining cores, such as a DRAM controller and an interconnection network. This causes that co-running applications may intensively compete with each other for these shared resources, leading to substantial and uneven performance degradation...Los procesadores multinúcleo o CMPs (Chip Multicore Processors) son actualmente la arquitectura más usada por la mayoría de sistemas de computación de propósito general, y muy probablemente se mantendrían en esa posición dominante en el futuro cercano. Los avances tecnológicos han permitido integrar progresivamente en el mismo chip más cores y aumentar los tamaños de los distintos niveles de cache. No obstante, la contención de recursos compartidos en CMPs {presente desde la aparición de estas arquitecturas{ todavía representa un reto importante que afrontar. Los cores en un CMP comparten en la mayor parte de los diseños una cache de último nivel o LLC (Last-Level Cache) y otros recursos, como el controlador de DRAM o una red de interconexión. La existencia de dichos recursos compartidos provoca en ocasiones que cuando se ejecutan dos o más aplicaciones simultáneamente en el sistema, se produzca una degradación sustancial y potencialmente desigual del rendimiento entre aplicaciones...Fac. de InformáticaTRUEunpu
    corecore