138 research outputs found

    Coexisting scheduling policies boosting I/O Virtual Machines

    Get PDF
    Abstract. Deploying multiple Virtual Machines (VMs) running various types of workloads on current many-core cloud computing infrastructures raises an important issue: The Virtual Machine Monitor (VMM) has to efficiently multiplex VM accesses to the hardware. We argue that altering the scheduling concept can optimize the system’s overall performance. Currently, the Xen VMM achieves near native performance multiplexing VMs with homogeneous workloads. Yet having a mixture of VMs with different types of workloads running concurrently, it leads to poor I/O performance. Taking into account the complexity of the design and implementation of a universal scheduler, let alone the probability of being fruitless, we focus on a system with multiple scheduling policies that coexist and service VMs according to their workload characteristics. Thus, VMs can benefit from various schedulers, either existing or new, that are optimal for each specific case. In this paper, we design a framework that provides three basic coexisting scheduling policies and implement it in the Xen paravirtualized environment. Evaluating our prototype we experience 2.3 times faster I/O service and link saturation, while the CPU-intensive VMs achieve more than 80 % of current performance.

    Fast and Cost-Effective Online Load-Balancing in Distributed Range-Queriable Systems

    Full text link

    CoCoPeLia: Communication-Computation Overlap Prediction for Efficient Linear Algebra on GPUs

    Get PDF
    Graphics Processing Units (GPUs) are well established in HPC systems and frequently used to accelerate linear algebra routines. Since data transfers pose a severe bottleneck for GPU offloading, modern GPUs provide the ability to overlap communication with computation by splitting the problem to fine-grained sub-kernels that are executed in a pipelined manner. This optimization is currently underutilized by GPU BLAS libraries, since it requires an approach to select an efficient tiling size, which in turn leads to a challenging problem that needs to consider routine, system, data, and problem-specific characteristics. In this work, we introduce an elaborate 3-way concurrency model for GPU BLAS offload time that considers previously neglected features regarding data access and machine behavior. We then incorporate our model in an automated, end-to-end framework (called CoCoPeLia) that supports overlap prediction, tile selection and effective tile scheduling. We validate our model's efficacy for dgemm, sgemm, and daxpy on two testbeds, with our experimental results showing that it achieves significantly lower prediction error than previous models and provides near-optimal tiling sizes for all problems. We also demonstrate that CoCoPeLia leads to considerable performance improvements compared to the state of the art BLAS routine implementations for GPUs

    PARALiA: a performance aware runtime for auto-tuning linear algebra on heterogeneous systems

    Get PDF
    Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. Unfortunately, multi-GPU systems are accompanied by two significant optimization challenges: data transfer bottlenecks as well as problem splitting and scheduling in multiple workers (GPUs) with distinct memories. We demonstrate that the current multi-GPU BLAS methods for tackling these challenges target very specific problem and data characteristics, resulting in serious performance degradation for any slightly deviating workload. Additionally, an even more critical decision is omitted because it cannot be addressed using current scheduler-based approaches: the determination of which devices should be used for a certain routine invocation. To address these issues we propose a model-based approach: using performance estimation to provide problem-specific autotuning during runtime. We integrate this autotuning into an end-to-end BLAS framework named PARALiA. This framework couples autotuning with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. We evaluate PARALiA in an HPC testbed with 8 NVIDIA-V100 GPUs, improving the average performance of GEMM by 1.7× and energy efficiency by 2.5× over the state-of-the-art in a large and diverse dataset and demonstrating the adaptability of our performance-aware approach to future heterogeneous systems
    • …
    corecore