51 research outputs found

    Effects of component-subscription network topology on large-scale data centre performance scaling

    Full text link
    Modern large-scale date centres, such as those used for cloud computing service provision, are becoming ever-larger as the operators of those data centres seek to maximise the benefits from economies of scale. With these increases in size comes a growth in system complexity, which is usually problematic. There is an increased desire for automated "self-star" configuration, management, and failure-recovery of the data-centre infrastructure, but many traditional techniques scale much worse than linearly as the number of nodes to be managed increases. As the number of nodes in a median-sized data-centre looks set to increase by two or three orders of magnitude in coming decades, it seems reasonable to attempt to explore and understand the scaling properties of the data-centre middleware before such data-centres are constructed. In [1] we presented SPECI, a simulator that predicts aspects of large-scale data-centre middleware performance, concentrating on the influence of status changes such as policy updates or routine node failures. [...]. In [1] we used a first-approximation assumption that such subscriptions are distributed wholly at random across the data centre. In this present paper, we explore the effects of introducing more realistic constraints to the structure of the internal network of subscriptions. We contrast the original results [...] exploring the effects of making the data-centre's subscription network have a regular lattice-like structure, and also semi-random network structures resulting from parameterised network generation functions that create "small-world" and "scale-free" networks. We show that for distributed middleware topologies, the structure and distribution of tasks carried out in the data centre can significantly influence the performance overhead imposed by the middleware

    Holistic VM Placement for Distributed Parallel Applications in Heterogeneous Clusters

    Get PDF
    In a heterogeneous cluster, virtual machine (VM) placement for a distributed parallel application is challenging due to numerous possible ways of placing the application and complexity of estimating the performance of the application. This study investigates a holistic VM placement technique for distributed parallel applications in a heterogeneous cluster, aiming to maximize the efficiency of the cluster and consequently reduce the costs for service providers and users. The proposed technique accommodates various factors that have an impact on performance in a combined manner. First, we analyze the effects of the heterogeneity of resources, different VM configurations, and interference between VMs on the performance of distributed parallel applications with a wide diversity of characteristics, including scientific and big data analytics applications. We then propose a placement technique that uses a machine learning algorithm to estimate the runtime of a distributed parallel application. To train a performance estimation model, a distributed parallel application is profiled against synthetic workloads that mostly utilize the dominant resource of the application, which strongly affects the application performance, reducing the profiling space dramatically. Through experimental and simulation studies, we show that the proposed placement technique can find good VM placement configurations for various workloads

    Analyzing and optimizing the performance and energy efficiency of transactional scientific applications on large-scale NUMA systems with HTM support

    No full text
    Hardware transactional memory (HTM) is widely supported by commodity processors. While the effectiveness of HTM has been evaluated based on small-scale multi-core systems, it still remains unexplored to quantify the performance and energy efficiency of HTM for scientific workloads on large-scale NUMA systems, which have been increasingly adopted to high-performance computing. To bridge this gap, this work investigates the performance and energy-efficiency impact of HTM on scientific applications on large-scale NUMA systems. Specifically, we quantify the performance and energy efficiency of HTM for scientific workloads based on the widely-used CLOMP-TM benchmark. We then discuss a set of generic software optimizations, which effectively improve the performance and energy efficiency of transactional scientific workloads on large-scale NUMA systems. Further, we present case studies in which we apply a set of the performance and energy-efficiency optimizations to representative transactional scientific applications and investigate the potential for high-performance and energy-efficient runtime support

    Hap: A heterogeneity-conscious runtime system for adaptive pipeline parallelism

    No full text
    Heterogeneous multiprocessing (HMP) is a promising solution for energy-efficient computing. While pipeline parallelism is an effective technique to accelerate various workloads (e.g., streaming), relatively little work has been done to investigate efficient runtime support for adaptive pipeline parallelism in the context of HMP. To bridge this gap, we propose a heterogeneity-conscious runtime system for adaptive pipeline parallelism (HAP). HAP dynamically controls the full HMP system resources to improve the energy efficiency of the target pipeline application. We demonstrate that HAP achieves significant energyefficiency gains over the Linux HMP scheduler and a state-of-the-art runtime system and incurs a low performance overhead

    HERTI: A Reinforcement Learning-Augmented System for Efficient Real-Time Inference on Heterogeneous Embedded Systems

    No full text
    Real-time inference is the key technology that enables a variety of latency-critical intelligent services such as autonomous driving and augmented reality. Heterogeneous embedded systems that consist of various computing devices with widely-different architectural and system-level characteristics are emerging as a promising solution for real-time inference. Despite extensive prior works, it still remains unexplored to design and implement a practical system that enables efficient real-time inference on heterogeneous embedded systems. To bridge this gap, we propose HERTI, a reinforcement learning-augmented system for efficient real-time inference on heterogeneous embedded systems. HERTI efficiently explores the state space and robustly finds an efficient state that significantly improves the efficiency of the target inference workload while satisfying its deadline constraint through reinforcement learning. Our quantitative evaluation conducted on a real heterogeneous embedded system demonstrates the effectiveness of HERTI in that HERTI achieves high inference efficiency in multiple metrics (i.e., energy and energy-delay product) with a strong deadline guarantee in contrast to the state-of-the-art techniques, delivers larger gains as the inference deadline and the system heterogeneity increase, provides strong generality for hyper-parameter tuning, and significantly reduces the training time through its estimation-based approach across all the evaluated inference workloads and scenarios

    BLPP: Improving the Performance of GPGPUs with Heterogeneous Memory through Bandwidth- and Latency-Aware Page Placement

    No full text
    GPGPUs with heterogeneous memory have surfaced as a promising solution to improve the programmability and flexibility of GPGPU computing. Despite the extensive prior works, relatively little work has been done to investigate holistic system software support for heterogeneity-aware memory management. To bridge this gap, we propose bandwidth- and latencyaware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP dynamically places pages across the heterogeneous memory nodes by preserving the optimal allocation ratio computed based on their performance characteristics. Our experimental results show that BLPP considerably outperforms the state-of-the-art technique and performs similarly to the staticbest version, which requires extensive offline profiling

    RCHC: A Holistic Runtime System for Concurrent Heterogeneous Computing

    No full text
    Concurrent heterogeneous computing (CHC) is rapidly emerging as a promising solution for high-performance and energy-efficient computing. The fundamental challenges for efficient CHC are how to partition the workload of the target application across the devices in the underlying CHC system and how to control the operating frequency of each device in order to maximize the overall efficiency. Despite the extensive prior work on the system software techniques for CHC, efficient runtime support for CHC that robustly supports both functional and performance heterogeneity without the need for extensive offline profiling still remains unexplored. To bridge this gap, we propose RCHC, a holistic runtime system for concurrent heterogeneous computing. RCHC dynamically profiles the target application and constructs the performance and power estimation models based on the runtime information. Guided by the estimation models, RCHC explores the system state space, determines the best system state that is expected to maximize the efficiency of the target application, and accordingly executes it. Our experimental results demonstrate that RCHC significantly outperforms the baseline version (e.g., 61.0% higher energy efficiency on average) that employs the GPU and achieves the efficiency comparable with that of the static best version, which requires extensive offline profiling

    Quantifying the Performance and Energy-Efficiency Impact of Hardware Transactional Memory on Scientific Applications on Large-Scale NUMA Systems

    No full text
    Hardware transactional memory (HTM) is supported by widely-used commodity processors. While the effectiveness of HTM has been evaluated based on small-scale multi-core systems, it still remains unexplored to quantify the performance and energy-efficiency of HTM for scientific workloads on large-scale NUMA systems, which have been increasingly adopted to high-performance computing. To bridge this gap, this work investigates the performance and energy-efficiency impact of HTM on scientific applications on large-scale NUMA systems. We first quantify the performance and energy efficiency of HTM for scientific workloads based on the widely-used CLOMP-TM benchmark. We then discuss a set of generic software optimizations that can be effectively used to improve the performance and energy efficiency of transactional scientific workloads on large-scale NUMA systems. Finally, we present case studies in which we apply a set of the optimizations to representative transactional scientific applications and significantly optimize their performance and energy efficiency on large-scale NUMA systems
    corecore