1,519 research outputs found

    Improving HPC system throughput and response time using memory disaggregation

    Get PDF
    © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.HPC clusters are cost-effective, well understood, and scalable, but the rigid boundaries between compute nodes may lead to poor utilization of compute and memory resources. HPC jobs may vary, by orders of magnitude, in memory consumption per core. Thus, even when the system is provisioned to accommodate normal and large capacity nodes, a mismatch between the system and the memory demands of the scheduled jobs can lead to inefficient usage of both memory and compute resources. Disaggregated memory has recently been proposed as a way to mitigate this problem by flexibly allocating memory capacity across cluster nodes. This paper presents a simulation approach for at-scale evaluation of job schedulers with disaggregated memories and it introduces a new disaggregated-aware job allocation policy for the Slurm resource manager. Our results show that using disaggregated memories, depending on the imbalance between the system and the submitted jobs, a similar throughput and job response time can be achieved on a system with up to 33% less total memory provisioning.This work is part of a project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 754337 (EuroEXA); it has been supported by the Spanish Ministry of Science and Innovation (project TIN2015-65316-P and Ramon y Cajal fellowship RYC2018-025628-I), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and the Severo Ochoa Programme (SEV-2015-0493).Peer ReviewedPostprint (author's final draft

    A Quantitative Approach for Adopting Disaggregated Memory in HPC Systems

    Full text link
    Memory disaggregation has recently been adopted in data centers to improve resource utilization, motivated by cost and sustainability. Recent studies on large-scale HPC facilities have also highlighted memory underutilization. A promising and non-disruptive option for memory disaggregation is rack-scale memory pooling, where shared memory pools supplement node-local memory. This work outlines the prospects and requirements for adoption and clarifies several misconceptions. We propose a quantitative method for dissecting application requirements on the memory system from the top down in three levels, moving from general, to multi-tier memory systems, and then to memory pooling. We provide a multi-level profiling tool and LBench to facilitate the quantitative approach. We evaluate a set of representative HPC workloads on an emulated platform. Our results show that prefetching activities can significantly influence memory traffic profiles. Interference in memory pooling has varied impacts on applications, depending on their access ratios to memory tiers and arithmetic intensities. Finally, in two case studies, we show the benefits of our findings at the application and system levels, achieving 50% reduction in remote access and 13% speedup in BFS, and reducing performance variation of co-located workloads in interference-aware job scheduling.Comment: Accepted to SC23 (The International Conference for High Performance Computing, Networking, Storage, and Analysis 2023

    Topology-aware GPU scheduling for learning workloads in cloud environments

    Get PDF
    Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud, are enabling deep learning in various domains including health care, autonomous vehicles, and Internet of Things. Multi-GPU systems exhibit complex connectivity among GPUs and between GPUs and CPUs. Workload schedulers must consider hardware topology and workload communication requirements in order to allocate CPU and GPU resources for optimal execution time and improved utilization in shared cloud environments. This paper presents a new topology-aware workload placement strategy to schedule deep learning jobs on multi-GPU systems. The placement strategy is evaluated with a prototype on a Power8 machine with Tesla P100 cards, showing speedups of up to ≈1.30x compared to state-of-the-art strategies; the proposed algorithm achieves this result by allocating GPUs that satisfy workload requirements while preventing interference. Additionally, a large-scale simulation shows that the proposed strategy provides higher resource utilization and performance in cloud systems.This project is supported by the IBM/BSC Technology Center for Supercomputing collaboration agreement. It has also received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595). It is also partially supported by the Ministry of Economy of Spain under contract TIN2015-65316-P and Generalitat de Catalunya under contract 2014SGR1051, by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program (SEV-2015-0493). We thank our IBM Research colleagues Alaa Youssef and Asser Tantawi for the valuable discussions. We also thank SC17 committee member Blair Bethwaite of Monash University for his constructive feedback on the earlier drafts of this paper.Peer ReviewedPostprint (published version

    EnergAt: Fine-Grained Energy Attribution for Multi-Tenancy

    Full text link
    In the post-Moore's Law era, relying solely on hardware advancements for automatic performance gains is no longer feasible without increased energy consumption, due to the end of Dennard scaling. Consequently, computing accounts for an increasing amount of global energy usage, contradicting the objective of sustainable computing. The lack of hardware support and the absence of a standardized, software-centric method for the precise tracing of energy provenance exacerbates the issue. Aiming to overcome this challenge, we argue that fine-grained software energy attribution is attainable, even with limited hardware support. To support our position, we present a thread-level, NUMA-aware energy attribution method for CPU and DRAM in multi-tenant environments. The evaluation of our prototype implementation, EnergAt, demonstrates the validity, effectiveness, and robustness of our theoretical model, even in the presence of the noisy-neighbor effect. We envisage a sustainable cloud environment and emphasize the importance of collective efforts to improve software energy efficiency.Comment: 8 pages, 4 figures; Published in HotCarbon 2023; Artifact available at https://github.com/HongyuHe/energa

    Optical Technologies and Control Methods for Scalable Data Centre Networks

    Get PDF
    Attributing to the increasing adoption of cloud services, video services and associated machine learning applications, the traffic demand inside data centers is increasing exponentially, which necessitates an innovated networking infrastructure with high scalability and cost-efficiency. As a promising candidate to provide high capacity, low latency, cost-effective and scalable interconnections, optical technologies have been introduced to data center networks (DCNs) for approximately a decade. To further improve the DCN performance to meet the increasing traffic demand by using photonic technologies, two current trends are a)increasing the bandwidth density of the transmission links and b) maximizing IT and network resources utilization through disaggregated topologies and architectures. Therefore, this PhD thesis focuses on introducing and applying advanced and efficient technologies in these two fields to DCNs to improve their performance. On the one hand, at the link level, since the traditional single-mode fiber (SMF) solutions based on wavelength division multiplexing (WDM) over C+L band may fall short in satisfying the capacity, front panel density, power consumption, and cost requirements of high-performance DCNs, a space division multiplexing (SDM) based DCN using homogeneous multi-core fibers (MCFs) is proposed.With the exploited bi-directional model and proposed spectrum allocation algorithms, the proposed DCN shows great benefits over the SMF solution in terms of network capacity and spatial efficiency. In the meanwhile, it is found that the inter-core crosstalk (IC-XT) between the adjacent cores inside the MCF is dynamic rather than static, therefore, the behaviour of the IC-XT is experimentally investigated under different transmission conditions. On the other hand, an optically disaggregated DCN is developed and to ensure the performance of it, different architectures, topologies, resource routing and allocation algorithms are proposed and compared. Compared to the traditional server-based DCN, the resource utilization, scalability and the cost-efficiency are significantly improved

    Lovelock: Towards Smart NIC-hosted Clusters

    Full text link
    Traditional cluster designs were originally server-centric, and have evolved recently to support hardware acceleration and storage disaggregation. In applications that leverage acceleration, the server CPU performs the role of orchestrating computation and data movement and data-intensive applications stress the memory bandwidth. Applications that leverage disaggregation can be adversely affected by the increased PCIe and network bandwidth resulting from disaggregation. In this paper, we advocate for a specialized cluster design for important data intensive applications, such as analytics, query processing and ML training. This design, Lovelock, replaces each server in a cluster with one or more headless smart NICs. Because smart NICs are significantly cheaper than servers on bandwidth, the resulting cluster can run these applications without adversely impacting performance, while obtaining cost and energy savings

    Spatial and Temporal Cache Sharing Analysis in Tasks

    Get PDF
    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.Understanding performance of large scale multicore systems is crucial for getting faster execution times and optimize workload efficiency, but it is becoming harder due to the increased complexity of hardware architectures. Cache sharing is a key component for performance in modern architectures, and it has been the focus of performance analysis tools and techniques in recent years. At the same time, new programming models have been introduced to aid the programmer dealing with the complexity of large scale systems, simplifying the coding process and making applications more scalable regardless of resource sharing. Taskbased runtime systems are one example of this that became popular recently. In this work we develop models to tackle performance analysis of shared resources in the task-based context, and for that we study cache sharing both in temporal and spatial ways. In temporal cache sharing, the effect of data reused over time by the tasks executed is modeled to predict different scenarios resulting in a tool called StatTask. In spatial cache sharing, the effect of tasks fighting for the cache at a given point in time through their execution is quantified and used to model their behavior on arbitrary cache sizes. Finally, we explain how these tools set up a unique and solid platform to improve runtime systems schedulers, maximizing performance of execution of large-scale task-based applications.European Cooperation in Science and Technology. COSTThe work presented in this paper has been partially supported by EU under the COST programme Action IC1305,‘Network for Sustainable Ultrascale Computing (NESUS)’, and by the Swedish Research Council, carried out within the Linnaeus centre of excellence UPMARC, Uppsala Programming for Multicore Architectures Research Center

    ACTiCLOUD: Enabling the Next Generation of Cloud Applications

    Get PDF
    Despite their proliferation as a dominant computing paradigm, cloud computing systems lack effective mechanisms to manage their vast amounts of resources efficiently. Resources are stranded and fragmented, ultimately limiting cloud systems' applicability to large classes of critical applications that pose non-moderate resource demands. Eliminating current technological barriers of actual fluidity and scalability of cloud resources is essential to strengthen cloud computing's role as a critical cornerstone for the digital economy. ACTiCLOUD proposes a novel cloud architecture that breaks the existing scale-up and share-nothing barriers and enables the holistic management of physical resources both at the local cloud site and at distributed levels. Specifically, it makes advancements in the cloud resource management stacks by extending state-of-the-art hypervisor technology beyond the physical server boundary and localized cloud management system to provide a holistic resource management within a rack, within a site, and across distributed cloud sites. On top of this, ACTiCLOUD will adapt and optimize system libraries and runtimes (e.g., JVM) as well as ACTiCLOUD-native applications, which are extremely demanding, and critical classes of applications that currently face severe difficulties in matching their resource requirements to state-of-the-art cloud offerings
    • …
    corecore