1,519 research outputs found
Improving HPC system throughput and response time using memory disaggregation
© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.HPC clusters are cost-effective, well understood, and scalable, but the rigid boundaries between compute nodes may lead to poor utilization of compute and memory resources. HPC jobs may vary, by orders of magnitude, in memory consumption per core. Thus, even when the system is provisioned to accommodate normal and large capacity nodes, a mismatch between the system and the memory demands of the scheduled jobs can lead to inefficient usage of both memory and compute resources. Disaggregated memory has recently been proposed as a way to mitigate this problem by flexibly allocating memory capacity across cluster nodes. This paper presents a simulation approach for at-scale evaluation of job schedulers with disaggregated memories and it introduces a new disaggregated-aware job allocation policy for the Slurm resource manager. Our results show that using disaggregated memories, depending on the imbalance between the system and the submitted jobs, a similar throughput and job response time can be achieved on a system with up to 33% less total memory provisioning.This work is part of a project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 754337 (EuroEXA); it has been supported by the Spanish Ministry of Science and Innovation (project TIN2015-65316-P and Ramon y Cajal fellowship RYC2018-025628-I), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and the Severo Ochoa Programme (SEV-2015-0493).Peer ReviewedPostprint (author's final draft
A Quantitative Approach for Adopting Disaggregated Memory in HPC Systems
Memory disaggregation has recently been adopted in data centers to improve
resource utilization, motivated by cost and sustainability. Recent studies on
large-scale HPC facilities have also highlighted memory underutilization. A
promising and non-disruptive option for memory disaggregation is rack-scale
memory pooling, where shared memory pools supplement node-local memory. This
work outlines the prospects and requirements for adoption and clarifies several
misconceptions. We propose a quantitative method for dissecting application
requirements on the memory system from the top down in three levels, moving
from general, to multi-tier memory systems, and then to memory pooling. We
provide a multi-level profiling tool and LBench to facilitate the quantitative
approach. We evaluate a set of representative HPC workloads on an emulated
platform. Our results show that prefetching activities can significantly
influence memory traffic profiles. Interference in memory pooling has varied
impacts on applications, depending on their access ratios to memory tiers and
arithmetic intensities. Finally, in two case studies, we show the benefits of
our findings at the application and system levels, achieving 50% reduction in
remote access and 13% speedup in BFS, and reducing performance variation of
co-located workloads in interference-aware job scheduling.Comment: Accepted to SC23 (The International Conference for High Performance
Computing, Networking, Storage, and Analysis 2023
Topology-aware GPU scheduling for learning workloads in cloud environments
Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud, are enabling deep learning in various domains including health care, autonomous vehicles, and Internet of Things. Multi-GPU systems exhibit complex connectivity among GPUs and between GPUs and CPUs. Workload schedulers must consider hardware topology and workload communication requirements in order to allocate CPU and GPU resources for optimal execution time and improved utilization in shared cloud environments.
This paper presents a new topology-aware workload placement strategy to schedule deep learning jobs on multi-GPU systems. The placement strategy is evaluated with a prototype on a Power8 machine with Tesla P100 cards, showing speedups of up to ≈1.30x compared to state-of-the-art strategies; the proposed algorithm achieves this result by allocating GPUs that satisfy workload requirements while preventing interference. Additionally, a large-scale simulation shows that the proposed strategy provides higher resource utilization and performance in cloud systems.This project is supported by the IBM/BSC Technology Center for Supercomputing
collaboration agreement. It has also received funding from the European Research Council (ERC) under the European Union’s Horizon
2020 research and innovation programme (grant agreement No 639595). It is
also partially supported by the Ministry of Economy of Spain under contract
TIN2015-65316-P and Generalitat de Catalunya under contract 2014SGR1051,
by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program
(SEV-2015-0493). We thank our IBM Research colleagues Alaa Youssef
and Asser Tantawi for the valuable discussions. We also thank SC17 committee
member Blair Bethwaite of Monash University for his constructive feedback on the earlier drafts of this paper.Peer ReviewedPostprint (published version
EnergAt: Fine-Grained Energy Attribution for Multi-Tenancy
In the post-Moore's Law era, relying solely on hardware advancements for
automatic performance gains is no longer feasible without increased energy
consumption, due to the end of Dennard scaling. Consequently, computing
accounts for an increasing amount of global energy usage, contradicting the
objective of sustainable computing. The lack of hardware support and the
absence of a standardized, software-centric method for the precise tracing of
energy provenance exacerbates the issue. Aiming to overcome this challenge, we
argue that fine-grained software energy attribution is attainable, even with
limited hardware support. To support our position, we present a thread-level,
NUMA-aware energy attribution method for CPU and DRAM in multi-tenant
environments. The evaluation of our prototype implementation, EnergAt,
demonstrates the validity, effectiveness, and robustness of our theoretical
model, even in the presence of the noisy-neighbor effect. We envisage a
sustainable cloud environment and emphasize the importance of collective
efforts to improve software energy efficiency.Comment: 8 pages, 4 figures; Published in HotCarbon 2023; Artifact available
at https://github.com/HongyuHe/energa
Recommended from our members
QoS-aware mechanisms for improving cost-efficiency of datacenters
Warehouse Scale Computers (WSCs) promise high cost-efficiency by amortizing power, cooling, and management overheads. WSCs today host a large variety of jobs with two broad performance requirements categories: latency-critical (LC) and best-effort (BE). Ideally, to fully utilize all hardware resources, WSC operators can simply fill all the nodes with computing jobs. Unfortunately, because colocated jobs contend for shared resources, systems with high loads often experience performance degradation, which negatively impacts the Quality of Service (QoS) for LC jobs. In fact, service providers usually over-provision resources to avoid any interference with LC jobs, leading to significant resource inefficiencies. In this dissertation, I explore opportunities across different system-abstraction layers to improve the cost-efficiency of dataceters by increasing resource utilization of WSCs with little or no impact on the performance of LC jobs. The dissertation has three main components. First, I explore opportunities to improve the throughput of multicore systems by reducing the performance variation of LC jobs. The main insight is that by reshaping the latency distribution curve, performance headroom of LC jobs can be effectively converted to improved BE throughput. I develop, implement, and evaluate a runtime system that achieves this goal with existing hardware. I leverage the cache partitioning, per-core frequency scaling, and thread masking of server processors. Evaluation results show the proposed solution enables 30% higher system throughput compared to solutions proposed in prior works while maintaining at least as good QoS for LC jobs. Second, I study resource contention in near-future heterogeneous memory architectures (HMA). This study is motivated by recent developments in non-volatile memory (NVM) technologies, which enable higher storage density at the cost of same performance. To understand the performance and QoS impact of HMAs, I design and implement a performance emulator in the Linux kernel that runs unmodified workloads with high accuracy, low overhead, and complete transparency. I further propose and evaluate multiple data and resource management QoS mechanisms, such as locality-aware page admission, occupancy management, and write buffer jailing. Third, I focus on accelerated machine learning (ML) systems. By profiling the performance of production workloads and accelerators, I show that accelerated ML tasks are highly sensitive to main memory interference due to fine-grained interaction between CPU and accelerator tasks. As a result, memory resource contention can significantly decreases the performance and efficiency gains of accelerators. I propose a runtime system that leverages existing hardware capabilities and show 17% higher system efficiency compared to previous approaches. This study further exposes opportunities for future processor architecturesElectrical and Computer Engineerin
Optical Technologies and Control Methods for Scalable Data Centre Networks
Attributing to the increasing adoption of cloud services, video services and associated machine learning applications, the traffic demand inside data centers is increasing exponentially, which necessitates an innovated networking infrastructure with high scalability and cost-efficiency. As a promising candidate to provide high capacity, low latency, cost-effective and scalable interconnections, optical technologies have been introduced to data center networks (DCNs) for approximately a decade. To further improve the DCN performance to meet the increasing traffic demand by using photonic technologies, two current trends are a)increasing the bandwidth density of the transmission links and b) maximizing IT and network resources utilization through disaggregated topologies and architectures. Therefore, this PhD thesis focuses on introducing and applying advanced and efficient technologies in these two fields to DCNs to improve their performance. On the one hand, at the link level, since the traditional single-mode fiber (SMF) solutions based on wavelength division multiplexing (WDM) over C+L band may fall short in satisfying the capacity, front panel density, power consumption, and cost requirements of high-performance DCNs, a space division multiplexing (SDM) based DCN using homogeneous multi-core fibers (MCFs) is proposed.With the exploited bi-directional model and proposed spectrum allocation algorithms, the proposed DCN shows great benefits over the SMF solution in terms of network capacity and spatial efficiency. In the meanwhile, it is found that the inter-core crosstalk (IC-XT) between the adjacent cores inside the MCF is dynamic rather than static, therefore, the behaviour of the IC-XT is experimentally investigated under different transmission conditions. On the other hand, an optically disaggregated DCN is developed and to ensure the performance of it, different architectures, topologies, resource routing and allocation algorithms are proposed and compared. Compared to the traditional server-based DCN, the resource utilization, scalability and the cost-efficiency are significantly improved
Lovelock: Towards Smart NIC-hosted Clusters
Traditional cluster designs were originally server-centric, and have evolved
recently to support hardware acceleration and storage disaggregation. In
applications that leverage acceleration, the server CPU performs the role of
orchestrating computation and data movement and data-intensive applications
stress the memory bandwidth. Applications that leverage disaggregation can be
adversely affected by the increased PCIe and network bandwidth resulting from
disaggregation. In this paper, we advocate for a specialized cluster design for
important data intensive applications, such as analytics, query processing and
ML training. This design, Lovelock, replaces each server in a cluster with one
or more headless smart NICs. Because smart NICs are significantly cheaper than
servers on bandwidth, the resulting cluster can run these applications without
adversely impacting performance, while obtaining cost and energy savings
Spatial and Temporal Cache Sharing Analysis in Tasks
Proceedings of the First PhD Symposium on Sustainable Ultrascale
Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.Understanding performance of large scale multicore systems is crucial for getting faster execution times
and optimize workload efficiency, but it is becoming harder due to the increased complexity of hardware
architectures. Cache sharing is a key component for performance in modern architectures, and it has been
the focus of performance analysis tools and techniques in recent years. At the same time, new programming
models have been introduced to aid the programmer dealing with the complexity of large scale systems,
simplifying the coding process and making applications more scalable regardless of resource sharing. Taskbased
runtime systems are one example of this that became popular recently. In this work we develop models
to tackle performance analysis of shared resources in the task-based context, and for that we study cache
sharing both in temporal and spatial ways. In temporal cache sharing, the effect of data reused over time by
the tasks executed is modeled to predict different scenarios resulting in a tool called StatTask. In spatial
cache sharing, the effect of tasks fighting for the cache at a given point in time through their execution is
quantified and used to model their behavior on arbitrary cache sizes. Finally, we explain how these tools
set up a unique and solid platform to improve runtime systems schedulers, maximizing performance of
execution of large-scale task-based applications.European Cooperation in Science and Technology. COSTThe work presented in this paper has been partially supported by EU under the COST programme Action
IC1305,‘Network for Sustainable Ultrascale Computing (NESUS)’, and by the Swedish Research Council, carried out within the Linnaeus centre of excellence UPMARC, Uppsala Programming for Multicore Architectures Research Center
ACTiCLOUD: Enabling the Next Generation of Cloud Applications
Despite their proliferation as a dominant computing paradigm, cloud computing systems lack effective mechanisms to manage their vast amounts of resources efficiently. Resources are stranded and fragmented, ultimately limiting cloud systems' applicability to large classes of critical applications that pose non-moderate resource demands. Eliminating current technological barriers of actual fluidity and scalability of cloud resources is essential to strengthen cloud computing's role as a critical cornerstone for the digital economy. ACTiCLOUD proposes a novel cloud architecture that breaks the existing scale-up and share-nothing barriers and enables the holistic management of physical resources both at the local cloud site and at distributed levels. Specifically, it makes advancements in the cloud resource management stacks by extending state-of-the-art hypervisor technology beyond the physical server boundary and localized cloud management system to provide a holistic resource management within a rack, within a site, and across distributed cloud sites. On top of this, ACTiCLOUD will adapt and optimize system libraries and runtimes (e.g., JVM) as well as ACTiCLOUD-native applications, which are extremely demanding, and critical classes of applications that currently face severe difficulties in matching their resource requirements to state-of-the-art cloud offerings
- …