51 research outputs found
Effects of component-subscription network topology on large-scale data centre performance scaling
Modern large-scale date centres, such as those used for cloud computing
service provision, are becoming ever-larger as the operators of those data
centres seek to maximise the benefits from economies of scale. With these
increases in size comes a growth in system complexity, which is usually
problematic. There is an increased desire for automated "self-star"
configuration, management, and failure-recovery of the data-centre
infrastructure, but many traditional techniques scale much worse than linearly
as the number of nodes to be managed increases. As the number of nodes in a
median-sized data-centre looks set to increase by two or three orders of
magnitude in coming decades, it seems reasonable to attempt to explore and
understand the scaling properties of the data-centre middleware before such
data-centres are constructed. In [1] we presented SPECI, a simulator that
predicts aspects of large-scale data-centre middleware performance,
concentrating on the influence of status changes such as policy updates or
routine node failures. [...]. In [1] we used a first-approximation assumption
that such subscriptions are distributed wholly at random across the data
centre. In this present paper, we explore the effects of introducing more
realistic constraints to the structure of the internal network of
subscriptions. We contrast the original results [...] exploring the effects of
making the data-centre's subscription network have a regular lattice-like
structure, and also semi-random network structures resulting from parameterised
network generation functions that create "small-world" and "scale-free"
networks. We show that for distributed middleware topologies, the structure and
distribution of tasks carried out in the data centre can significantly
influence the performance overhead imposed by the middleware
Holistic VM Placement for Distributed Parallel Applications in Heterogeneous Clusters
In a heterogeneous cluster, virtual machine (VM) placement for a distributed parallel application is challenging due to numerous possible ways of placing the application and complexity of estimating the performance of the application. This study investigates a holistic VM placement technique for distributed parallel applications in a heterogeneous cluster, aiming to maximize the efficiency of the cluster and consequently reduce the costs for service providers and users. The proposed technique accommodates various factors that have an impact on performance in a combined manner. First, we analyze the effects of the heterogeneity of resources, different VM configurations, and interference between VMs on the performance of distributed parallel applications with a wide diversity of characteristics, including scientific and big data analytics applications. We then propose a placement technique that uses a machine learning algorithm to estimate the runtime of a distributed parallel application. To train a performance estimation model, a distributed parallel application is profiled against synthetic workloads that mostly utilize the dominant resource of the application, which strongly affects the application performance, reducing the profiling space dramatically. Through experimental and simulation studies, we show that the proposed placement technique can find good VM placement configurations for various workloads
Analyzing and optimizing the performance and energy efficiency of transactional scientific applications on large-scale NUMA systems with HTM support
Hardware transactional memory (HTM) is widely supported by commodity processors. While the effectiveness of HTM has been evaluated based on small-scale multi-core systems, it still remains unexplored to quantify the performance and energy efficiency of HTM for scientific workloads on large-scale NUMA systems, which have been increasingly adopted to high-performance computing. To bridge this gap, this work investigates the performance and energy-efficiency impact of HTM on scientific applications on large-scale NUMA systems. Specifically, we quantify the performance and energy efficiency of HTM for scientific workloads based on the widely-used CLOMP-TM benchmark. We then discuss a set of generic software optimizations, which effectively improve the performance and energy efficiency of transactional scientific workloads on large-scale NUMA systems. Further, we present case studies in which we apply a set of the performance and energy-efficiency optimizations to representative transactional scientific applications and investigate the potential for high-performance and energy-efficient runtime support
Hap: A heterogeneity-conscious runtime system for adaptive pipeline parallelism
Heterogeneous multiprocessing (HMP) is a promising solution for energy-efficient computing. While pipeline parallelism is an effective technique to accelerate various workloads (e.g., streaming), relatively little work has been done to investigate efficient runtime support for adaptive pipeline parallelism in the context of HMP. To bridge this gap, we propose a heterogeneity-conscious runtime system for adaptive pipeline parallelism (HAP). HAP dynamically controls the full HMP system resources to improve the energy efficiency of the target pipeline application. We demonstrate that HAP achieves significant energyefficiency gains over the Linux HMP scheduler and a state-of-the-art runtime system and incurs a low performance overhead
HERTI: A Reinforcement Learning-Augmented System for Efficient Real-Time Inference on Heterogeneous Embedded Systems
Real-time inference is the key technology that enables a variety of latency-critical intelligent services such as autonomous driving and augmented reality. Heterogeneous embedded systems that consist of various computing devices with widely-different architectural and system-level characteristics are emerging as a promising solution for real-time inference. Despite extensive prior works, it still remains unexplored to design and implement a practical system that enables efficient real-time inference on heterogeneous embedded systems. To bridge this gap, we propose HERTI, a reinforcement learning-augmented system for efficient real-time inference on heterogeneous embedded systems. HERTI efficiently explores the state space and robustly finds an efficient state that significantly improves the efficiency of the target inference workload while satisfying its deadline constraint through reinforcement learning. Our quantitative evaluation conducted on a real heterogeneous embedded system demonstrates the effectiveness of HERTI in that HERTI achieves high inference efficiency in multiple metrics (i.e., energy and energy-delay product) with a strong deadline guarantee in contrast to the state-of-the-art techniques, delivers larger gains as the inference deadline and the system heterogeneity increase, provides strong generality for hyper-parameter tuning, and significantly reduces the training time through its estimation-based approach across all the evaluated inference workloads and scenarios
BLPP: Improving the Performance of GPGPUs with Heterogeneous Memory through Bandwidth- and Latency-Aware Page Placement
GPGPUs with heterogeneous memory have surfaced as a promising solution to improve the programmability and flexibility of GPGPU computing. Despite the extensive prior works, relatively little work has been done to investigate holistic system software support for heterogeneity-aware memory management. To bridge this gap, we propose bandwidth- and latencyaware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP dynamically places pages across the heterogeneous memory nodes by preserving the optimal allocation ratio computed based on their performance characteristics. Our experimental results show that BLPP considerably outperforms the state-of-the-art technique and performs similarly to the staticbest version, which requires extensive offline profiling
RCHC: A Holistic Runtime System for Concurrent Heterogeneous Computing
Concurrent heterogeneous computing (CHC) is rapidly emerging as a promising solution for high-performance and energy-efficient computing. The fundamental challenges for efficient CHC are how to partition the workload of the target application across the devices in the underlying CHC system and how to control the operating frequency of each device in order to maximize the overall efficiency. Despite the extensive prior work on the system software techniques for CHC, efficient runtime support for CHC that robustly supports both functional and performance heterogeneity without the need for extensive offline profiling still remains unexplored. To bridge this gap, we propose RCHC, a holistic runtime system for concurrent heterogeneous computing. RCHC dynamically profiles the target application and constructs the performance and power estimation models based on the runtime information. Guided by the estimation models, RCHC explores the system state space, determines the best system state that is expected to maximize the efficiency of the target application, and accordingly executes it. Our experimental results demonstrate that RCHC significantly outperforms the baseline version (e.g., 61.0% higher energy efficiency on average) that employs the GPU and achieves the efficiency comparable with that of the static best version, which requires extensive offline profiling
Quantifying the Performance and Energy-Efficiency Impact of Hardware Transactional Memory on Scientific Applications on Large-Scale NUMA Systems
Hardware transactional memory (HTM) is supported by widely-used commodity processors. While the effectiveness of HTM has been evaluated based on small-scale multi-core systems, it still remains unexplored to quantify the performance and energy-efficiency of HTM for scientific workloads on large-scale NUMA systems, which have been increasingly adopted to high-performance computing. To bridge this gap, this work investigates the performance and energy-efficiency impact of HTM on scientific applications on large-scale NUMA systems. We first quantify the performance and energy efficiency of HTM for scientific workloads based on the widely-used CLOMP-TM benchmark. We then discuss a set of generic software optimizations that can be effectively used to improve the performance and energy efficiency of transactional scientific workloads on large-scale NUMA systems. Finally, we present case studies in which we apply a set of the optimizations to representative transactional scientific applications and significantly optimize their performance and energy efficiency on large-scale NUMA systems
- …