51 research outputs found
Effects of component-subscription network topology on large-scale data centre performance scaling
Modern large-scale date centres, such as those used for cloud computing
service provision, are becoming ever-larger as the operators of those data
centres seek to maximise the benefits from economies of scale. With these
increases in size comes a growth in system complexity, which is usually
problematic. There is an increased desire for automated "self-star"
configuration, management, and failure-recovery of the data-centre
infrastructure, but many traditional techniques scale much worse than linearly
as the number of nodes to be managed increases. As the number of nodes in a
median-sized data-centre looks set to increase by two or three orders of
magnitude in coming decades, it seems reasonable to attempt to explore and
understand the scaling properties of the data-centre middleware before such
data-centres are constructed. In [1] we presented SPECI, a simulator that
predicts aspects of large-scale data-centre middleware performance,
concentrating on the influence of status changes such as policy updates or
routine node failures. [...]. In [1] we used a first-approximation assumption
that such subscriptions are distributed wholly at random across the data
centre. In this present paper, we explore the effects of introducing more
realistic constraints to the structure of the internal network of
subscriptions. We contrast the original results [...] exploring the effects of
making the data-centre's subscription network have a regular lattice-like
structure, and also semi-random network structures resulting from parameterised
network generation functions that create "small-world" and "scale-free"
networks. We show that for distributed middleware topologies, the structure and
distribution of tasks carried out in the data centre can significantly
influence the performance overhead imposed by the middleware
Holistic VM Placement for Distributed Parallel Applications in Heterogeneous Clusters
In a heterogeneous cluster, virtual machine (VM) placement for a distributed parallel application is challenging due to numerous possible ways of placing the application and complexity of estimating the performance of the application. This study investigates a holistic VM placement technique for distributed parallel applications in a heterogeneous cluster, aiming to maximize the efficiency of the cluster and consequently reduce the costs for service providers and users. The proposed technique accommodates various factors that have an impact on performance in a combined manner. First, we analyze the effects of the heterogeneity of resources, different VM configurations, and interference between VMs on the performance of distributed parallel applications with a wide diversity of characteristics, including scientific and big data analytics applications. We then propose a placement technique that uses a machine learning algorithm to estimate the runtime of a distributed parallel application. To train a performance estimation model, a distributed parallel application is profiled against synthetic workloads that mostly utilize the dominant resource of the application, which strongly affects the application performance, reducing the profiling space dramatically. Through experimental and simulation studies, we show that the proposed placement technique can find good VM placement configurations for various workloads
Analyzing and optimizing the performance and energy efficiency of transactional scientific applications on large-scale NUMA systems with HTM support
Hardware transactional memory (HTM) is widely supported by commodity processors. While the effectiveness of HTM has been evaluated based on small-scale multi-core systems, it still remains unexplored to quantify the performance and energy efficiency of HTM for scientific workloads on large-scale NUMA systems, which have been increasingly adopted to high-performance computing. To bridge this gap, this work investigates the performance and energy-efficiency impact of HTM on scientific applications on large-scale NUMA systems. Specifically, we quantify the performance and energy efficiency of HTM for scientific workloads based on the widely-used CLOMP-TM benchmark. We then discuss a set of generic software optimizations, which effectively improve the performance and energy efficiency of transactional scientific workloads on large-scale NUMA systems. Further, we present case studies in which we apply a set of the performance and energy-efficiency optimizations to representative transactional scientific applications and investigate the potential for high-performance and energy-efficient runtime support
Hap: A heterogeneity-conscious runtime system for adaptive pipeline parallelism
Heterogeneous multiprocessing (HMP) is a promising solution for energy-efficient computing. While pipeline parallelism is an effective technique to accelerate various workloads (e.g., streaming), relatively little work has been done to investigate efficient runtime support for adaptive pipeline parallelism in the context of HMP. To bridge this gap, we propose a heterogeneity-conscious runtime system for adaptive pipeline parallelism (HAP). HAP dynamically controls the full HMP system resources to improve the energy efficiency of the target pipeline application. We demonstrate that HAP achieves significant energyefficiency gains over the Linux HMP scheduler and a state-of-the-art runtime system and incurs a low performance overhead
BLPP: Improving the Performance of GPGPUs with Heterogeneous Memory through Bandwidth- and Latency-Aware Page Placement
GPGPUs with heterogeneous memory have surfaced as a promising solution to improve the programmability and flexibility of GPGPU computing. Despite the extensive prior works, relatively little work has been done to investigate holistic system software support for heterogeneity-aware memory management. To bridge this gap, we propose bandwidth- and latencyaware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP dynamically places pages across the heterogeneous memory nodes by preserving the optimal allocation ratio computed based on their performance characteristics. Our experimental results show that BLPP considerably outperforms the state-of-the-art technique and performs similarly to the staticbest version, which requires extensive offline profiling
RCHC: A Holistic Runtime System for Concurrent Heterogeneous Computing
Concurrent heterogeneous computing (CHC) is rapidly emerging as a promising solution for high-performance and energy-efficient computing. The fundamental challenges for efficient CHC are how to partition the workload of the target application across the devices in the underlying CHC system and how to control the operating frequency of each device in order to maximize the overall efficiency. Despite the extensive prior work on the system software techniques for CHC, efficient runtime support for CHC that robustly supports both functional and performance heterogeneity without the need for extensive offline profiling still remains unexplored. To bridge this gap, we propose RCHC, a holistic runtime system for concurrent heterogeneous computing. RCHC dynamically profiles the target application and constructs the performance and power estimation models based on the runtime information. Guided by the estimation models, RCHC explores the system state space, determines the best system state that is expected to maximize the efficiency of the target application, and accordingly executes it. Our experimental results demonstrate that RCHC significantly outperforms the baseline version (e.g., 61.0% higher energy efficiency on average) that employs the GPU and achieves the efficiency comparable with that of the static best version, which requires extensive offline profiling
Quantifying the Performance and Energy-Efficiency Impact of Hardware Transactional Memory on Scientific Applications on Large-Scale NUMA Systems
Hardware transactional memory (HTM) is supported by widely-used commodity processors. While the effectiveness of HTM has been evaluated based on small-scale multi-core systems, it still remains unexplored to quantify the performance and energy-efficiency of HTM for scientific workloads on large-scale NUMA systems, which have been increasingly adopted to high-performance computing. To bridge this gap, this work investigates the performance and energy-efficiency impact of HTM on scientific applications on large-scale NUMA systems. We first quantify the performance and energy efficiency of HTM for scientific workloads based on the widely-used CLOMP-TM benchmark. We then discuss a set of generic software optimizations that can be effectively used to improve the performance and energy efficiency of transactional scientific workloads on large-scale NUMA systems. Finally, we present case studies in which we apply a set of the optimizations to representative transactional scientific applications and significantly optimize their performance and energy efficiency on large-scale NUMA systems
RPPC: a Holistic Runtime System for Maximizing Performance under Power Capping
Maximizing performance in power-constrained computing environments is highly important in cloud and datacenter computing. To achieve the best possible performance of parallel applications under power capping, it is crucial to execute them with the optimal concurrency level and cross-component power allocation between CPUs and memory. Despite extensive prior works, it still remains unexplored to investigate the efficient runtime support that maximizes the performance of parallel applications under power capping through the coordinated control of concurrency level and cross-component power allocation. To bridge this gap, this work proposes RPPC, a holistic runtime system for maximizing performance under power capping. In contrast to the state-of-the-art techniques, RPPC robustly controls the two performance-critical knobs (i.e., concurrency level and cross-component power allocation) in a coordinated manner to maximize the performance of parallel applications under power capping. RPPC dynamically identifies the characteristics of the target parallel application and explores the system state space to find an efficient system state. Our experimental results demonstrate that RPPC significantly outperforms the two state-of-the-art power-capping techniques, achieves the performance comparable with the static best version that requires extensive per-application offline profiling, incurs small performance overheads, and provides the re-adaptation mechanism to external events such as total power budget changes
- …